[jira] [Updated] (SPARK-35496) Upgrade Scala 2.13 to 2.13.7
[ https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-35496: - Affects Version/s: 3.3.0 > Upgrade Scala 2.13 to 2.13.7 > > > Key: SPARK-35496 > URL: https://issues.apache.org/jira/browse/SPARK-35496 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.2.0, 3.3.0 >Reporter: Yang Jie >Priority: Minor > > This issue aims to upgrade to Scala 2.13.7. > Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6). > However, we skip 2.13.6 because there is a breaking behavior change at 2.13.6 > which is different from both Scala 2.13.5 and Scala 3. > - https://github.com/scala/bug/issues/12403 > {code} > scala3-3.0.0:$ bin/scala > scala> Array.empty[Double].intersect(Array(0.0)) > val res0: Array[Double] = Array() > scala-2.13.6:$ bin/scala > Welcome to Scala 2.13.6 (OpenJDK 64-Bit Server VM, Java 1.8.0_292). > Type in expressions for evaluation. Or try :help. > scala> Array.empty[Double].intersect(Array(0.0)) > java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [D > ... 32 elided > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37172) Push down filters having both partitioning and non-partitioning columns
[ https://issues.apache.org/jira/browse/SPARK-37172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437135#comment-17437135 ] Chungmin commented on SPARK-37172: -- I can work on this if the rationale seems okay. > Push down filters having both partitioning and non-partitioning columns > --- > > Key: SPARK-37172 > URL: https://issues.apache.org/jira/browse/SPARK-37172 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chungmin >Priority: Major > > Currently, filters having both partitioning and non-partitioning columns are > lost during the creation of {{FileSourceScanExec}} and not pushed down to the > data source. However, theoretically and practically, there is no reason to > exclude such filters from {{dataFilters}}. For any partitioned source data > file, the values of partitioning columns are the same for all rows. They can > be stored physically (or reconstructed logically) along with statistics for > non-partitioning columns to allow more powerful data skipping. If a data > source doesn't know how to handle such filters, it can simply ignore such > filters. > Example: Suppose that there is a table {{MYTAB}} with two columns {{A}} and > {{B}}, partitioned by {{A}} (Hive partitioning). Currently, data skipping > cannot be applied to queries like {{select * from MYTAB where A < B + 7}} > because {{A < B + 7}} is included in neither {{partitionFilters}} nor > {{dataFilters}}. However, we could have included the filter in > {{dataFilters}} because data sources have no obligation to use > {{dataFilters}} and they could have ignored filters that they cannot use. > It's not obvious whether we can change the semantics of > {{FileSourceScanExec.dataFilters}} without breaking existing code. It is > passed to {{FileIndex.listFiles}} and > {{FileFormat.buildReaderWithPartitionValues}} and the contracts for the > methods are not clear enough. > If we should not change {{dataFilters}}, we might have to add a new member > variable to {{FileSourceScanExec}} (e.g. {{dataFiltersWithPartitionColumns}}) > and add an overload of {{listFiles}} to the {{FileIndex}} trait, which > defaults to the existing {{listFiles}} without using the filters. Both > {{dataFilters}} and {{dataFiltersWIthoutPartitionColumns}} are optional; > implementations can ignore the filters if they can't utilize them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37051) The filter operator gets wrong results in char type
[ https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437127#comment-17437127 ] frankli commented on SPARK-37051: - I know this SQL can work, but this behavior is different from MYSQL and PostgreSQL. > The filter operator gets wrong results in char type > --- > > Key: SPARK-37051 > URL: https://issues.apache.org/jira/browse/SPARK-37051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.1, 3.3.0 > Environment: Spark 3.1.2 > Scala 2.12 / Java 1.8 >Reporter: frankli >Priority: Critical > > When I try the following sample SQL on the TPCDS data, the filter operator > returns an empty row set (shown in web ui). > _select * from item where i_category = 'Music' limit 100;_ > The table is in ORC format, and i_category is char(50) type. > Data is inserted by hive, and queried by Spark. > I guest that the char(50) type will remains redundant blanks after the actual > word. > It will affect the boolean value of "x.equals(Y)", and results in wrong > results. > Luckily, the varchar type is OK. > > This bug can be reproduced by a few steps. > >>> desc t2_orc; > ++---+++ > |col_name|data_type|comment| > ++---+++ > |a|string |NULL| > |b|char(50) |NULL| > |c|int |NULL| > ++---++--–+ > >>> select * from t2_orc where a='a'; > +-+---++--+ > |a|b|c| > +-+---++--+ > |a|b|1| > |a|b|2| > |a|b|3| > |a|b|4| > |a|b|5| > +-+---++–+ > >>> select * from t2_orc where b='b'; > +-+---++--+ > |a|b|c| > +-+---++--+ > +-+---++--+ > > By the way, Spark's tests should add more cases on the char type. > > == Physical Plan == > CollectLimit (3) > +- Filter (2) > +- Scan orc tpcds_bin_partitioned_orc_2.item (1) > (1) Scan orc tpcds_bin_partitioned_orc_2.item > Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Batched: false > Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item] > PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music > )] > ReadSchema: > struct > (2) Filter > Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music ))+ > (3) CollectLimit > Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Arguments: 100 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37193) DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer joins
[ https://issues.apache.org/jira/browse/SPARK-37193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437126#comment-17437126 ] Apache Spark commented on SPARK-37193: -- User 'ekoifman' has created a pull request for this issue: https://github.com/apache/spark/pull/34464 > DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer > joins > -- > > Key: SPARK-37193 > URL: https://issues.apache.org/jira/browse/SPARK-37193 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Eugene Koifman >Priority: Major > > {{DynamicJoinSelection.shouldDemoteBroadcastHashJoin}} will prevent AQE from > converting Sort merge join into a broadcast join because SMJ is faster when > the side that would be broadcast has a lot of empty partitions. > This makes sense for inner joins which can short circuit if one side is > empty. > For (left,right) outer join, the streaming side still has to be processed so > demoting broadcast join doesn't have the same advantage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37193) DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer joins
[ https://issues.apache.org/jira/browse/SPARK-37193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37193: Assignee: Apache Spark > DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer > joins > -- > > Key: SPARK-37193 > URL: https://issues.apache.org/jira/browse/SPARK-37193 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Eugene Koifman >Assignee: Apache Spark >Priority: Major > > {{DynamicJoinSelection.shouldDemoteBroadcastHashJoin}} will prevent AQE from > converting Sort merge join into a broadcast join because SMJ is faster when > the side that would be broadcast has a lot of empty partitions. > This makes sense for inner joins which can short circuit if one side is > empty. > For (left,right) outer join, the streaming side still has to be processed so > demoting broadcast join doesn't have the same advantage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37193) DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer joins
[ https://issues.apache.org/jira/browse/SPARK-37193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37193: Assignee: (was: Apache Spark) > DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer > joins > -- > > Key: SPARK-37193 > URL: https://issues.apache.org/jira/browse/SPARK-37193 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Eugene Koifman >Priority: Major > > {{DynamicJoinSelection.shouldDemoteBroadcastHashJoin}} will prevent AQE from > converting Sort merge join into a broadcast join because SMJ is faster when > the side that would be broadcast has a lot of empty partitions. > This makes sense for inner joins which can short circuit if one side is > empty. > For (left,right) outer join, the streaming side still has to be processed so > demoting broadcast join doesn't have the same advantage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37193) DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer joins
[ https://issues.apache.org/jira/browse/SPARK-37193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437125#comment-17437125 ] Apache Spark commented on SPARK-37193: -- User 'ekoifman' has created a pull request for this issue: https://github.com/apache/spark/pull/34464 > DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer > joins > -- > > Key: SPARK-37193 > URL: https://issues.apache.org/jira/browse/SPARK-37193 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Eugene Koifman >Priority: Major > > {{DynamicJoinSelection.shouldDemoteBroadcastHashJoin}} will prevent AQE from > converting Sort merge join into a broadcast join because SMJ is faster when > the side that would be broadcast has a lot of empty partitions. > This makes sense for inner joins which can short circuit if one side is > empty. > For (left,right) outer join, the streaming side still has to be processed so > demoting broadcast join doesn't have the same advantage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37051) The filter operator gets wrong results in char type
[ https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437122#comment-17437122 ] Yang Jie commented on SPARK-37051: -- Can you test {code:java} select * from t2_orc where b='b0'; {code} 1 B and 49 zeros > The filter operator gets wrong results in char type > --- > > Key: SPARK-37051 > URL: https://issues.apache.org/jira/browse/SPARK-37051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.1, 3.3.0 > Environment: Spark 3.1.2 > Scala 2.12 / Java 1.8 >Reporter: frankli >Priority: Critical > > When I try the following sample SQL on the TPCDS data, the filter operator > returns an empty row set (shown in web ui). > _select * from item where i_category = 'Music' limit 100;_ > The table is in ORC format, and i_category is char(50) type. > Data is inserted by hive, and queried by Spark. > I guest that the char(50) type will remains redundant blanks after the actual > word. > It will affect the boolean value of "x.equals(Y)", and results in wrong > results. > Luckily, the varchar type is OK. > > This bug can be reproduced by a few steps. > >>> desc t2_orc; > ++---+++ > |col_name|data_type|comment| > ++---+++ > |a|string |NULL| > |b|char(50) |NULL| > |c|int |NULL| > ++---++--–+ > >>> select * from t2_orc where a='a'; > +-+---++--+ > |a|b|c| > +-+---++--+ > |a|b|1| > |a|b|2| > |a|b|3| > |a|b|4| > |a|b|5| > +-+---++–+ > >>> select * from t2_orc where b='b'; > +-+---++--+ > |a|b|c| > +-+---++--+ > +-+---++--+ > > By the way, Spark's tests should add more cases on the char type. > > == Physical Plan == > CollectLimit (3) > +- Filter (2) > +- Scan orc tpcds_bin_partitioned_orc_2.item (1) > (1) Scan orc tpcds_bin_partitioned_orc_2.item > Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Batched: false > Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item] > PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music > )] > ReadSchema: > struct > (2) Filter > Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music ))+ > (3) CollectLimit > Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Arguments: 100 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37193) DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer joins
Eugene Koifman created SPARK-37193: -- Summary: DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer joins Key: SPARK-37193 URL: https://issues.apache.org/jira/browse/SPARK-37193 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.1 Reporter: Eugene Koifman {{DynamicJoinSelection.shouldDemoteBroadcastHashJoin}} will prevent AQE from converting Sort merge join into a broadcast join because SMJ is faster when the side that would be broadcast has a lot of empty partitions. This makes sense for inner joins which can short circuit if one side is empty. For (left,right) outer join, the streaming side still has to be processed so demoting broadcast join doesn't have the same advantage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37192) Migrate SHOW TBLPROPERTIES to use V2 command by default
[ https://issues.apache.org/jira/browse/SPARK-37192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437120#comment-17437120 ] Terry Kim commented on SPARK-37192: --- Yes, go for it! Thanks! > Migrate SHOW TBLPROPERTIES to use V2 command by default > --- > > Key: SPARK-37192 > URL: https://issues.apache.org/jira/browse/SPARK-37192 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Priority: Major > Fix For: 3.3.0 > > > Migrate SHOW TBLPROPERTIES to use V2 command by default -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37124) Support RowToColumnarExec with Arrow
[ https://issues.apache.org/jira/browse/SPARK-37124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chendi.Xue updated SPARK-37124: --- Description: This Jira is aim to support Arrow format in RowToColumnarExec Current ArrowColumnVector is not fully equivalent to OnHeap/OffHeapColumnVector in spark, so RowToColumnarExec doesn't support write to Arrow format so far. since Arrow API is now being more stable, and using pandas udf will perform much better than python udf. I am proposing to support RowToColumnarExec with Arrow. What I did in this PR is to add a load api in ArrowColumnVector to load arrowRecordBatch to ArrowColumnVector, then called inside RowToColumnarExec doExecute. UTs are also added to test this new API and RowToColumnarExec with ArrowFormat was: This Jira is aim to add Arrow format as an alternative for ColumnVector solution. Current ArrowColumnVector is not fully equivalent to OnHeap/OffHeapColumnVector in spark, and since Arrow API is now being more stable, and using pandas udf will perform much better than python udf. I am proposing to fully support arrow format as an alternative to ColumnVector just like the other two. What I did in this PR is to create a new class in the same package with OnHeap/OffHeapColumnVector and extend from WritableColumnVector to support all put APIs. UTs are covering all Data Format with testing on writing to columnVector and reading from columnVector. I also added 3 UTs for testing on loading from ArrowRecordBatch and allocateColumns . Summary: Support RowToColumnarExec with Arrow (was: Support Writable ArrowColumnarVector) > Support RowToColumnarExec with Arrow > > > Key: SPARK-37124 > URL: https://issues.apache.org/jira/browse/SPARK-37124 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chendi.Xue >Priority: Major > > This Jira is aim to support Arrow format in RowToColumnarExec > Current ArrowColumnVector is not fully equivalent to > OnHeap/OffHeapColumnVector in spark, so RowToColumnarExec doesn't support > write to Arrow format so far. > since Arrow API is now being more stable, and using pandas udf will perform > much better than python udf. > I am proposing to support RowToColumnarExec with Arrow. > What I did in this PR is to add a load api in ArrowColumnVector to load > arrowRecordBatch to ArrowColumnVector, then called inside RowToColumnarExec > doExecute. > > UTs are also added to test this new API and RowToColumnarExec with ArrowFormat -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37192) Migrate SHOW TBLPROPERTIES to use V2 command by default
[ https://issues.apache.org/jira/browse/SPARK-37192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437113#comment-17437113 ] PengLei commented on SPARK-37192: - [~imback82] [~wenchen] I want to try to fix it, okay? > Migrate SHOW TBLPROPERTIES to use V2 command by default > --- > > Key: SPARK-37192 > URL: https://issues.apache.org/jira/browse/SPARK-37192 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Priority: Major > Fix For: 3.3.0 > > > Migrate SHOW TBLPROPERTIES to use V2 command by default -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37192) Migrate SHOW TBLPROPERTIES to use V2 command by default
PengLei created SPARK-37192: --- Summary: Migrate SHOW TBLPROPERTIES to use V2 command by default Key: SPARK-37192 URL: https://issues.apache.org/jira/browse/SPARK-37192 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: PengLei Fix For: 3.3.0 Migrate SHOW TBLPROPERTIES to use V2 command by default -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37159) Change HiveExternalCatalogVersionsSuite to be able to test with Java 17
[ https://issues.apache.org/jira/browse/SPARK-37159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-37159. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved in https://github.com/apache/spark/pull/34425 > Change HiveExternalCatalogVersionsSuite to be able to test with Java 17 > --- > > Key: SPARK-37159 > URL: https://issues.apache.org/jira/browse/SPARK-37159 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.3.0 > > > SPARK-37105 seems to have fixed most of tests in `sql/hive` for Java 17 but > `HiveExternalCatalogVersionsSuite`. > {code} > [info] org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED > *** (42 seconds, 526 milliseconds) > [info] spark-submit returned with exit code 1. > [info] Command line: > '/home/kou/work/oss/spark-java17/sql/hive/target/tmp/org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite/test-spark-d86af275-0c40-4b47-9cab-defa92a5ffa7/spark-3.2.0/bin/spark-submit' > '--name' 'prepare testing tables' '--master' 'local[2]' '--conf' > 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false' '--conf' > 'spark.sql.hive.metastore.version=2.3' '--conf' > 'spark.sql.hive.metastore.jars=maven' '--conf' > 'spark.sql.warehouse.dir=/home/kou/work/oss/spark-java17/sql/hive/target/tmp/org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite/warehouse-69d9bdbc-54ce-443b-8677-a413663ddb62' > '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' > '-Dderby.system.home=/home/kou/work/oss/spark-java17/sql/hive/target/tmp/org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite/warehouse-69d9bdbc-54ce-443b-8677-a413663ddb62' > > '/home/kou/work/oss/spark-java17/sql/hive/target/tmp/org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite/test15166225869206697603.py' > [info] > [info] 2021-10-28 06:07:18.486 - stderr> Using Spark's default log4j > profile: org/apache/spark/log4j-defaults.properties > [info] 2021-10-28 06:07:18.49 - stderr> 21/10/28 22:07:18 INFO > SparkContext: Running Spark version 3.2.0 > [info] 2021-10-28 06:07:18.537 - stderr> 21/10/28 22:07:18 WARN > NativeCodeLoader: Unable to load native-hadoop library for your platform... > using builtin-java classes where applicable > [info] 2021-10-28 06:07:18.616 - stderr> 21/10/28 22:07:18 INFO > ResourceUtils: == > [info] 2021-10-28 06:07:18.616 - stderr> 21/10/28 22:07:18 INFO > ResourceUtils: No custom resources configured for spark.driver. > [info] 2021-10-28 06:07:18.616 - stderr> 21/10/28 22:07:18 INFO > ResourceUtils: == > [info] 2021-10-28 06:07:18.617 - stderr> 21/10/28 22:07:18 INFO > SparkContext: Submitted application: prepare testing tables > [info] 2021-10-28 06:07:18.632 - stderr> 21/10/28 22:07:18 INFO > ResourceProfile: Default ResourceProfile created, executor resources: > Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: > memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: > 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0) > [info] 2021-10-28 06:07:18.641 - stderr> 21/10/28 22:07:18 INFO > ResourceProfile: Limiting resource is cpu > [info] 2021-10-28 06:07:18.641 - stderr> 21/10/28 22:07:18 INFO > ResourceProfileManager: Added ResourceProfile id: 0 > [info] 2021-10-28 06:07:18.679 - stderr> 21/10/28 22:07:18 INFO > SecurityManager: Changing view acls to: kou > [info] 2021-10-28 06:07:18.679 - stderr> 21/10/28 22:07:18 INFO > SecurityManager: Changing modify acls to: kou > [info] 2021-10-28 06:07:18.68 - stderr> 21/10/28 22:07:18 INFO > SecurityManager: Changing view acls groups to: > [info] 2021-10-28 06:07:18.68 - stderr> 21/10/28 22:07:18 INFO > SecurityManager: Changing modify acls groups to: > [info] 2021-10-28 06:07:18.68 - stderr> 21/10/28 22:07:18 INFO > SecurityManager: SecurityManager: authentication disabled; ui acls disabled; > users with view permissions: Set(kou); groups with view permissions: Set(); > users with modify permissions: Set(kou); groups with modify permissions: > Set() > [info] 2021-10-28 06:07:18.886 - stderr> 21/10/28 22:07:18 INFO Utils: > Successfully started service 'sparkDriver' on port 35867. > [info] 2021-10-28 06:07:18.906 - stderr> 21/10/28 22:07:18 INFO SparkEnv: > Registering MapOutputTracker > [info] 2021-10-28 06:07:18.93 - stderr> 21/10/28 22:07:18 INFO SparkEnv: > Registering BlockManagerMaster > [info] 2021-10-28 06:07:18.943 - stderr> 21/10/28
[jira] [Resolved] (SPARK-36554) Error message while trying to use spark sql functions directly on dataframe columns without using select expression
[ https://issues.apache.org/jira/browse/SPARK-36554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-36554. Fix Version/s: 3.3.0 Assignee: Nicolas Azrak Resolution: Fixed Issue resolved in https://github.com/apache/spark/pull/34356 > Error message while trying to use spark sql functions directly on dataframe > columns without using select expression > --- > > Key: SPARK-36554 > URL: https://issues.apache.org/jira/browse/SPARK-36554 > Project: Spark > Issue Type: Bug > Components: Documentation, Examples, PySpark >Affects Versions: 3.1.1 >Reporter: Lekshmi Ramachandran >Assignee: Nicolas Azrak >Priority: Minor > Labels: documentation, features, functions, spark-sql > Fix For: 3.3.0 > > Attachments: Screen Shot .png > > Original Estimate: 24h > Remaining Estimate: 24h > > The below code generates a dataframe successfully . Here make_date function > is used inside a select expression > > from pyspark.sql.functions import expr, make_date > df = spark.createDataFrame([(2020, 6, 26), (1000, 2, 29), (-44, 1, 1)],['Y', > 'M', 'D']) > df.select("*",expr("make_date(Y,M,D) as lk")).show() > > The below code fails with a message "cannot import name 'make_date' from > 'pyspark.sql.functions'" . Here the make_date function is directly called on > dataframe columns without select expression > > from pyspark.sql.functions import make_date > df = spark.createDataFrame([(2020, 6, 26), (1000, 2, 29), (-44, 1, 1)],['Y', > 'M', 'D']) > df.select(make_date(df.Y,df.M,df.D).alias("datefield")).show() > > The error message generated is misleading when it says "cannot import > make_date from pyspark.sql.functions" > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37190) Improve error messages for casting under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-37190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437070#comment-17437070 ] Apache Spark commented on SPARK-37190: -- User 'allisonwang-db' has created a pull request for this issue: https://github.com/apache/spark/pull/34463 > Improve error messages for casting under ANSI mode > -- > > Key: SPARK-37190 > URL: https://issues.apache.org/jira/browse/SPARK-37190 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Allison Wang >Priority: Major > > Improve error messages for casting under ANSI mode by making it more > actionable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37190) Improve error messages for casting under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-37190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37190: Assignee: (was: Apache Spark) > Improve error messages for casting under ANSI mode > -- > > Key: SPARK-37190 > URL: https://issues.apache.org/jira/browse/SPARK-37190 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Allison Wang >Priority: Major > > Improve error messages for casting under ANSI mode by making it more > actionable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37190) Improve error messages for casting under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-37190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437068#comment-17437068 ] Apache Spark commented on SPARK-37190: -- User 'allisonwang-db' has created a pull request for this issue: https://github.com/apache/spark/pull/34463 > Improve error messages for casting under ANSI mode > -- > > Key: SPARK-37190 > URL: https://issues.apache.org/jira/browse/SPARK-37190 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Allison Wang >Priority: Major > > Improve error messages for casting under ANSI mode by making it more > actionable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37190) Improve error messages for casting under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-37190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37190: Assignee: Apache Spark > Improve error messages for casting under ANSI mode > -- > > Key: SPARK-37190 > URL: https://issues.apache.org/jira/browse/SPARK-37190 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Allison Wang >Assignee: Apache Spark >Priority: Major > > Improve error messages for casting under ANSI mode by making it more > actionable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37191) Allow merging DecimalTypes with different precision values
[ https://issues.apache.org/jira/browse/SPARK-37191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437064#comment-17437064 ] Apache Spark commented on SPARK-37191: -- User 'sadikovi' has created a pull request for this issue: https://github.com/apache/spark/pull/34462 > Allow merging DecimalTypes with different precision values > --- > > Key: SPARK-37191 > URL: https://issues.apache.org/jira/browse/SPARK-37191 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.0, 3.1.1, 3.2.0 >Reporter: Ivan >Priority: Major > Fix For: 3.3.0 > > > When merging DecimalTypes with different precision but the same scale, one > would get the following error: > {code:java} > Failed to merge fields 'col' and 'col'. Failed to merge decimal types with > incompatible precision 17 and 12 at > org.apache.spark.sql.types.StructType$.$anonfun$merge$2(StructType.scala:652) > at scala.Option.map(Option.scala:230) > at > org.apache.spark.sql.types.StructType$.$anonfun$merge$1(StructType.scala:644) > at > org.apache.spark.sql.types.StructType$.$anonfun$merge$1$adapted(StructType.scala:641) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:641) > at org.apache.spark.sql.types.StructType.merge(StructType.scala:550) > {code} > > We could allow merging DecimalType values with different precision if the > scale is the same for both types since there should not be any data > correctness issues as one of the types will be extended, for example, > DECIMAL(12, 2) -> DECIMAL(17, 2); however, this is not the case for upcasting > when the scale is different - this would depend on the actual values. > > Repro code: > {code:java} > import org.apache.spark.sql.types._ > val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil) > val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil) > schema1.merge(schema2) {code} > > This also affects Parquet schema merge which is where this issue was > discovered originally: > {code:java} > import java.math.BigDecimal > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val data1 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1) > val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil) > val data2 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1) > val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil) > spark.createDataFrame(data2, > schema2).write.parquet("/tmp/decimal-test.parquet") > spark.createDataFrame(data1, > schema1).write.mode("append").parquet("/tmp/decimal-test.parquet") > // Reading the DataFrame fails > spark.read.option("mergeSchema", > "true").parquet("/tmp/decimal-test.parquet").show() > >>> > Failed merging schema: > root > |-- col: decimal(17,2) (nullable = true) > Caused by: Failed to merge fields 'col' and 'col'. Failed to merge decimal > types with incompatible precision 12 and 17 > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37191) Allow merging DecimalTypes with different precision values
[ https://issues.apache.org/jira/browse/SPARK-37191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37191: Assignee: Apache Spark > Allow merging DecimalTypes with different precision values > --- > > Key: SPARK-37191 > URL: https://issues.apache.org/jira/browse/SPARK-37191 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.0, 3.1.1, 3.2.0 >Reporter: Ivan >Assignee: Apache Spark >Priority: Major > Fix For: 3.3.0 > > > When merging DecimalTypes with different precision but the same scale, one > would get the following error: > {code:java} > Failed to merge fields 'col' and 'col'. Failed to merge decimal types with > incompatible precision 17 and 12 at > org.apache.spark.sql.types.StructType$.$anonfun$merge$2(StructType.scala:652) > at scala.Option.map(Option.scala:230) > at > org.apache.spark.sql.types.StructType$.$anonfun$merge$1(StructType.scala:644) > at > org.apache.spark.sql.types.StructType$.$anonfun$merge$1$adapted(StructType.scala:641) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:641) > at org.apache.spark.sql.types.StructType.merge(StructType.scala:550) > {code} > > We could allow merging DecimalType values with different precision if the > scale is the same for both types since there should not be any data > correctness issues as one of the types will be extended, for example, > DECIMAL(12, 2) -> DECIMAL(17, 2); however, this is not the case for upcasting > when the scale is different - this would depend on the actual values. > > Repro code: > {code:java} > import org.apache.spark.sql.types._ > val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil) > val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil) > schema1.merge(schema2) {code} > > This also affects Parquet schema merge which is where this issue was > discovered originally: > {code:java} > import java.math.BigDecimal > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val data1 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1) > val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil) > val data2 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1) > val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil) > spark.createDataFrame(data2, > schema2).write.parquet("/tmp/decimal-test.parquet") > spark.createDataFrame(data1, > schema1).write.mode("append").parquet("/tmp/decimal-test.parquet") > // Reading the DataFrame fails > spark.read.option("mergeSchema", > "true").parquet("/tmp/decimal-test.parquet").show() > >>> > Failed merging schema: > root > |-- col: decimal(17,2) (nullable = true) > Caused by: Failed to merge fields 'col' and 'col'. Failed to merge decimal > types with incompatible precision 12 and 17 > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37191) Allow merging DecimalTypes with different precision values
[ https://issues.apache.org/jira/browse/SPARK-37191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37191: Assignee: (was: Apache Spark) > Allow merging DecimalTypes with different precision values > --- > > Key: SPARK-37191 > URL: https://issues.apache.org/jira/browse/SPARK-37191 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.0, 3.1.1, 3.2.0 >Reporter: Ivan >Priority: Major > Fix For: 3.3.0 > > > When merging DecimalTypes with different precision but the same scale, one > would get the following error: > {code:java} > Failed to merge fields 'col' and 'col'. Failed to merge decimal types with > incompatible precision 17 and 12 at > org.apache.spark.sql.types.StructType$.$anonfun$merge$2(StructType.scala:652) > at scala.Option.map(Option.scala:230) > at > org.apache.spark.sql.types.StructType$.$anonfun$merge$1(StructType.scala:644) > at > org.apache.spark.sql.types.StructType$.$anonfun$merge$1$adapted(StructType.scala:641) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:641) > at org.apache.spark.sql.types.StructType.merge(StructType.scala:550) > {code} > > We could allow merging DecimalType values with different precision if the > scale is the same for both types since there should not be any data > correctness issues as one of the types will be extended, for example, > DECIMAL(12, 2) -> DECIMAL(17, 2); however, this is not the case for upcasting > when the scale is different - this would depend on the actual values. > > Repro code: > {code:java} > import org.apache.spark.sql.types._ > val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil) > val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil) > schema1.merge(schema2) {code} > > This also affects Parquet schema merge which is where this issue was > discovered originally: > {code:java} > import java.math.BigDecimal > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val data1 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1) > val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil) > val data2 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1) > val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil) > spark.createDataFrame(data2, > schema2).write.parquet("/tmp/decimal-test.parquet") > spark.createDataFrame(data1, > schema1).write.mode("append").parquet("/tmp/decimal-test.parquet") > // Reading the DataFrame fails > spark.read.option("mergeSchema", > "true").parquet("/tmp/decimal-test.parquet").show() > >>> > Failed merging schema: > root > |-- col: decimal(17,2) (nullable = true) > Caused by: Failed to merge fields 'col' and 'col'. Failed to merge decimal > types with incompatible precision 12 and 17 > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37191) Allow merging DecimalTypes with different precision values
[ https://issues.apache.org/jira/browse/SPARK-37191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437061#comment-17437061 ] Ivan commented on SPARK-37191: -- This is somewhat related to https://issues.apache.org/jira/browse/SPARK-32317 although the issue is a bit different - they are trying to merge decimals of different scales. > Allow merging DecimalTypes with different precision values > --- > > Key: SPARK-37191 > URL: https://issues.apache.org/jira/browse/SPARK-37191 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.0, 3.1.1, 3.2.0 >Reporter: Ivan >Priority: Major > Fix For: 3.3.0 > > > When merging DecimalTypes with different precision but the same scale, one > would get the following error: > {code:java} > Failed to merge fields 'col' and 'col'. Failed to merge decimal types with > incompatible precision 17 and 12 at > org.apache.spark.sql.types.StructType$.$anonfun$merge$2(StructType.scala:652) > at scala.Option.map(Option.scala:230) > at > org.apache.spark.sql.types.StructType$.$anonfun$merge$1(StructType.scala:644) > at > org.apache.spark.sql.types.StructType$.$anonfun$merge$1$adapted(StructType.scala:641) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:641) > at org.apache.spark.sql.types.StructType.merge(StructType.scala:550) > {code} > > We could allow merging DecimalType values with different precision if the > scale is the same for both types since there should not be any data > correctness issues as one of the types will be extended, for example, > DECIMAL(12, 2) -> DECIMAL(17, 2); however, this is not the case for upcasting > when the scale is different - this would depend on the actual values. > > Repro code: > {code:java} > import org.apache.spark.sql.types._ > val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil) > val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil) > schema1.merge(schema2) {code} > > This also affects Parquet schema merge which is where this issue was > discovered originally: > {code:java} > import java.math.BigDecimal > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val data1 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1) > val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil) > val data2 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1) > val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil) > spark.createDataFrame(data2, > schema2).write.parquet("/tmp/decimal-test.parquet") > spark.createDataFrame(data1, > schema1).write.mode("append").parquet("/tmp/decimal-test.parquet") > // Reading the DataFrame fails > spark.read.option("mergeSchema", > "true").parquet("/tmp/decimal-test.parquet").show() > >>> > Failed merging schema: > root > |-- col: decimal(17,2) (nullable = true) > Caused by: Failed to merge fields 'col' and 'col'. Failed to merge decimal > types with incompatible precision 12 and 17 > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37191) Allow merging DecimalTypes with different precision values
[ https://issues.apache.org/jira/browse/SPARK-37191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan updated SPARK-37191: - Description: When merging DecimalTypes with different precision but the same scale, one would get the following error: {code:java} Failed to merge fields 'col' and 'col'. Failed to merge decimal types with incompatible precision 17 and 12 at org.apache.spark.sql.types.StructType$.$anonfun$merge$2(StructType.scala:652) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.types.StructType$.$anonfun$merge$1(StructType.scala:644) at org.apache.spark.sql.types.StructType$.$anonfun$merge$1$adapted(StructType.scala:641) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at org.apache.spark.sql.types.StructType$.merge(StructType.scala:641) at org.apache.spark.sql.types.StructType.merge(StructType.scala:550) {code} We could allow merging DecimalType values with different precision if the scale is the same for both types since there should not be any data correctness issues as one of the types will be extended, for example, DECIMAL(12, 2) -> DECIMAL(17, 2); however, this is not the case for upcasting when the scale is different - this would depend on the actual values. Repro code: {code:java} import org.apache.spark.sql.types._ val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil) val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil) schema1.merge(schema2) {code} This also affects Parquet schema merge which is where this issue was discovered originally: {code:java} import java.math.BigDecimal import org.apache.spark.sql.Row import org.apache.spark.sql.types._ val data1 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1) val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil) val data2 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1) val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil) spark.createDataFrame(data2, schema2).write.parquet("/tmp/decimal-test.parquet") spark.createDataFrame(data1, schema1).write.mode("append").parquet("/tmp/decimal-test.parquet") // Reading the DataFrame fails spark.read.option("mergeSchema", "true").parquet("/tmp/decimal-test.parquet").show() >>> Failed merging schema: root |-- col: decimal(17,2) (nullable = true) Caused by: Failed to merge fields 'col' and 'col'. Failed to merge decimal types with incompatible precision 12 and 17 {code} was: When merging DecimalTypes with different precision but the same scale, one would get the following error: {code:java} Failed to merge fields 'col' and 'col'. Failed to merge decimal types with incompatible precision 17 and 12 at org.apache.spark.sql.types.StructType$.$anonfun$merge$2(StructType.scala:652) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.types.StructType$.$anonfun$merge$1(StructType.scala:644) at org.apache.spark.sql.types.StructType$.$anonfun$merge$1$adapted(StructType.scala:641) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at org.apache.spark.sql.types.StructType$.merge(StructType.scala:641) at org.apache.spark.sql.types.StructType.merge(StructType.scala:550) {code} We could allow merging DecimalType values with different precision if the scale is the same for both types since there should not be any data correctness issues as one of the types will be extended, for example, DECIMAL(12, 2) -> DECIMAL(17, 2); however, this is not the case for upcasting when the scale is different - this would depend on the actual values. Repro code: {code:java} import org.apache.spark.sql.types._ val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil) val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil) schema1.merge(schema2) {code} This also affects Parquet schema merge which is where this issue was discovered originally: {code:java} import java.math.BigDecimal import org.apache.spark.sql.Row import org.apache.spark.sql.types._ val data1 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1) val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil) val data2 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1) val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil) spark.createDataFrame(data2, schema2).write.parquet("/tmp/decimal-test.parquet") spark.createDataFrame(data1,
[jira] [Updated] (SPARK-37191) Allow merging DecimalTypes with different precision values
[ https://issues.apache.org/jira/browse/SPARK-37191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan updated SPARK-37191: - Description: When merging DecimalTypes with different precision but the same scale, one would get the following error: {code:java} Failed to merge fields 'col' and 'col'. Failed to merge decimal types with incompatible precision 17 and 12 at org.apache.spark.sql.types.StructType$.$anonfun$merge$2(StructType.scala:652) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.types.StructType$.$anonfun$merge$1(StructType.scala:644) at org.apache.spark.sql.types.StructType$.$anonfun$merge$1$adapted(StructType.scala:641) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at org.apache.spark.sql.types.StructType$.merge(StructType.scala:641) at org.apache.spark.sql.types.StructType.merge(StructType.scala:550) {code} We could allow merging DecimalType values with different precision if the scale is the same for both types since there should not be any data correctness issues as one of the types will be extended, for example, DECIMAL(12, 2) -> DECIMAL(17, 2); however, this is not the case for upcasting when the scale is different - this would depend on the actual values. Repro code: {code:java} import org.apache.spark.sql.types._ val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil) val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil) schema1.merge(schema2) {code} This also affects Parquet schema merge which is where this issue was discovered originally: {code:java} import java.math.BigDecimal import org.apache.spark.sql.Row import org.apache.spark.sql.types._ val data1 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1) val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil) val data2 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1) val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil) spark.createDataFrame(data2, schema2).write.parquet("/tmp/decimal-test.parquet") spark.createDataFrame(data1, schema1).write.mode("append").parquet("/tmp/decimal-test.parquet") // Reading the DataFrame fails spark.read.option("mergeSchema", "true").parquet("/mnt/ivan/decimal-test.parquet").show() >>> Failed merging schema: root |-- col: decimal(17,2) (nullable = true) Caused by: Failed to merge fields 'col' and 'col'. Failed to merge decimal types with incompatible precision 12 and 17 {code} > Allow merging DecimalTypes with different precision values > --- > > Key: SPARK-37191 > URL: https://issues.apache.org/jira/browse/SPARK-37191 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.0, 3.1.1, 3.2.0 >Reporter: Ivan >Priority: Major > Fix For: 3.3.0 > > > When merging DecimalTypes with different precision but the same scale, one > would get the following error: > {code:java} > Failed to merge fields 'col' and 'col'. Failed to merge decimal types with > incompatible precision 17 and 12 at > org.apache.spark.sql.types.StructType$.$anonfun$merge$2(StructType.scala:652) > at scala.Option.map(Option.scala:230) > at > org.apache.spark.sql.types.StructType$.$anonfun$merge$1(StructType.scala:644) > at > org.apache.spark.sql.types.StructType$.$anonfun$merge$1$adapted(StructType.scala:641) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:641) > at org.apache.spark.sql.types.StructType.merge(StructType.scala:550) > {code} > > We could allow merging DecimalType values with different precision if the > scale is the same for both types since there should not be any data > correctness issues as one of the types will be extended, for example, > DECIMAL(12, 2) -> DECIMAL(17, 2); however, this is not the case for upcasting > when the scale is different - this would depend on the actual values. > > Repro code: > {code:java} > import org.apache.spark.sql.types._ > val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil) > val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil) > schema1.merge(schema2) {code} > > This also affects Parquet schema merge which is where this issue was > discovered originally: > {code:java} > import java.math.BigDecimal
[jira] [Created] (SPARK-37191) Allow merging DecimalTypes with different precision values
Ivan created SPARK-37191: Summary: Allow merging DecimalTypes with different precision values Key: SPARK-37191 URL: https://issues.apache.org/jira/browse/SPARK-37191 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0, 3.1.1, 3.1.0, 3.0.3 Reporter: Ivan Fix For: 3.3.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37190) Improve error messages for casting under ANSI mode
Allison Wang created SPARK-37190: Summary: Improve error messages for casting under ANSI mode Key: SPARK-37190 URL: https://issues.apache.org/jira/browse/SPARK-37190 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Allison Wang Improve error messages for casting under ANSI mode by making it more actionable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37023) Avoid fetching merge status when shuffleMergeEnabled is false for a shuffleDependency during retry
[ https://issues.apache.org/jira/browse/SPARK-37023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37023: Assignee: Apache Spark > Avoid fetching merge status when shuffleMergeEnabled is false for a > shuffleDependency during retry > -- > > Key: SPARK-37023 > URL: https://issues.apache.org/jira/browse/SPARK-37023 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: Ye Zhou >Assignee: Apache Spark >Priority: Major > > The assertion below in MapOutoutputTracker.getMapSizesByExecutorId is not > guaranteed > {code:java} > assert(mapSizesByExecutorId.enableBatchFetch == true){code} > The reason is during some stage retry cases, the > shuffleDependency.shuffleMergeEnabled is set to false, but there will be > mergeStatus since the Driver has collected the merged status for its shuffle > dependency. If this is the case, the current implementation would set the > enableBatchFetch to false, since there are mergeStatus. > Details can be found here: > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L1492] > We should improve the implementation here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37023) Avoid fetching merge status when shuffleMergeEnabled is false for a shuffleDependency during retry
[ https://issues.apache.org/jira/browse/SPARK-37023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37023: Assignee: (was: Apache Spark) > Avoid fetching merge status when shuffleMergeEnabled is false for a > shuffleDependency during retry > -- > > Key: SPARK-37023 > URL: https://issues.apache.org/jira/browse/SPARK-37023 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: Ye Zhou >Priority: Major > > The assertion below in MapOutoutputTracker.getMapSizesByExecutorId is not > guaranteed > {code:java} > assert(mapSizesByExecutorId.enableBatchFetch == true){code} > The reason is during some stage retry cases, the > shuffleDependency.shuffleMergeEnabled is set to false, but there will be > mergeStatus since the Driver has collected the merged status for its shuffle > dependency. If this is the case, the current implementation would set the > enableBatchFetch to false, since there are mergeStatus. > Details can be found here: > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L1492] > We should improve the implementation here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37023) Avoid fetching merge status when shuffleMergeEnabled is false for a shuffleDependency during retry
[ https://issues.apache.org/jira/browse/SPARK-37023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437042#comment-17437042 ] Apache Spark commented on SPARK-37023: -- User 'rmcyang' has created a pull request for this issue: https://github.com/apache/spark/pull/34461 > Avoid fetching merge status when shuffleMergeEnabled is false for a > shuffleDependency during retry > -- > > Key: SPARK-37023 > URL: https://issues.apache.org/jira/browse/SPARK-37023 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: Ye Zhou >Priority: Major > > The assertion below in MapOutoutputTracker.getMapSizesByExecutorId is not > guaranteed > {code:java} > assert(mapSizesByExecutorId.enableBatchFetch == true){code} > The reason is during some stage retry cases, the > shuffleDependency.shuffleMergeEnabled is set to false, but there will be > mergeStatus since the Driver has collected the merged status for its shuffle > dependency. If this is the case, the current implementation would set the > enableBatchFetch to false, since there are mergeStatus. > Details can be found here: > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L1492] > We should improve the implementation here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37023) Avoid fetching merge status when shuffleMergeEnabled is false for a shuffleDependency during retry
[ https://issues.apache.org/jira/browse/SPARK-37023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437041#comment-17437041 ] Apache Spark commented on SPARK-37023: -- User 'rmcyang' has created a pull request for this issue: https://github.com/apache/spark/pull/34461 > Avoid fetching merge status when shuffleMergeEnabled is false for a > shuffleDependency during retry > -- > > Key: SPARK-37023 > URL: https://issues.apache.org/jira/browse/SPARK-37023 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: Ye Zhou >Priority: Major > > The assertion below in MapOutoutputTracker.getMapSizesByExecutorId is not > guaranteed > {code:java} > assert(mapSizesByExecutorId.enableBatchFetch == true){code} > The reason is during some stage retry cases, the > shuffleDependency.shuffleMergeEnabled is set to false, but there will be > mergeStatus since the Driver has collected the merged status for its shuffle > dependency. If this is the case, the current implementation would set the > enableBatchFetch to false, since there are mergeStatus. > Details can be found here: > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L1492] > We should improve the implementation here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37189) pyspark.pandas histogram accepts the range option but does not use it
[ https://issues.apache.org/jira/browse/SPARK-37189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chuck Connell updated SPARK-37189: -- Description: In pyspark.pandas if you write a line like this {quote}DF.plot.hist(bins=30, range=[0, 20], title="US Counties -- DeathsPer100k (<20)") {quote} it compiles and runs, but the plot does not respect the range. All the values are shown. The workaround is to create a new DataFrame that pre-selects just the rows you want, but line above should work also. was: In pyspark.pandas if you write a line like this {quote}DF.plot.hist(bins=20, title="US Counties -- FullVaxPer100") {quote} it compiles and runs, but the plot has no title. > pyspark.pandas histogram accepts the range option but does not use it > - > > Key: SPARK-37189 > URL: https://issues.apache.org/jira/browse/SPARK-37189 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > In pyspark.pandas if you write a line like this > {quote}DF.plot.hist(bins=30, range=[0, 20], title="US Counties -- > DeathsPer100k (<20)") > {quote} > it compiles and runs, but the plot does not respect the range. All the values > are shown. > The workaround is to create a new DataFrame that pre-selects just the rows > you want, but line above should work also. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37189) CLONE - pyspark.pandas histogram accepts the title option but does not add a title to the plot
Chuck Connell created SPARK-37189: - Summary: CLONE - pyspark.pandas histogram accepts the title option but does not add a title to the plot Key: SPARK-37189 URL: https://issues.apache.org/jira/browse/SPARK-37189 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.2.0 Reporter: Chuck Connell In pyspark.pandas if you write a line like this {quote}DF.plot.hist(bins=20, title="US Counties -- FullVaxPer100") {quote} it compiles and runs, but the plot has no title. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37189) pyspark.pandas histogram accepts the range option but does not use it
[ https://issues.apache.org/jira/browse/SPARK-37189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chuck Connell updated SPARK-37189: -- Summary: pyspark.pandas histogram accepts the range option but does not use it (was: CLONE - pyspark.pandas histogram accepts the title option but does not add a title to the plot) > pyspark.pandas histogram accepts the range option but does not use it > - > > Key: SPARK-37189 > URL: https://issues.apache.org/jira/browse/SPARK-37189 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > In pyspark.pandas if you write a line like this > {quote}DF.plot.hist(bins=20, title="US Counties -- FullVaxPer100") > {quote} > it compiles and runs, but the plot has no title. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37188) pyspark.pandas histogram accepts the title option but does not add a title to the plot
Chuck Connell created SPARK-37188: - Summary: pyspark.pandas histogram accepts the title option but does not add a title to the plot Key: SPARK-37188 URL: https://issues.apache.org/jira/browse/SPARK-37188 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.2.0 Reporter: Chuck Connell In pyspark.pandas if you write a line like this {quote}DF.plot.hist(bins=20, title="US Counties -- FullVaxPer100") {quote} it compiles and runs, but the plot has no title. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37187) pyspark.pandas fails to create a histogram of one column from a large DataFrame
Chuck Connell created SPARK-37187: - Summary: pyspark.pandas fails to create a histogram of one column from a large DataFrame Key: SPARK-37187 URL: https://issues.apache.org/jira/browse/SPARK-37187 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.2.0 Reporter: Chuck Connell When trying to create a histogram from one column of a large DataFrame, pyspark.pandas fails. So this line {quote}DF.plot.hist(column="FullVaxPer100", bins=20) # there are many other columns {quote} yields this error {quote}cannot resolve 'least(min(EndDate), min(EndDeaths), min(`STATE-COUNTY`), min(StartDate), min(StartDeaths), min(POPESTIMATE2020), min(ST_ABBR), min(VaxStartDate), min(Series_Complete_Yes_Start), min(Administered_Dose1_Recip_Start), min(VaxEndDate), min(Series_Complete_Yes_End), min(Administered_Dose1_Recip_End), min(Deaths), min(Series_Complete_Yes_Mid), min(Administered_Dose1_Recip_Mid), min(FullVaxPer100), min(OnePlusVaxPer100), min(DeathsPer100k))' due to data type mismatch: The expressions should all have the same type, got LEAST(timestamp, bigint, string, timestamp, bigint, bigint, string, timestamp, bigint, bigint, timestamp, bigint, bigint, bigint, double, double, double, double, double).; {quote} The odd thing is that pyspark.pandas seems to be operating on all the columns when only one is needed. As a workaround, you can first create a one-column DataFrame that selects just the field you want, then make a histogram of that. But the command above should work also. I can supply the complete program and datasets that demonstrate the error. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37186) pyspark.pandas should support tseries.offsets
Chuck Connell created SPARK-37186: - Summary: pyspark.pandas should support tseries.offsets Key: SPARK-37186 URL: https://issues.apache.org/jira/browse/SPARK-37186 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.2.0 Reporter: Chuck Connell In regular pandas you can use pandas.offsets to create a time delta. This allows a line like {quote}this_period_start = OVERALL_START_DATE + pd.offsets.Day(NN) {quote} But this does not work in pyspark.pandas. There are good workarounds, such as datetime.timedelta(days=NN), but pandas programmers would like to move code to pyspark without changing it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37185) DataFrame.take() only uses one worker
[ https://issues.apache.org/jira/browse/SPARK-37185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437009#comment-17437009 ] mathieu longtin commented on SPARK-37185: - Additional note: if there's a "group by" in the query, this is not an issue. > DataFrame.take() only uses one worker > - > > Key: SPARK-37185 > URL: https://issues.apache.org/jira/browse/SPARK-37185 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.0 > Environment: CentOS 7 >Reporter: mathieu longtin >Priority: Major > > Say you have query: > {code:java} > >>> df = spark.sql("select * from mytable where x = 99"){code} > Now, out of billions of row, there's only ten rows where x is 99. > If I do: > {code:java} > >>> df.limit(10).collect() > [Stage 1:> (0 + 1) / 1]{code} > It only uses one worker. This takes a really long time since one CPU is > reading the billions of row. > However, if I do this: > {code:java} > >>> df.limit(10).rdd.collect() > [Stage 1:> (0 + 10) / 22]{code} > All the workers are running. > I think there's some optimization issue DataFrame.take(...). > This did not use to be an issue, but I'm not sure if it was working with 3.0 > or 2.4. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37185) DataFrame.take() only uses one worker
mathieu longtin created SPARK-37185: --- Summary: DataFrame.take() only uses one worker Key: SPARK-37185 URL: https://issues.apache.org/jira/browse/SPARK-37185 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0, 3.1.1 Environment: CentOS 7 Reporter: mathieu longtin Say you have query: {code:java} >>> df = spark.sql("select * from mytable where x = 99"){code} Now, out of billions of row, there's only ten rows where x is 99. If I do: {code:java} >>> df.limit(10).collect() [Stage 1:> (0 + 1) / 1]{code} It only uses one worker. This takes a really long time since one CPU is reading the billions of row. However, if I do this: {code:java} >>> df.limit(10).rdd.collect() [Stage 1:> (0 + 10) / 22]{code} All the workers are running. I think there's some optimization issue DataFrame.take(...). This did not use to be an issue, but I'm not sure if it was working with 3.0 or 2.4. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-37182) pyspark.pandas.to_numeric() should support the errors option
[ https://issues.apache.org/jira/browse/SPARK-37182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chuck Connell updated SPARK-37182: -- Comment: was deleted (was: https://issues.apache.org/jira/browse/SPARK-36609) > pyspark.pandas.to_numeric() should support the errors option > > > Key: SPARK-37182 > URL: https://issues.apache.org/jira/browse/SPARK-37182 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > In regular pandas you can say to_numeric(errors='coerce'). But the errors > option is not recognized by pyspark.pandas. > FYI, the errors option is recognized by pyspark.pandas.to_datetime() -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37182) pyspark.pandas.to_numeric() should support the errors option
[ https://issues.apache.org/jira/browse/SPARK-37182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chuck Connell resolved SPARK-37182. --- Resolution: Duplicate https://issues.apache.org/jira/browse/SPARK-36609 > pyspark.pandas.to_numeric() should support the errors option > > > Key: SPARK-37182 > URL: https://issues.apache.org/jira/browse/SPARK-37182 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > In regular pandas you can say to_numeric(errors='coerce'). But the errors > option is not recognized by pyspark.pandas. > FYI, the errors option is recognized by pyspark.pandas.to_datetime() -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37182) pyspark.pandas.to_numeric() should support the errors option
[ https://issues.apache.org/jira/browse/SPARK-37182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436992#comment-17436992 ] Chuck Connell commented on SPARK-37182: --- Duplicate of https://issues.apache.org/jira/browse/SPARK-36609 > pyspark.pandas.to_numeric() should support the errors option > > > Key: SPARK-37182 > URL: https://issues.apache.org/jira/browse/SPARK-37182 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > In regular pandas you can say to_numeric(errors='coerce'). But the errors > option is not recognized by pyspark.pandas. > FYI, the errors option is recognized by pyspark.pandas.to_datetime() -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37184) pyspark.pandas should support DF["column"].str.split("some_suffix").str[0]
Chuck Connell created SPARK-37184: - Summary: pyspark.pandas should support DF["column"].str.split("some_suffix").str[0] Key: SPARK-37184 URL: https://issues.apache.org/jira/browse/SPARK-37184 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.2.0 Reporter: Chuck Connell In regular pandas you can say {quote}DF["column"] = DF["column"].str.split("suffix").str[0] {quote} In order to strip off a suffix. With pyspark.pandas, this syntax does not work. You have to say something like {quote}DF["column"] = DF["column"].str.replace("suffix", '', 1) {quote} which works fine if the suffix only appears once at the end, but is not really the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37183) pyspark.pandas.DataFrame.map() should support .fillna()
Chuck Connell created SPARK-37183: - Summary: pyspark.pandas.DataFrame.map() should support .fillna() Key: SPARK-37183 URL: https://issues.apache.org/jira/browse/SPARK-37183 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.2.0 Reporter: Chuck Connell In regular pandas you can say {quote}DF["new_column"] = DF["column"].map(some_map).fillna(DF["column"]) {quote} In order to use the existing value if the mapping key is not found. But this does not work in pyspark.pandas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37182) pyspark.pandas.to_numeric() should support the errors option
Chuck Connell created SPARK-37182: - Summary: pyspark.pandas.to_numeric() should support the errors option Key: SPARK-37182 URL: https://issues.apache.org/jira/browse/SPARK-37182 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.2.0 Reporter: Chuck Connell In regular pandas you can say to_numeric(errors='coerce'). But the errors option is not recognized by pyspark.pandas. FYI, the errors option is recognized by pyspark.pandas.to_datetime() -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding
Chuck Connell created SPARK-37181: - Summary: pyspark.pandas.read_csv() should support latin-1 encoding Key: SPARK-37181 URL: https://issues.apache.org/jira/browse/SPARK-37181 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.2.0 Reporter: Chuck Connell {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding is not recognized in pyspark.pandas. You have to use Windows-1252 instead, which is almost the same but not identical. }} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37166) SPIP: Storage Partitioned Join
[ https://issues.apache.org/jira/browse/SPARK-37166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436963#comment-17436963 ] Chao Sun commented on SPARK-37166: -- [~xkrogen] sure just linked. > SPIP: Storage Partitioned Join > -- > > Key: SPARK-37166 > URL: https://issues.apache.org/jira/browse/SPARK-37166 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > This JIRA tracks the SPIP for storage partitioned join. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37166) SPIP: Storage Partitioned Join
[ https://issues.apache.org/jira/browse/SPARK-37166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436959#comment-17436959 ] Erik Krogen commented on SPARK-37166: - [~csun] can you link the doc here? > SPIP: Storage Partitioned Join > -- > > Key: SPARK-37166 > URL: https://issues.apache.org/jira/browse/SPARK-37166 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > This JIRA tracks the SPIP for storage partitioned join. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37034) What's the progress of vectorized execution for spark?
[ https://issues.apache.org/jira/browse/SPARK-37034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436898#comment-17436898 ] Wenchen Fan edited comment on SPARK-37034 at 11/1/21, 3:23 PM: --- This is a question, not a feature request. Please ask it in the dev-list instead of JIRA. A quick answer is: there is no plan to add vectorized execution into Spark in the near future (at least I haven't seen any proposal), but there are third-party libraries doing it via the columnar API in the Spark query plan. was (Author: cloud_fan): This is a question, not a feature request. Please ask it in the dev-list instead of JIRA. A quick answer is: there is no plan to add vectorized execution into Spark in the near future, but there are third-party libraries doing it via the columnar API in the Spark query plan. > What's the progress of vectorized execution for spark? > -- > > Key: SPARK-37034 > URL: https://issues.apache.org/jira/browse/SPARK-37034 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: xiaoli >Priority: Major > > Spark has support vectorized read for ORC and parquet. What's the progress of > other vectorized execution, e.g. vectorized write, join, aggr, simple > operator (string function, math function)? > Hive support vectorized execution in early version > (https://cwiki.apache.org/confluence/display/hive/vectorized+query+execution) > As we know, Spark is replacement of Hive. I guess the reason why Spark does > not support vectorized execution maybe the difficulty of design or > implementation in Spark is larger. What's the main issue for Spark to support > vectorized execution? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37034) What's the progress of vectorized execution for spark?
[ https://issues.apache.org/jira/browse/SPARK-37034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436898#comment-17436898 ] Wenchen Fan commented on SPARK-37034: - This is a question, not a feature request. Please ask it in the dev-list instead of JIRA. A quick answer is: there is no plan to add vectorized execution into Spark in the near future, but there are third-party libraries doing it via the columnar API in the Spark query plan. > What's the progress of vectorized execution for spark? > -- > > Key: SPARK-37034 > URL: https://issues.apache.org/jira/browse/SPARK-37034 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: xiaoli >Priority: Major > > Spark has support vectorized read for ORC and parquet. What's the progress of > other vectorized execution, e.g. vectorized write, join, aggr, simple > operator (string function, math function)? > Hive support vectorized execution in early version > (https://cwiki.apache.org/confluence/display/hive/vectorized+query+execution) > As we know, Spark is replacement of Hive. I guess the reason why Spark does > not support vectorized execution maybe the difficulty of design or > implementation in Spark is larger. What's the main issue for Spark to support > vectorized execution? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36566) Add Spark appname as a label to the executor pods
[ https://issues.apache.org/jira/browse/SPARK-36566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436896#comment-17436896 ] Apache Spark commented on SPARK-36566: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/34460 > Add Spark appname as a label to the executor pods > - > > Key: SPARK-36566 > URL: https://issues.apache.org/jira/browse/SPARK-36566 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Holden Karau >Priority: Trivial > > Adding the appName as a label to the executor pods could simplify debugging. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36566) Add Spark appname as a label to the executor pods
[ https://issues.apache.org/jira/browse/SPARK-36566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36566: Assignee: Apache Spark > Add Spark appname as a label to the executor pods > - > > Key: SPARK-36566 > URL: https://issues.apache.org/jira/browse/SPARK-36566 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Holden Karau >Assignee: Apache Spark >Priority: Trivial > > Adding the appName as a label to the executor pods could simplify debugging. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36566) Add Spark appname as a label to the executor pods
[ https://issues.apache.org/jira/browse/SPARK-36566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436895#comment-17436895 ] Apache Spark commented on SPARK-36566: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/34460 > Add Spark appname as a label to the executor pods > - > > Key: SPARK-36566 > URL: https://issues.apache.org/jira/browse/SPARK-36566 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Holden Karau >Priority: Trivial > > Adding the appName as a label to the executor pods could simplify debugging. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36566) Add Spark appname as a label to the executor pods
[ https://issues.apache.org/jira/browse/SPARK-36566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36566: Assignee: (was: Apache Spark) > Add Spark appname as a label to the executor pods > - > > Key: SPARK-36566 > URL: https://issues.apache.org/jira/browse/SPARK-36566 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Holden Karau >Priority: Trivial > > Adding the appName as a label to the executor pods could simplify debugging. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-36566) Add Spark appname as a label to the executor pods
[ https://issues.apache.org/jira/browse/SPARK-36566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436885#comment-17436885 ] Yikun Jiang edited comment on SPARK-36566 at 11/1/21, 3:11 PM: --- Yep, it's useful for me. Does it make sense if we also set driver label to app name? then we can find out all pods (driver/executor) list by using: {{k get pods -l spark.app.name=xxx}} was (Author: yikunkero): Yep, it's useful for me. Does it make sense if we also set driver label to app name? then we can find out all pods (driver/executor) list by using: {{k get pods -l spark-app}} > Add Spark appname as a label to the executor pods > - > > Key: SPARK-36566 > URL: https://issues.apache.org/jira/browse/SPARK-36566 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Holden Karau >Priority: Trivial > > Adding the appName as a label to the executor pods could simplify debugging. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36566) Add Spark appname as a label to the executor pods
[ https://issues.apache.org/jira/browse/SPARK-36566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436885#comment-17436885 ] Yikun Jiang commented on SPARK-36566: - Yep, it's useful for me. Does it make sense if we also set driver label to app name? then we can find out all pods (driver/executor) list by using: {{k get pods -l spark-app}} > Add Spark appname as a label to the executor pods > - > > Key: SPARK-36566 > URL: https://issues.apache.org/jira/browse/SPARK-36566 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Holden Karau >Priority: Trivial > > Adding the appName as a label to the executor pods could simplify debugging. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37180) PySpark.pandas should support __version__
[ https://issues.apache.org/jira/browse/SPARK-37180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chuck Connell updated SPARK-37180: -- Description: In regular pandas you can say {quote}pd.___version___ {quote} to get the pandas version number. PySpark pandas should support the same. was: In regular pandas you can say {quote}{{pd.__version__ }}{quote} to get the pandas version number. PySpark pandas should support the same. > PySpark.pandas should support __version__ > - > > Key: SPARK-37180 > URL: https://issues.apache.org/jira/browse/SPARK-37180 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > In regular pandas you can say > {quote}pd.___version___ > {quote} > to get the pandas version number. PySpark pandas should support the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37180) PySpark.pandas should support __version__
[ https://issues.apache.org/jira/browse/SPARK-37180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chuck Connell updated SPARK-37180: -- Description: In regular pandas you can say {quote}{{pd.__version__ }}{quote} to get the pandas version number. PySpark pandas should support the same. was:In regular pandas you can say pd.__version__ to get the pandas version number. PySpark pandas should support the same. > PySpark.pandas should support __version__ > - > > Key: SPARK-37180 > URL: https://issues.apache.org/jira/browse/SPARK-37180 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > In regular pandas you can say > {quote}{{pd.__version__ }}{quote} > to get the pandas version number. PySpark pandas should support the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37180) PySpark.pandas should support __version__
Chuck Connell created SPARK-37180: - Summary: PySpark.pandas should support __version__ Key: SPARK-37180 URL: https://issues.apache.org/jira/browse/SPARK-37180 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.2.0 Reporter: Chuck Connell In regular pandas you can say pd.__version__ to get the pandas version number. PySpark pandas should support the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37179) ANSI mode: Allow casting between Timestamp and Numeric
[ https://issues.apache.org/jira/browse/SPARK-37179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436861#comment-17436861 ] Apache Spark commented on SPARK-37179: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/34459 > ANSI mode: Allow casting between Timestamp and Numeric > -- > > Key: SPARK-37179 > URL: https://issues.apache.org/jira/browse/SPARK-37179 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > We should allow the casting between Timestamp and Numeric types: > * As we did some data science, we found that many Spark SQL users are > actually using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. > * The Spark SQL connector for Tableau is using this feature for DateTime > math. e.g. > {code:java} > CAST(FROM_UNIXTIME(CAST(CAST(%1 AS BIGINT) + (%2 * 86400) AS BIGINT)) AS > TIMESTAMP) > {code} > * In the current syntax, we specially allow Numeric <=> Boolean and String > <=> Binary since they are straight forward and frequently used. I suggest we > allow Timestamp <=> Numeric as well for better ANSI mode adoption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37179) ANSI mode: Allow casting between Timestamp and Numeric
[ https://issues.apache.org/jira/browse/SPARK-37179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37179: Assignee: Apache Spark (was: Gengliang Wang) > ANSI mode: Allow casting between Timestamp and Numeric > -- > > Key: SPARK-37179 > URL: https://issues.apache.org/jira/browse/SPARK-37179 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > We should allow the casting between Timestamp and Numeric types: > * As we did some data science, we found that many Spark SQL users are > actually using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. > * The Spark SQL connector for Tableau is using this feature for DateTime > math. e.g. > {code:java} > CAST(FROM_UNIXTIME(CAST(CAST(%1 AS BIGINT) + (%2 * 86400) AS BIGINT)) AS > TIMESTAMP) > {code} > * In the current syntax, we specially allow Numeric <=> Boolean and String > <=> Binary since they are straight forward and frequently used. I suggest we > allow Timestamp <=> Numeric as well for better ANSI mode adoption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37179) ANSI mode: Allow casting between Timestamp and Numeric
[ https://issues.apache.org/jira/browse/SPARK-37179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436858#comment-17436858 ] Apache Spark commented on SPARK-37179: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/34459 > ANSI mode: Allow casting between Timestamp and Numeric > -- > > Key: SPARK-37179 > URL: https://issues.apache.org/jira/browse/SPARK-37179 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > We should allow the casting between Timestamp and Numeric types: > * As we did some data science, we found that many Spark SQL users are > actually using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. > * The Spark SQL connector for Tableau is using this feature for DateTime > math. e.g. > {code:java} > CAST(FROM_UNIXTIME(CAST(CAST(%1 AS BIGINT) + (%2 * 86400) AS BIGINT)) AS > TIMESTAMP) > {code} > * In the current syntax, we specially allow Numeric <=> Boolean and String > <=> Binary since they are straight forward and frequently used. I suggest we > allow Timestamp <=> Numeric as well for better ANSI mode adoption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37179) ANSI mode: Allow casting between Timestamp and Numeric
[ https://issues.apache.org/jira/browse/SPARK-37179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37179: Assignee: Gengliang Wang (was: Apache Spark) > ANSI mode: Allow casting between Timestamp and Numeric > -- > > Key: SPARK-37179 > URL: https://issues.apache.org/jira/browse/SPARK-37179 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > We should allow the casting between Timestamp and Numeric types: > * As we did some data science, we found that many Spark SQL users are > actually using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. > * The Spark SQL connector for Tableau is using this feature for DateTime > math. e.g. > {code:java} > CAST(FROM_UNIXTIME(CAST(CAST(%1 AS BIGINT) + (%2 * 86400) AS BIGINT)) AS > TIMESTAMP) > {code} > * In the current syntax, we specially allow Numeric <=> Boolean and String > <=> Binary since they are straight forward and frequently used. I suggest we > allow Timestamp <=> Numeric as well for better ANSI mode adoption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37179) ANSI mode: Allow casting between Timestamp and Numeric
[ https://issues.apache.org/jira/browse/SPARK-37179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-37179: --- Description: We should allow the casting between Timestamp and Numeric types: * As we did some data science, we found that many Spark SQL users are actually using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. * The Spark SQL connector for Tableau is using this feature for DateTime math. e.g. {code:java} CAST(FROM_UNIXTIME(CAST(CAST(%1 AS BIGINT) + (%2 * 86400) AS BIGINT)) AS TIMESTAMP) {code} * In the current syntax, we specially allow Numeric <=> Boolean and String <=> Binary since they are straight forward and frequently used. I suggest we allow Timestamp <=> Numeric as well for better ANSI mode adoption. was: We should allow casting As we did some data science, we found that many Spark SQL users are actually using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. > ANSI mode: Allow casting between Timestamp and Numeric > -- > > Key: SPARK-37179 > URL: https://issues.apache.org/jira/browse/SPARK-37179 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > We should allow the casting between Timestamp and Numeric types: > * As we did some data science, we found that many Spark SQL users are > actually using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. > * The Spark SQL connector for Tableau is using this feature for DateTime > math. e.g. > {code:java} > CAST(FROM_UNIXTIME(CAST(CAST(%1 AS BIGINT) + (%2 * 86400) AS BIGINT)) AS > TIMESTAMP) > {code} > * In the current syntax, we specially allow Numeric <=> Boolean and String > <=> Binary since they are straight forward and frequently used. I suggest we > allow Timestamp <=> Numeric as well for better ANSI mode adoption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37179) ANSI mode: Allow casting between Timestamp and Numeric
[ https://issues.apache.org/jira/browse/SPARK-37179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-37179: -- Assignee: Gengliang Wang > ANSI mode: Allow casting between Timestamp and Numeric > -- > > Key: SPARK-37179 > URL: https://issues.apache.org/jira/browse/SPARK-37179 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > We should allow casting > As we did some data science, we found that many Spark SQL users are actually > using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37179) ANSI mode: Allow casting between Timestamp and Numeric
[ https://issues.apache.org/jira/browse/SPARK-37179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-37179: --- Description: We should allow casting As we did some data science, we found that many Spark SQL users are actually using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. was:The casting between > ANSI mode: Allow casting between Timestamp and Numeric > -- > > Key: SPARK-37179 > URL: https://issues.apache.org/jira/browse/SPARK-37179 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Priority: Major > > We should allow casting > As we did some data science, we found that many Spark SQL users are actually > using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37179) ANSI mode: Allow casting between Timestamp and Numeric
[ https://issues.apache.org/jira/browse/SPARK-37179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-37179: --- Description: The casting between > ANSI mode: Allow casting between Timestamp and Numeric > -- > > Key: SPARK-37179 > URL: https://issues.apache.org/jira/browse/SPARK-37179 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Priority: Major > > The casting between -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37179) ANSI mode: Allow casting between Timestamp and Numeric
Gengliang Wang created SPARK-37179: -- Summary: ANSI mode: Allow casting between Timestamp and Numeric Key: SPARK-37179 URL: https://issues.apache.org/jira/browse/SPARK-37179 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37178) Add Target Encoding to ml.feature
[ https://issues.apache.org/jira/browse/SPARK-37178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436844#comment-17436844 ] Apache Spark commented on SPARK-37178: -- User 'taosiyuan163' has created a pull request for this issue: https://github.com/apache/spark/pull/34457 > Add Target Encoding to ml.feature > - > > Key: SPARK-37178 > URL: https://issues.apache.org/jira/browse/SPARK-37178 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.2.0 >Reporter: Simon Tao >Priority: Major > > Target Encoding is a mechanism of converting categorical features to > continues features based on the posterior probability __ calculated from > values of the label (target) column. > > Target Encoding can help to improve accuracy of machine learning algorithms > when columns with high cardinality are used as features during training phase. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37178) Add Target Encoding to ml.feature
[ https://issues.apache.org/jira/browse/SPARK-37178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37178: Assignee: (was: Apache Spark) > Add Target Encoding to ml.feature > - > > Key: SPARK-37178 > URL: https://issues.apache.org/jira/browse/SPARK-37178 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.2.0 >Reporter: Simon Tao >Priority: Major > > Target Encoding is a mechanism of converting categorical features to > continues features based on the posterior probability __ calculated from > values of the label (target) column. > > Target Encoding can help to improve accuracy of machine learning algorithms > when columns with high cardinality are used as features during training phase. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37178) Add Target Encoding to ml.feature
[ https://issues.apache.org/jira/browse/SPARK-37178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37178: Assignee: Apache Spark > Add Target Encoding to ml.feature > - > > Key: SPARK-37178 > URL: https://issues.apache.org/jira/browse/SPARK-37178 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.2.0 >Reporter: Simon Tao >Assignee: Apache Spark >Priority: Major > > Target Encoding is a mechanism of converting categorical features to > continues features based on the posterior probability __ calculated from > values of the label (target) column. > > Target Encoding can help to improve accuracy of machine learning algorithms > when columns with high cardinality are used as features during training phase. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37034) What's the progress of vectorized execution for spark?
[ https://issues.apache.org/jira/browse/SPARK-37034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436825#comment-17436825 ] xiaoli commented on SPARK-37034: [~dongjoon] [~yumwang] [~cloud_fan] Sorry to ping you, as there is no answer/comments of this question for a week. All of you have contribute to spark's vectorized read, so I guess you may know something about this question. Could you look at this question? > What's the progress of vectorized execution for spark? > -- > > Key: SPARK-37034 > URL: https://issues.apache.org/jira/browse/SPARK-37034 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: xiaoli >Priority: Major > > Spark has support vectorized read for ORC and parquet. What's the progress of > other vectorized execution, e.g. vectorized write, join, aggr, simple > operator (string function, math function)? > Hive support vectorized execution in early version > (https://cwiki.apache.org/confluence/display/hive/vectorized+query+execution) > As we know, Spark is replacement of Hive. I guess the reason why Spark does > not support vectorized execution maybe the difficulty of design or > implementation in Spark is larger. What's the main issue for Spark to support > vectorized execution? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37178) Add Target Encoding to ml.feature
Simon Tao created SPARK-37178: - Summary: Add Target Encoding to ml.feature Key: SPARK-37178 URL: https://issues.apache.org/jira/browse/SPARK-37178 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 3.2.0 Reporter: Simon Tao Target Encoding is a mechanism of converting categorical features to continues features based on the posterior probability __ calculated from values of the label (target) column. Target Encoding can help to improve accuracy of machine learning algorithms when columns with high cardinality are used as features during training phase. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37177) Support LONG argument to the Spark SQL LIMIT clause
Douglas Moore created SPARK-37177: - Summary: Support LONG argument to the Spark SQL LIMIT clause Key: SPARK-37177 URL: https://issues.apache.org/jira/browse/SPARK-37177 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.2.0 Reporter: Douglas Moore Big data sets may exceed INT max thus Limit clause in Spark SQL `LIMIT ` should be expanded to support . Thus `SELECT * FROM LIMIT 200` would run and not return an error. See docs: [https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-limit.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37062) Introduce a new data source for providing consistent set of rows per microbatch
[ https://issues.apache.org/jira/browse/SPARK-37062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-37062: Assignee: Jungtaek Lim > Introduce a new data source for providing consistent set of rows per > microbatch > --- > > Key: SPARK-37062 > URL: https://issues.apache.org/jira/browse/SPARK-37062 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > > The "rate" data source has been known to be used as a benchmark for streaming > query. > While this helps to put the query to the limit (how many rows the query could > process per second), the rate data source doesn't provide consistent rows per > batch into stream, which leads two environments be hard to compare with. > For example, in many cases, you may want to compare the metrics in the > batches between test environments (like running same streaming query with > different options). These metrics are strongly affected if the distribution > of input rows in batches are changing, especially a micro-batch has been > lagged (in any reason) and rate data source produces more input rows to the > next batch. > Also, when you test against streaming aggregation, you may want the data > source produces the same set of input rows per batch (deterministic), so that > you can plan how these input rows will be aggregated and how state rows will > be evicted, and craft the test query based on the plan. > The requirements of new data source would follow: > * it should produce a specific number of input rows as requested > * it should also include a timestamp (event time) into each row > ** to make the input rows fully deterministic, timestamp should be configured > as well (like start timestamp & amount of advance per batch) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37062) Introduce a new data source for providing consistent set of rows per microbatch
[ https://issues.apache.org/jira/browse/SPARK-37062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-37062. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34333 [https://github.com/apache/spark/pull/34333] > Introduce a new data source for providing consistent set of rows per > microbatch > --- > > Key: SPARK-37062 > URL: https://issues.apache.org/jira/browse/SPARK-37062 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.3.0 > > > The "rate" data source has been known to be used as a benchmark for streaming > query. > While this helps to put the query to the limit (how many rows the query could > process per second), the rate data source doesn't provide consistent rows per > batch into stream, which leads two environments be hard to compare with. > For example, in many cases, you may want to compare the metrics in the > batches between test environments (like running same streaming query with > different options). These metrics are strongly affected if the distribution > of input rows in batches are changing, especially a micro-batch has been > lagged (in any reason) and rate data source produces more input rows to the > next batch. > Also, when you test against streaming aggregation, you may want the data > source produces the same set of input rows per batch (deterministic), so that > you can plan how these input rows will be aggregated and how state rows will > be evicted, and craft the test query based on the plan. > The requirements of new data source would follow: > * it should produce a specific number of input rows as requested > * it should also include a timestamp (event time) into each row > ** to make the input rows fully deterministic, timestamp should be configured > as well (like start timestamp & amount of advance per batch) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
[ https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436740#comment-17436740 ] Gustavo Martin edited comment on SPARK-23977 at 11/1/21, 10:48 AM: --- Thank you very much [~ste...@apache.org] for your explanations. I am experiencing the same problems as [~danzhi] and your comments helped me a lot. Sad magic committer does not work with dynamic partition overwrite because it has an amazing performance when writing loads of JSON partitioned data. was (Author: gumartinm): Thank you very much [~ste...@apache.org] for your explanations. I am experiencing the same problems as [~danzhi] and your comments helped me a lot. Sad magic committer does not work with partition overwrite because it has an amazing performance when writing loads of JSON partitioned data. > Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism > --- > > Key: SPARK-23977 > URL: https://issues.apache.org/jira/browse/SPARK-23977 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 3.0.0 > > > Hadoop 3.1 adds a mechanism for job-specific and store-specific committers > (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, > HADOOP-13786 > These committers deliver high-performance output of MR and spark jobs to S3, > and offer the key semantics which Spark depends on: no visible output until > job commit, a failure of a task at an stage, including partway through task > commit, can be handled by executing and committing another task attempt. > In contrast, the FileOutputFormat commit algorithms on S3 have issues: > * Awful performance because files are copied by rename > * FileOutputFormat v1: weak task commit failure recovery semantics as the > (v1) expectation: "directory renames are atomic" doesn't hold. > * S3 metadata eventual consistency can cause rename to miss files or fail > entirely (SPARK-15849) > Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of > the commit semantics w.r.t observability of or recovery from task commit > failure, on any filesystem. > The S3A committers address these by way of uploading all data to the > destination through multipart uploads, uploads which are only completed in > job commit. > The new {{PathOutputCommitter}} factory mechanism allows applications to work > with the S3A committers and any other, by adding a plugin mechanism into the > MRv2 FileOutputFormat class, where it job config and filesystem configuration > options can dynamically choose the output committer. > Spark can use these with some binding classes to > # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 > classes and {{PathOutputCommitterFactory}} to create the committers. > # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}} > to wire up Parquet output even when code requires the committer to be a > subclass of {{ParquetOutputCommitter}} > This patch builds on SPARK-23807 for setting up the dependencies. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
[ https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436740#comment-17436740 ] Gustavo Martin edited comment on SPARK-23977 at 11/1/21, 10:35 AM: --- Thank you very much [~ste...@apache.org] for your explanations. I am experiencing the same problems as [~danzhi] and your comments helped me a lot. Sad magic committer does not work with partition overwrite because it has an amazing performance when writing loads of JSON partitioned data. was (Author: gumartinm): Thank you very much [~ste...@apache.org] for your explanations. I am experiencing the same problems as [~danzhi] and your comments helped me a lot. Sad magic committer does not work with partition overwrite because it has an amazing performance when writing loads of partitioned data. > Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism > --- > > Key: SPARK-23977 > URL: https://issues.apache.org/jira/browse/SPARK-23977 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 3.0.0 > > > Hadoop 3.1 adds a mechanism for job-specific and store-specific committers > (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, > HADOOP-13786 > These committers deliver high-performance output of MR and spark jobs to S3, > and offer the key semantics which Spark depends on: no visible output until > job commit, a failure of a task at an stage, including partway through task > commit, can be handled by executing and committing another task attempt. > In contrast, the FileOutputFormat commit algorithms on S3 have issues: > * Awful performance because files are copied by rename > * FileOutputFormat v1: weak task commit failure recovery semantics as the > (v1) expectation: "directory renames are atomic" doesn't hold. > * S3 metadata eventual consistency can cause rename to miss files or fail > entirely (SPARK-15849) > Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of > the commit semantics w.r.t observability of or recovery from task commit > failure, on any filesystem. > The S3A committers address these by way of uploading all data to the > destination through multipart uploads, uploads which are only completed in > job commit. > The new {{PathOutputCommitter}} factory mechanism allows applications to work > with the S3A committers and any other, by adding a plugin mechanism into the > MRv2 FileOutputFormat class, where it job config and filesystem configuration > options can dynamically choose the output committer. > Spark can use these with some binding classes to > # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 > classes and {{PathOutputCommitterFactory}} to create the committers. > # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}} > to wire up Parquet output even when code requires the committer to be a > subclass of {{ParquetOutputCommitter}} > This patch builds on SPARK-23807 for setting up the dependencies. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
[ https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436740#comment-17436740 ] Gustavo Martin edited comment on SPARK-23977 at 11/1/21, 10:34 AM: --- Thank you very much [~ste...@apache.org] for your explanations. I am experiencing the same problems as [~danzhi] and your comments helped me a lot. Sad magic committer does not work with partition overwrite because it has an amazing performance when writing loads of partitioned data. was (Author: gumartinm): Thank you ver much [~ste...@apache.org] for your explanations. I am experiencing the same problems as [~danzhi] and your comments helped me a lot. Sad magic committer does not work with partition overwrite because it has an amazing performance when writing loads of partitioned data. > Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism > --- > > Key: SPARK-23977 > URL: https://issues.apache.org/jira/browse/SPARK-23977 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 3.0.0 > > > Hadoop 3.1 adds a mechanism for job-specific and store-specific committers > (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, > HADOOP-13786 > These committers deliver high-performance output of MR and spark jobs to S3, > and offer the key semantics which Spark depends on: no visible output until > job commit, a failure of a task at an stage, including partway through task > commit, can be handled by executing and committing another task attempt. > In contrast, the FileOutputFormat commit algorithms on S3 have issues: > * Awful performance because files are copied by rename > * FileOutputFormat v1: weak task commit failure recovery semantics as the > (v1) expectation: "directory renames are atomic" doesn't hold. > * S3 metadata eventual consistency can cause rename to miss files or fail > entirely (SPARK-15849) > Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of > the commit semantics w.r.t observability of or recovery from task commit > failure, on any filesystem. > The S3A committers address these by way of uploading all data to the > destination through multipart uploads, uploads which are only completed in > job commit. > The new {{PathOutputCommitter}} factory mechanism allows applications to work > with the S3A committers and any other, by adding a plugin mechanism into the > MRv2 FileOutputFormat class, where it job config and filesystem configuration > options can dynamically choose the output committer. > Spark can use these with some binding classes to > # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 > classes and {{PathOutputCommitterFactory}} to create the committers. > # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}} > to wire up Parquet output even when code requires the committer to be a > subclass of {{ParquetOutputCommitter}} > This patch builds on SPARK-23807 for setting up the dependencies. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
[ https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436740#comment-17436740 ] Gustavo Martin commented on SPARK-23977: Thank you ver much [~ste...@apache.org] for your explanations. I am experiencing the same problems as [~danzhi] and your comments helped me a lot. Sad magic committer does not work with partition overwrite because it has an amazing performance when writing loads of partitioned data. > Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism > --- > > Key: SPARK-23977 > URL: https://issues.apache.org/jira/browse/SPARK-23977 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 3.0.0 > > > Hadoop 3.1 adds a mechanism for job-specific and store-specific committers > (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, > HADOOP-13786 > These committers deliver high-performance output of MR and spark jobs to S3, > and offer the key semantics which Spark depends on: no visible output until > job commit, a failure of a task at an stage, including partway through task > commit, can be handled by executing and committing another task attempt. > In contrast, the FileOutputFormat commit algorithms on S3 have issues: > * Awful performance because files are copied by rename > * FileOutputFormat v1: weak task commit failure recovery semantics as the > (v1) expectation: "directory renames are atomic" doesn't hold. > * S3 metadata eventual consistency can cause rename to miss files or fail > entirely (SPARK-15849) > Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of > the commit semantics w.r.t observability of or recovery from task commit > failure, on any filesystem. > The S3A committers address these by way of uploading all data to the > destination through multipart uploads, uploads which are only completed in > job commit. > The new {{PathOutputCommitter}} factory mechanism allows applications to work > with the S3A committers and any other, by adding a plugin mechanism into the > MRv2 FileOutputFormat class, where it job config and filesystem configuration > options can dynamically choose the output committer. > Spark can use these with some binding classes to > # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 > classes and {{PathOutputCommitterFactory}} to create the committers. > # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}} > to wire up Parquet output even when code requires the committer to be a > subclass of {{ParquetOutputCommitter}} > This patch builds on SPARK-23807 for setting up the dependencies. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36061) Create a PodGroup with user specified minimum resources required
[ https://issues.apache.org/jira/browse/SPARK-36061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36061: Assignee: (was: Apache Spark) > Create a PodGroup with user specified minimum resources required > > > Key: SPARK-36061 > URL: https://issues.apache.org/jira/browse/SPARK-36061 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Holden Karau >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36061) Create a PodGroup with user specified minimum resources required
[ https://issues.apache.org/jira/browse/SPARK-36061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36061: Assignee: Apache Spark > Create a PodGroup with user specified minimum resources required > > > Key: SPARK-36061 > URL: https://issues.apache.org/jira/browse/SPARK-36061 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Holden Karau >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36061) Create a PodGroup with user specified minimum resources required
[ https://issues.apache.org/jira/browse/SPARK-36061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436725#comment-17436725 ] Apache Spark commented on SPARK-36061: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/34456 > Create a PodGroup with user specified minimum resources required > > > Key: SPARK-36061 > URL: https://issues.apache.org/jira/browse/SPARK-36061 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Holden Karau >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37176) JsonSource's infer should have the same exception handle logic as JacksonParser's parse logic
[ https://issues.apache.org/jira/browse/SPARK-37176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37176: Assignee: (was: Apache Spark) > JsonSource's infer should have the same exception handle logic as > JacksonParser's parse logic > - > > Key: SPARK-37176 > URL: https://issues.apache.org/jira/browse/SPARK-37176 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0 >Reporter: Xianjin YE >Priority: Minor > > JacksonParser's exception handle logic is different with > org.apache.spark.sql.catalyst.json.JsonInferSchema#infer logic, the different > can be saw as below: > {code:java} > // code JacksonParser's parse > try { > Utils.tryWithResource(createParser(factory, record)) { parser => > // a null first token is equivalent to testing for input.trim.isEmpty > // but it works on any token stream and not just strings > parser.nextToken() match { > case null => None > case _ => rootConverter.apply(parser) match { > case null => throw > QueryExecutionErrors.rootConverterReturnNullError() > case rows => rows.toSeq > } > } > } > } catch { > case e: SparkUpgradeException => throw e > case e @ (_: RuntimeException | _: JsonProcessingException | _: > MalformedInputException) => > // JSON parser currently doesn't support partial results for > corrupted records. > // For such records, all fields other than the field configured by > // `columnNameOfCorruptRecord` are set to `null`. > throw BadRecordException(() => recordLiteral(record), () => None, e) > case e: CharConversionException if options.encoding.isEmpty => > val msg = > """JSON parser cannot handle a character in its input. > |Specifying encoding as an input option explicitly might help to > resolve the issue. > |""".stripMargin + e.getMessage > val wrappedCharException = new CharConversionException(msg) > wrappedCharException.initCause(e) > throw BadRecordException(() => recordLiteral(record), () => None, > wrappedCharException) > case PartialResultException(row, cause) => > throw BadRecordException( > record = () => recordLiteral(record), > partialResult = () => Some(row), > cause) > } > {code} > v.s. > {code:java} > // JsonInferSchema's infer logic > val mergedTypesFromPartitions = json.mapPartitions { iter => > val factory = options.buildJsonFactory() > iter.flatMap { row => > try { > Utils.tryWithResource(createParser(factory, row)) { parser => > parser.nextToken() > Some(inferField(parser)) > } > } catch { > case e @ (_: RuntimeException | _: JsonProcessingException) => > parseMode match { > case PermissiveMode => > Some(StructType(Seq(StructField(columnNameOfCorruptRecord, > StringType > case DropMalformedMode => > None > case FailFastMode => > throw > QueryExecutionErrors.malformedRecordsDetectedInSchemaInferenceError(e) > } > } > }.reduceOption(typeMerger).toIterator > } > {code} > They should have the same exception handle logic, otherwise it may confuse > user because of the inconsistency. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37176) JsonSource's infer should have the same exception handle logic as JacksonParser's parse logic
[ https://issues.apache.org/jira/browse/SPARK-37176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436675#comment-17436675 ] Apache Spark commented on SPARK-37176: -- User 'advancedxy' has created a pull request for this issue: https://github.com/apache/spark/pull/34455 > JsonSource's infer should have the same exception handle logic as > JacksonParser's parse logic > - > > Key: SPARK-37176 > URL: https://issues.apache.org/jira/browse/SPARK-37176 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0 >Reporter: Xianjin YE >Priority: Minor > > JacksonParser's exception handle logic is different with > org.apache.spark.sql.catalyst.json.JsonInferSchema#infer logic, the different > can be saw as below: > {code:java} > // code JacksonParser's parse > try { > Utils.tryWithResource(createParser(factory, record)) { parser => > // a null first token is equivalent to testing for input.trim.isEmpty > // but it works on any token stream and not just strings > parser.nextToken() match { > case null => None > case _ => rootConverter.apply(parser) match { > case null => throw > QueryExecutionErrors.rootConverterReturnNullError() > case rows => rows.toSeq > } > } > } > } catch { > case e: SparkUpgradeException => throw e > case e @ (_: RuntimeException | _: JsonProcessingException | _: > MalformedInputException) => > // JSON parser currently doesn't support partial results for > corrupted records. > // For such records, all fields other than the field configured by > // `columnNameOfCorruptRecord` are set to `null`. > throw BadRecordException(() => recordLiteral(record), () => None, e) > case e: CharConversionException if options.encoding.isEmpty => > val msg = > """JSON parser cannot handle a character in its input. > |Specifying encoding as an input option explicitly might help to > resolve the issue. > |""".stripMargin + e.getMessage > val wrappedCharException = new CharConversionException(msg) > wrappedCharException.initCause(e) > throw BadRecordException(() => recordLiteral(record), () => None, > wrappedCharException) > case PartialResultException(row, cause) => > throw BadRecordException( > record = () => recordLiteral(record), > partialResult = () => Some(row), > cause) > } > {code} > v.s. > {code:java} > // JsonInferSchema's infer logic > val mergedTypesFromPartitions = json.mapPartitions { iter => > val factory = options.buildJsonFactory() > iter.flatMap { row => > try { > Utils.tryWithResource(createParser(factory, row)) { parser => > parser.nextToken() > Some(inferField(parser)) > } > } catch { > case e @ (_: RuntimeException | _: JsonProcessingException) => > parseMode match { > case PermissiveMode => > Some(StructType(Seq(StructField(columnNameOfCorruptRecord, > StringType > case DropMalformedMode => > None > case FailFastMode => > throw > QueryExecutionErrors.malformedRecordsDetectedInSchemaInferenceError(e) > } > } > }.reduceOption(typeMerger).toIterator > } > {code} > They should have the same exception handle logic, otherwise it may confuse > user because of the inconsistency. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37176) JsonSource's infer should have the same exception handle logic as JacksonParser's parse logic
[ https://issues.apache.org/jira/browse/SPARK-37176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37176: Assignee: Apache Spark > JsonSource's infer should have the same exception handle logic as > JacksonParser's parse logic > - > > Key: SPARK-37176 > URL: https://issues.apache.org/jira/browse/SPARK-37176 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0 >Reporter: Xianjin YE >Assignee: Apache Spark >Priority: Minor > > JacksonParser's exception handle logic is different with > org.apache.spark.sql.catalyst.json.JsonInferSchema#infer logic, the different > can be saw as below: > {code:java} > // code JacksonParser's parse > try { > Utils.tryWithResource(createParser(factory, record)) { parser => > // a null first token is equivalent to testing for input.trim.isEmpty > // but it works on any token stream and not just strings > parser.nextToken() match { > case null => None > case _ => rootConverter.apply(parser) match { > case null => throw > QueryExecutionErrors.rootConverterReturnNullError() > case rows => rows.toSeq > } > } > } > } catch { > case e: SparkUpgradeException => throw e > case e @ (_: RuntimeException | _: JsonProcessingException | _: > MalformedInputException) => > // JSON parser currently doesn't support partial results for > corrupted records. > // For such records, all fields other than the field configured by > // `columnNameOfCorruptRecord` are set to `null`. > throw BadRecordException(() => recordLiteral(record), () => None, e) > case e: CharConversionException if options.encoding.isEmpty => > val msg = > """JSON parser cannot handle a character in its input. > |Specifying encoding as an input option explicitly might help to > resolve the issue. > |""".stripMargin + e.getMessage > val wrappedCharException = new CharConversionException(msg) > wrappedCharException.initCause(e) > throw BadRecordException(() => recordLiteral(record), () => None, > wrappedCharException) > case PartialResultException(row, cause) => > throw BadRecordException( > record = () => recordLiteral(record), > partialResult = () => Some(row), > cause) > } > {code} > v.s. > {code:java} > // JsonInferSchema's infer logic > val mergedTypesFromPartitions = json.mapPartitions { iter => > val factory = options.buildJsonFactory() > iter.flatMap { row => > try { > Utils.tryWithResource(createParser(factory, row)) { parser => > parser.nextToken() > Some(inferField(parser)) > } > } catch { > case e @ (_: RuntimeException | _: JsonProcessingException) => > parseMode match { > case PermissiveMode => > Some(StructType(Seq(StructField(columnNameOfCorruptRecord, > StringType > case DropMalformedMode => > None > case FailFastMode => > throw > QueryExecutionErrors.malformedRecordsDetectedInSchemaInferenceError(e) > } > } > }.reduceOption(typeMerger).toIterator > } > {code} > They should have the same exception handle logic, otherwise it may confuse > user because of the inconsistency. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37176) JsonSource's infer should have the same exception handle logic as JacksonParser's parse logic
Xianjin YE created SPARK-37176: -- Summary: JsonSource's infer should have the same exception handle logic as JacksonParser's parse logic Key: SPARK-37176 URL: https://issues.apache.org/jira/browse/SPARK-37176 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0, 3.1.2, 3.0.3 Reporter: Xianjin YE JacksonParser's exception handle logic is different with org.apache.spark.sql.catalyst.json.JsonInferSchema#infer logic, the different can be saw as below: {code:java} // code JacksonParser's parse try { Utils.tryWithResource(createParser(factory, record)) { parser => // a null first token is equivalent to testing for input.trim.isEmpty // but it works on any token stream and not just strings parser.nextToken() match { case null => None case _ => rootConverter.apply(parser) match { case null => throw QueryExecutionErrors.rootConverterReturnNullError() case rows => rows.toSeq } } } } catch { case e: SparkUpgradeException => throw e case e @ (_: RuntimeException | _: JsonProcessingException | _: MalformedInputException) => // JSON parser currently doesn't support partial results for corrupted records. // For such records, all fields other than the field configured by // `columnNameOfCorruptRecord` are set to `null`. throw BadRecordException(() => recordLiteral(record), () => None, e) case e: CharConversionException if options.encoding.isEmpty => val msg = """JSON parser cannot handle a character in its input. |Specifying encoding as an input option explicitly might help to resolve the issue. |""".stripMargin + e.getMessage val wrappedCharException = new CharConversionException(msg) wrappedCharException.initCause(e) throw BadRecordException(() => recordLiteral(record), () => None, wrappedCharException) case PartialResultException(row, cause) => throw BadRecordException( record = () => recordLiteral(record), partialResult = () => Some(row), cause) } {code} v.s. {code:java} // JsonInferSchema's infer logic val mergedTypesFromPartitions = json.mapPartitions { iter => val factory = options.buildJsonFactory() iter.flatMap { row => try { Utils.tryWithResource(createParser(factory, row)) { parser => parser.nextToken() Some(inferField(parser)) } } catch { case e @ (_: RuntimeException | _: JsonProcessingException) => parseMode match { case PermissiveMode => Some(StructType(Seq(StructField(columnNameOfCorruptRecord, StringType case DropMalformedMode => None case FailFastMode => throw QueryExecutionErrors.malformedRecordsDetectedInSchemaInferenceError(e) } } }.reduceOption(typeMerger).toIterator } {code} They should have the same exception handle logic, otherwise it may confuse user because of the inconsistency. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37013) `select format_string('%0$s', 'Hello')` has different behavior when using java 8 and Java 17
[ https://issues.apache.org/jira/browse/SPARK-37013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436641#comment-17436641 ] Apache Spark commented on SPARK-37013: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/34454 > `select format_string('%0$s', 'Hello')` has different behavior when using > java 8 and Java 17 > > > Key: SPARK-37013 > URL: https://issues.apache.org/jira/browse/SPARK-37013 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Labels: release-notes > Fix For: 3.3.0 > > > {code:java} > --PostgreSQL throw ERROR: format specifies argument 0, but arguments are > numbered from 1 > select format_string('%0$s', 'Hello'); > {code} > Execute with Java 8 > {code:java} > -- !query > select format_string('%0$s', 'Hello') > -- !query schema > struct > -- !query output > Hello > {code} > Execute with Java 17 > {code:java} > -- !query > select format_string('%0$s', 'Hello') > -- !query schema > struct<> > -- !query output > java.util.IllegalFormatArgumentIndexException > Illegal format argument index = 0 > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37161) RowToColumnConverter support AnsiIntervalType
[ https://issues.apache.org/jira/browse/SPARK-37161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-37161: Assignee: PengLei > RowToColumnConverter support AnsiIntervalType > -- > > Key: SPARK-37161 > URL: https://issues.apache.org/jira/browse/SPARK-37161 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Assignee: PengLei >Priority: Major > > currently, we have RowToColumnConverter for all data types except > AnsiIntervalType > {code:java} > // code placeholder > val core = dataType match { > case BinaryType => BinaryConverter > case BooleanType => BooleanConverter > case ByteType => ByteConverter > case ShortType => ShortConverter > case IntegerType | DateType => IntConverter > case FloatType => FloatConverter > case LongType | TimestampType => LongConverter > case DoubleType => DoubleConverter > case StringType => StringConverter > case CalendarIntervalType => CalendarConverter > case at: ArrayType => ArrayConverter(getConverterForType(at.elementType, > at.containsNull)) > case st: StructType => new StructConverter(st.fields.map( > (f) => getConverterForType(f.dataType, f.nullable))) > case dt: DecimalType => new DecimalConverter(dt) > case mt: MapType => MapConverter(getConverterForType(mt.keyType, nullable = > false), > getConverterForType(mt.valueType, mt.valueContainsNull)) > case unknown => throw > QueryExecutionErrors.unsupportedDataTypeError(unknown.toString) > } > if (nullable) { > dataType match { > case CalendarIntervalType => new StructNullableTypeConverter(core) > case st: StructType => new StructNullableTypeConverter(core) > case _ => new BasicNullableTypeConverter(core) > } > } else { > core > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37161) RowToColumnConverter support AnsiIntervalType
[ https://issues.apache.org/jira/browse/SPARK-37161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-37161. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34446 [https://github.com/apache/spark/pull/34446] > RowToColumnConverter support AnsiIntervalType > -- > > Key: SPARK-37161 > URL: https://issues.apache.org/jira/browse/SPARK-37161 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Assignee: PengLei >Priority: Major > Fix For: 3.3.0 > > > currently, we have RowToColumnConverter for all data types except > AnsiIntervalType > {code:java} > // code placeholder > val core = dataType match { > case BinaryType => BinaryConverter > case BooleanType => BooleanConverter > case ByteType => ByteConverter > case ShortType => ShortConverter > case IntegerType | DateType => IntConverter > case FloatType => FloatConverter > case LongType | TimestampType => LongConverter > case DoubleType => DoubleConverter > case StringType => StringConverter > case CalendarIntervalType => CalendarConverter > case at: ArrayType => ArrayConverter(getConverterForType(at.elementType, > at.containsNull)) > case st: StructType => new StructConverter(st.fields.map( > (f) => getConverterForType(f.dataType, f.nullable))) > case dt: DecimalType => new DecimalConverter(dt) > case mt: MapType => MapConverter(getConverterForType(mt.keyType, nullable = > false), > getConverterForType(mt.valueType, mt.valueContainsNull)) > case unknown => throw > QueryExecutionErrors.unsupportedDataTypeError(unknown.toString) > } > if (nullable) { > dataType match { > case CalendarIntervalType => new StructNullableTypeConverter(core) > case st: StructType => new StructNullableTypeConverter(core) > case _ => new BasicNullableTypeConverter(core) > } > } else { > core > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org