date:20211101

[jira] [Updated] (SPARK-35496) Upgrade Scala 2.13 to 2.13.7

2021-11-01 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-35496:
-
Affects Version/s: 3.3.0

> Upgrade Scala 2.13 to 2.13.7
> 
>
> Key: SPARK-35496
> URL: https://issues.apache.org/jira/browse/SPARK-35496
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> This issue aims to upgrade to Scala 2.13.7.
> Scala 2.13.6 released（https://github.com/scala/scala/releases/tag/v2.13.6）. 
> However, we skip 2.13.6 because there is a breaking behavior change at 2.13.6 
> which is different from both Scala 2.13.5 and Scala 3.
> - https://github.com/scala/bug/issues/12403
> {code}
> scala3-3.0.0:$ bin/scala
> scala> Array.empty[Double].intersect(Array(0.0))
> val res0: Array[Double] = Array()
> scala-2.13.6:$ bin/scala
> Welcome to Scala 2.13.6 (OpenJDK 64-Bit Server VM, Java 1.8.0_292).
> Type in expressions for evaluation. Or try :help.
> scala> Array.empty[Double].intersect(Array(0.0))
> java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [D
>   ... 32 elided
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37172) Push down filters having both partitioning and non-partitioning columns

2021-11-01 Thread Chungmin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437135#comment-17437135
 ] 

Chungmin commented on SPARK-37172:
--

I can work on this if the rationale seems okay.

> Push down filters having both partitioning and non-partitioning columns
> ---
>
> Key: SPARK-37172
> URL: https://issues.apache.org/jira/browse/SPARK-37172
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chungmin
>Priority: Major
>
> Currently, filters having both partitioning and non-partitioning columns are 
> lost during the creation of {{FileSourceScanExec}} and not pushed down to the 
> data source. However, theoretically and practically, there is no reason to 
> exclude such filters from {{dataFilters}}. For any partitioned source data 
> file, the values of partitioning columns are the same for all rows. They can 
> be stored physically (or reconstructed logically) along with statistics for 
> non-partitioning columns to allow more powerful data skipping. If a data 
> source doesn't know how to handle such filters, it can simply ignore such 
> filters.
> Example: Suppose that there is a table {{MYTAB}} with two columns {{A}} and 
> {{B}}, partitioned by {{A}} (Hive partitioning). Currently, data skipping 
> cannot be applied to queries like {{select * from MYTAB where A < B + 7}} 
> because {{A < B + 7}} is included in neither {{partitionFilters}} nor 
> {{dataFilters}}. However, we could have included the filter in 
> {{dataFilters}} because data sources have no obligation to use 
> {{dataFilters}} and they could have ignored filters that they cannot use.
> It's not obvious whether we can change the semantics of 
> {{FileSourceScanExec.dataFilters}} without breaking existing code. It is 
> passed to {{FileIndex.listFiles}} and 
> {{FileFormat.buildReaderWithPartitionValues}} and the contracts for the 
> methods are not clear enough.
> If we should not change {{dataFilters}}, we might have to add a new member 
> variable to {{FileSourceScanExec}} (e.g. {{dataFiltersWithPartitionColumns}}) 
> and add an overload of {{listFiles}} to the {{FileIndex}} trait, which 
> defaults to the existing {{listFiles}} without using the filters. Both 
> {{dataFilters}} and {{dataFiltersWIthoutPartitionColumns}} are optional; 
> implementations can ignore the filters if they can't utilize them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37051) The filter operator gets wrong results in char type

2021-11-01 Thread frankli (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437127#comment-17437127
 ] 

frankli commented on SPARK-37051:
-

I know this SQL can work, but this behavior is different from MYSQL and 
PostgreSQL.

> The filter operator gets wrong results in char type
> ---
>
> Key: SPARK-37051
> URL: https://issues.apache.org/jira/browse/SPARK-37051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1, 3.3.0
> Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>Reporter: frankli
>Priority: Critical
>
> When I try the following sample SQL on  the TPCDS data, the filter operator 
> returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type. 
> Data is inserted by hive, and queried by Spark.
> I guest that the char(50) type will remains redundant blanks after the actual 
> word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong 
> results.
> Luckily, the varchar type is OK. 
>  
> This bug can be reproduced by a few steps.
> >>> desc t2_orc;
>  ++---+++
> |col_name|data_type|comment|
> ++---+++
> |a|string      |NULL|
> |b|char(50)  |NULL|
> |c|int            |NULL|
> ++---++--–+
> >>> select * from t2_orc where a='a';
>  +-+---++--+
> |a|b|c|
> +-+---++--+
> |a|b|1|
> |a|b|2|
> |a|b|3|
> |a|b|4|
> |a|b|5|
> +-+---++–+
> >>> select * from t2_orc where b='b';
>  +-+---++--+
> |a|b|c|
> +-+---++--+
>  +-+---++--+
>  
> By the way, Spark's tests should add more cases on the char type.
>  
> == Physical Plan ==
>  CollectLimit (3)
>  +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
>  Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Batched: false
>  Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
>  PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
> )]
>  ReadSchema: 
> struct
> (2) Filter
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Arguments: 100
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37193) DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer joins

2021-11-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437126#comment-17437126
 ] 

Apache Spark commented on SPARK-37193:
--

User 'ekoifman' has created a pull request for this issue:
https://github.com/apache/spark/pull/34464

> DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer 
> joins
> --
>
> Key: SPARK-37193
> URL: https://issues.apache.org/jira/browse/SPARK-37193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Eugene Koifman
>Priority: Major
>
> {{DynamicJoinSelection.shouldDemoteBroadcastHashJoin}} will prevent AQE from 
> converting Sort merge join into a broadcast join because SMJ is faster when 
> the side that would be broadcast has a lot of empty partitions.
>  This makes sense for inner joins which can short circuit if one side is 
> empty.
>  For (left,right) outer join, the streaming side still has to be processed so 
> demoting broadcast join doesn't have the same advantage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37193) DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer joins

2021-11-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37193:


Assignee: Apache Spark

> DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer 
> joins
> --
>
> Key: SPARK-37193
> URL: https://issues.apache.org/jira/browse/SPARK-37193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Eugene Koifman
>Assignee: Apache Spark
>Priority: Major
>
> {{DynamicJoinSelection.shouldDemoteBroadcastHashJoin}} will prevent AQE from 
> converting Sort merge join into a broadcast join because SMJ is faster when 
> the side that would be broadcast has a lot of empty partitions.
>  This makes sense for inner joins which can short circuit if one side is 
> empty.
>  For (left,right) outer join, the streaming side still has to be processed so 
> demoting broadcast join doesn't have the same advantage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37193) DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer joins

2021-11-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37193:


Assignee: (was: Apache Spark)

> DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer 
> joins
> --
>
> Key: SPARK-37193
> URL: https://issues.apache.org/jira/browse/SPARK-37193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Eugene Koifman
>Priority: Major
>
> {{DynamicJoinSelection.shouldDemoteBroadcastHashJoin}} will prevent AQE from 
> converting Sort merge join into a broadcast join because SMJ is faster when 
> the side that would be broadcast has a lot of empty partitions.
>  This makes sense for inner joins which can short circuit if one side is 
> empty.
>  For (left,right) outer join, the streaming side still has to be processed so 
> demoting broadcast join doesn't have the same advantage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37193) DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer joins

2021-11-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437125#comment-17437125
 ] 

Apache Spark commented on SPARK-37193:
--

User 'ekoifman' has created a pull request for this issue:
https://github.com/apache/spark/pull/34464

> DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer 
> joins
> --
>
> Key: SPARK-37193
> URL: https://issues.apache.org/jira/browse/SPARK-37193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Eugene Koifman
>Priority: Major
>
> {{DynamicJoinSelection.shouldDemoteBroadcastHashJoin}} will prevent AQE from 
> converting Sort merge join into a broadcast join because SMJ is faster when 
> the side that would be broadcast has a lot of empty partitions.
>  This makes sense for inner joins which can short circuit if one side is 
> empty.
>  For (left,right) outer join, the streaming side still has to be processed so 
> demoting broadcast join doesn't have the same advantage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37051) The filter operator gets wrong results in char type

2021-11-01 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437122#comment-17437122
 ] 

Yang Jie commented on SPARK-37051:
--

Can you test
{code:java}
select * from t2_orc where 
b='b0';
{code}
1 B and 49 zeros

 

 

> The filter operator gets wrong results in char type
> ---
>
> Key: SPARK-37051
> URL: https://issues.apache.org/jira/browse/SPARK-37051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1, 3.3.0
> Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>Reporter: frankli
>Priority: Critical
>
> When I try the following sample SQL on  the TPCDS data, the filter operator 
> returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type. 
> Data is inserted by hive, and queried by Spark.
> I guest that the char(50) type will remains redundant blanks after the actual 
> word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong 
> results.
> Luckily, the varchar type is OK. 
>  
> This bug can be reproduced by a few steps.
> >>> desc t2_orc;
>  ++---+++
> |col_name|data_type|comment|
> ++---+++
> |a|string      |NULL|
> |b|char(50)  |NULL|
> |c|int            |NULL|
> ++---++--–+
> >>> select * from t2_orc where a='a';
>  +-+---++--+
> |a|b|c|
> +-+---++--+
> |a|b|1|
> |a|b|2|
> |a|b|3|
> |a|b|4|
> |a|b|5|
> +-+---++–+
> >>> select * from t2_orc where b='b';
>  +-+---++--+
> |a|b|c|
> +-+---++--+
>  +-+---++--+
>  
> By the way, Spark's tests should add more cases on the char type.
>  
> == Physical Plan ==
>  CollectLimit (3)
>  +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
>  Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Batched: false
>  Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
>  PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
> )]
>  ReadSchema: 
> struct
> (2) Filter
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Arguments: 100
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37193) DynamicJoinSelection.shouldDemoteBroadcastHashJoin should not apply to outer joins

2021-11-01 Thread Eugene Koifman (Jira)

Eugene Koifman created SPARK-37193:
--

 Summary: DynamicJoinSelection.shouldDemoteBroadcastHashJoin should 
not apply to outer joins
 Key: SPARK-37193
 URL: https://issues.apache.org/jira/browse/SPARK-37193
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.1
Reporter: Eugene Koifman


{{DynamicJoinSelection.shouldDemoteBroadcastHashJoin}} will prevent AQE from 
converting Sort merge join into a broadcast join because SMJ is faster when the 
side that would be broadcast has a lot of empty partitions.
 This makes sense for inner joins which can short circuit if one side is empty.
 For (left,right) outer join, the streaming side still has to be processed so 
demoting broadcast join doesn't have the same advantage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37192) Migrate SHOW TBLPROPERTIES to use V2 command by default

2021-11-01 Thread Terry Kim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437120#comment-17437120
 ] 

Terry Kim commented on SPARK-37192:
---

Yes, go for it! Thanks!

> Migrate SHOW TBLPROPERTIES to use V2 command by default
> ---
>
> Key: SPARK-37192
> URL: https://issues.apache.org/jira/browse/SPARK-37192
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>
> Migrate SHOW TBLPROPERTIES to use V2 command by default



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37124) Support RowToColumnarExec with Arrow

2021-11-01 Thread Chendi.Xue (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chendi.Xue updated SPARK-37124:
---
Description: 
This Jira is aim to support Arrow format in RowToColumnarExec 

Current ArrowColumnVector is not fully equivalent to OnHeap/OffHeapColumnVector 
in spark, so RowToColumnarExec doesn't support write to Arrow format so far.

since Arrow API is now being more stable, and using pandas udf will perform 
much better than python udf.

I am  proposing to support RowToColumnarExec with Arrow.

What I did in this PR is to add a load api in ArrowColumnVector to load 
arrowRecordBatch to ArrowColumnVector, then called inside RowToColumnarExec 
doExecute.

 

UTs are also added to test this new API and RowToColumnarExec with ArrowFormat

  was:
This Jira is aim to add Arrow format as an alternative for ColumnVector 
solution.

Current ArrowColumnVector is not fully equivalent to OnHeap/OffHeapColumnVector 
in spark, and since Arrow API is now being more stable, and using pandas udf 
will perform much better than python udf.

I am  proposing to fully support arrow format as an alternative to ColumnVector 
just like the other two.

What I did in this PR is to create a new class in the same package with 
OnHeap/OffHeapColumnVector and extend from WritableColumnVector to support all 
put APIs.

UTs are covering all Data Format with testing on writing to columnVector and 
reading from columnVector. I also added 3 UTs for testing on loading from 
ArrowRecordBatch and allocateColumns .

Summary: Support RowToColumnarExec with Arrow  (was: Support Writable 
ArrowColumnarVector)

> Support RowToColumnarExec with Arrow
> 
>
> Key: SPARK-37124
> URL: https://issues.apache.org/jira/browse/SPARK-37124
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chendi.Xue
>Priority: Major
>
> This Jira is aim to support Arrow format in RowToColumnarExec 
> Current ArrowColumnVector is not fully equivalent to 
> OnHeap/OffHeapColumnVector in spark, so RowToColumnarExec doesn't support 
> write to Arrow format so far.
> since Arrow API is now being more stable, and using pandas udf will perform 
> much better than python udf.
> I am  proposing to support RowToColumnarExec with Arrow.
> What I did in this PR is to add a load api in ArrowColumnVector to load 
> arrowRecordBatch to ArrowColumnVector, then called inside RowToColumnarExec 
> doExecute.
>  
> UTs are also added to test this new API and RowToColumnarExec with ArrowFormat



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37192) Migrate SHOW TBLPROPERTIES to use V2 command by default

2021-11-01 Thread PengLei (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437113#comment-17437113
 ] 

PengLei commented on SPARK-37192:
-

[~imback82] [~wenchen] I want to try to fix it, okay?

> Migrate SHOW TBLPROPERTIES to use V2 command by default
> ---
>
> Key: SPARK-37192
> URL: https://issues.apache.org/jira/browse/SPARK-37192
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>
> Migrate SHOW TBLPROPERTIES to use V2 command by default



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37192) Migrate SHOW TBLPROPERTIES to use V2 command by default

2021-11-01 Thread PengLei (Jira)

PengLei created SPARK-37192:
---

 Summary: Migrate SHOW TBLPROPERTIES to use V2 command by default
 Key: SPARK-37192
 URL: https://issues.apache.org/jira/browse/SPARK-37192
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: PengLei
 Fix For: 3.3.0


Migrate SHOW TBLPROPERTIES to use V2 command by default



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37159) Change HiveExternalCatalogVersionsSuite to be able to test with Java 17

2021-11-01 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-37159.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved in https://github.com/apache/spark/pull/34425

> Change HiveExternalCatalogVersionsSuite to be able to test with Java 17
> ---
>
> Key: SPARK-37159
> URL: https://issues.apache.org/jira/browse/SPARK-37159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.3.0
>
>
> SPARK-37105 seems to have fixed most of tests in `sql/hive` for Java 17 but 
> `HiveExternalCatalogVersionsSuite`.
> {code}
> [info] org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED 
> *** (42 seconds, 526 milliseconds)
> [info]   spark-submit returned with exit code 1.
> [info]   Command line: 
> '/home/kou/work/oss/spark-java17/sql/hive/target/tmp/org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite/test-spark-d86af275-0c40-4b47-9cab-defa92a5ffa7/spark-3.2.0/bin/spark-submit'
>  '--name' 'prepare testing tables' '--master' 'local[2]' '--conf' 
> 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false' '--conf' 
> 'spark.sql.hive.metastore.version=2.3' '--conf' 
> 'spark.sql.hive.metastore.jars=maven' '--conf' 
> 'spark.sql.warehouse.dir=/home/kou/work/oss/spark-java17/sql/hive/target/tmp/org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite/warehouse-69d9bdbc-54ce-443b-8677-a413663ddb62'
>  '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' 
> '-Dderby.system.home=/home/kou/work/oss/spark-java17/sql/hive/target/tmp/org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite/warehouse-69d9bdbc-54ce-443b-8677-a413663ddb62'
>  
> '/home/kou/work/oss/spark-java17/sql/hive/target/tmp/org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite/test15166225869206697603.py'
> [info]   
> [info]   2021-10-28 06:07:18.486 - stderr> Using Spark's default log4j 
> profile: org/apache/spark/log4j-defaults.properties
> [info]   2021-10-28 06:07:18.49 - stderr> 21/10/28 22:07:18 INFO 
> SparkContext: Running Spark version 3.2.0
> [info]   2021-10-28 06:07:18.537 - stderr> 21/10/28 22:07:18 WARN 
> NativeCodeLoader: Unable to load native-hadoop library for your platform... 
> using builtin-java classes where applicable
> [info]   2021-10-28 06:07:18.616 - stderr> 21/10/28 22:07:18 INFO 
> ResourceUtils: ==
> [info]   2021-10-28 06:07:18.616 - stderr> 21/10/28 22:07:18 INFO 
> ResourceUtils: No custom resources configured for spark.driver.
> [info]   2021-10-28 06:07:18.616 - stderr> 21/10/28 22:07:18 INFO 
> ResourceUtils: ==
> [info]   2021-10-28 06:07:18.617 - stderr> 21/10/28 22:07:18 INFO 
> SparkContext: Submitted application: prepare testing tables
> [info]   2021-10-28 06:07:18.632 - stderr> 21/10/28 22:07:18 INFO 
> ResourceProfile: Default ResourceProfile created, executor resources: 
> Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: 
> memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 
> 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
> [info]   2021-10-28 06:07:18.641 - stderr> 21/10/28 22:07:18 INFO 
> ResourceProfile: Limiting resource is cpu
> [info]   2021-10-28 06:07:18.641 - stderr> 21/10/28 22:07:18 INFO 
> ResourceProfileManager: Added ResourceProfile id: 0
> [info]   2021-10-28 06:07:18.679 - stderr> 21/10/28 22:07:18 INFO 
> SecurityManager: Changing view acls to: kou
> [info]   2021-10-28 06:07:18.679 - stderr> 21/10/28 22:07:18 INFO 
> SecurityManager: Changing modify acls to: kou
> [info]   2021-10-28 06:07:18.68 - stderr> 21/10/28 22:07:18 INFO 
> SecurityManager: Changing view acls groups to: 
> [info]   2021-10-28 06:07:18.68 - stderr> 21/10/28 22:07:18 INFO 
> SecurityManager: Changing modify acls groups to: 
> [info]   2021-10-28 06:07:18.68 - stderr> 21/10/28 22:07:18 INFO 
> SecurityManager: SecurityManager: authentication disabled; ui acls disabled; 
> users  with view permissions: Set(kou); groups with view permissions: Set(); 
> users  with modify permissions: Set(kou); groups with modify permissions: 
> Set()
> [info]   2021-10-28 06:07:18.886 - stderr> 21/10/28 22:07:18 INFO Utils: 
> Successfully started service 'sparkDriver' on port 35867.
> [info]   2021-10-28 06:07:18.906 - stderr> 21/10/28 22:07:18 INFO SparkEnv: 
> Registering MapOutputTracker
> [info]   2021-10-28 06:07:18.93 - stderr> 21/10/28 22:07:18 INFO SparkEnv: 
> Registering BlockManagerMaster
> [info]   2021-10-28 06:07:18.943 - stderr> 21/10/28

[jira] [Resolved] (SPARK-36554) Error message while trying to use spark sql functions directly on dataframe columns without using select expression

2021-11-01 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-36554.

Fix Version/s: 3.3.0
 Assignee: Nicolas Azrak
   Resolution: Fixed

Issue resolved in https://github.com/apache/spark/pull/34356

> Error message while trying to use spark sql functions directly on dataframe 
> columns without using select expression
> ---
>
> Key: SPARK-36554
> URL: https://issues.apache.org/jira/browse/SPARK-36554
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Examples, PySpark
>Affects Versions: 3.1.1
>Reporter: Lekshmi Ramachandran
>Assignee: Nicolas Azrak
>Priority: Minor
>  Labels: documentation, features, functions, spark-sql
> Fix For: 3.3.0
>
> Attachments: Screen Shot .png
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The below code generates a dataframe successfully . Here make_date function 
> is used inside a select expression
>  
> from pyspark.sql.functions import  expr, make_date
> df = spark.createDataFrame([(2020, 6, 26), (1000, 2, 29), (-44, 1, 1)],['Y', 
> 'M', 'D'])
> df.select("*",expr("make_date(Y,M,D) as lk")).show()
>  
> The below code fails with a message "cannot import name 'make_date' from 
> 'pyspark.sql.functions'" . Here the make_date function is directly called on 
> dataframe columns without select expression
>  
> from pyspark.sql.functions import make_date
> df = spark.createDataFrame([(2020, 6, 26), (1000, 2, 29), (-44, 1, 1)],['Y', 
> 'M', 'D'])
> df.select(make_date(df.Y,df.M,df.D).alias("datefield")).show()
>  
> The error message generated is misleading when it says "cannot  import 
> make_date from pyspark.sql.functions"
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37190) Improve error messages for casting under ANSI mode

2021-11-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437070#comment-17437070
 ] 

Apache Spark commented on SPARK-37190:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/34463

> Improve error messages for casting under ANSI mode
> --
>
> Key: SPARK-37190
> URL: https://issues.apache.org/jira/browse/SPARK-37190
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Priority: Major
>
> Improve error messages for casting under ANSI mode by making it more 
> actionable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37190) Improve error messages for casting under ANSI mode

2021-11-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37190:


Assignee: (was: Apache Spark)

> Improve error messages for casting under ANSI mode
> --
>
> Key: SPARK-37190
> URL: https://issues.apache.org/jira/browse/SPARK-37190
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Priority: Major
>
> Improve error messages for casting under ANSI mode by making it more 
> actionable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37190) Improve error messages for casting under ANSI mode

2021-11-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437068#comment-17437068
 ] 

Apache Spark commented on SPARK-37190:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/34463

> Improve error messages for casting under ANSI mode
> --
>
> Key: SPARK-37190
> URL: https://issues.apache.org/jira/browse/SPARK-37190
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Priority: Major
>
> Improve error messages for casting under ANSI mode by making it more 
> actionable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37190) Improve error messages for casting under ANSI mode

2021-11-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37190:


Assignee: Apache Spark

> Improve error messages for casting under ANSI mode
> --
>
> Key: SPARK-37190
> URL: https://issues.apache.org/jira/browse/SPARK-37190
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Assignee: Apache Spark
>Priority: Major
>
> Improve error messages for casting under ANSI mode by making it more 
> actionable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37191) Allow merging DecimalTypes with different precision values

2021-11-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437064#comment-17437064
 ] 

Apache Spark commented on SPARK-37191:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/34462

> Allow merging DecimalTypes with different precision values 
> ---
>
> Key: SPARK-37191
> URL: https://issues.apache.org/jira/browse/SPARK-37191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.0, 3.1.1, 3.2.0
>Reporter: Ivan
>Priority: Major
> Fix For: 3.3.0
>
>
> When merging DecimalTypes with different precision but the same scale, one 
> would get the following error:
> {code:java}
> Failed to merge fields 'col' and 'col'. Failed to merge decimal types with 
> incompatible precision 17 and 12   at 
> org.apache.spark.sql.types.StructType$.$anonfun$merge$2(StructType.scala:652)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.sql.types.StructType$.$anonfun$merge$1(StructType.scala:644)
>   at 
> org.apache.spark.sql.types.StructType$.$anonfun$merge$1$adapted(StructType.scala:641)
>   at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>   at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>   at org.apache.spark.sql.types.StructType$.merge(StructType.scala:641)
>   at org.apache.spark.sql.types.StructType.merge(StructType.scala:550) 
> {code}
>  
> We could allow merging DecimalType values with different precision if the 
> scale is the same for both types since there should not be any data 
> correctness issues as one of the types will be extended, for example, 
> DECIMAL(12, 2) -> DECIMAL(17, 2); however, this is not the case for upcasting 
> when the scale is different - this would depend on the actual values.
>  
> Repro code:
> {code:java}
> import org.apache.spark.sql.types._
> val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil)
> val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil)
> schema1.merge(schema2) {code}
>  
> This also affects Parquet schema merge which is where this issue was 
> discovered originally:
> {code:java}
> import java.math.BigDecimal
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val data1 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1)
> val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil)
> val data2 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1)
> val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil)
> spark.createDataFrame(data2, 
> schema2).write.parquet("/tmp/decimal-test.parquet")
> spark.createDataFrame(data1, 
> schema1).write.mode("append").parquet("/tmp/decimal-test.parquet")
> // Reading the DataFrame fails
> spark.read.option("mergeSchema", 
> "true").parquet("/tmp/decimal-test.parquet").show()
> >>>
> Failed merging schema:
> root
>  |-- col: decimal(17,2) (nullable = true)
> Caused by: Failed to merge fields 'col' and 'col'. Failed to merge decimal 
> types with incompatible precision 12 and 17
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37191) Allow merging DecimalTypes with different precision values

2021-11-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37191:


Assignee: Apache Spark

> Allow merging DecimalTypes with different precision values 
> ---
>
> Key: SPARK-37191
> URL: https://issues.apache.org/jira/browse/SPARK-37191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.0, 3.1.1, 3.2.0
>Reporter: Ivan
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.3.0
>
>
> When merging DecimalTypes with different precision but the same scale, one 
> would get the following error:
> {code:java}
> Failed to merge fields 'col' and 'col'. Failed to merge decimal types with 
> incompatible precision 17 and 12   at 
> org.apache.spark.sql.types.StructType$.$anonfun$merge$2(StructType.scala:652)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.sql.types.StructType$.$anonfun$merge$1(StructType.scala:644)
>   at 
> org.apache.spark.sql.types.StructType$.$anonfun$merge$1$adapted(StructType.scala:641)
>   at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>   at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>   at org.apache.spark.sql.types.StructType$.merge(StructType.scala:641)
>   at org.apache.spark.sql.types.StructType.merge(StructType.scala:550) 
> {code}
>  
> We could allow merging DecimalType values with different precision if the 
> scale is the same for both types since there should not be any data 
> correctness issues as one of the types will be extended, for example, 
> DECIMAL(12, 2) -> DECIMAL(17, 2); however, this is not the case for upcasting 
> when the scale is different - this would depend on the actual values.
>  
> Repro code:
> {code:java}
> import org.apache.spark.sql.types._
> val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil)
> val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil)
> schema1.merge(schema2) {code}
>  
> This also affects Parquet schema merge which is where this issue was 
> discovered originally:
> {code:java}
> import java.math.BigDecimal
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val data1 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1)
> val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil)
> val data2 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1)
> val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil)
> spark.createDataFrame(data2, 
> schema2).write.parquet("/tmp/decimal-test.parquet")
> spark.createDataFrame(data1, 
> schema1).write.mode("append").parquet("/tmp/decimal-test.parquet")
> // Reading the DataFrame fails
> spark.read.option("mergeSchema", 
> "true").parquet("/tmp/decimal-test.parquet").show()
> >>>
> Failed merging schema:
> root
>  |-- col: decimal(17,2) (nullable = true)
> Caused by: Failed to merge fields 'col' and 'col'. Failed to merge decimal 
> types with incompatible precision 12 and 17
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37191) Allow merging DecimalTypes with different precision values

2021-11-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37191:


Assignee: (was: Apache Spark)

> Allow merging DecimalTypes with different precision values 
> ---
>
> Key: SPARK-37191
> URL: https://issues.apache.org/jira/browse/SPARK-37191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.0, 3.1.1, 3.2.0
>Reporter: Ivan
>Priority: Major
> Fix For: 3.3.0
>
>
> When merging DecimalTypes with different precision but the same scale, one 
> would get the following error:
> {code:java}
> Failed to merge fields 'col' and 'col'. Failed to merge decimal types with 
> incompatible precision 17 and 12   at 
> org.apache.spark.sql.types.StructType$.$anonfun$merge$2(StructType.scala:652)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.sql.types.StructType$.$anonfun$merge$1(StructType.scala:644)
>   at 
> org.apache.spark.sql.types.StructType$.$anonfun$merge$1$adapted(StructType.scala:641)
>   at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>   at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>   at org.apache.spark.sql.types.StructType$.merge(StructType.scala:641)
>   at org.apache.spark.sql.types.StructType.merge(StructType.scala:550) 
> {code}
>  
> We could allow merging DecimalType values with different precision if the 
> scale is the same for both types since there should not be any data 
> correctness issues as one of the types will be extended, for example, 
> DECIMAL(12, 2) -> DECIMAL(17, 2); however, this is not the case for upcasting 
> when the scale is different - this would depend on the actual values.
>  
> Repro code:
> {code:java}
> import org.apache.spark.sql.types._
> val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil)
> val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil)
> schema1.merge(schema2) {code}
>  
> This also affects Parquet schema merge which is where this issue was 
> discovered originally:
> {code:java}
> import java.math.BigDecimal
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val data1 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1)
> val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil)
> val data2 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1)
> val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil)
> spark.createDataFrame(data2, 
> schema2).write.parquet("/tmp/decimal-test.parquet")
> spark.createDataFrame(data1, 
> schema1).write.mode("append").parquet("/tmp/decimal-test.parquet")
> // Reading the DataFrame fails
> spark.read.option("mergeSchema", 
> "true").parquet("/tmp/decimal-test.parquet").show()
> >>>
> Failed merging schema:
> root
>  |-- col: decimal(17,2) (nullable = true)
> Caused by: Failed to merge fields 'col' and 'col'. Failed to merge decimal 
> types with incompatible precision 12 and 17
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37191) Allow merging DecimalTypes with different precision values

2021-11-01 Thread Ivan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437061#comment-17437061
 ] 

Ivan commented on SPARK-37191:
--

This is somewhat related to https://issues.apache.org/jira/browse/SPARK-32317 
although the issue is a bit different - they are trying to merge decimals of 
different scales.

> Allow merging DecimalTypes with different precision values 
> ---
>
> Key: SPARK-37191
> URL: https://issues.apache.org/jira/browse/SPARK-37191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.0, 3.1.1, 3.2.0
>Reporter: Ivan
>Priority: Major
> Fix For: 3.3.0
>
>
> When merging DecimalTypes with different precision but the same scale, one 
> would get the following error:
> {code:java}
> Failed to merge fields 'col' and 'col'. Failed to merge decimal types with 
> incompatible precision 17 and 12   at 
> org.apache.spark.sql.types.StructType$.$anonfun$merge$2(StructType.scala:652)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.sql.types.StructType$.$anonfun$merge$1(StructType.scala:644)
>   at 
> org.apache.spark.sql.types.StructType$.$anonfun$merge$1$adapted(StructType.scala:641)
>   at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>   at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>   at org.apache.spark.sql.types.StructType$.merge(StructType.scala:641)
>   at org.apache.spark.sql.types.StructType.merge(StructType.scala:550) 
> {code}
>  
> We could allow merging DecimalType values with different precision if the 
> scale is the same for both types since there should not be any data 
> correctness issues as one of the types will be extended, for example, 
> DECIMAL(12, 2) -> DECIMAL(17, 2); however, this is not the case for upcasting 
> when the scale is different - this would depend on the actual values.
>  
> Repro code:
> {code:java}
> import org.apache.spark.sql.types._
> val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil)
> val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil)
> schema1.merge(schema2) {code}
>  
> This also affects Parquet schema merge which is where this issue was 
> discovered originally:
> {code:java}
> import java.math.BigDecimal
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val data1 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1)
> val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil)
> val data2 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1)
> val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil)
> spark.createDataFrame(data2, 
> schema2).write.parquet("/tmp/decimal-test.parquet")
> spark.createDataFrame(data1, 
> schema1).write.mode("append").parquet("/tmp/decimal-test.parquet")
> // Reading the DataFrame fails
> spark.read.option("mergeSchema", 
> "true").parquet("/tmp/decimal-test.parquet").show()
> >>>
> Failed merging schema:
> root
>  |-- col: decimal(17,2) (nullable = true)
> Caused by: Failed to merge fields 'col' and 'col'. Failed to merge decimal 
> types with incompatible precision 12 and 17
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37191) Allow merging DecimalTypes with different precision values

2021-11-01 Thread Ivan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan updated SPARK-37191:
-
Description: 
When merging DecimalTypes with different precision but the same scale, one 
would get the following error:
{code:java}
Failed to merge fields 'col' and 'col'. Failed to merge decimal types with 
incompatible precision 17 and 12 at 
org.apache.spark.sql.types.StructType$.$anonfun$merge$2(StructType.scala:652)
at scala.Option.map(Option.scala:230)
at 
org.apache.spark.sql.types.StructType$.$anonfun$merge$1(StructType.scala:644)
at 
org.apache.spark.sql.types.StructType$.$anonfun$merge$1$adapted(StructType.scala:641)
at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at org.apache.spark.sql.types.StructType$.merge(StructType.scala:641)
at org.apache.spark.sql.types.StructType.merge(StructType.scala:550) 
{code}
 

We could allow merging DecimalType values with different precision if the scale 
is the same for both types since there should not be any data correctness 
issues as one of the types will be extended, for example, DECIMAL(12, 2) -> 
DECIMAL(17, 2); however, this is not the case for upcasting when the scale is 
different - this would depend on the actual values.

 

Repro code:
{code:java}
import org.apache.spark.sql.types._

val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil)
val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil)
schema1.merge(schema2) {code}
 

This also affects Parquet schema merge which is where this issue was discovered 
originally:
{code:java}
import java.math.BigDecimal
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

val data1 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1)
val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil)

val data2 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1)
val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil)

spark.createDataFrame(data2, schema2).write.parquet("/tmp/decimal-test.parquet")
spark.createDataFrame(data1, 
schema1).write.mode("append").parquet("/tmp/decimal-test.parquet")

// Reading the DataFrame fails
spark.read.option("mergeSchema", 
"true").parquet("/tmp/decimal-test.parquet").show()

>>>
Failed merging schema:
root
 |-- col: decimal(17,2) (nullable = true)

Caused by: Failed to merge fields 'col' and 'col'. Failed to merge decimal 
types with incompatible precision 12 and 17



{code}
 

  was:
When merging DecimalTypes with different precision but the same scale, one 
would get the following error:
{code:java}
Failed to merge fields 'col' and 'col'. Failed to merge decimal types with 
incompatible precision 17 and 12 at 
org.apache.spark.sql.types.StructType$.$anonfun$merge$2(StructType.scala:652)
at scala.Option.map(Option.scala:230)
at 
org.apache.spark.sql.types.StructType$.$anonfun$merge$1(StructType.scala:644)
at 
org.apache.spark.sql.types.StructType$.$anonfun$merge$1$adapted(StructType.scala:641)
at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at org.apache.spark.sql.types.StructType$.merge(StructType.scala:641)
at org.apache.spark.sql.types.StructType.merge(StructType.scala:550) 
{code}
 

We could allow merging DecimalType values with different precision if the scale 
is the same for both types since there should not be any data correctness 
issues as one of the types will be extended, for example, DECIMAL(12, 2) -> 
DECIMAL(17, 2); however, this is not the case for upcasting when the scale is 
different - this would depend on the actual values.

 

Repro code:
{code:java}
import org.apache.spark.sql.types._

val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil)
val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil)
schema1.merge(schema2) {code}
 

This also affects Parquet schema merge which is where this issue was discovered 
originally:
{code:java}
import java.math.BigDecimal
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

val data1 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1)
val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil)

val data2 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1)
val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil)

spark.createDataFrame(data2, schema2).write.parquet("/tmp/decimal-test.parquet")
spark.createDataFrame(data1,

[jira] [Updated] (SPARK-37191) Allow merging DecimalTypes with different precision values

2021-11-01 Thread Ivan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan updated SPARK-37191:
-
Description: 
When merging DecimalTypes with different precision but the same scale, one 
would get the following error:
{code:java}
Failed to merge fields 'col' and 'col'. Failed to merge decimal types with 
incompatible precision 17 and 12 at 
org.apache.spark.sql.types.StructType$.$anonfun$merge$2(StructType.scala:652)
at scala.Option.map(Option.scala:230)
at 
org.apache.spark.sql.types.StructType$.$anonfun$merge$1(StructType.scala:644)
at 
org.apache.spark.sql.types.StructType$.$anonfun$merge$1$adapted(StructType.scala:641)
at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at org.apache.spark.sql.types.StructType$.merge(StructType.scala:641)
at org.apache.spark.sql.types.StructType.merge(StructType.scala:550) 
{code}
 

We could allow merging DecimalType values with different precision if the scale 
is the same for both types since there should not be any data correctness 
issues as one of the types will be extended, for example, DECIMAL(12, 2) -> 
DECIMAL(17, 2); however, this is not the case for upcasting when the scale is 
different - this would depend on the actual values.

 

Repro code:
{code:java}
import org.apache.spark.sql.types._

val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil)
val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil)
schema1.merge(schema2) {code}
 

This also affects Parquet schema merge which is where this issue was discovered 
originally:
{code:java}
import java.math.BigDecimal
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

val data1 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1)
val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil)

val data2 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1)
val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil)

spark.createDataFrame(data2, schema2).write.parquet("/tmp/decimal-test.parquet")
spark.createDataFrame(data1, 
schema1).write.mode("append").parquet("/tmp/decimal-test.parquet")

// Reading the DataFrame fails
spark.read.option("mergeSchema", 
"true").parquet("/mnt/ivan/decimal-test.parquet").show()

>>>
Failed merging schema:
root
 |-- col: decimal(17,2) (nullable = true)

Caused by: Failed to merge fields 'col' and 'col'. Failed to merge decimal 
types with incompatible precision 12 and 17



{code}
 

> Allow merging DecimalTypes with different precision values 
> ---
>
> Key: SPARK-37191
> URL: https://issues.apache.org/jira/browse/SPARK-37191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.0, 3.1.1, 3.2.0
>Reporter: Ivan
>Priority: Major
> Fix For: 3.3.0
>
>
> When merging DecimalTypes with different precision but the same scale, one 
> would get the following error:
> {code:java}
> Failed to merge fields 'col' and 'col'. Failed to merge decimal types with 
> incompatible precision 17 and 12   at 
> org.apache.spark.sql.types.StructType$.$anonfun$merge$2(StructType.scala:652)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.sql.types.StructType$.$anonfun$merge$1(StructType.scala:644)
>   at 
> org.apache.spark.sql.types.StructType$.$anonfun$merge$1$adapted(StructType.scala:641)
>   at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>   at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>   at org.apache.spark.sql.types.StructType$.merge(StructType.scala:641)
>   at org.apache.spark.sql.types.StructType.merge(StructType.scala:550) 
> {code}
>  
> We could allow merging DecimalType values with different precision if the 
> scale is the same for both types since there should not be any data 
> correctness issues as one of the types will be extended, for example, 
> DECIMAL(12, 2) -> DECIMAL(17, 2); however, this is not the case for upcasting 
> when the scale is different - this would depend on the actual values.
>  
> Repro code:
> {code:java}
> import org.apache.spark.sql.types._
> val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil)
> val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil)
> schema1.merge(schema2) {code}
>  
> This also affects Parquet schema merge which is where this issue was 
> discovered originally:
> {code:java}
> import java.math.BigDecimal

[jira] [Created] (SPARK-37191) Allow merging DecimalTypes with different precision values

2021-11-01 Thread Ivan (Jira)

Ivan created SPARK-37191:


 Summary: Allow merging DecimalTypes with different precision 
values 
 Key: SPARK-37191
 URL: https://issues.apache.org/jira/browse/SPARK-37191
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0, 3.1.1, 3.1.0, 3.0.3
Reporter: Ivan
 Fix For: 3.3.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37190) Improve error messages for casting under ANSI mode

2021-11-01 Thread Allison Wang (Jira)

Allison Wang created SPARK-37190:


 Summary: Improve error messages for casting under ANSI mode
 Key: SPARK-37190
 URL: https://issues.apache.org/jira/browse/SPARK-37190
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Allison Wang


Improve error messages for casting under ANSI mode by making it more actionable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37023) Avoid fetching merge status when shuffleMergeEnabled is false for a shuffleDependency during retry

2021-11-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37023:


Assignee: Apache Spark

> Avoid fetching merge status when shuffleMergeEnabled is false for a 
> shuffleDependency during retry
> --
>
> Key: SPARK-37023
> URL: https://issues.apache.org/jira/browse/SPARK-37023
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Ye Zhou
>Assignee: Apache Spark
>Priority: Major
>
> The assertion below in MapOutoutputTracker.getMapSizesByExecutorId is not 
> guaranteed
> {code:java}
> assert(mapSizesByExecutorId.enableBatchFetch == true){code}
> The reason is during some stage retry cases, the 
> shuffleDependency.shuffleMergeEnabled is set to false, but there will be 
> mergeStatus since the Driver has collected the merged status for its shuffle 
> dependency. If this is the case, the current implementation would set the 
> enableBatchFetch to false, since there are mergeStatus.
> Details can be found here:
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L1492]
> We should improve the implementation here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37023) Avoid fetching merge status when shuffleMergeEnabled is false for a shuffleDependency during retry

2021-11-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37023:


Assignee: (was: Apache Spark)

> Avoid fetching merge status when shuffleMergeEnabled is false for a 
> shuffleDependency during retry
> --
>
> Key: SPARK-37023
> URL: https://issues.apache.org/jira/browse/SPARK-37023
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Ye Zhou
>Priority: Major
>
> The assertion below in MapOutoutputTracker.getMapSizesByExecutorId is not 
> guaranteed
> {code:java}
> assert(mapSizesByExecutorId.enableBatchFetch == true){code}
> The reason is during some stage retry cases, the 
> shuffleDependency.shuffleMergeEnabled is set to false, but there will be 
> mergeStatus since the Driver has collected the merged status for its shuffle 
> dependency. If this is the case, the current implementation would set the 
> enableBatchFetch to false, since there are mergeStatus.
> Details can be found here:
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L1492]
> We should improve the implementation here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37023) Avoid fetching merge status when shuffleMergeEnabled is false for a shuffleDependency during retry

2021-11-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437042#comment-17437042
 ] 

Apache Spark commented on SPARK-37023:
--

User 'rmcyang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34461

> Avoid fetching merge status when shuffleMergeEnabled is false for a 
> shuffleDependency during retry
> --
>
> Key: SPARK-37023
> URL: https://issues.apache.org/jira/browse/SPARK-37023
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Ye Zhou
>Priority: Major
>
> The assertion below in MapOutoutputTracker.getMapSizesByExecutorId is not 
> guaranteed
> {code:java}
> assert(mapSizesByExecutorId.enableBatchFetch == true){code}
> The reason is during some stage retry cases, the 
> shuffleDependency.shuffleMergeEnabled is set to false, but there will be 
> mergeStatus since the Driver has collected the merged status for its shuffle 
> dependency. If this is the case, the current implementation would set the 
> enableBatchFetch to false, since there are mergeStatus.
> Details can be found here:
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L1492]
> We should improve the implementation here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37023) Avoid fetching merge status when shuffleMergeEnabled is false for a shuffleDependency during retry

2021-11-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437041#comment-17437041
 ] 

Apache Spark commented on SPARK-37023:
--

User 'rmcyang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34461

> Avoid fetching merge status when shuffleMergeEnabled is false for a 
> shuffleDependency during retry
> --
>
> Key: SPARK-37023
> URL: https://issues.apache.org/jira/browse/SPARK-37023
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Ye Zhou
>Priority: Major
>
> The assertion below in MapOutoutputTracker.getMapSizesByExecutorId is not 
> guaranteed
> {code:java}
> assert(mapSizesByExecutorId.enableBatchFetch == true){code}
> The reason is during some stage retry cases, the 
> shuffleDependency.shuffleMergeEnabled is set to false, but there will be 
> mergeStatus since the Driver has collected the merged status for its shuffle 
> dependency. If this is the case, the current implementation would set the 
> enableBatchFetch to false, since there are mergeStatus.
> Details can be found here:
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L1492]
> We should improve the implementation here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37189) pyspark.pandas histogram accepts the range option but does not use it

2021-11-01 Thread Chuck Connell (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chuck Connell updated SPARK-37189:
--
Description: 
In pyspark.pandas if you write a line like this
{quote}DF.plot.hist(bins=30, range=[0, 20], title="US Counties -- DeathsPer100k 
(<20)")
{quote}
it compiles and runs, but the plot does not respect the range. All the values 
are shown.

The workaround is to create a new DataFrame that pre-selects just the rows you 
want, but line above should work also.

  was:
In pyspark.pandas if you write a line like this
{quote}DF.plot.hist(bins=20, title="US Counties -- FullVaxPer100")
{quote}
it compiles and runs, but the plot has no title.


> pyspark.pandas histogram accepts the range option but does not use it
> -
>
> Key: SPARK-37189
> URL: https://issues.apache.org/jira/browse/SPARK-37189
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In pyspark.pandas if you write a line like this
> {quote}DF.plot.hist(bins=30, range=[0, 20], title="US Counties -- 
> DeathsPer100k (<20)")
> {quote}
> it compiles and runs, but the plot does not respect the range. All the values 
> are shown.
> The workaround is to create a new DataFrame that pre-selects just the rows 
> you want, but line above should work also.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37189) CLONE - pyspark.pandas histogram accepts the title option but does not add a title to the plot

2021-11-01 Thread Chuck Connell (Jira)

Chuck Connell created SPARK-37189:
-

 Summary: CLONE - pyspark.pandas histogram accepts the title option 
but does not add a title to the plot
 Key: SPARK-37189
 URL: https://issues.apache.org/jira/browse/SPARK-37189
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Chuck Connell


In pyspark.pandas if you write a line like this
{quote}DF.plot.hist(bins=20, title="US Counties -- FullVaxPer100")
{quote}
it compiles and runs, but the plot has no title.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37189) pyspark.pandas histogram accepts the range option but does not use it

2021-11-01 Thread Chuck Connell (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chuck Connell updated SPARK-37189:
--
Summary: pyspark.pandas histogram accepts the range option but does not use 
it  (was: CLONE - pyspark.pandas histogram accepts the title option but does 
not add a title to the plot)

> pyspark.pandas histogram accepts the range option but does not use it
> -
>
> Key: SPARK-37189
> URL: https://issues.apache.org/jira/browse/SPARK-37189
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In pyspark.pandas if you write a line like this
> {quote}DF.plot.hist(bins=20, title="US Counties -- FullVaxPer100")
> {quote}
> it compiles and runs, but the plot has no title.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37188) pyspark.pandas histogram accepts the title option but does not add a title to the plot

2021-11-01 Thread Chuck Connell (Jira)

Chuck Connell created SPARK-37188:
-

 Summary: pyspark.pandas histogram accepts the title option but 
does not add a title to the plot
 Key: SPARK-37188
 URL: https://issues.apache.org/jira/browse/SPARK-37188
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Chuck Connell


In pyspark.pandas if you write a line like this
{quote}DF.plot.hist(bins=20, title="US Counties -- FullVaxPer100")
{quote}
it compiles and runs, but the plot has no title.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37187) pyspark.pandas fails to create a histogram of one column from a large DataFrame

2021-11-01 Thread Chuck Connell (Jira)

Chuck Connell created SPARK-37187:
-

 Summary: pyspark.pandas fails to create a histogram of one column 
from a large DataFrame
 Key: SPARK-37187
 URL: https://issues.apache.org/jira/browse/SPARK-37187
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Chuck Connell


 

When trying to create a histogram from one column of a large DataFrame, 
pyspark.pandas fails. So this line
{quote}DF.plot.hist(column="FullVaxPer100", bins=20)  # there are many other 
columns
{quote}
yields this error
{quote}cannot resolve 'least(min(EndDate), min(EndDeaths), min(`STATE-COUNTY`), 
min(StartDate), min(StartDeaths), min(POPESTIMATE2020), min(ST_ABBR), 
min(VaxStartDate), min(Series_Complete_Yes_Start), 
min(Administered_Dose1_Recip_Start), min(VaxEndDate), 
min(Series_Complete_Yes_End), min(Administered_Dose1_Recip_End), min(Deaths), 
min(Series_Complete_Yes_Mid), min(Administered_Dose1_Recip_Mid), 
min(FullVaxPer100), min(OnePlusVaxPer100), min(DeathsPer100k))' due to data 
type mismatch: The expressions should all have the same type, got 
LEAST(timestamp, bigint, string, timestamp, bigint, bigint, string, timestamp, 
bigint, bigint, timestamp, bigint, bigint, bigint, double, double, double, 
double, double).;
{quote}
The odd thing is that pyspark.pandas seems to be operating on all the columns 
when only one is needed.

As a workaround, you can first create a one-column DataFrame that selects just 
the field you want, then make a histogram of that. But the command above should 
work also.

I can supply the complete program and datasets that demonstrate the error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37186) pyspark.pandas should support tseries.offsets

2021-11-01 Thread Chuck Connell (Jira)

Chuck Connell created SPARK-37186:
-

 Summary: pyspark.pandas should support tseries.offsets
 Key: SPARK-37186
 URL: https://issues.apache.org/jira/browse/SPARK-37186
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Chuck Connell


In regular pandas you can use pandas.offsets to create a time delta. This 
allows a line like
{quote}this_period_start = OVERALL_START_DATE + pd.offsets.Day(NN)
{quote}
But this does not work in pyspark.pandas.

There are good workarounds, such as datetime.timedelta(days=NN), but pandas 
programmers would like to move code to pyspark without changing it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37185) DataFrame.take() only uses one worker

2021-11-01 Thread mathieu longtin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437009#comment-17437009
 ] 

mathieu longtin commented on SPARK-37185:
-

Additional note: if there's a "group by" in the query, this is not an issue.

> DataFrame.take() only uses one worker
> -
>
> Key: SPARK-37185
> URL: https://issues.apache.org/jira/browse/SPARK-37185
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
> Environment: CentOS 7
>Reporter: mathieu longtin
>Priority: Major
>
> Say you have query:
> {code:java}
> >>> df = spark.sql("select * from mytable where x = 99"){code}
> Now, out of billions of row, there's only ten rows where x is 99.
> If I do:
> {code:java}
> >>> df.limit(10).collect()
> [Stage 1:>  (0 + 1) / 1]{code}
> It only uses one worker. This takes a really long time since one CPU is 
> reading the billions of row.
> However, if I do this:
> {code:java}
> >>> df.limit(10).rdd.collect()
> [Stage 1:>  (0 + 10) / 22]{code}
> All the workers are running.
> I think there's some optimization issue DataFrame.take(...).
> This did not use to be an issue, but I'm not sure if it was working with 3.0 
> or 2.4.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37185) DataFrame.take() only uses one worker

2021-11-01 Thread mathieu longtin (Jira)

mathieu longtin created SPARK-37185:
---

 Summary: DataFrame.take() only uses one worker
 Key: SPARK-37185
 URL: https://issues.apache.org/jira/browse/SPARK-37185
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0, 3.1.1
 Environment: CentOS 7
Reporter: mathieu longtin


Say you have query:
{code:java}
>>> df = spark.sql("select * from mytable where x = 99"){code}
Now, out of billions of row, there's only ten rows where x is 99.

If I do:
{code:java}
>>> df.limit(10).collect()
[Stage 1:>  (0 + 1) / 1]{code}
It only uses one worker. This takes a really long time since one CPU is reading 
the billions of row.

However, if I do this:
{code:java}
>>> df.limit(10).rdd.collect()
[Stage 1:>  (0 + 10) / 22]{code}
All the workers are running.

I think there's some optimization issue DataFrame.take(...).

This did not use to be an issue, but I'm not sure if it was working with 3.0 or 
2.4.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-37182) pyspark.pandas.to_numeric() should support the errors option

2021-11-01 Thread Chuck Connell (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chuck Connell updated SPARK-37182:
--
Comment: was deleted

(was: https://issues.apache.org/jira/browse/SPARK-36609)

> pyspark.pandas.to_numeric() should support the errors option
> 
>
> Key: SPARK-37182
> URL: https://issues.apache.org/jira/browse/SPARK-37182
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In regular pandas you can say to_numeric(errors='coerce'). But the errors 
> option is not recognized by pyspark.pandas.
> FYI, the errors option is recognized by pyspark.pandas.to_datetime()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37182) pyspark.pandas.to_numeric() should support the errors option

2021-11-01 Thread Chuck Connell (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chuck Connell resolved SPARK-37182.
---
Resolution: Duplicate

https://issues.apache.org/jira/browse/SPARK-36609

> pyspark.pandas.to_numeric() should support the errors option
> 
>
> Key: SPARK-37182
> URL: https://issues.apache.org/jira/browse/SPARK-37182
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In regular pandas you can say to_numeric(errors='coerce'). But the errors 
> option is not recognized by pyspark.pandas.
> FYI, the errors option is recognized by pyspark.pandas.to_datetime()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37182) pyspark.pandas.to_numeric() should support the errors option

2021-11-01 Thread Chuck Connell (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436992#comment-17436992
 ] 

Chuck Connell commented on SPARK-37182:
---

Duplicate of https://issues.apache.org/jira/browse/SPARK-36609

> pyspark.pandas.to_numeric() should support the errors option
> 
>
> Key: SPARK-37182
> URL: https://issues.apache.org/jira/browse/SPARK-37182
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In regular pandas you can say to_numeric(errors='coerce'). But the errors 
> option is not recognized by pyspark.pandas.
> FYI, the errors option is recognized by pyspark.pandas.to_datetime()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37184) pyspark.pandas should support DF["column"].str.split("some_suffix").str[0]

2021-11-01 Thread Chuck Connell (Jira)

Chuck Connell created SPARK-37184:
-

 Summary:  pyspark.pandas should support 
DF["column"].str.split("some_suffix").str[0]
 Key: SPARK-37184
 URL: https://issues.apache.org/jira/browse/SPARK-37184
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Chuck Connell


In regular pandas you can say
{quote}DF["column"] = DF["column"].str.split("suffix").str[0]
{quote}
In order to strip off a suffix.

With pyspark.pandas, this syntax does not work. You have to say something like
{quote}DF["column"] = DF["column"].str.replace("suffix", '', 1)
{quote}
which works fine if the suffix only appears once at the end, but is not really 
the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37183) pyspark.pandas.DataFrame.map() should support .fillna()

2021-11-01 Thread Chuck Connell (Jira)

Chuck Connell created SPARK-37183:
-

 Summary: pyspark.pandas.DataFrame.map() should support .fillna()
 Key: SPARK-37183
 URL: https://issues.apache.org/jira/browse/SPARK-37183
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Chuck Connell


In regular pandas you can say 
{quote}DF["new_column"] = DF["column"].map(some_map).fillna(DF["column"])
{quote}
In order to use the existing value if the mapping key is not found.

But this does not work in pyspark.pandas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37182) pyspark.pandas.to_numeric() should support the errors option

2021-11-01 Thread Chuck Connell (Jira)

Chuck Connell created SPARK-37182:
-

 Summary: pyspark.pandas.to_numeric() should support the errors 
option
 Key: SPARK-37182
 URL: https://issues.apache.org/jira/browse/SPARK-37182
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Chuck Connell


In regular pandas you can say to_numeric(errors='coerce'). But the errors 
option is not recognized by pyspark.pandas.

FYI, the errors option is recognized by pyspark.pandas.to_datetime()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding

2021-11-01 Thread Chuck Connell (Jira)

Chuck Connell created SPARK-37181:
-

 Summary: pyspark.pandas.read_csv() should support latin-1 encoding
 Key: SPARK-37181
 URL: https://issues.apache.org/jira/browse/SPARK-37181
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Chuck Connell


{{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding is 
not recognized in pyspark.pandas. You have to use Windows-1252 instead, which 
is almost the same but not identical. }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37166) SPIP: Storage Partitioned Join

2021-11-01 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436963#comment-17436963
 ] 

Chao Sun commented on SPARK-37166:
--

[~xkrogen] sure just linked.

> SPIP: Storage Partitioned Join
> --
>
> Key: SPARK-37166
> URL: https://issues.apache.org/jira/browse/SPARK-37166
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> This JIRA tracks the SPIP for storage partitioned join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37166) SPIP: Storage Partitioned Join

2021-11-01 Thread Erik Krogen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436959#comment-17436959
 ] 

Erik Krogen commented on SPARK-37166:
-

[~csun] can you link the doc here?

> SPIP: Storage Partitioned Join
> --
>
> Key: SPARK-37166
> URL: https://issues.apache.org/jira/browse/SPARK-37166
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> This JIRA tracks the SPIP for storage partitioned join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-37034) What's the progress of vectorized execution for spark?

2021-11-01 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436898#comment-17436898
 ] 

Wenchen Fan edited comment on SPARK-37034 at 11/1/21, 3:23 PM:
---

This is a question, not a feature request. Please ask it in the dev-list 
instead of JIRA.

A quick answer is: there is no plan to add vectorized execution into Spark in 
the near future (at least I haven't seen any proposal), but there are 
third-party libraries doing it via the columnar API in the Spark query plan.


was (Author: cloud_fan):
This is a question, not a feature request. Please ask it in the dev-list 
instead of JIRA.

A quick answer is: there is no plan to add vectorized execution into Spark in 
the near future, but there are third-party libraries doing it via the columnar 
API in the Spark query plan.

> What's the progress of vectorized execution for spark?
> --
>
> Key: SPARK-37034
> URL: https://issues.apache.org/jira/browse/SPARK-37034
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: xiaoli
>Priority: Major
>
> Spark has support vectorized read for ORC and parquet. What's the progress of 
> other vectorized execution, e.g. vectorized write,  join, aggr, simple 
> operator (string function, math function)? 
> Hive support vectorized execution in early version 
> (https://cwiki.apache.org/confluence/display/hive/vectorized+query+execution) 
> As we know, Spark is replacement of Hive. I guess the reason why Spark does 
> not support vectorized execution maybe the difficulty of design or 
> implementation in Spark is larger. What's the main issue for Spark to support 
> vectorized execution?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37034) What's the progress of vectorized execution for spark?

2021-11-01 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436898#comment-17436898
 ] 

Wenchen Fan commented on SPARK-37034:
-

This is a question, not a feature request. Please ask it in the dev-list 
instead of JIRA.

A quick answer is: there is no plan to add vectorized execution into Spark in 
the near future, but there are third-party libraries doing it via the columnar 
API in the Spark query plan.

> What's the progress of vectorized execution for spark?
> --
>
> Key: SPARK-37034
> URL: https://issues.apache.org/jira/browse/SPARK-37034
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: xiaoli
>Priority: Major
>
> Spark has support vectorized read for ORC and parquet. What's the progress of 
> other vectorized execution, e.g. vectorized write,  join, aggr, simple 
> operator (string function, math function)? 
> Hive support vectorized execution in early version 
> (https://cwiki.apache.org/confluence/display/hive/vectorized+query+execution) 
> As we know, Spark is replacement of Hive. I guess the reason why Spark does 
> not support vectorized execution maybe the difficulty of design or 
> implementation in Spark is larger. What's the main issue for Spark to support 
> vectorized execution?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36566) Add Spark appname as a label to the executor pods

2021-11-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436896#comment-17436896
 ] 

Apache Spark commented on SPARK-36566:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34460

> Add Spark appname as a label to the executor pods
> -
>
> Key: SPARK-36566
> URL: https://issues.apache.org/jira/browse/SPARK-36566
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Holden Karau
>Priority: Trivial
>
> Adding the appName as a label to the executor pods could simplify debugging.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36566) Add Spark appname as a label to the executor pods

2021-11-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36566:


Assignee: Apache Spark

> Add Spark appname as a label to the executor pods
> -
>
> Key: SPARK-36566
> URL: https://issues.apache.org/jira/browse/SPARK-36566
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Trivial
>
> Adding the appName as a label to the executor pods could simplify debugging.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36566) Add Spark appname as a label to the executor pods

2021-11-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436895#comment-17436895
 ] 

Apache Spark commented on SPARK-36566:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34460

> Add Spark appname as a label to the executor pods
> -
>
> Key: SPARK-36566
> URL: https://issues.apache.org/jira/browse/SPARK-36566
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Holden Karau
>Priority: Trivial
>
> Adding the appName as a label to the executor pods could simplify debugging.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36566) Add Spark appname as a label to the executor pods

2021-11-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36566:


Assignee: (was: Apache Spark)

> Add Spark appname as a label to the executor pods
> -
>
> Key: SPARK-36566
> URL: https://issues.apache.org/jira/browse/SPARK-36566
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Holden Karau
>Priority: Trivial
>
> Adding the appName as a label to the executor pods could simplify debugging.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-36566) Add Spark appname as a label to the executor pods

2021-11-01 Thread Yikun Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436885#comment-17436885
 ] 

Yikun Jiang edited comment on SPARK-36566 at 11/1/21, 3:11 PM:
---

Yep, it's useful for me. Does it make sense if we also set driver label to app 
name? then we can find out all pods (driver/executor) list by using:

{{k get pods -l spark.app.name=xxx}}


was (Author: yikunkero):
Yep, it's useful for me. Does it make sense if we also set driver label to app 
name? then we can find out all pods (driver/executor) list by using:


 {{k get pods -l spark-app}}

> Add Spark appname as a label to the executor pods
> -
>
> Key: SPARK-36566
> URL: https://issues.apache.org/jira/browse/SPARK-36566
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Holden Karau
>Priority: Trivial
>
> Adding the appName as a label to the executor pods could simplify debugging.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36566) Add Spark appname as a label to the executor pods

2021-11-01 Thread Yikun Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436885#comment-17436885
 ] 

Yikun Jiang commented on SPARK-36566:
-

Yep, it's useful for me. Does it make sense if we also set driver label to app 
name? then we can find out all pods (driver/executor) list by using:


 {{k get pods -l spark-app}}

> Add Spark appname as a label to the executor pods
> -
>
> Key: SPARK-36566
> URL: https://issues.apache.org/jira/browse/SPARK-36566
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Holden Karau
>Priority: Trivial
>
> Adding the appName as a label to the executor pods could simplify debugging.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37180) PySpark.pandas should support version

2021-11-01 Thread Chuck Connell (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chuck Connell updated SPARK-37180:
--
Description: 
In regular pandas you can say
{quote}pd.___version___ 
{quote}
to get the pandas version number. PySpark pandas should support the same.

  was:
In regular pandas you can say
{quote}{{pd.__version__ }}{quote}
to get the pandas version number. PySpark pandas should support the same.


> PySpark.pandas should support __version__
> -
>
> Key: SPARK-37180
> URL: https://issues.apache.org/jira/browse/SPARK-37180
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In regular pandas you can say
> {quote}pd.___version___ 
> {quote}
> to get the pandas version number. PySpark pandas should support the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37180) PySpark.pandas should support version

2021-11-01 Thread Chuck Connell (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chuck Connell updated SPARK-37180:
--
Description: 
In regular pandas you can say
{quote}{{pd.__version__ }}{quote}
to get the pandas version number. PySpark pandas should support the same.

  was:In regular pandas you can say pd.__version__ to get the pandas version 
number. PySpark pandas should support the same.


> PySpark.pandas should support __version__
> -
>
> Key: SPARK-37180
> URL: https://issues.apache.org/jira/browse/SPARK-37180
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In regular pandas you can say
> {quote}{{pd.__version__ }}{quote}
> to get the pandas version number. PySpark pandas should support the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37180) PySpark.pandas should support version

2021-11-01 Thread Chuck Connell (Jira)

Chuck Connell created SPARK-37180:
-

 Summary: PySpark.pandas should support __version__
 Key: SPARK-37180
 URL: https://issues.apache.org/jira/browse/SPARK-37180
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Chuck Connell


In regular pandas you can say pd.__version__ to get the pandas version number. 
PySpark pandas should support the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37179) ANSI mode: Allow casting between Timestamp and Numeric

2021-11-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436861#comment-17436861
 ] 

Apache Spark commented on SPARK-37179:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34459

> ANSI mode: Allow casting between Timestamp and Numeric
> --
>
> Key: SPARK-37179
> URL: https://issues.apache.org/jira/browse/SPARK-37179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> We should allow the casting between Timestamp and Numeric types:
> * As we did some data science, we found that many Spark SQL users are 
> actually using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. 
> * The Spark SQL connector for Tableau is using this feature for DateTime 
> math. e.g.
> {code:java}
> CAST(FROM_UNIXTIME(CAST(CAST(%1 AS BIGINT) + (%2 * 86400) AS BIGINT)) AS 
> TIMESTAMP)
> {code}
> * In the current syntax, we specially allow Numeric <=> Boolean and String 
> <=> Binary since they are straight forward and frequently used.  I suggest we 
> allow Timestamp <=> Numeric as well for better ANSI mode adoption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37179) ANSI mode: Allow casting between Timestamp and Numeric

2021-11-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37179:


Assignee: Apache Spark  (was: Gengliang Wang)

> ANSI mode: Allow casting between Timestamp and Numeric
> --
>
> Key: SPARK-37179
> URL: https://issues.apache.org/jira/browse/SPARK-37179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> We should allow the casting between Timestamp and Numeric types:
> * As we did some data science, we found that many Spark SQL users are 
> actually using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. 
> * The Spark SQL connector for Tableau is using this feature for DateTime 
> math. e.g.
> {code:java}
> CAST(FROM_UNIXTIME(CAST(CAST(%1 AS BIGINT) + (%2 * 86400) AS BIGINT)) AS 
> TIMESTAMP)
> {code}
> * In the current syntax, we specially allow Numeric <=> Boolean and String 
> <=> Binary since they are straight forward and frequently used.  I suggest we 
> allow Timestamp <=> Numeric as well for better ANSI mode adoption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37179) ANSI mode: Allow casting between Timestamp and Numeric

2021-11-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436858#comment-17436858
 ] 

Apache Spark commented on SPARK-37179:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34459

> ANSI mode: Allow casting between Timestamp and Numeric
> --
>
> Key: SPARK-37179
> URL: https://issues.apache.org/jira/browse/SPARK-37179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> We should allow the casting between Timestamp and Numeric types:
> * As we did some data science, we found that many Spark SQL users are 
> actually using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. 
> * The Spark SQL connector for Tableau is using this feature for DateTime 
> math. e.g.
> {code:java}
> CAST(FROM_UNIXTIME(CAST(CAST(%1 AS BIGINT) + (%2 * 86400) AS BIGINT)) AS 
> TIMESTAMP)
> {code}
> * In the current syntax, we specially allow Numeric <=> Boolean and String 
> <=> Binary since they are straight forward and frequently used.  I suggest we 
> allow Timestamp <=> Numeric as well for better ANSI mode adoption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37179) ANSI mode: Allow casting between Timestamp and Numeric

2021-11-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37179:


Assignee: Gengliang Wang  (was: Apache Spark)

> ANSI mode: Allow casting between Timestamp and Numeric
> --
>
> Key: SPARK-37179
> URL: https://issues.apache.org/jira/browse/SPARK-37179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> We should allow the casting between Timestamp and Numeric types:
> * As we did some data science, we found that many Spark SQL users are 
> actually using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. 
> * The Spark SQL connector for Tableau is using this feature for DateTime 
> math. e.g.
> {code:java}
> CAST(FROM_UNIXTIME(CAST(CAST(%1 AS BIGINT) + (%2 * 86400) AS BIGINT)) AS 
> TIMESTAMP)
> {code}
> * In the current syntax, we specially allow Numeric <=> Boolean and String 
> <=> Binary since they are straight forward and frequently used.  I suggest we 
> allow Timestamp <=> Numeric as well for better ANSI mode adoption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37179) ANSI mode: Allow casting between Timestamp and Numeric

2021-11-01 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-37179:
---
Description: 
We should allow the casting between Timestamp and Numeric types:
* As we did some data science, we found that many Spark SQL users are actually 
using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. 
* The Spark SQL connector for Tableau is using this feature for DateTime math. 
e.g.
{code:java}
CAST(FROM_UNIXTIME(CAST(CAST(%1 AS BIGINT) + (%2 * 86400) AS BIGINT)) AS 
TIMESTAMP)
{code}
* In the current syntax, we specially allow Numeric <=> Boolean and String <=> 
Binary since they are straight forward and frequently used.  I suggest we allow 
Timestamp <=> Numeric as well for better ANSI mode adoption.

  was:
We should allow casting 
As we did some data science, we found that many Spark SQL users are actually 
using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. 


> ANSI mode: Allow casting between Timestamp and Numeric
> --
>
> Key: SPARK-37179
> URL: https://issues.apache.org/jira/browse/SPARK-37179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> We should allow the casting between Timestamp and Numeric types:
> * As we did some data science, we found that many Spark SQL users are 
> actually using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. 
> * The Spark SQL connector for Tableau is using this feature for DateTime 
> math. e.g.
> {code:java}
> CAST(FROM_UNIXTIME(CAST(CAST(%1 AS BIGINT) + (%2 * 86400) AS BIGINT)) AS 
> TIMESTAMP)
> {code}
> * In the current syntax, we specially allow Numeric <=> Boolean and String 
> <=> Binary since they are straight forward and frequently used.  I suggest we 
> allow Timestamp <=> Numeric as well for better ANSI mode adoption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37179) ANSI mode: Allow casting between Timestamp and Numeric

2021-11-01 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-37179:
--

Assignee: Gengliang Wang

> ANSI mode: Allow casting between Timestamp and Numeric
> --
>
> Key: SPARK-37179
> URL: https://issues.apache.org/jira/browse/SPARK-37179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> We should allow casting 
> As we did some data science, we found that many Spark SQL users are actually 
> using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37179) ANSI mode: Allow casting between Timestamp and Numeric

2021-11-01 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-37179:
---
Description: 
We should allow casting 
As we did some data science, we found that many Spark SQL users are actually 
using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. 

  was:The casting between 


> ANSI mode: Allow casting between Timestamp and Numeric
> --
>
> Key: SPARK-37179
> URL: https://issues.apache.org/jira/browse/SPARK-37179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Priority: Major
>
> We should allow casting 
> As we did some data science, we found that many Spark SQL users are actually 
> using `Cast(Timestamp as Numeric)` and `Cast(Numeric as Timestamp)`. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37179) ANSI mode: Allow casting between Timestamp and Numeric

2021-11-01 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-37179:
---
Description: The casting between 

> ANSI mode: Allow casting between Timestamp and Numeric
> --
>
> Key: SPARK-37179
> URL: https://issues.apache.org/jira/browse/SPARK-37179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Priority: Major
>
> The casting between 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37179) ANSI mode: Allow casting between Timestamp and Numeric

2021-11-01 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-37179:
--

 Summary: ANSI mode: Allow casting between Timestamp and Numeric
 Key: SPARK-37179
 URL: https://issues.apache.org/jira/browse/SPARK-37179
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37178) Add Target Encoding to ml.feature

2021-11-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436844#comment-17436844
 ] 

Apache Spark commented on SPARK-37178:
--

User 'taosiyuan163' has created a pull request for this issue:
https://github.com/apache/spark/pull/34457

> Add Target Encoding to ml.feature
> -
>
> Key: SPARK-37178
> URL: https://issues.apache.org/jira/browse/SPARK-37178
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: Simon Tao
>Priority: Major
>
> Target Encoding is a mechanism of converting categorical features to 
> continues features based on the posterior probability __ calculated from 
> values of the label (target) column.
>  
> Target Encoding can help to improve accuracy of machine learning algorithms 
> when columns with high cardinality are used as features during training phase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37178) Add Target Encoding to ml.feature

2021-11-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37178:


Assignee: (was: Apache Spark)

> Add Target Encoding to ml.feature
> -
>
> Key: SPARK-37178
> URL: https://issues.apache.org/jira/browse/SPARK-37178
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: Simon Tao
>Priority: Major
>
> Target Encoding is a mechanism of converting categorical features to 
> continues features based on the posterior probability __ calculated from 
> values of the label (target) column.
>  
> Target Encoding can help to improve accuracy of machine learning algorithms 
> when columns with high cardinality are used as features during training phase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37178) Add Target Encoding to ml.feature

2021-11-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37178:


Assignee: Apache Spark

> Add Target Encoding to ml.feature
> -
>
> Key: SPARK-37178
> URL: https://issues.apache.org/jira/browse/SPARK-37178
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: Simon Tao
>Assignee: Apache Spark
>Priority: Major
>
> Target Encoding is a mechanism of converting categorical features to 
> continues features based on the posterior probability __ calculated from 
> values of the label (target) column.
>  
> Target Encoding can help to improve accuracy of machine learning algorithms 
> when columns with high cardinality are used as features during training phase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37034) What's the progress of vectorized execution for spark?

2021-11-01 Thread xiaoli (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436825#comment-17436825
 ] 

xiaoli commented on SPARK-37034:


 

[~dongjoon] [~yumwang] [~cloud_fan] Sorry to ping you, as there is no 
answer/comments of this question for a week. All of you have contribute to 
spark's vectorized read, so I guess you may know something about this question. 
Could you look at this question？

> What's the progress of vectorized execution for spark?
> --
>
> Key: SPARK-37034
> URL: https://issues.apache.org/jira/browse/SPARK-37034
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: xiaoli
>Priority: Major
>
> Spark has support vectorized read for ORC and parquet. What's the progress of 
> other vectorized execution, e.g. vectorized write,  join, aggr, simple 
> operator (string function, math function)? 
> Hive support vectorized execution in early version 
> (https://cwiki.apache.org/confluence/display/hive/vectorized+query+execution) 
> As we know, Spark is replacement of Hive. I guess the reason why Spark does 
> not support vectorized execution maybe the difficulty of design or 
> implementation in Spark is larger. What's the main issue for Spark to support 
> vectorized execution?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37178) Add Target Encoding to ml.feature

2021-11-01 Thread Simon Tao (Jira)

Simon Tao created SPARK-37178:
-

 Summary: Add Target Encoding to ml.feature
 Key: SPARK-37178
 URL: https://issues.apache.org/jira/browse/SPARK-37178
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 3.2.0
Reporter: Simon Tao


Target Encoding is a mechanism of converting categorical features to continues 
features based on the posterior probability __ calculated from values of the 
label (target) column.

 

Target Encoding can help to improve accuracy of machine learning algorithms 
when columns with high cardinality are used as features during training phase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37177) Support LONG argument to the Spark SQL LIMIT clause

2021-11-01 Thread Douglas Moore (Jira)

Douglas Moore created SPARK-37177:
-

 Summary: Support LONG argument to the Spark SQL LIMIT clause
 Key: SPARK-37177
 URL: https://issues.apache.org/jira/browse/SPARK-37177
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Douglas Moore


Big data sets may exceed INT max thus Limit clause in Spark SQL `LIMIT 
` should be expanded to support . 

Thus
 `SELECT * FROM  LIMIT 200` would run and not return an error.

 

See docs: 
[https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-limit.html]
  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37062) Introduce a new data source for providing consistent set of rows per microbatch

2021-11-01 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-37062:


Assignee: Jungtaek Lim

> Introduce a new data source for providing consistent set of rows per 
> microbatch
> ---
>
> Key: SPARK-37062
> URL: https://issues.apache.org/jira/browse/SPARK-37062
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> The "rate" data source has been known to be used as a benchmark for streaming 
> query.
> While this helps to put the query to the limit (how many rows the query could 
> process per second), the rate data source doesn't provide consistent rows per 
> batch into stream, which leads two environments be hard to compare with.
> For example, in many cases, you may want to compare the metrics in the 
> batches between test environments (like running same streaming query with 
> different options). These metrics are strongly affected if the distribution 
> of input rows in batches are changing, especially a micro-batch has been 
> lagged (in any reason) and rate data source produces more input rows to the 
> next batch.
> Also, when you test against streaming aggregation, you may want the data 
> source produces the same set of input rows per batch (deterministic), so that 
> you can plan how these input rows will be aggregated and how state rows will 
> be evicted, and craft the test query based on the plan.
> The requirements of new data source would follow:
> * it should produce a specific number of input rows as requested
> * it should also include a timestamp (event time) into each row
> ** to make the input rows fully deterministic, timestamp should be configured 
> as well (like start timestamp & amount of advance per batch)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37062) Introduce a new data source for providing consistent set of rows per microbatch

2021-11-01 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-37062.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34333
[https://github.com/apache/spark/pull/34333]

> Introduce a new data source for providing consistent set of rows per 
> microbatch
> ---
>
> Key: SPARK-37062
> URL: https://issues.apache.org/jira/browse/SPARK-37062
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.3.0
>
>
> The "rate" data source has been known to be used as a benchmark for streaming 
> query.
> While this helps to put the query to the limit (how many rows the query could 
> process per second), the rate data source doesn't provide consistent rows per 
> batch into stream, which leads two environments be hard to compare with.
> For example, in many cases, you may want to compare the metrics in the 
> batches between test environments (like running same streaming query with 
> different options). These metrics are strongly affected if the distribution 
> of input rows in batches are changing, especially a micro-batch has been 
> lagged (in any reason) and rate data source produces more input rows to the 
> next batch.
> Also, when you test against streaming aggregation, you may want the data 
> source produces the same set of input rows per batch (deterministic), so that 
> you can plan how these input rows will be aggregated and how state rows will 
> be evicted, and craft the test query based on the plan.
> The requirements of new data source would follow:
> * it should produce a specific number of input rows as requested
> * it should also include a timestamp (event time) into each row
> ** to make the input rows fully deterministic, timestamp should be configured 
> as well (like start timestamp & amount of advance per batch)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism

2021-11-01 Thread Gustavo Martin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436740#comment-17436740
 ] 

Gustavo Martin edited comment on SPARK-23977 at 11/1/21, 10:48 AM:
---

Thank you very much [~ste...@apache.org] for your explanations.

I am experiencing the same problems as [~danzhi] and your comments helped me a 
lot.

Sad magic committer does not work with dynamic partition overwrite because it 
has an amazing performance when writing loads of JSON partitioned data.


was (Author: gumartinm):
Thank you very much [~ste...@apache.org] for your explanations.

I am experiencing the same problems as [~danzhi] and your comments helped me a 
lot.

Sad magic committer does not work with partition overwrite because it has an 
amazing performance when writing loads of JSON partitioned data.

> Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
> ---
>
> Key: SPARK-23977
> URL: https://issues.apache.org/jira/browse/SPARK-23977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 3.0.0
>
>
> Hadoop 3.1 adds a mechanism for job-specific and store-specific committers 
> (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, 
> HADOOP-13786
> These committers deliver high-performance output of MR and spark jobs to S3, 
> and offer the key semantics which Spark depends on: no visible output until 
> job commit, a failure of a task at an stage, including partway through task 
> commit, can be handled by executing and committing another task attempt. 
> In contrast, the FileOutputFormat commit algorithms on S3 have issues:
> * Awful performance because files are copied by rename
> * FileOutputFormat v1: weak task commit failure recovery semantics as the 
> (v1) expectation: "directory renames are atomic" doesn't hold.
> * S3 metadata eventual consistency can cause rename to miss files or fail 
> entirely (SPARK-15849)
> Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of 
> the commit semantics w.r.t observability of or recovery from task commit 
> failure, on any filesystem.
> The S3A committers address these by way of uploading all data to the 
> destination through multipart uploads, uploads which are only completed in 
> job commit.
> The new {{PathOutputCommitter}} factory mechanism allows applications to work 
> with the S3A committers and any other, by adding a plugin mechanism into the 
> MRv2 FileOutputFormat class, where it job config and filesystem configuration 
> options can dynamically choose the output committer.
> Spark can use these with some binding classes to 
> # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 
> classes and {{PathOutputCommitterFactory}} to create the committers.
> # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}}
> to wire up Parquet output even when code requires the committer to be a 
> subclass of {{ParquetOutputCommitter}}
> This patch builds on SPARK-23807 for setting up the dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism

2021-11-01 Thread Gustavo Martin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436740#comment-17436740
 ] 

Gustavo Martin edited comment on SPARK-23977 at 11/1/21, 10:35 AM:
---

Thank you very much [~ste...@apache.org] for your explanations.

I am experiencing the same problems as [~danzhi] and your comments helped me a 
lot.

Sad magic committer does not work with partition overwrite because it has an 
amazing performance when writing loads of JSON partitioned data.


was (Author: gumartinm):
Thank you very much [~ste...@apache.org] for your explanations.

I am experiencing the same problems as [~danzhi] and your comments helped me a 
lot.

Sad magic committer does not work with partition overwrite because it has an 
amazing performance when writing loads of partitioned data.

> Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
> ---
>
> Key: SPARK-23977
> URL: https://issues.apache.org/jira/browse/SPARK-23977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 3.0.0
>
>
> Hadoop 3.1 adds a mechanism for job-specific and store-specific committers 
> (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, 
> HADOOP-13786
> These committers deliver high-performance output of MR and spark jobs to S3, 
> and offer the key semantics which Spark depends on: no visible output until 
> job commit, a failure of a task at an stage, including partway through task 
> commit, can be handled by executing and committing another task attempt. 
> In contrast, the FileOutputFormat commit algorithms on S3 have issues:
> * Awful performance because files are copied by rename
> * FileOutputFormat v1: weak task commit failure recovery semantics as the 
> (v1) expectation: "directory renames are atomic" doesn't hold.
> * S3 metadata eventual consistency can cause rename to miss files or fail 
> entirely (SPARK-15849)
> Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of 
> the commit semantics w.r.t observability of or recovery from task commit 
> failure, on any filesystem.
> The S3A committers address these by way of uploading all data to the 
> destination through multipart uploads, uploads which are only completed in 
> job commit.
> The new {{PathOutputCommitter}} factory mechanism allows applications to work 
> with the S3A committers and any other, by adding a plugin mechanism into the 
> MRv2 FileOutputFormat class, where it job config and filesystem configuration 
> options can dynamically choose the output committer.
> Spark can use these with some binding classes to 
> # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 
> classes and {{PathOutputCommitterFactory}} to create the committers.
> # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}}
> to wire up Parquet output even when code requires the committer to be a 
> subclass of {{ParquetOutputCommitter}}
> This patch builds on SPARK-23807 for setting up the dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism

2021-11-01 Thread Gustavo Martin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436740#comment-17436740
 ] 

Gustavo Martin edited comment on SPARK-23977 at 11/1/21, 10:34 AM:
---

Thank you very much [~ste...@apache.org] for your explanations.

I am experiencing the same problems as [~danzhi] and your comments helped me a 
lot.

Sad magic committer does not work with partition overwrite because it has an 
amazing performance when writing loads of partitioned data.


was (Author: gumartinm):
Thank you ver much [~ste...@apache.org] for your explanations.

I am experiencing the same problems as [~danzhi] and your comments helped me a 
lot.

Sad magic committer does not work with partition overwrite because it has an 
amazing performance when writing loads of partitioned data.

> Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
> ---
>
> Key: SPARK-23977
> URL: https://issues.apache.org/jira/browse/SPARK-23977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 3.0.0
>
>
> Hadoop 3.1 adds a mechanism for job-specific and store-specific committers 
> (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, 
> HADOOP-13786
> These committers deliver high-performance output of MR and spark jobs to S3, 
> and offer the key semantics which Spark depends on: no visible output until 
> job commit, a failure of a task at an stage, including partway through task 
> commit, can be handled by executing and committing another task attempt. 
> In contrast, the FileOutputFormat commit algorithms on S3 have issues:
> * Awful performance because files are copied by rename
> * FileOutputFormat v1: weak task commit failure recovery semantics as the 
> (v1) expectation: "directory renames are atomic" doesn't hold.
> * S3 metadata eventual consistency can cause rename to miss files or fail 
> entirely (SPARK-15849)
> Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of 
> the commit semantics w.r.t observability of or recovery from task commit 
> failure, on any filesystem.
> The S3A committers address these by way of uploading all data to the 
> destination through multipart uploads, uploads which are only completed in 
> job commit.
> The new {{PathOutputCommitter}} factory mechanism allows applications to work 
> with the S3A committers and any other, by adding a plugin mechanism into the 
> MRv2 FileOutputFormat class, where it job config and filesystem configuration 
> options can dynamically choose the output committer.
> Spark can use these with some binding classes to 
> # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 
> classes and {{PathOutputCommitterFactory}} to create the committers.
> # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}}
> to wire up Parquet output even when code requires the committer to be a 
> subclass of {{ParquetOutputCommitter}}
> This patch builds on SPARK-23807 for setting up the dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism

2021-11-01 Thread Gustavo Martin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436740#comment-17436740
 ] 

Gustavo Martin commented on SPARK-23977:


Thank you ver much [~ste...@apache.org] for your explanations.

I am experiencing the same problems as [~danzhi] and your comments helped me a 
lot.

Sad magic committer does not work with partition overwrite because it has an 
amazing performance when writing loads of partitioned data.

> Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
> ---
>
> Key: SPARK-23977
> URL: https://issues.apache.org/jira/browse/SPARK-23977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 3.0.0
>
>
> Hadoop 3.1 adds a mechanism for job-specific and store-specific committers 
> (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, 
> HADOOP-13786
> These committers deliver high-performance output of MR and spark jobs to S3, 
> and offer the key semantics which Spark depends on: no visible output until 
> job commit, a failure of a task at an stage, including partway through task 
> commit, can be handled by executing and committing another task attempt. 
> In contrast, the FileOutputFormat commit algorithms on S3 have issues:
> * Awful performance because files are copied by rename
> * FileOutputFormat v1: weak task commit failure recovery semantics as the 
> (v1) expectation: "directory renames are atomic" doesn't hold.
> * S3 metadata eventual consistency can cause rename to miss files or fail 
> entirely (SPARK-15849)
> Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of 
> the commit semantics w.r.t observability of or recovery from task commit 
> failure, on any filesystem.
> The S3A committers address these by way of uploading all data to the 
> destination through multipart uploads, uploads which are only completed in 
> job commit.
> The new {{PathOutputCommitter}} factory mechanism allows applications to work 
> with the S3A committers and any other, by adding a plugin mechanism into the 
> MRv2 FileOutputFormat class, where it job config and filesystem configuration 
> options can dynamically choose the output committer.
> Spark can use these with some binding classes to 
> # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 
> classes and {{PathOutputCommitterFactory}} to create the committers.
> # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}}
> to wire up Parquet output even when code requires the committer to be a 
> subclass of {{ParquetOutputCommitter}}
> This patch builds on SPARK-23807 for setting up the dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36061) Create a PodGroup with user specified minimum resources required

2021-11-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36061:


Assignee: (was: Apache Spark)

> Create a PodGroup with user specified minimum resources required
> 
>
> Key: SPARK-36061
> URL: https://issues.apache.org/jira/browse/SPARK-36061
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36061) Create a PodGroup with user specified minimum resources required

2021-11-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36061:


Assignee: Apache Spark

> Create a PodGroup with user specified minimum resources required
> 
>
> Key: SPARK-36061
> URL: https://issues.apache.org/jira/browse/SPARK-36061
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36061) Create a PodGroup with user specified minimum resources required

2021-11-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436725#comment-17436725
 ] 

Apache Spark commented on SPARK-36061:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34456

> Create a PodGroup with user specified minimum resources required
> 
>
> Key: SPARK-36061
> URL: https://issues.apache.org/jira/browse/SPARK-36061
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37176) JsonSource's infer should have the same exception handle logic as JacksonParser's parse logic

2021-11-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37176:


Assignee: (was: Apache Spark)

> JsonSource's infer should have the same exception handle logic as 
> JacksonParser's parse logic
> -
>
> Key: SPARK-37176
> URL: https://issues.apache.org/jira/browse/SPARK-37176
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Xianjin YE
>Priority: Minor
>
> JacksonParser's exception handle logic is different with 
> org.apache.spark.sql.catalyst.json.JsonInferSchema#infer logic, the different 
> can be saw as below:
> {code:java}
> // code JacksonParser's parse
> try {
>   Utils.tryWithResource(createParser(factory, record)) { parser =>
> // a null first token is equivalent to testing for input.trim.isEmpty
> // but it works on any token stream and not just strings
> parser.nextToken() match {
>   case null => None
>   case _ => rootConverter.apply(parser) match {
> case null => throw 
> QueryExecutionErrors.rootConverterReturnNullError()
> case rows => rows.toSeq
>   }
> }
>   }
> } catch {
>   case e: SparkUpgradeException => throw e
>   case e @ (_: RuntimeException | _: JsonProcessingException | _: 
> MalformedInputException) =>
> // JSON parser currently doesn't support partial results for 
> corrupted records.
> // For such records, all fields other than the field configured by
> // `columnNameOfCorruptRecord` are set to `null`.
> throw BadRecordException(() => recordLiteral(record), () => None, e)
>   case e: CharConversionException if options.encoding.isEmpty =>
> val msg =
>   """JSON parser cannot handle a character in its input.
> |Specifying encoding as an input option explicitly might help to 
> resolve the issue.
> |""".stripMargin + e.getMessage
> val wrappedCharException = new CharConversionException(msg)
> wrappedCharException.initCause(e)
> throw BadRecordException(() => recordLiteral(record), () => None, 
> wrappedCharException)
>   case PartialResultException(row, cause) =>
> throw BadRecordException(
>   record = () => recordLiteral(record),
>   partialResult = () => Some(row),
>   cause)
> }
> {code}
> v.s. 
> {code:java}
> // JsonInferSchema's infer logic
> val mergedTypesFromPartitions = json.mapPartitions { iter =>
>   val factory = options.buildJsonFactory()
>   iter.flatMap { row =>
> try {
>   Utils.tryWithResource(createParser(factory, row)) { parser =>
> parser.nextToken()
> Some(inferField(parser))
>   }
> } catch {
>   case  e @ (_: RuntimeException | _: JsonProcessingException) => 
> parseMode match {
> case PermissiveMode =>
>   Some(StructType(Seq(StructField(columnNameOfCorruptRecord, 
> StringType
> case DropMalformedMode =>
>   None
> case FailFastMode =>
>   throw 
> QueryExecutionErrors.malformedRecordsDetectedInSchemaInferenceError(e)
>   }
> }
>   }.reduceOption(typeMerger).toIterator
> }
> {code}
> They should have the same exception handle logic, otherwise it may confuse 
> user because of the inconsistency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37176) JsonSource's infer should have the same exception handle logic as JacksonParser's parse logic

2021-11-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436675#comment-17436675
 ] 

Apache Spark commented on SPARK-37176:
--

User 'advancedxy' has created a pull request for this issue:
https://github.com/apache/spark/pull/34455

> JsonSource's infer should have the same exception handle logic as 
> JacksonParser's parse logic
> -
>
> Key: SPARK-37176
> URL: https://issues.apache.org/jira/browse/SPARK-37176
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Xianjin YE
>Priority: Minor
>
> JacksonParser's exception handle logic is different with 
> org.apache.spark.sql.catalyst.json.JsonInferSchema#infer logic, the different 
> can be saw as below:
> {code:java}
> // code JacksonParser's parse
> try {
>   Utils.tryWithResource(createParser(factory, record)) { parser =>
> // a null first token is equivalent to testing for input.trim.isEmpty
> // but it works on any token stream and not just strings
> parser.nextToken() match {
>   case null => None
>   case _ => rootConverter.apply(parser) match {
> case null => throw 
> QueryExecutionErrors.rootConverterReturnNullError()
> case rows => rows.toSeq
>   }
> }
>   }
> } catch {
>   case e: SparkUpgradeException => throw e
>   case e @ (_: RuntimeException | _: JsonProcessingException | _: 
> MalformedInputException) =>
> // JSON parser currently doesn't support partial results for 
> corrupted records.
> // For such records, all fields other than the field configured by
> // `columnNameOfCorruptRecord` are set to `null`.
> throw BadRecordException(() => recordLiteral(record), () => None, e)
>   case e: CharConversionException if options.encoding.isEmpty =>
> val msg =
>   """JSON parser cannot handle a character in its input.
> |Specifying encoding as an input option explicitly might help to 
> resolve the issue.
> |""".stripMargin + e.getMessage
> val wrappedCharException = new CharConversionException(msg)
> wrappedCharException.initCause(e)
> throw BadRecordException(() => recordLiteral(record), () => None, 
> wrappedCharException)
>   case PartialResultException(row, cause) =>
> throw BadRecordException(
>   record = () => recordLiteral(record),
>   partialResult = () => Some(row),
>   cause)
> }
> {code}
> v.s. 
> {code:java}
> // JsonInferSchema's infer logic
> val mergedTypesFromPartitions = json.mapPartitions { iter =>
>   val factory = options.buildJsonFactory()
>   iter.flatMap { row =>
> try {
>   Utils.tryWithResource(createParser(factory, row)) { parser =>
> parser.nextToken()
> Some(inferField(parser))
>   }
> } catch {
>   case  e @ (_: RuntimeException | _: JsonProcessingException) => 
> parseMode match {
> case PermissiveMode =>
>   Some(StructType(Seq(StructField(columnNameOfCorruptRecord, 
> StringType
> case DropMalformedMode =>
>   None
> case FailFastMode =>
>   throw 
> QueryExecutionErrors.malformedRecordsDetectedInSchemaInferenceError(e)
>   }
> }
>   }.reduceOption(typeMerger).toIterator
> }
> {code}
> They should have the same exception handle logic, otherwise it may confuse 
> user because of the inconsistency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37176) JsonSource's infer should have the same exception handle logic as JacksonParser's parse logic

2021-11-01 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37176:


Assignee: Apache Spark

> JsonSource's infer should have the same exception handle logic as 
> JacksonParser's parse logic
> -
>
> Key: SPARK-37176
> URL: https://issues.apache.org/jira/browse/SPARK-37176
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Xianjin YE
>Assignee: Apache Spark
>Priority: Minor
>
> JacksonParser's exception handle logic is different with 
> org.apache.spark.sql.catalyst.json.JsonInferSchema#infer logic, the different 
> can be saw as below:
> {code:java}
> // code JacksonParser's parse
> try {
>   Utils.tryWithResource(createParser(factory, record)) { parser =>
> // a null first token is equivalent to testing for input.trim.isEmpty
> // but it works on any token stream and not just strings
> parser.nextToken() match {
>   case null => None
>   case _ => rootConverter.apply(parser) match {
> case null => throw 
> QueryExecutionErrors.rootConverterReturnNullError()
> case rows => rows.toSeq
>   }
> }
>   }
> } catch {
>   case e: SparkUpgradeException => throw e
>   case e @ (_: RuntimeException | _: JsonProcessingException | _: 
> MalformedInputException) =>
> // JSON parser currently doesn't support partial results for 
> corrupted records.
> // For such records, all fields other than the field configured by
> // `columnNameOfCorruptRecord` are set to `null`.
> throw BadRecordException(() => recordLiteral(record), () => None, e)
>   case e: CharConversionException if options.encoding.isEmpty =>
> val msg =
>   """JSON parser cannot handle a character in its input.
> |Specifying encoding as an input option explicitly might help to 
> resolve the issue.
> |""".stripMargin + e.getMessage
> val wrappedCharException = new CharConversionException(msg)
> wrappedCharException.initCause(e)
> throw BadRecordException(() => recordLiteral(record), () => None, 
> wrappedCharException)
>   case PartialResultException(row, cause) =>
> throw BadRecordException(
>   record = () => recordLiteral(record),
>   partialResult = () => Some(row),
>   cause)
> }
> {code}
> v.s. 
> {code:java}
> // JsonInferSchema's infer logic
> val mergedTypesFromPartitions = json.mapPartitions { iter =>
>   val factory = options.buildJsonFactory()
>   iter.flatMap { row =>
> try {
>   Utils.tryWithResource(createParser(factory, row)) { parser =>
> parser.nextToken()
> Some(inferField(parser))
>   }
> } catch {
>   case  e @ (_: RuntimeException | _: JsonProcessingException) => 
> parseMode match {
> case PermissiveMode =>
>   Some(StructType(Seq(StructField(columnNameOfCorruptRecord, 
> StringType
> case DropMalformedMode =>
>   None
> case FailFastMode =>
>   throw 
> QueryExecutionErrors.malformedRecordsDetectedInSchemaInferenceError(e)
>   }
> }
>   }.reduceOption(typeMerger).toIterator
> }
> {code}
> They should have the same exception handle logic, otherwise it may confuse 
> user because of the inconsistency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-37176) JsonSource's infer should have the same exception handle logic as JacksonParser's parse logic

2021-11-01 Thread Xianjin YE (Jira)

Xianjin YE created SPARK-37176:
--

 Summary: JsonSource's infer should have the same exception handle 
logic as JacksonParser's parse logic
 Key: SPARK-37176
 URL: https://issues.apache.org/jira/browse/SPARK-37176
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0, 3.1.2, 3.0.3
Reporter: Xianjin YE


JacksonParser's exception handle logic is different with 
org.apache.spark.sql.catalyst.json.JsonInferSchema#infer logic, the different 
can be saw as below:
{code:java}
// code JacksonParser's parse
try {
  Utils.tryWithResource(createParser(factory, record)) { parser =>
// a null first token is equivalent to testing for input.trim.isEmpty
// but it works on any token stream and not just strings
parser.nextToken() match {
  case null => None
  case _ => rootConverter.apply(parser) match {
case null => throw 
QueryExecutionErrors.rootConverterReturnNullError()
case rows => rows.toSeq
  }
}
  }
} catch {
  case e: SparkUpgradeException => throw e
  case e @ (_: RuntimeException | _: JsonProcessingException | _: 
MalformedInputException) =>
// JSON parser currently doesn't support partial results for corrupted 
records.
// For such records, all fields other than the field configured by
// `columnNameOfCorruptRecord` are set to `null`.
throw BadRecordException(() => recordLiteral(record), () => None, e)
  case e: CharConversionException if options.encoding.isEmpty =>
val msg =
  """JSON parser cannot handle a character in its input.
|Specifying encoding as an input option explicitly might help to 
resolve the issue.
|""".stripMargin + e.getMessage
val wrappedCharException = new CharConversionException(msg)
wrappedCharException.initCause(e)
throw BadRecordException(() => recordLiteral(record), () => None, 
wrappedCharException)
  case PartialResultException(row, cause) =>
throw BadRecordException(
  record = () => recordLiteral(record),
  partialResult = () => Some(row),
  cause)
}
{code}
v.s. 
{code:java}
// JsonInferSchema's infer logic
val mergedTypesFromPartitions = json.mapPartitions { iter =>
  val factory = options.buildJsonFactory()
  iter.flatMap { row =>
try {
  Utils.tryWithResource(createParser(factory, row)) { parser =>
parser.nextToken()
Some(inferField(parser))
  }
} catch {
  case  e @ (_: RuntimeException | _: JsonProcessingException) => 
parseMode match {
case PermissiveMode =>
  Some(StructType(Seq(StructField(columnNameOfCorruptRecord, 
StringType
case DropMalformedMode =>
  None
case FailFastMode =>
  throw 
QueryExecutionErrors.malformedRecordsDetectedInSchemaInferenceError(e)
  }
}
  }.reduceOption(typeMerger).toIterator
}

{code}
They should have the same exception handle logic, otherwise it may confuse user 
because of the inconsistency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37013) `select format_string('%0$s', 'Hello')` has different behavior when using java 8 and Java 17

2021-11-01 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436641#comment-17436641
 ] 

Apache Spark commented on SPARK-37013:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34454

> `select format_string('%0$s', 'Hello')` has different behavior when using 
> java 8 and Java 17
> 
>
> Key: SPARK-37013
> URL: https://issues.apache.org/jira/browse/SPARK-37013
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.3.0
>
>
> {code:java}
> --PostgreSQL throw ERROR:  format specifies argument 0, but arguments are 
> numbered from 1
> select format_string('%0$s', 'Hello');
> {code}
> Execute with Java 8
> {code:java}
> -- !query
> select format_string('%0$s', 'Hello')
> -- !query schema
> struct
> -- !query output
> Hello
> {code}
> Execute with Java 17
> {code:java}
> -- !query
> select format_string('%0$s', 'Hello')
> -- !query schema
> struct<>
> -- !query output
> java.util.IllegalFormatArgumentIndexException
> Illegal format argument index = 0
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37161) RowToColumnConverter support AnsiIntervalType

2021-11-01 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-37161:


Assignee: PengLei

> RowToColumnConverter  support AnsiIntervalType
> --
>
> Key: SPARK-37161
> URL: https://issues.apache.org/jira/browse/SPARK-37161
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Assignee: PengLei
>Priority: Major
>
> currently, we have RowToColumnConverter for all data types except 
> AnsiIntervalType
> {code:java}
> // code placeholder
> val core = dataType match {
>   case BinaryType => BinaryConverter
>   case BooleanType => BooleanConverter
>   case ByteType => ByteConverter
>   case ShortType => ShortConverter
>   case IntegerType | DateType => IntConverter
>   case FloatType => FloatConverter
>   case LongType | TimestampType => LongConverter
>   case DoubleType => DoubleConverter
>   case StringType => StringConverter
>   case CalendarIntervalType => CalendarConverter
>   case at: ArrayType => ArrayConverter(getConverterForType(at.elementType, 
> at.containsNull))
>   case st: StructType => new StructConverter(st.fields.map(
> (f) => getConverterForType(f.dataType, f.nullable)))
>   case dt: DecimalType => new DecimalConverter(dt)
>   case mt: MapType => MapConverter(getConverterForType(mt.keyType, nullable = 
> false),
> getConverterForType(mt.valueType, mt.valueContainsNull))
>   case unknown => throw 
> QueryExecutionErrors.unsupportedDataTypeError(unknown.toString)
> }
> if (nullable) {
>   dataType match {
> case CalendarIntervalType => new StructNullableTypeConverter(core)
> case st: StructType => new StructNullableTypeConverter(core)
> case _ => new BasicNullableTypeConverter(core)
>   }
> } else {
>   core
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37161) RowToColumnConverter support AnsiIntervalType

2021-11-01 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-37161.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34446
[https://github.com/apache/spark/pull/34446]

> RowToColumnConverter  support AnsiIntervalType
> --
>
> Key: SPARK-37161
> URL: https://issues.apache.org/jira/browse/SPARK-37161
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Assignee: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>
> currently, we have RowToColumnConverter for all data types except 
> AnsiIntervalType
> {code:java}
> // code placeholder
> val core = dataType match {
>   case BinaryType => BinaryConverter
>   case BooleanType => BooleanConverter
>   case ByteType => ByteConverter
>   case ShortType => ShortConverter
>   case IntegerType | DateType => IntConverter
>   case FloatType => FloatConverter
>   case LongType | TimestampType => LongConverter
>   case DoubleType => DoubleConverter
>   case StringType => StringConverter
>   case CalendarIntervalType => CalendarConverter
>   case at: ArrayType => ArrayConverter(getConverterForType(at.elementType, 
> at.containsNull))
>   case st: StructType => new StructConverter(st.fields.map(
> (f) => getConverterForType(f.dataType, f.nullable)))
>   case dt: DecimalType => new DecimalConverter(dt)
>   case mt: MapType => MapConverter(getConverterForType(mt.keyType, nullable = 
> false),
> getConverterForType(mt.valueType, mt.valueContainsNull))
>   case unknown => throw 
> QueryExecutionErrors.unsupportedDataTypeError(unknown.toString)
> }
> if (nullable) {
>   dataType match {
> case CalendarIntervalType => new StructNullableTypeConverter(core)
> case st: StructType => new StructNullableTypeConverter(core)
> case _ => new BasicNullableTypeConverter(core)
>   }
> } else {
>   core
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

90 matches

Mail list logo