[jira] [Created] (SPARK-34134) LDAP authentication of spark thrift server support user id mapping

2021-01-15 Thread Timothy Zhang (Jira)
Timothy Zhang created SPARK-34134:
-

 Summary: LDAP authentication of spark thrift server support user 
id mapping
 Key: SPARK-34134
 URL: https://issues.apache.org/jira/browse/SPARK-34134
 Project: Spark
  Issue Type: Improvement
  Components: Security
Affects Versions: 3.0.1
Reporter: Timothy Zhang


I'm trying to configure LDAP authentication of spark thrift server, and would 
like to implement user id mapping to mail address.

My scenario is, "uid" is the key of our LDAP system, and "mail"(email address) 
is one of attributes. Now we want users to input email address, i.e. "mail" 
when they login thrift client. That is to map "username" input to mail 
attribute query. e.g.

{code:none}
hive.server2.authentication.ldap.customLDAPQuery="(&(objectClass=person)(mail=${uid}))"
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34119) Keep necessary stats after partition pruning

2021-01-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266522#comment-17266522
 ] 

Apache Spark commented on SPARK-34119:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/31205

> Keep necessary stats after partition pruning
> 
>
> Key: SPARK-34119
> URL: https://issues.apache.org/jira/browse/SPARK-34119
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> It missing stats if filter conditions contains dynamicpruning, we should keep 
> these stats after partition pruning:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
> +- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
> class_id#160)) AND (i_category_id#18 = category_id#161)), 
> Statistics(sizeInBytes=2.42E+28 B)
>:- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
> Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
>:  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
> isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
> rowCount=3.69E+5)
>: +- 
> Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
>  parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
>+- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
> class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
>   +- Aggregate [brand_id#159, class_id#160, category_id#161], 
> [brand_id#159, class_id#160, category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
>  +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
> class_id#160, i_category_id#18 AS category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
> :  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
> Statistics(sizeInBytes=3.83E+21 B)
> :  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
> :  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
> Statistics(sizeInBytes=516.5 PiB)
> :  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
> Statistics(sizeInBytes=61.1 GiB)
> :  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
> isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
> Statistics(sizeInBytes=580.6 GiB)
> :  : : : :  +- Project [d_date_sk#52], 
> Statistics(sizeInBytes=8.6 KiB, rowCount=731)
> :  : : : : +- Filter d_year#58 >= 1999) AND 
> (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
> Statistics(sizeInBytes=175.6 KiB, rowCount=731)
> :  : : : :+- 
> Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,...
>  4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
> :  : : : +- 
> Relation[ss_sold_time_sk#29,ss_item_sk#30,ss_customer_sk#31,ss_cdemo_sk#32,ss_hdemo_sk#33,ss_addr_sk#34,ss_store_sk#35,ss_promo_sk#36,ss_ticket_number#37L,ss_quantity#38,ss_wholesale_cost#39,ss_list_price#40,ss_sales_price#41,ss_ext_discount_amt#42,ss_ext_sales_price#43,ss_ext_wholesale_cost#44,ss_ext_list_price#45,ss_ext_tax#46,ss_coupon_amt#47,ss_net_paid#48,ss_net_paid_inc_tax#49,ss_net_profit#50,ss_sold_date_sk#51]
>  parquet, Statistics(sizeInBytes=580.6 GiB)
> :  : : +- Project [i_item_sk#7, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, 
> rowCount=3.69E+5)
> :  : :+- Filter (((isnotnull(i_brand_id#14) AND 
> isnotnull(i_class_id#16)) AND isnotnull(i_category_id#18)) AND 
> 

[jira] [Assigned] (SPARK-34119) Keep necessary stats after partition pruning

2021-01-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34119:


Assignee: Apache Spark

> Keep necessary stats after partition pruning
> 
>
> Key: SPARK-34119
> URL: https://issues.apache.org/jira/browse/SPARK-34119
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> It missing stats if filter conditions contains dynamicpruning, we should keep 
> these stats after partition pruning:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
> +- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
> class_id#160)) AND (i_category_id#18 = category_id#161)), 
> Statistics(sizeInBytes=2.42E+28 B)
>:- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
> Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
>:  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
> isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
> rowCount=3.69E+5)
>: +- 
> Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
>  parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
>+- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
> class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
>   +- Aggregate [brand_id#159, class_id#160, category_id#161], 
> [brand_id#159, class_id#160, category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
>  +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
> class_id#160, i_category_id#18 AS category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
> :  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
> Statistics(sizeInBytes=3.83E+21 B)
> :  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
> :  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
> Statistics(sizeInBytes=516.5 PiB)
> :  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
> Statistics(sizeInBytes=61.1 GiB)
> :  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
> isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
> Statistics(sizeInBytes=580.6 GiB)
> :  : : : :  +- Project [d_date_sk#52], 
> Statistics(sizeInBytes=8.6 KiB, rowCount=731)
> :  : : : : +- Filter d_year#58 >= 1999) AND 
> (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
> Statistics(sizeInBytes=175.6 KiB, rowCount=731)
> :  : : : :+- 
> Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,...
>  4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
> :  : : : +- 
> Relation[ss_sold_time_sk#29,ss_item_sk#30,ss_customer_sk#31,ss_cdemo_sk#32,ss_hdemo_sk#33,ss_addr_sk#34,ss_store_sk#35,ss_promo_sk#36,ss_ticket_number#37L,ss_quantity#38,ss_wholesale_cost#39,ss_list_price#40,ss_sales_price#41,ss_ext_discount_amt#42,ss_ext_sales_price#43,ss_ext_wholesale_cost#44,ss_ext_list_price#45,ss_ext_tax#46,ss_coupon_amt#47,ss_net_paid#48,ss_net_paid_inc_tax#49,ss_net_profit#50,ss_sold_date_sk#51]
>  parquet, Statistics(sizeInBytes=580.6 GiB)
> :  : : +- Project [i_item_sk#7, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, 
> rowCount=3.69E+5)
> :  : :+- Filter (((isnotnull(i_brand_id#14) AND 
> isnotnull(i_class_id#16)) AND isnotnull(i_category_id#18)) AND 
> isnotnull(i_item_sk#7)), Statistics(sizeInBytes=150.0 MiB, rowCount=3.69E+5)
> 

[jira] [Assigned] (SPARK-34119) Keep necessary stats after partition pruning

2021-01-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34119:


Assignee: (was: Apache Spark)

> Keep necessary stats after partition pruning
> 
>
> Key: SPARK-34119
> URL: https://issues.apache.org/jira/browse/SPARK-34119
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> It missing stats if filter conditions contains dynamicpruning, we should keep 
> these stats after partition pruning:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
> +- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
> class_id#160)) AND (i_category_id#18 = category_id#161)), 
> Statistics(sizeInBytes=2.42E+28 B)
>:- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
> Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
>:  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
> isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
> rowCount=3.69E+5)
>: +- 
> Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
>  parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
>+- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
> class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
>   +- Aggregate [brand_id#159, class_id#160, category_id#161], 
> [brand_id#159, class_id#160, category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
>  +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
> class_id#160, i_category_id#18 AS category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
> :  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
> Statistics(sizeInBytes=3.83E+21 B)
> :  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
> :  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
> Statistics(sizeInBytes=516.5 PiB)
> :  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
> Statistics(sizeInBytes=61.1 GiB)
> :  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
> isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
> Statistics(sizeInBytes=580.6 GiB)
> :  : : : :  +- Project [d_date_sk#52], 
> Statistics(sizeInBytes=8.6 KiB, rowCount=731)
> :  : : : : +- Filter d_year#58 >= 1999) AND 
> (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
> Statistics(sizeInBytes=175.6 KiB, rowCount=731)
> :  : : : :+- 
> Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,...
>  4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
> :  : : : +- 
> Relation[ss_sold_time_sk#29,ss_item_sk#30,ss_customer_sk#31,ss_cdemo_sk#32,ss_hdemo_sk#33,ss_addr_sk#34,ss_store_sk#35,ss_promo_sk#36,ss_ticket_number#37L,ss_quantity#38,ss_wholesale_cost#39,ss_list_price#40,ss_sales_price#41,ss_ext_discount_amt#42,ss_ext_sales_price#43,ss_ext_wholesale_cost#44,ss_ext_list_price#45,ss_ext_tax#46,ss_coupon_amt#47,ss_net_paid#48,ss_net_paid_inc_tax#49,ss_net_profit#50,ss_sold_date_sk#51]
>  parquet, Statistics(sizeInBytes=580.6 GiB)
> :  : : +- Project [i_item_sk#7, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, 
> rowCount=3.69E+5)
> :  : :+- Filter (((isnotnull(i_brand_id#14) AND 
> isnotnull(i_class_id#16)) AND isnotnull(i_category_id#18)) AND 
> isnotnull(i_item_sk#7)), Statistics(sizeInBytes=150.0 MiB, rowCount=3.69E+5)
> :  : :   

[jira] [Commented] (SPARK-34119) Keep necessary stats after partition pruning

2021-01-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266521#comment-17266521
 ] 

Apache Spark commented on SPARK-34119:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/31205

> Keep necessary stats after partition pruning
> 
>
> Key: SPARK-34119
> URL: https://issues.apache.org/jira/browse/SPARK-34119
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> It missing stats if filter conditions contains dynamicpruning, we should keep 
> these stats after partition pruning:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
> +- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
> class_id#160)) AND (i_category_id#18 = category_id#161)), 
> Statistics(sizeInBytes=2.42E+28 B)
>:- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
> Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
>:  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
> isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
> rowCount=3.69E+5)
>: +- 
> Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
>  parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
>+- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
> class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
>   +- Aggregate [brand_id#159, class_id#160, category_id#161], 
> [brand_id#159, class_id#160, category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
>  +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
> class_id#160, i_category_id#18 AS category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
> :  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
> Statistics(sizeInBytes=3.83E+21 B)
> :  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
> :  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
> Statistics(sizeInBytes=516.5 PiB)
> :  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
> Statistics(sizeInBytes=61.1 GiB)
> :  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
> isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
> Statistics(sizeInBytes=580.6 GiB)
> :  : : : :  +- Project [d_date_sk#52], 
> Statistics(sizeInBytes=8.6 KiB, rowCount=731)
> :  : : : : +- Filter d_year#58 >= 1999) AND 
> (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
> Statistics(sizeInBytes=175.6 KiB, rowCount=731)
> :  : : : :+- 
> Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,...
>  4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
> :  : : : +- 
> Relation[ss_sold_time_sk#29,ss_item_sk#30,ss_customer_sk#31,ss_cdemo_sk#32,ss_hdemo_sk#33,ss_addr_sk#34,ss_store_sk#35,ss_promo_sk#36,ss_ticket_number#37L,ss_quantity#38,ss_wholesale_cost#39,ss_list_price#40,ss_sales_price#41,ss_ext_discount_amt#42,ss_ext_sales_price#43,ss_ext_wholesale_cost#44,ss_ext_list_price#45,ss_ext_tax#46,ss_coupon_amt#47,ss_net_paid#48,ss_net_paid_inc_tax#49,ss_net_profit#50,ss_sold_date_sk#51]
>  parquet, Statistics(sizeInBytes=580.6 GiB)
> :  : : +- Project [i_item_sk#7, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, 
> rowCount=3.69E+5)
> :  : :+- Filter (((isnotnull(i_brand_id#14) AND 
> isnotnull(i_class_id#16)) AND isnotnull(i_category_id#18)) AND 
> 

[jira] [Issue Comment Deleted] (SPARK-34119) Keep necessary stats after partition pruning

2021-01-15 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34119:

Comment: was deleted

(was: [~leoluan] Would you like work on this?)

> Keep necessary stats after partition pruning
> 
>
> Key: SPARK-34119
> URL: https://issues.apache.org/jira/browse/SPARK-34119
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> It missing stats if filter conditions contains dynamicpruning, we should keep 
> these stats after partition pruning:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
> +- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
> class_id#160)) AND (i_category_id#18 = category_id#161)), 
> Statistics(sizeInBytes=2.42E+28 B)
>:- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
> Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
>:  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
> isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
> rowCount=3.69E+5)
>: +- 
> Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
>  parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
>+- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
> class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
>   +- Aggregate [brand_id#159, class_id#160, category_id#161], 
> [brand_id#159, class_id#160, category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
>  +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
> class_id#160, i_category_id#18 AS category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
> :  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
> Statistics(sizeInBytes=3.83E+21 B)
> :  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
> :  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
> Statistics(sizeInBytes=516.5 PiB)
> :  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
> Statistics(sizeInBytes=61.1 GiB)
> :  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
> isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
> Statistics(sizeInBytes=580.6 GiB)
> :  : : : :  +- Project [d_date_sk#52], 
> Statistics(sizeInBytes=8.6 KiB, rowCount=731)
> :  : : : : +- Filter d_year#58 >= 1999) AND 
> (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
> Statistics(sizeInBytes=175.6 KiB, rowCount=731)
> :  : : : :+- 
> Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,...
>  4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
> :  : : : +- 
> Relation[ss_sold_time_sk#29,ss_item_sk#30,ss_customer_sk#31,ss_cdemo_sk#32,ss_hdemo_sk#33,ss_addr_sk#34,ss_store_sk#35,ss_promo_sk#36,ss_ticket_number#37L,ss_quantity#38,ss_wholesale_cost#39,ss_list_price#40,ss_sales_price#41,ss_ext_discount_amt#42,ss_ext_sales_price#43,ss_ext_wholesale_cost#44,ss_ext_list_price#45,ss_ext_tax#46,ss_coupon_amt#47,ss_net_paid#48,ss_net_paid_inc_tax#49,ss_net_profit#50,ss_sold_date_sk#51]
>  parquet, Statistics(sizeInBytes=580.6 GiB)
> :  : : +- Project [i_item_sk#7, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, 
> rowCount=3.69E+5)
> :  : :+- Filter (((isnotnull(i_brand_id#14) AND 
> isnotnull(i_class_id#16)) AND isnotnull(i_category_id#18)) AND 
> isnotnull(i_item_sk#7)), Statistics(sizeInBytes=150.0 MiB, rowCount=3.69E+5)
>

[jira] [Resolved] (SPARK-34110) Upgrade ZooKeeper to 3.6.2

2021-01-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34110.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31177
[https://github.com/apache/spark/pull/31177]

> Upgrade ZooKeeper to 3.6.2
> --
>
> Key: SPARK-34110
> URL: https://issues.apache.org/jira/browse/SPARK-34110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> When running Spark on JDK 14:
> {noformat}
> 21/01/13 20:25:32,533 WARN 
> [Driver-SendThread(apache-spark-zk-3.vip.hadoop.com:2181)] 
> zookeeper.ClientCnxn:1164 : Session 0x0 for server 
> apache-spark-zk-3.vip.hadoop.com/:2181, unexpected error, closing 
> socket connection and attempting reconnect
> java.lang.IllegalArgumentException: Unable to canonicalize address 
> apache-spark-zk-3.vip.hadoop.com/:2181 because it's not resolvable
>   at 
> org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65)
>   at 
> org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41)
>   at 
> org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
> {noformat}
> Please see ZOOKEEPER-3779 for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34110) Upgrade ZooKeeper to 3.6.2

2021-01-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34110:
-

Assignee: Yuming Wang

> Upgrade ZooKeeper to 3.6.2
> --
>
> Key: SPARK-34110
> URL: https://issues.apache.org/jira/browse/SPARK-34110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> When running Spark on JDK 14:
> {noformat}
> 21/01/13 20:25:32,533 WARN 
> [Driver-SendThread(apache-spark-zk-3.vip.hadoop.com:2181)] 
> zookeeper.ClientCnxn:1164 : Session 0x0 for server 
> apache-spark-zk-3.vip.hadoop.com/:2181, unexpected error, closing 
> socket connection and attempting reconnect
> java.lang.IllegalArgumentException: Unable to canonicalize address 
> apache-spark-zk-3.vip.hadoop.com/:2181 because it's not resolvable
>   at 
> org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65)
>   at 
> org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41)
>   at 
> org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
> {noformat}
> Please see ZOOKEEPER-3779 for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26399) Add new stage-level REST APIs and parameters

2021-01-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266500#comment-17266500
 ] 

Apache Spark commented on SPARK-26399:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/31204

> Add new stage-level REST APIs and parameters
> 
>
> Key: SPARK-26399
> URL: https://issues.apache.org/jira/browse/SPARK-26399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Edward Lu
>Priority: Major
> Attachments: executorMetricsSummary.json, 
> lispark230_restapi_ex2_stages_failedTasks.json, 
> lispark230_restapi_ex2_stages_withSummaries.json, 
> stage_executorSummary_image1.png
>
>
> Add the peak values for the metrics to the stages REST API. Also add a new 
> executorSummary REST API, which will return executor summary metrics for a 
> specified stage:
> {code:java}
> curl http:// server>:18080/api/v1/applicationsexecutorMetricsSummary{code}
> Add parameters to the stages REST API to specify:
>  * filtering for task status, and returning tasks that match (for example, 
> FAILED tasks).
>  * task metric quantiles, add adding the task summary if specified
>  * executor metric quantiles, and adding the executor summary if specified
> *. *. *
> Note that the above description is too brief to be clear.  [~angerszhuuu] and 
> [~ron8hu] discussed a generic and consistent way for endpoint 
> /application/\{app-id}/stages.  It can be:
> /application/\{app-id}/stages?details=[true|false]=[ACTIVE|COMPLETE|FAILED|PENDING|SKIPPED]=[true|false]=[RUNNING|SUCCESS|FAILED|PENDING]
> where
>  * query parameter details=true is to show the detailed task information 
> within each stage.  The default value is details=false;
>  * query parameter status can select those stages with the specified status.  
> When status parameter is not specified, a list of all stages are generated.  
>  * query parameter withSummaries=true is to show both task summary 
> information in percentile distribution and executor summary information in 
> percentile distribution.  The default value is withSummaries=false.
>  * query parameter taskStatus is to show only those tasks with the specified 
> status within their corresponding stages.  This parameter will be set when 
> details=true (i.e. this parameter will be ignored when details=false).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors

2021-01-15 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu resolved SPARK-34080.

Fix Version/s: 3.2.0
   3.1.1
   Resolution: Fixed

Issue resolved by pull request 31160
[https://github.com/apache/spark/pull/31160]

> Add UnivariateFeatureSelector to deprecate existing selectors
> -
>
> Key: SPARK-34080
> URL: https://issues.apache.org/jira/browse/SPARK-34080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Xiangrui Meng
>Assignee: Huaxin Gao
>Priority: Critical
> Fix For: 3.1.1, 3.2.0
>
>
> In SPARK-26111, we introduced a few univariate feature selectors, which share 
> a common set of params. And they are named after the underlying test, which 
> requires users to understand the test to find the matched scenarios. It would 
> be nice if we introduce a single class called UnivariateFeatureSelector that 
> accepts a selection criterion and a score method (string names). Then we can 
> deprecate all other univariate selectors.
> For the params, instead of ask users to provide what score function to use, 
> it is more friendly to ask users to specify the feature and label types 
> (continuous or categorical) and we set a default score function for each 
> combo. We can also detect the types from feature metadata if given. Advanced 
> users can overwrite it (if there are multiple score function that is 
> compatible with the feature type and label type combo). Example (param names 
> are not finalized):
> {code}
> selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
> labelCol=["target"], featureType="categorical", labelType="continuous", 
> select="bestK", k=100)
> {code}
> cc: [~huaxingao] [~ruifengz] [~weichenxu123]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors

2021-01-15 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu reassigned SPARK-34080:
--

Assignee: Huaxin Gao

> Add UnivariateFeatureSelector to deprecate existing selectors
> -
>
> Key: SPARK-34080
> URL: https://issues.apache.org/jira/browse/SPARK-34080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Xiangrui Meng
>Assignee: Huaxin Gao
>Priority: Critical
>
> In SPARK-26111, we introduced a few univariate feature selectors, which share 
> a common set of params. And they are named after the underlying test, which 
> requires users to understand the test to find the matched scenarios. It would 
> be nice if we introduce a single class called UnivariateFeatureSelector that 
> accepts a selection criterion and a score method (string names). Then we can 
> deprecate all other univariate selectors.
> For the params, instead of ask users to provide what score function to use, 
> it is more friendly to ask users to specify the feature and label types 
> (continuous or categorical) and we set a default score function for each 
> combo. We can also detect the types from feature metadata if given. Advanced 
> users can overwrite it (if there are multiple score function that is 
> compatible with the feature type and label type combo). Example (param names 
> are not finalized):
> {code}
> selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
> labelCol=["target"], featureType="categorical", labelType="continuous", 
> select="bestK", k=100)
> {code}
> cc: [~huaxingao] [~ruifengz] [~weichenxu123]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33507) Improve and fix cache behavior in v1 and v2

2021-01-15 Thread Anton Okolnychyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266479#comment-17266479
 ] 

Anton Okolnychyi commented on SPARK-33507:
--

[~csun], shall we also handle streaming writes?

> Improve and fix cache behavior in v1 and v2
> ---
>
> Key: SPARK-33507
> URL: https://issues.apache.org/jira/browse/SPARK-33507
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Priority: Critical
>
> This is an umbrella JIRA to track fixes & improvements for caching behavior 
> in Spark datasource v1 and v2, which includes:
>   - fix existing cache behavior in v1 and v2.
>   - fix inconsistent cache behavior between v1 and v2
>   - implement missing features in v2 to align with those in v1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34132) Update Roxygen version references to 7.1.1

2021-01-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34132:
-

Assignee: Maciej Szymkiewicz

> Update Roxygen version references to 7.1.1
> --
>
> Key: SPARK-34132
> URL: https://issues.apache.org/jira/browse/SPARK-34132
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, R
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
>
> At the moment R docs and {{DESCRIPTION}} file reference roxygen 5.0.1, but 
> all servers run 7.1.1 (SPARK-30747) and GitHub actions install latest package 
> version (7.1.1 at the moment).
> Docs and {{DESCRIPTION}} file should be updated to reflect that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34132) Update Roxygen version references to 7.1.1

2021-01-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34132.
---
Fix Version/s: 3.1.1
   Resolution: Fixed

Issue resolved by pull request 31200
[https://github.com/apache/spark/pull/31200]

> Update Roxygen version references to 7.1.1
> --
>
> Key: SPARK-34132
> URL: https://issues.apache.org/jira/browse/SPARK-34132
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, R
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 3.1.1
>
>
> At the moment R docs and {{DESCRIPTION}} file reference roxygen 5.0.1, but 
> all servers run 7.1.1 (SPARK-30747) and GitHub actions install latest package 
> version (7.1.1 at the moment).
> Docs and {{DESCRIPTION}} file should be updated to reflect that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-01-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266422#comment-17266422
 ] 

Apache Spark commented on SPARK-33212:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/31203

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-01-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266420#comment-17266420
 ] 

Apache Spark commented on SPARK-33212:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/31203

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34133) [AVRO] Respect case sensitivity when performing Catalyst-to-Avro field matching and enhance error messages

2021-01-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266394#comment-17266394
 ] 

Apache Spark commented on SPARK-34133:
--

User 'xkrogen' has created a pull request for this issue:
https://github.com/apache/spark/pull/31201

> [AVRO] Respect case sensitivity when performing Catalyst-to-Avro field 
> matching and enhance error messages
> --
>
> Key: SPARK-34133
> URL: https://issues.apache.org/jira/browse/SPARK-34133
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Erik Krogen
>Priority: Major
>
> Spark SQL is normally case-insensitive (by default), but currently when 
> {{AvroSerializer}} and {{AvroDeserializer}} perform matching between Catalyst 
> schemas and Avro schemas, the matching is done in a case-sensitive manner. So 
> for example the following will fail:
> {code}
>   val avroSchema =
> """
>   |{
>   |  "type" : "record",
>   |  "name" : "test_schema",
>   |  "fields" : [
>   |{"name": "foo", "type": "int"},
>   |{"name": "BAR", "type": "int"}
>   |  ]
>   |}
>   """.stripMargin
>   val df = Seq((1, 3), (2, 4)).toDF("FOO", "bar")
>   df.write.option("avroSchema", avroSchema).format("avro").save(savePath)
> {code}
> The same is true on the read path, if we assume {{testAvro}} has been written 
> using the schema above, the below will fail to match the fields:
> {code}
> df.read.schema(new StructType().add("FOO", IntegerType).add("bar", 
> IntegerType))
>   .format("avro").load(testAvro)
> {code}
> In addition the error messages in this type of failure scenario are very 
> lacking in information on the write path ({{AvroSerializer}}), we can make 
> them much more helpful for users to debug schema issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34133) [AVRO] Respect case sensitivity when performing Catalyst-to-Avro field matching and enhance error messages

2021-01-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34133:


Assignee: (was: Apache Spark)

> [AVRO] Respect case sensitivity when performing Catalyst-to-Avro field 
> matching and enhance error messages
> --
>
> Key: SPARK-34133
> URL: https://issues.apache.org/jira/browse/SPARK-34133
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Erik Krogen
>Priority: Major
>
> Spark SQL is normally case-insensitive (by default), but currently when 
> {{AvroSerializer}} and {{AvroDeserializer}} perform matching between Catalyst 
> schemas and Avro schemas, the matching is done in a case-sensitive manner. So 
> for example the following will fail:
> {code}
>   val avroSchema =
> """
>   |{
>   |  "type" : "record",
>   |  "name" : "test_schema",
>   |  "fields" : [
>   |{"name": "foo", "type": "int"},
>   |{"name": "BAR", "type": "int"}
>   |  ]
>   |}
>   """.stripMargin
>   val df = Seq((1, 3), (2, 4)).toDF("FOO", "bar")
>   df.write.option("avroSchema", avroSchema).format("avro").save(savePath)
> {code}
> The same is true on the read path, if we assume {{testAvro}} has been written 
> using the schema above, the below will fail to match the fields:
> {code}
> df.read.schema(new StructType().add("FOO", IntegerType).add("bar", 
> IntegerType))
>   .format("avro").load(testAvro)
> {code}
> In addition the error messages in this type of failure scenario are very 
> lacking in information on the write path ({{AvroSerializer}}), we can make 
> them much more helpful for users to debug schema issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34133) [AVRO] Respect case sensitivity when performing Catalyst-to-Avro field matching and enhance error messages

2021-01-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266392#comment-17266392
 ] 

Apache Spark commented on SPARK-34133:
--

User 'xkrogen' has created a pull request for this issue:
https://github.com/apache/spark/pull/31201

> [AVRO] Respect case sensitivity when performing Catalyst-to-Avro field 
> matching and enhance error messages
> --
>
> Key: SPARK-34133
> URL: https://issues.apache.org/jira/browse/SPARK-34133
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Erik Krogen
>Priority: Major
>
> Spark SQL is normally case-insensitive (by default), but currently when 
> {{AvroSerializer}} and {{AvroDeserializer}} perform matching between Catalyst 
> schemas and Avro schemas, the matching is done in a case-sensitive manner. So 
> for example the following will fail:
> {code}
>   val avroSchema =
> """
>   |{
>   |  "type" : "record",
>   |  "name" : "test_schema",
>   |  "fields" : [
>   |{"name": "foo", "type": "int"},
>   |{"name": "BAR", "type": "int"}
>   |  ]
>   |}
>   """.stripMargin
>   val df = Seq((1, 3), (2, 4)).toDF("FOO", "bar")
>   df.write.option("avroSchema", avroSchema).format("avro").save(savePath)
> {code}
> The same is true on the read path, if we assume {{testAvro}} has been written 
> using the schema above, the below will fail to match the fields:
> {code}
> df.read.schema(new StructType().add("FOO", IntegerType).add("bar", 
> IntegerType))
>   .format("avro").load(testAvro)
> {code}
> In addition the error messages in this type of failure scenario are very 
> lacking in information on the write path ({{AvroSerializer}}), we can make 
> them much more helpful for users to debug schema issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34133) [AVRO] Respect case sensitivity when performing Catalyst-to-Avro field matching and enhance error messages

2021-01-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34133:


Assignee: Apache Spark

> [AVRO] Respect case sensitivity when performing Catalyst-to-Avro field 
> matching and enhance error messages
> --
>
> Key: SPARK-34133
> URL: https://issues.apache.org/jira/browse/SPARK-34133
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Erik Krogen
>Assignee: Apache Spark
>Priority: Major
>
> Spark SQL is normally case-insensitive (by default), but currently when 
> {{AvroSerializer}} and {{AvroDeserializer}} perform matching between Catalyst 
> schemas and Avro schemas, the matching is done in a case-sensitive manner. So 
> for example the following will fail:
> {code}
>   val avroSchema =
> """
>   |{
>   |  "type" : "record",
>   |  "name" : "test_schema",
>   |  "fields" : [
>   |{"name": "foo", "type": "int"},
>   |{"name": "BAR", "type": "int"}
>   |  ]
>   |}
>   """.stripMargin
>   val df = Seq((1, 3), (2, 4)).toDF("FOO", "bar")
>   df.write.option("avroSchema", avroSchema).format("avro").save(savePath)
> {code}
> The same is true on the read path, if we assume {{testAvro}} has been written 
> using the schema above, the below will fail to match the fields:
> {code}
> df.read.schema(new StructType().add("FOO", IntegerType).add("bar", 
> IntegerType))
>   .format("avro").load(testAvro)
> {code}
> In addition the error messages in this type of failure scenario are very 
> lacking in information on the write path ({{AvroSerializer}}), we can make 
> them much more helpful for users to debug schema issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34064) Broadcast job is not aborted even the SQL statement canceled

2021-01-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34064:
-

Assignee: (was: Lantao Jin)

> Broadcast job is not aborted even the SQL statement canceled
> 
>
> Key: SPARK-34064
> URL: https://issues.apache.org/jira/browse/SPARK-34064
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.2.0, 3.1.1
>Reporter: Lantao Jin
>Priority: Minor
> Attachments: Screen Shot 2021-01-11 at 12.03.13 PM.png
>
>
> SPARK-27036 introduced a runId for BroadcastExchangeExec to resolve the 
> problem that a broadcast job is not aborted when broadcast timeout happens. 
> Since the runId is a random UUID, when a SQL statement is cancelled, these 
> broadcast sub-jobs still not canceled as a whole.
>  !Screen Shot 2021-01-11 at 12.03.13 PM.png|width=100%! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34133) [AVRO] Respect case sensitivity when performing Catalyst-to-Avro field matching and enhance error messages

2021-01-15 Thread Erik Krogen (Jira)
Erik Krogen created SPARK-34133:
---

 Summary: [AVRO] Respect case sensitivity when performing 
Catalyst-to-Avro field matching and enhance error messages
 Key: SPARK-34133
 URL: https://issues.apache.org/jira/browse/SPARK-34133
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, SQL
Affects Versions: 2.4.0, 3.2.0
Reporter: Erik Krogen


Spark SQL is normally case-insensitive (by default), but currently when 
{{AvroSerializer}} and {{AvroDeserializer}} perform matching between Catalyst 
schemas and Avro schemas, the matching is done in a case-sensitive manner. So 
for example the following will fail:
{code}
  val avroSchema =
"""
  |{
  |  "type" : "record",
  |  "name" : "test_schema",
  |  "fields" : [
  |{"name": "foo", "type": "int"},
  |{"name": "BAR", "type": "int"}
  |  ]
  |}
  """.stripMargin
  val df = Seq((1, 3), (2, 4)).toDF("FOO", "bar")

  df.write.option("avroSchema", avroSchema).format("avro").save(savePath)
{code}
The same is true on the read path, if we assume {{testAvro}} has been written 
using the schema above, the below will fail to match the fields:
{code}
df.read.schema(new StructType().add("FOO", IntegerType).add("bar", IntegerType))
  .format("avro").load(testAvro)
{code}

In addition the error messages in this type of failure scenario are very 
lacking in information on the write path ({{AvroSerializer}}), we can make them 
much more helpful for users to debug schema issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33212) Move to shaded clients for Hadoop 3.x profile

2021-01-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33212.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30701
[https://github.com/apache/spark/pull/30701]

> Move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-01-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33212:
--
Summary: Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x 
profile  (was: Move to shaded clients for Hadoop 3.x profile)

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33212) Move to shaded clients for Hadoop 3.x profile

2021-01-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33212:
-

Assignee: Chao Sun

> Move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33262) Keep pending pods in account while scheduling new pods

2021-01-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33262:
--
Parent: SPARK-33005
Issue Type: Sub-task  (was: Improvement)

> Keep pending pods in account while scheduling new pods
> --
>
> Key: SPARK-33262
> URL: https://issues.apache.org/jira/browse/SPARK-33262
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33288) Support k8s cluster manager with stage level scheduling

2021-01-15 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266350#comment-17266350
 ] 

Dongjoon Hyun edited comment on SPARK-33288 at 1/15/21, 9:17 PM:
-

Hi, [~tgraves]. To give more visibility to this big feature, I collected this 
into SPARK-33005 (K8s GA) umbrella.


was (Author: dongjoon):
Hi, [~tgraves]. To give more visibility to this big issue, I collected this 
into SPARK-33005 (K8s GA) umbrella.

> Support k8s cluster manager with stage level scheduling
> ---
>
> Key: SPARK-33288
> URL: https://issues.apache.org/jira/browse/SPARK-33288
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.1.0
>
>
> Kubernetes supports dynamic allocation via the 
> {{spark.dynamicAllocation.shuffleTracking.enabled}}
> {{config, we can add support for stage level scheduling when that is turned 
> on.  }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33288) Support k8s cluster manager with stage level scheduling

2021-01-15 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266350#comment-17266350
 ] 

Dongjoon Hyun commented on SPARK-33288:
---

Hi, [~tgraves]. To give more visibility to this big issue, I collected this 
into SPARK-33005 (K8s GA) umbrella.

> Support k8s cluster manager with stage level scheduling
> ---
>
> Key: SPARK-33288
> URL: https://issues.apache.org/jira/browse/SPARK-33288
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.1.0
>
>
> Kubernetes supports dynamic allocation via the 
> {{spark.dynamicAllocation.shuffleTracking.enabled}}
> {{config, we can add support for stage level scheduling when that is turned 
> on.  }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33288) Support k8s cluster manager with stage level scheduling

2021-01-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33288:
--
Parent: SPARK-33005
Issue Type: Sub-task  (was: New Feature)

> Support k8s cluster manager with stage level scheduling
> ---
>
> Key: SPARK-33288
> URL: https://issues.apache.org/jira/browse/SPARK-33288
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.1.0
>
>
> Kubernetes supports dynamic allocation via the 
> {{spark.dynamicAllocation.shuffleTracking.enabled}}
> {{config, we can add support for stage level scheduling when that is turned 
> on.  }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33668) Fix flaky test "Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties."

2021-01-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33668:
--
Parent: SPARK-33005
Issue Type: Sub-task  (was: Bug)

> Fix flaky test "Verify logging configuration is picked from the provided 
> SPARK_CONF_DIR/log4j.properties."
> --
>
> Key: SPARK-33668
> URL: https://issues.apache.org/jira/browse/SPARK-33668
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Major
> Fix For: 3.1.0
>
>
> The test is flaking, with multiple flaked instances - the reason for the 
> failure has been similar to:
> {code:java}
>   The code passed to eventually never returned normally. Attempted 109 times 
> over 3.007988241397 minutes. Last failure message: Failure executing: GET 
> at: 
> https://192.168.39.167:8443/api/v1/namespaces/b37fc72a991b49baa68a2eaaa1516463/pods/spark-pi-97a9bc76308e7fe3-exec-1/log?pretty=false.
>  Message: pods "spark-pi-97a9bc76308e7fe3-exec-1" not found. Received status: 
> Status(apiVersion=v1, code=404, details=StatusDetails(causes=[], group=null, 
> kind=pods, name=spark-pi-97a9bc76308e7fe3-exec-1, retryAfterSeconds=null, 
> uid=null, additionalProperties={}), kind=Status, message=pods 
> "spark-pi-97a9bc76308e7fe3-exec-1" not found, 
> metadata=ListMeta(_continue=null, remainingItemCount=null, 
> resourceVersion=null, selfLink=null, additionalProperties={}), 
> reason=NotFound, status=Failure, additionalProperties={}).. 
> (KubernetesSuite.scala:402)
> {code}
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36854/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36852/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36850/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36848/console
> From the above failures, it seems, that executor finishes too quickly and is 
> removed by spark before the test can complete. 
> So, in order to mitigate this situation, one way is to turn on the flag
> {code}
>"spark.kubernetes.executor.deleteOnTermination"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33727) `gpg: keyserver receive failed: No name` during K8s IT

2021-01-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33727:
--
Parent: SPARK-33005
Issue Type: Sub-task  (was: Task)

> `gpg: keyserver receive failed: No name` during K8s IT
> --
>
> Key: SPARK-33727
> URL: https://issues.apache.org/jira/browse/SPARK-33727
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Project Infra, Tests
>Affects Versions: 3.0.2, 3.1.0, 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Holden Karau
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> K8s IT fails with gpg: keyserver receive failed: No name. This seems to be 
> consistent in the new Jenkins Server.
> {code}
> Executing: /tmp/apt-key-gpghome.gGqC9RwptN/gpg.1.sh --keyserver 
> keys.gnupg.net --recv-key E19F5F87128899B192B1A2C2AD5F960A256A04AF
> gpg: keyserver receive failed: No name
> The command '/bin/sh -c echo "deb http://cloud.r-project.org/bin/linux/debian 
> buster-cran35/" >> /etc/apt/sources.list &&   apt install -y gnupg &&   
> apt-key adv --keyserver keys.gnupg.net --recv-key 
> 'E19F5F87128899B192B1A2C2AD5F960A256A04AF' &&   apt-get update &&   apt 
> install -y -t buster-cran35 r-base r-base-dev &&   rm -rf /var/cache/apt/*' 
> returned a non-zero code: 2
> {code}
> It locally works on Mac.
> {code}
> $ gpg1 --keyserver keys.gnupg.net --recv-key 
> E19F5F87128899B192B1A2C2AD5F960A256A04AF
> gpg: requesting key 256A04AF from hkp server keys.gnupg.net
> gpg: key 256A04AF: public key "Johannes Ranke (Wissenschaftlicher Berater) 
> " imported
> gpg: 3 marginal(s) needed, 1 complete(s) needed, PGP trust model
> gpg: depth: 0  valid:   2  signed:   1  trust: 0-, 0q, 0n, 0m, 0f, 2u
> gpg: depth: 1  valid:   1  signed:   0  trust: 1-, 0q, 0n, 0m, 0f, 0u
> gpg: Total number processed: 1
> gpg:   imported: 1  (RSA: 1)
> {code}
> It happens multiple times.
> - https://github.com/apache/spark/pull/30693
> - https://github.com/apache/spark/pull/30694



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33732) Kubernetes integration tests doesn't work with Minikube 1.9+

2021-01-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33732:
--
Parent: SPARK-33005
Issue Type: Sub-task  (was: Bug)

> Kubernetes integration tests doesn't work with Minikube 1.9+
> 
>
> Key: SPARK-33732
> URL: https://issues.apache.org/jira/browse/SPARK-33732
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> Kubernetes integration tests doesn't work with Minikube 1.9+.
> This is due to the location of apiserver.crt and apiserver.key is changed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33754) Update kubernetes/integration-tests/README.md to follow the default Hadoop profile updated

2021-01-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33754:
--
Parent: SPARK-33005
Issue Type: Sub-task  (was: Improvement)

> Update kubernetes/integration-tests/README.md to follow the default Hadoop 
> profile updated
> --
>
> Key: SPARK-33754
> URL: https://issues.apache.org/jira/browse/SPARK-33754
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs, Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.1.0
>
>
> kubernetes/integration-tests/README.md says about how to run the integration 
> tests for Kubernetes as follows.
> {code}
> To run tests with Hadoop 3.2 instead of Hadoop 2.7, use `--hadoop-profile`.
> ./dev/dev-run-integration-tests.sh --hadoop-profile hadoop-2.7
> {code}
> In the current master, the default Hadoop profile is hadoop-3.2 so it's 
> better to update the document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33874) Spark may report PodRunning if there is a sidecar that has not exited

2021-01-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33874:
--
Parent: SPARK-33005
Issue Type: Sub-task  (was: Bug)

> Spark may report PodRunning if there is a sidecar that has not exited
> -
>
> Key: SPARK-33874
> URL: https://issues.apache.org/jira/browse/SPARK-33874
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.2, 3.1.0, 3.2.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
> Fix For: 3.1.1
>
>
> This is a continuation of SPARK-30821 which handles the situation where Spark 
> is still running but it may have sidecar containers that exited.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33711) Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns

2021-01-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33711:
--
Fix Version/s: 3.0.2

>  Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns
> --
>
> Key: SPARK-33711
> URL: https://issues.apache.org/jira/browse/SPARK-33711
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.4, 2.4.7, 3.0.0, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.0.2, 3.2.0, 3.1.1
>
>
> Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD 
> changes which could wrongfully lead to detecting of missing PODs (PODs known 
> by scheduler backend but missing from POD snapshots) by the executor POD 
> lifecycle manager.
> A key indicator of this is seeing this log msg:
> "The executor with ID [some_id] was not found in the cluster but we didn't 
> get a reason why. Marking the executor as failed. The executor may have been 
> deleted but the driver missed the deletion event."
> So one of the problem is running the missing POD detection even when a single 
> pod is changed without having a full consistent snapshot about all the PODs 
> (see ExecutorPodsPollingSnapshotSource). The other could be a race between 
> the executor POD lifecycle manager and the scheduler backend.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33711) Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns

2021-01-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33711:
--
Parent: SPARK-33005
Issue Type: Sub-task  (was: Bug)

>  Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns
> --
>
> Key: SPARK-33711
> URL: https://issues.apache.org/jira/browse/SPARK-33711
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.4, 2.4.7, 3.0.0, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.0.2, 3.2.0, 3.1.1
>
>
> Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD 
> changes which could wrongfully lead to detecting of missing PODs (PODs known 
> by scheduler backend but missing from POD snapshots) by the executor POD 
> lifecycle manager.
> A key indicator of this is seeing this log msg:
> "The executor with ID [some_id] was not found in the cluster but we didn't 
> get a reason why. Marking the executor as failed. The executor may have been 
> deleted but the driver missed the deletion event."
> So one of the problem is running the missing POD detection even when a single 
> pod is changed without having a full consistent snapshot about all the PODs 
> (see ExecutorPodsPollingSnapshotSource). The other could be a race between 
> the executor POD lifecycle manager and the scheduler backend.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34132) Update Roxygen version references to 7.1.1

2021-01-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34132:


Assignee: (was: Apache Spark)

> Update Roxygen version references to 7.1.1
> --
>
> Key: SPARK-34132
> URL: https://issues.apache.org/jira/browse/SPARK-34132
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, R
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> At the moment R docs and {{DESCRIPTION}} file reference roxygen 5.0.1, but 
> all servers run 7.1.1 (SPARK-30747) and GitHub actions install latest package 
> version (7.1.1 at the moment).
> Docs and {{DESCRIPTION}} file should be updated to reflect that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34132) Update Roxygen version references to 7.1.1

2021-01-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34132:


Assignee: Apache Spark

> Update Roxygen version references to 7.1.1
> --
>
> Key: SPARK-34132
> URL: https://issues.apache.org/jira/browse/SPARK-34132
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, R
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Minor
>
> At the moment R docs and {{DESCRIPTION}} file reference roxygen 5.0.1, but 
> all servers run 7.1.1 (SPARK-30747) and GitHub actions install latest package 
> version (7.1.1 at the moment).
> Docs and {{DESCRIPTION}} file should be updated to reflect that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34132) Update Roxygen version references to 7.1.1

2021-01-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266303#comment-17266303
 ] 

Apache Spark commented on SPARK-34132:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/31200

> Update Roxygen version references to 7.1.1
> --
>
> Key: SPARK-34132
> URL: https://issues.apache.org/jira/browse/SPARK-34132
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, R
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> At the moment R docs and {{DESCRIPTION}} file reference roxygen 5.0.1, but 
> all servers run 7.1.1 (SPARK-30747) and GitHub actions install latest package 
> version (7.1.1 at the moment).
> Docs and {{DESCRIPTION}} file should be updated to reflect that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34132) Update Roxygen version references to 7.1.1

2021-01-15 Thread Maciej Szymkiewicz (Jira)
Maciej Szymkiewicz created SPARK-34132:
--

 Summary: Update Roxygen version references to 7.1.1
 Key: SPARK-34132
 URL: https://issues.apache.org/jira/browse/SPARK-34132
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, R
Affects Versions: 3.2.0, 3.1.1
Reporter: Maciej Szymkiewicz


At the moment R docs and {{DESCRIPTION}} file reference roxygen 5.0.1, but all 
servers run 7.1.1 (SPARK-30747) and GitHub actions install latest package 
version (7.1.1 at the moment).

Docs and {{DESCRIPTION}} file should be updated to reflect that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33212) Move to shaded clients for Hadoop 3.x profile

2021-01-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266282#comment-17266282
 ] 

Apache Spark commented on SPARK-33212:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/30701

> Move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Priority: Major
>  Labels: releasenotes
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33212) Move to shaded clients for Hadoop 3.x profile

2021-01-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266281#comment-17266281
 ] 

Apache Spark commented on SPARK-33212:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/30701

> Move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Priority: Major
>  Labels: releasenotes
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34037) Remove unnecessary upcasting for Avg & Sum which handle by themself internally

2021-01-15 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-34037.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31079
[https://github.com/apache/spark/pull/31079]

> Remove unnecessary upcasting for Avg & Sum which handle by themself internally
> --
>
> Key: SPARK-34037
> URL: https://issues.apache.org/jira/browse/SPARK-34037
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.2.0
>
>
> The type-coercion for numeric types of average and sum is not necessary at 
> all, as the resultType and sumType can prevent the overflow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34037) Remove unnecessary upcasting for Avg & Sum which handle by themself internally

2021-01-15 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-34037:
---

Assignee: Kent Yao

> Remove unnecessary upcasting for Avg & Sum which handle by themself internally
> --
>
> Key: SPARK-34037
> URL: https://issues.apache.org/jira/browse/SPARK-34037
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> The type-coercion for numeric types of average and sum is not necessary at 
> all, as the resultType and sumType can prevent the overflow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34131) NPE when driver.podTemplateFile defines no containers

2021-01-15 Thread Jacek Laskowski (Jira)
Jacek Laskowski created SPARK-34131:
---

 Summary: NPE when driver.podTemplateFile defines no containers
 Key: SPARK-34131
 URL: https://issues.apache.org/jira/browse/SPARK-34131
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.0.1
Reporter: Jacek Laskowski


An empty pod template leads to the following NPE:

{code}
21/01/15 18:44:32 ERROR KubernetesUtils: Encountered exception while attempting 
to load initial pod spec from file
java.lang.NullPointerException
at 
org.apache.spark.deploy.k8s.KubernetesUtils$.selectSparkContainer(KubernetesUtils.scala:108)
at 
org.apache.spark.deploy.k8s.KubernetesUtils$.loadPodFromTemplate(KubernetesUtils.scala:88)
at 
org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$1(KubernetesDriverBuilder.scala:36)
at scala.Option.map(Option.scala:230)
at 
org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:32)
at 
org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:98)
at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4(KubernetesClientApplication.scala:221)
at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4$adapted(KubernetesClientApplication.scala:215)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2539)
at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:215)
at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:188)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

{code:java}
$> cat empty-template.yml
spec:
{code}

{code}
$> ./bin/run-example \
  --master k8s://$K8S_SERVER \
  --deploy-mode cluster \
  --conf spark.kubernetes.driver.podTemplateFile=empty-template.yml \
  --name $POD_NAME \
  --jars local:///opt/spark/examples/jars/spark-examples_2.12-3.0.1.jar \
  --conf spark.kubernetes.container.image=spark:v3.0.1 \
  --conf spark.kubernetes.driver.pod.name=$POD_NAME \
  --conf spark.kubernetes.namespace=spark-demo \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --verbose \
   SparkPi 10
{code}

It appears that the implicit requirement is that there's at least one 
well-defined container of any name (not necessarily 
{{spark.kubernetes.driver.podTemplateContainerName}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34037) Remove unnecessary upcasting for Avg & Sum which handle by themself internally

2021-01-15 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-34037:

Affects Version/s: (was: 3.1.0)
   3.2.0

> Remove unnecessary upcasting for Avg & Sum which handle by themself internally
> --
>
> Key: SPARK-34037
> URL: https://issues.apache.org/jira/browse/SPARK-34037
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Priority: Major
>
> The type-coercion for numeric types of average and sum is not necessary at 
> all, as the resultType and sumType can prevent the overflow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34037) Remove unnecessary upcasting for Avg & Sum which handle by themself internally

2021-01-15 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260270#comment-17260270
 ] 

Kent Yao edited comment on SPARK-34037 at 1/15/21, 5:43 PM:


Oh, I mean that when users query for 4 attributes, for example, we may return 5 
(or maybe more) instead, with one attribute named `aggOrder`.

This attribute is added by spark's analyzer for push *sortorder * into 
aggeration for internal logic only and it should not affect the final output

*The above comment is outdated as we change the JIRA title and desc*


was (Author: qin yao):
Oh, I mean that when users query for 4 attributes, for example, we may return 5 
(or maybe more) instead, with one attribute named `aggOrder`.

This attribute is added by spark's analyzer for push *sortorder * into 
aggeration for internal logic only and it should not affect the final output

> Remove unnecessary upcasting for Avg & Sum which handle by themself internally
> --
>
> Key: SPARK-34037
> URL: https://issues.apache.org/jira/browse/SPARK-34037
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> The type-coercion for numeric types of average and sum is not necessary at 
> all, as the resultType and sumType can prevent the overflow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34037) Remove unnecessary upcasting for Avg & Sum which handle by themself internally

2021-01-15 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34037:
-
Summary: Remove unnecessary upcasting for Avg & Sum which handle by 
themself internally  (was: aggOrder should not be output as a auxiliary 
internal attribute)

> Remove unnecessary upcasting for Avg & Sum which handle by themself internally
> --
>
> Key: SPARK-34037
> URL: https://issues.apache.org/jira/browse/SPARK-34037
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> The type-coercion for numeric types of average and sum is not necessary at 
> all, as the resultType and sumType can prevent the overflow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34037) aggOrder should not be output as a auxiliary internal attribute

2021-01-15 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34037:
-
Issue Type: Improvement  (was: Bug)

> aggOrder should not be output as a auxiliary internal attribute
> ---
>
> Key: SPARK-34037
> URL: https://issues.apache.org/jira/browse/SPARK-34037
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Blocker
>
> The type-coercion for numeric types of average and sum is not necessary at 
> all, as the resultType and sumType can prevent the overflow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34037) aggOrder should not be output as a auxiliary internal attribute

2021-01-15 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34037:
-
Description: 
The type-coercion for numeric types of average and sum is not necessary at all, 
as the resultType and sumType can prevent the overflow.



  was:
TakeOrderedAndProject 
[ca_state,cd_gender,cd_marital_status,{color:red}aggOrder{color},cd_dep_employed_count,cd_dep_college_count,cnt1,min(cd_dep_count),max(cd_dep_count),avg(cd_dep_count),cnt2,min(cd_dep_employed_count),max(cd_dep_employed_count),avg(cd_dep_employed_count),cnt3,min(cd_dep_college_count),max(cd_dep_college_count),avg(cd_dep_college_count)]


The TPCDS plan results for q35 q85 are messed up with this attribute


> aggOrder should not be output as a auxiliary internal attribute
> ---
>
> Key: SPARK-34037
> URL: https://issues.apache.org/jira/browse/SPARK-34037
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Blocker
>
> The type-coercion for numeric types of average and sum is not necessary at 
> all, as the resultType and sumType can prevent the overflow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34037) aggOrder should not be output as a auxiliary internal attribute

2021-01-15 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34037:
-
Priority: Major  (was: Blocker)

> aggOrder should not be output as a auxiliary internal attribute
> ---
>
> Key: SPARK-34037
> URL: https://issues.apache.org/jira/browse/SPARK-34037
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> The type-coercion for numeric types of average and sum is not necessary at 
> all, as the resultType and sumType can prevent the overflow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34130) Impove preformace for char varchar padding and length check with StaticInvoke

2021-01-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34130:


Assignee: (was: Apache Spark)

> Impove preformace for char varchar padding and length check with StaticInvoke
> -
>
> Key: SPARK-34130
> URL: https://issues.apache.org/jira/browse/SPARK-34130
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> This could reduce the `generate.java` size to prevent codegen fallback which 
> causes performance regression.
> here is a case from tpcds that could be fixed by this improvementent
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133964/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34130) Impove preformace for char varchar padding and length check with StaticInvoke

2021-01-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266227#comment-17266227
 ] 

Apache Spark commented on SPARK-34130:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31199

> Impove preformace for char varchar padding and length check with StaticInvoke
> -
>
> Key: SPARK-34130
> URL: https://issues.apache.org/jira/browse/SPARK-34130
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> This could reduce the `generate.java` size to prevent codegen fallback which 
> causes performance regression.
> here is a case from tpcds that could be fixed by this improvementent
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133964/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34130) Impove preformace for char varchar padding and length check with StaticInvoke

2021-01-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34130:


Assignee: Apache Spark

> Impove preformace for char varchar padding and length check with StaticInvoke
> -
>
> Key: SPARK-34130
> URL: https://issues.apache.org/jira/browse/SPARK-34130
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> This could reduce the `generate.java` size to prevent codegen fallback which 
> causes performance regression.
> here is a case from tpcds that could be fixed by this improvementent
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133964/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33711) Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns

2021-01-15 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266217#comment-17266217
 ] 

Dongjoon Hyun commented on SPARK-33711:
---

Thank you so much!

>  Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns
> --
>
> Key: SPARK-33711
> URL: https://issues.apache.org/jira/browse/SPARK-33711
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.4, 2.4.7, 3.0.0, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
>
> Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD 
> changes which could wrongfully lead to detecting of missing PODs (PODs known 
> by scheduler backend but missing from POD snapshots) by the executor POD 
> lifecycle manager.
> A key indicator of this is seeing this log msg:
> "The executor with ID [some_id] was not found in the cluster but we didn't 
> get a reason why. Marking the executor as failed. The executor may have been 
> deleted but the driver missed the deletion event."
> So one of the problem is running the missing POD detection even when a single 
> pod is changed without having a full consistent snapshot about all the PODs 
> (see ExecutorPodsPollingSnapshotSource). The other could be a race between 
> the executor POD lifecycle manager and the scheduler backend.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34130) Impove preformace for char varchar padding and length check with StaticInvoke

2021-01-15 Thread Kent Yao (Jira)
Kent Yao created SPARK-34130:


 Summary: Impove preformace for char varchar padding and length 
check with StaticInvoke
 Key: SPARK-34130
 URL: https://issues.apache.org/jira/browse/SPARK-34130
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Kent Yao


This could reduce the `generate.java` size to prevent codegen fallback which 
causes performance regression.

here is a case from tpcds that could be fixed by this improvementent
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133964/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-15 Thread dzcxzl (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265724#comment-17265724
 ] 

dzcxzl edited comment on SPARK-33790 at 1/15/21, 4:28 PM:
--

[https://github.com/scala/bug/issues/10436]

 


was (Author: dzcxzl):
Thread stack when not working.
 PID 117049 0x1c939

[^top.png]

[^jstack.png]

 

 

[https://github.com/scala/bug/issues/10436]

 

 

 

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.0.2, 3.2.0, 3.1.1
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-15 Thread dzcxzl (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265724#comment-17265724
 ] 

dzcxzl edited comment on SPARK-33790 at 1/15/21, 4:27 PM:
--

Thread stack when not working.
 PID 117049 0x1c939

[^top.png]

[^jstack.png]

 

 

[https://github.com/scala/bug/issues/10436]

 

 

 


was (Author: dzcxzl):
Thread stack when not working.
 PID 117049 0x1c939

!top.png!

!jstack.png!  

 

 

[https://github.com/scala/bug/issues/10436]

 

 

 

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.0.2, 3.2.0, 3.1.1
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-15 Thread dzcxzl (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265724#comment-17265724
 ] 

dzcxzl edited comment on SPARK-33790 at 1/15/21, 4:26 PM:
--

Thread stack when not working.
 PID 117049 0x1c939

!top.png!

!jstack.png!  

 

 

[https://github.com/scala/bug/issues/10436]

 

 

 


was (Author: dzcxzl):
Thread stack when not working.
PID 117049 0x1c939

!top.png!

 

!jstack.png!

 

[https://github.com/scala/bug/issues/10436]

 

 

 

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.0.2, 3.2.0, 3.1.1
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-15 Thread dzcxzl (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265724#comment-17265724
 ] 

dzcxzl edited comment on SPARK-33790 at 1/15/21, 4:25 PM:
--

Thread stack when not working.
PID 117049 0x1c939

!top.png!

 

!jstack.png!

 

[https://github.com/scala/bug/issues/10436]

 

 

 


was (Author: dzcxzl):
Thread stack when not working
!http://git.dev.sh.ctripcorp.com/framework-di/spark-2.2.0/uploads/9cfa9662f563ac64f77f4d4ee6fd9243/image.png!

 

[https://github.com/scala/bug/issues/10436]

 

 

 

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.0.2, 3.2.0, 3.1.1
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33790:
-
Fix Version/s: 3.1.1

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.0.2, 3.2.0, 3.1.1
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33245) Add built-in UDF - GETBIT

2021-01-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266127#comment-17266127
 ] 

Apache Spark commented on SPARK-33245:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/31198

> Add built-in UDF - GETBIT 
> --
>
> Key: SPARK-33245
> URL: https://issues.apache.org/jira/browse/SPARK-33245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Teradata, Impala, Snowflake and Yellowbrick support this function:
> https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/PK1oV1b2jqvG~ohRnOro9w
> https://docs.cloudera.com/runtime/7.2.0/impala-sql-reference/topics/impala-bit-functions.html#bit_functions__getbit
> https://docs.snowflake.com/en/sql-reference/functions/getbit.html
> https://www.yellowbrick.com/docs/2.2/ybd_sqlref/getbit.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32598) Not able to see driver logs in spark history server in standalone mode

2021-01-15 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-32598.
--
Fix Version/s: 3.1.1
   3.0.2
 Assignee: Kevin Wang
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/29644

> Not able to see driver logs in spark history server in standalone mode
> --
>
> Key: SPARK-32598
> URL: https://issues.apache.org/jira/browse/SPARK-32598
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.4
>Reporter: Sriram Ganesh
>Assignee: Kevin Wang
>Priority: Minor
> Fix For: 3.0.2, 3.1.1
>
> Attachments: image-2020-08-12-11-50-01-899.png
>
>
> Driver logs are not coming in history server in spark standalone mode. 
> Checked in the spark events logs it is not there. Is this by design or can I 
> fix it by creating a patch?. Not able to see any proper documentation 
> regarding this.
>  
> !image-2020-08-12-11-50-01-899.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32598) Not able to see driver logs in spark history server in standalone mode

2021-01-15 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-32598:
-
Priority: Minor  (was: Major)

> Not able to see driver logs in spark history server in standalone mode
> --
>
> Key: SPARK-32598
> URL: https://issues.apache.org/jira/browse/SPARK-32598
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.4
>Reporter: Sriram Ganesh
>Priority: Minor
> Attachments: image-2020-08-12-11-50-01-899.png
>
>
> Driver logs are not coming in history server in spark standalone mode. 
> Checked in the spark events logs it is not there. Is this by design or can I 
> fix it by creating a patch?. Not able to see any proper documentation 
> regarding this.
>  
> !image-2020-08-12-11-50-01-899.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33346) Change the never changed var to val

2021-01-15 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-33346:


Assignee: Yang Jie

> Change the never changed var to val
> ---
>
> Key: SPARK-33346
> URL: https://issues.apache.org/jira/browse/SPARK-33346
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> Some local variables are declared as "var", but they are never reassigned and 
> should be declared as "val".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33346) Change the never changed var to val

2021-01-15 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-33346.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31142
[https://github.com/apache/spark/pull/31142]

> Change the never changed var to val
> ---
>
> Key: SPARK-33346
> URL: https://issues.apache.org/jira/browse/SPARK-33346
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.2.0
>
>
> Some local variables are declared as "var", but they are never reassigned and 
> should be declared as "val".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34060) ALTER TABLE .. DROP PARTITION uncaches Hive table while updating table stats

2021-01-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266072#comment-17266072
 ] 

Apache Spark commented on SPARK-34060:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31197

> ALTER TABLE .. DROP PARTITION uncaches Hive table while updating table stats
> 
>
> Key: SPARK-34060
> URL: https://issues.apache.org/jira/browse/SPARK-34060
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.2, 3.2.0, 3.1.1
>
>
> The example below portraits the issue:
> {code:scala}
> scala> spark.conf.set("spark.sql.statistics.size.autoUpdate.enabled", true)
> scala> sql(s"CREATE TABLE tbl (id int, part int) USING hive PARTITIONED BY 
> (part)")
> 21/01/10 13:19:59 WARN HiveMetaStore: Location: 
> file:/Users/maximgekk/proj/apache-spark/spark-warehouse/tbl specified for 
> non-external table:tbl
> res12: org.apache.spark.sql.DataFrame = []
> scala> sql("INSERT INTO tbl PARTITION (part=0) SELECT 0")
> res13: org.apache.spark.sql.DataFrame = []
> scala> sql("INSERT INTO tbl PARTITION (part=1) SELECT 1")
> res14: org.apache.spark.sql.DataFrame = []
> scala> sql("CACHE TABLE tbl")
> res15: org.apache.spark.sql.DataFrame = []
> scala> sql("SELECT * FROM tbl").show(false)
> +---++
> |id |part|
> +---++
> |0  |0   |
> |1  |1   |
> +---++
> scala> spark.catalog.isCached("tbl")
> res17: Boolean = true
> scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)")
> res18: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res19: Boolean = false
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34060) ALTER TABLE .. DROP PARTITION uncaches Hive table while updating table stats

2021-01-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266071#comment-17266071
 ] 

Apache Spark commented on SPARK-34060:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31197

> ALTER TABLE .. DROP PARTITION uncaches Hive table while updating table stats
> 
>
> Key: SPARK-34060
> URL: https://issues.apache.org/jira/browse/SPARK-34060
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.2, 3.2.0, 3.1.1
>
>
> The example below portraits the issue:
> {code:scala}
> scala> spark.conf.set("spark.sql.statistics.size.autoUpdate.enabled", true)
> scala> sql(s"CREATE TABLE tbl (id int, part int) USING hive PARTITIONED BY 
> (part)")
> 21/01/10 13:19:59 WARN HiveMetaStore: Location: 
> file:/Users/maximgekk/proj/apache-spark/spark-warehouse/tbl specified for 
> non-external table:tbl
> res12: org.apache.spark.sql.DataFrame = []
> scala> sql("INSERT INTO tbl PARTITION (part=0) SELECT 0")
> res13: org.apache.spark.sql.DataFrame = []
> scala> sql("INSERT INTO tbl PARTITION (part=1) SELECT 1")
> res14: org.apache.spark.sql.DataFrame = []
> scala> sql("CACHE TABLE tbl")
> res15: org.apache.spark.sql.DataFrame = []
> scala> sql("SELECT * FROM tbl").show(false)
> +---++
> |id |part|
> +---++
> |0  |0   |
> |1  |1   |
> +---++
> scala> spark.catalog.isCached("tbl")
> res17: Boolean = true
> scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)")
> res18: org.apache.spark.sql.DataFrame = []
> scala> spark.catalog.isCached("tbl")
> res19: Boolean = false
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29890) Unable to fill na with 0 with duplicate columns

2021-01-15 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266063#comment-17266063
 ] 

Peter Toth commented on SPARK-29890:


[~imback82], [~cloud_fan], due this change `fill` started to throw exception 
when `cols` contains a column that can't be resolved. I wonder if this 
behaviour change of `fill` is desired or it is more like bug. I'm happy to open 
fix PR if you think this side effect is unintended.

> Unable to fill na with 0 with duplicate columns
> ---
>
> Key: SPARK-29890
> URL: https://issues.apache.org/jira/browse/SPARK-29890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.3, 2.4.3
>Reporter: sandeshyapuram
>Assignee: Terry Kim
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Trying to fill out na values with 0.
> {noformat}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> val parent = 
> spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
> val c1 = parent.filter(lit(true))
> val c2 = parent.filter(lit(true))
> c1.join(c2, Seq("nums"), "left")
> .na.fill(0).show{noformat}
> {noformat}
> 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: 
> error looking up the name of group 820818257: No such file or directory
> org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could 
> be: abc, abc.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1246)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)
>   ... 54 elided{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34129) Add table name to LogicalRelation.simpleString

2021-01-15 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34129:

Description: 
Current:
{noformat}
== Optimized Logical Plan ==
Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
+- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
class_id#160)) AND (i_category_id#18 = category_id#161)), 
Statistics(sizeInBytes=2.42E+28 B)
   :- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
   :  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
rowCount=3.69E+5)
   : +- 
Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
 parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
   +- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
  +- Aggregate [brand_id#159, class_id#160, category_id#161], 
[brand_id#159, class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 
B)
 +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND (class_id#160 
<=> i_class_id#16)) AND (category_id#161 <=> i_category_id#18)), 
Statistics(sizeInBytes=2.73E+21 B)
:- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
(class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> i_category_id#18)), 
Statistics(sizeInBytes=2.73E+21 B)
:  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
class_id#160, i_category_id#18 AS category_id#161], 
Statistics(sizeInBytes=2.73E+21 B)
:  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
Statistics(sizeInBytes=3.83E+21 B)
:  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
:  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
Statistics(sizeInBytes=516.5 PiB)
:  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
Statistics(sizeInBytes=61.1 GiB)
:  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
Statistics(sizeInBytes=580.6 GiB)
:  : : : :  +- Project [d_date_sk#52], 
Statistics(sizeInBytes=8.6 KiB, rowCount=731)
:  : : : : +- Filter d_year#58 >= 1999) AND 
(d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
Statistics(sizeInBytes=175.6 KiB, rowCount=731)
:  : : : :+- 
Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,...
 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
:  : : : +- 
Relation[ss_sold_time_sk#29,ss_item_sk#30,ss_customer_sk#31,ss_cdemo_sk#32,ss_hdemo_sk#33,ss_addr_sk#34,ss_store_sk#35,ss_promo_sk#36,ss_ticket_number#37L,ss_quantity#38,ss_wholesale_cost#39,ss_list_price#40,ss_sales_price#41,ss_ext_discount_amt#42,ss_ext_sales_price#43,ss_ext_wholesale_cost#44,ss_ext_list_price#45,ss_ext_tax#46,ss_coupon_amt#47,ss_net_paid#48,ss_net_paid_inc_tax#49,ss_net_profit#50,ss_sold_date_sk#51]
 parquet, Statistics(sizeInBytes=580.6 GiB)
:  : : +- Project [i_item_sk#7, i_brand_id#14, 
i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, 
rowCount=3.69E+5)
:  : :+- Filter (((isnotnull(i_brand_id#14) AND 
isnotnull(i_class_id#16)) AND isnotnull(i_category_id#18)) AND 
isnotnull(i_item_sk#7)), Statistics(sizeInBytes=150.0 MiB, rowCount=3.69E+5)
:  : :   +- 
Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
 parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
:  : +- Project [d_date_sk#52], Statistics(sizeInBytes=8.6 KiB, 
rowCount=731)
:  :+- Filter d_year#58 >= 1999) AND (d_year#58 <= 
2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 

[jira] [Assigned] (SPARK-34129) Add table name to LogicalRelation.simpleString

2021-01-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34129:


Assignee: Apache Spark

> Add table name to LogicalRelation.simpleString
> --
>
> Key: SPARK-34129
> URL: https://issues.apache.org/jira/browse/SPARK-34129
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
> +- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
> class_id#160)) AND (i_category_id#18 = category_id#161)), 
> Statistics(sizeInBytes=2.42E+28 B)
>:- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
> Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
>:  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
> isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
> rowCount=3.69E+5)
>: +- 
> Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
>  parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
>+- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
> class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
>   +- Aggregate [brand_id#159, class_id#160, category_id#161], 
> [brand_id#159, class_id#160, category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
>  +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
> class_id#160, i_category_id#18 AS category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
> :  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
> Statistics(sizeInBytes=3.83E+21 B)
> :  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
> :  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
> Statistics(sizeInBytes=516.5 PiB)
> :  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
> Statistics(sizeInBytes=61.1 GiB)
> :  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
> isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
> Statistics(sizeInBytes=580.6 GiB)
> :  : : : :  +- Project [d_date_sk#52], 
> Statistics(sizeInBytes=8.6 KiB, rowCount=731)
> :  : : : : +- Filter d_year#58 >= 1999) AND 
> (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
> Statistics(sizeInBytes=175.6 KiB, rowCount=731)
> :  : : : :+- 
> Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,...
>  4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
> :  : : : +- 
> Relation[ss_sold_time_sk#29,ss_item_sk#30,ss_customer_sk#31,ss_cdemo_sk#32,ss_hdemo_sk#33,ss_addr_sk#34,ss_store_sk#35,ss_promo_sk#36,ss_ticket_number#37L,ss_quantity#38,ss_wholesale_cost#39,ss_list_price#40,ss_sales_price#41,ss_ext_discount_amt#42,ss_ext_sales_price#43,ss_ext_wholesale_cost#44,ss_ext_list_price#45,ss_ext_tax#46,ss_coupon_amt#47,ss_net_paid#48,ss_net_paid_inc_tax#49,ss_net_profit#50,ss_sold_date_sk#51]
>  parquet, Statistics(sizeInBytes=580.6 GiB)
> :  : : +- Project [i_item_sk#7, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, 
> rowCount=3.69E+5)
> :  : :+- Filter (((isnotnull(i_brand_id#14) AND 
> isnotnull(i_class_id#16)) AND isnotnull(i_category_id#18)) AND 
> isnotnull(i_item_sk#7)), Statistics(sizeInBytes=150.0 MiB, rowCount=3.69E+5)
> :  : :   +- 
> 

[jira] [Commented] (SPARK-34129) Add table name to LogicalRelation.simpleString

2021-01-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266057#comment-17266057
 ] 

Apache Spark commented on SPARK-34129:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/31196

> Add table name to LogicalRelation.simpleString
> --
>
> Key: SPARK-34129
> URL: https://issues.apache.org/jira/browse/SPARK-34129
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
> +- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
> class_id#160)) AND (i_category_id#18 = category_id#161)), 
> Statistics(sizeInBytes=2.42E+28 B)
>:- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
> Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
>:  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
> isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
> rowCount=3.69E+5)
>: +- 
> Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
>  parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
>+- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
> class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
>   +- Aggregate [brand_id#159, class_id#160, category_id#161], 
> [brand_id#159, class_id#160, category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
>  +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
> class_id#160, i_category_id#18 AS category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
> :  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
> Statistics(sizeInBytes=3.83E+21 B)
> :  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
> :  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
> Statistics(sizeInBytes=516.5 PiB)
> :  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
> Statistics(sizeInBytes=61.1 GiB)
> :  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
> isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
> Statistics(sizeInBytes=580.6 GiB)
> :  : : : :  +- Project [d_date_sk#52], 
> Statistics(sizeInBytes=8.6 KiB, rowCount=731)
> :  : : : : +- Filter d_year#58 >= 1999) AND 
> (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
> Statistics(sizeInBytes=175.6 KiB, rowCount=731)
> :  : : : :+- 
> Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,...
>  4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
> :  : : : +- 
> Relation[ss_sold_time_sk#29,ss_item_sk#30,ss_customer_sk#31,ss_cdemo_sk#32,ss_hdemo_sk#33,ss_addr_sk#34,ss_store_sk#35,ss_promo_sk#36,ss_ticket_number#37L,ss_quantity#38,ss_wholesale_cost#39,ss_list_price#40,ss_sales_price#41,ss_ext_discount_amt#42,ss_ext_sales_price#43,ss_ext_wholesale_cost#44,ss_ext_list_price#45,ss_ext_tax#46,ss_coupon_amt#47,ss_net_paid#48,ss_net_paid_inc_tax#49,ss_net_profit#50,ss_sold_date_sk#51]
>  parquet, Statistics(sizeInBytes=580.6 GiB)
> :  : : +- Project [i_item_sk#7, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, 
> rowCount=3.69E+5)
> :  : :+- Filter (((isnotnull(i_brand_id#14) AND 
> isnotnull(i_class_id#16)) AND isnotnull(i_category_id#18)) AND 
> isnotnull(i_item_sk#7)), Statistics(sizeInBytes=150.0 MiB, rowCount=3.69E+5)
> :  : :   

[jira] [Assigned] (SPARK-34129) Add table name to LogicalRelation.simpleString

2021-01-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34129:


Assignee: (was: Apache Spark)

> Add table name to LogicalRelation.simpleString
> --
>
> Key: SPARK-34129
> URL: https://issues.apache.org/jira/browse/SPARK-34129
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
> +- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
> class_id#160)) AND (i_category_id#18 = category_id#161)), 
> Statistics(sizeInBytes=2.42E+28 B)
>:- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
> Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
>:  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
> isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
> rowCount=3.69E+5)
>: +- 
> Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
>  parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
>+- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
> class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
>   +- Aggregate [brand_id#159, class_id#160, category_id#161], 
> [brand_id#159, class_id#160, category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
>  +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
> class_id#160, i_category_id#18 AS category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
> :  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
> Statistics(sizeInBytes=3.83E+21 B)
> :  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
> :  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
> Statistics(sizeInBytes=516.5 PiB)
> :  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
> Statistics(sizeInBytes=61.1 GiB)
> :  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
> isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
> Statistics(sizeInBytes=580.6 GiB)
> :  : : : :  +- Project [d_date_sk#52], 
> Statistics(sizeInBytes=8.6 KiB, rowCount=731)
> :  : : : : +- Filter d_year#58 >= 1999) AND 
> (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
> Statistics(sizeInBytes=175.6 KiB, rowCount=731)
> :  : : : :+- 
> Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,...
>  4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
> :  : : : +- 
> Relation[ss_sold_time_sk#29,ss_item_sk#30,ss_customer_sk#31,ss_cdemo_sk#32,ss_hdemo_sk#33,ss_addr_sk#34,ss_store_sk#35,ss_promo_sk#36,ss_ticket_number#37L,ss_quantity#38,ss_wholesale_cost#39,ss_list_price#40,ss_sales_price#41,ss_ext_discount_amt#42,ss_ext_sales_price#43,ss_ext_wholesale_cost#44,ss_ext_list_price#45,ss_ext_tax#46,ss_coupon_amt#47,ss_net_paid#48,ss_net_paid_inc_tax#49,ss_net_profit#50,ss_sold_date_sk#51]
>  parquet, Statistics(sizeInBytes=580.6 GiB)
> :  : : +- Project [i_item_sk#7, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, 
> rowCount=3.69E+5)
> :  : :+- Filter (((isnotnull(i_brand_id#14) AND 
> isnotnull(i_class_id#16)) AND isnotnull(i_category_id#18)) AND 
> isnotnull(i_item_sk#7)), Statistics(sizeInBytes=150.0 MiB, rowCount=3.69E+5)
> :  : :   +- 
> 

[jira] [Assigned] (SPARK-34064) Broadcast job is not aborted even the SQL statement canceled

2021-01-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34064:


Assignee: Lantao Jin  (was: Apache Spark)

> Broadcast job is not aborted even the SQL statement canceled
> 
>
> Key: SPARK-34064
> URL: https://issues.apache.org/jira/browse/SPARK-34064
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.2.0, 3.1.1
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Minor
> Attachments: Screen Shot 2021-01-11 at 12.03.13 PM.png
>
>
> SPARK-27036 introduced a runId for BroadcastExchangeExec to resolve the 
> problem that a broadcast job is not aborted when broadcast timeout happens. 
> Since the runId is a random UUID, when a SQL statement is cancelled, these 
> broadcast sub-jobs still not canceled as a whole.
>  !Screen Shot 2021-01-11 at 12.03.13 PM.png|width=100%! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34064) Broadcast job is not aborted even the SQL statement canceled

2021-01-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34064:


Assignee: Apache Spark  (was: Lantao Jin)

> Broadcast job is not aborted even the SQL statement canceled
> 
>
> Key: SPARK-34064
> URL: https://issues.apache.org/jira/browse/SPARK-34064
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.2.0, 3.1.1
>Reporter: Lantao Jin
>Assignee: Apache Spark
>Priority: Minor
> Attachments: Screen Shot 2021-01-11 at 12.03.13 PM.png
>
>
> SPARK-27036 introduced a runId for BroadcastExchangeExec to resolve the 
> problem that a broadcast job is not aborted when broadcast timeout happens. 
> Since the runId is a random UUID, when a SQL statement is cancelled, these 
> broadcast sub-jobs still not canceled as a whole.
>  !Screen Shot 2021-01-11 at 12.03.13 PM.png|width=100%! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-34064) Broadcast job is not aborted even the SQL statement canceled

2021-01-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-34064:
-

> Broadcast job is not aborted even the SQL statement canceled
> 
>
> Key: SPARK-34064
> URL: https://issues.apache.org/jira/browse/SPARK-34064
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.2.0, 3.1.1
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Minor
> Fix For: 3.1.1
>
> Attachments: Screen Shot 2021-01-11 at 12.03.13 PM.png
>
>
> SPARK-27036 introduced a runId for BroadcastExchangeExec to resolve the 
> problem that a broadcast job is not aborted when broadcast timeout happens. 
> Since the runId is a random UUID, when a SQL statement is cancelled, these 
> broadcast sub-jobs still not canceled as a whole.
>  !Screen Shot 2021-01-11 at 12.03.13 PM.png|width=100%! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34064) Broadcast job is not aborted even the SQL statement canceled

2021-01-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-34064:

Fix Version/s: (was: 3.1.1)

> Broadcast job is not aborted even the SQL statement canceled
> 
>
> Key: SPARK-34064
> URL: https://issues.apache.org/jira/browse/SPARK-34064
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.2.0, 3.1.1
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Minor
> Attachments: Screen Shot 2021-01-11 at 12.03.13 PM.png
>
>
> SPARK-27036 introduced a runId for BroadcastExchangeExec to resolve the 
> problem that a broadcast job is not aborted when broadcast timeout happens. 
> Since the runId is a random UUID, when a SQL statement is cancelled, these 
> broadcast sub-jobs still not canceled as a whole.
>  !Screen Shot 2021-01-11 at 12.03.13 PM.png|width=100%! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34128) Suppress excessive logging of TTransportExceptions in Spark ThriftServer

2021-01-15 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-34128:
-
Issue Type: Improvement  (was: Bug)
  Priority: Minor  (was: Major)

>  Suppress excessive logging of TTransportExceptions in Spark ThriftServer
> -
>
> Key: SPARK-34128
> URL: https://issues.apache.org/jira/browse/SPARK-34128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Minor
>
> Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.
> Due to THRIFT-4805, The SparkThrift Server will print annoying TExceptions. 
> For example, the current thrift server module test in Github action workflow 
> outputs more than 200MB of data for this error only, while the total size of 
> the test log only about 1GB.
>  
> I checked the latest `hive-service-rpc` module in the maven center,  
> [https://mvnrepository.com/artifact/org.apache.hive/hive-service-rpc/3.1.2.] 
>  It still uses the 0.9.3 version. 
>  
> Due to THRIFT-5274 , It looks like we need to wait for thrift 0.14.0 to 
> release or downgrade to 0.9.3 to fix this issue if any of them is appropriate
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33711) Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns

2021-01-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266040#comment-17266040
 ] 

Apache Spark commented on SPARK-33711:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/31195

>  Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns
> --
>
> Key: SPARK-33711
> URL: https://issues.apache.org/jira/browse/SPARK-33711
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.4, 2.4.7, 3.0.0, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
>
> Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD 
> changes which could wrongfully lead to detecting of missing PODs (PODs known 
> by scheduler backend but missing from POD snapshots) by the executor POD 
> lifecycle manager.
> A key indicator of this is seeing this log msg:
> "The executor with ID [some_id] was not found in the cluster but we didn't 
> get a reason why. Marking the executor as failed. The executor may have been 
> deleted but the driver missed the deletion event."
> So one of the problem is running the missing POD detection even when a single 
> pod is changed without having a full consistent snapshot about all the PODs 
> (see ExecutorPodsPollingSnapshotSource). The other could be a race between 
> the executor POD lifecycle manager and the scheduler backend.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34129) Add table name to LogicalRelation.simpleString

2021-01-15 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-34129:
---

 Summary: Add table name to LogicalRelation.simpleString
 Key: SPARK-34129
 URL: https://issues.apache.org/jira/browse/SPARK-34129
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


Current:
{noformat}
== Optimized Logical Plan ==
Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
+- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
class_id#160)) AND (i_category_id#18 = category_id#161)), 
Statistics(sizeInBytes=2.42E+28 B)
   :- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
   :  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
rowCount=3.69E+5)
   : +- 
Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
 parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
   +- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
  +- Aggregate [brand_id#159, class_id#160, category_id#161], 
[brand_id#159, class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 
B)
 +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND (class_id#160 
<=> i_class_id#16)) AND (category_id#161 <=> i_category_id#18)), 
Statistics(sizeInBytes=2.73E+21 B)
:- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
(class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> i_category_id#18)), 
Statistics(sizeInBytes=2.73E+21 B)
:  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
class_id#160, i_category_id#18 AS category_id#161], 
Statistics(sizeInBytes=2.73E+21 B)
:  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
Statistics(sizeInBytes=3.83E+21 B)
:  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
:  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
Statistics(sizeInBytes=516.5 PiB)
:  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
Statistics(sizeInBytes=61.1 GiB)
:  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
Statistics(sizeInBytes=580.6 GiB)
:  : : : :  +- Project [d_date_sk#52], 
Statistics(sizeInBytes=8.6 KiB, rowCount=731)
:  : : : : +- Filter d_year#58 >= 1999) AND 
(d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
Statistics(sizeInBytes=175.6 KiB, rowCount=731)
:  : : : :+- 
Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,...
 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
:  : : : +- 
Relation[ss_sold_time_sk#29,ss_item_sk#30,ss_customer_sk#31,ss_cdemo_sk#32,ss_hdemo_sk#33,ss_addr_sk#34,ss_store_sk#35,ss_promo_sk#36,ss_ticket_number#37L,ss_quantity#38,ss_wholesale_cost#39,ss_list_price#40,ss_sales_price#41,ss_ext_discount_amt#42,ss_ext_sales_price#43,ss_ext_wholesale_cost#44,ss_ext_list_price#45,ss_ext_tax#46,ss_coupon_amt#47,ss_net_paid#48,ss_net_paid_inc_tax#49,ss_net_profit#50,ss_sold_date_sk#51]
 parquet, Statistics(sizeInBytes=580.6 GiB)
:  : : +- Project [i_item_sk#7, i_brand_id#14, 
i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, 
rowCount=3.69E+5)
:  : :+- Filter (((isnotnull(i_brand_id#14) AND 
isnotnull(i_class_id#16)) AND isnotnull(i_category_id#18)) AND 
isnotnull(i_item_sk#7)), Statistics(sizeInBytes=150.0 MiB, rowCount=3.69E+5)
:  : :   +- 
Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
 parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
:  : +- Project [d_date_sk#52], 

[jira] [Commented] (SPARK-34128) Suppress excessive logging of TTransportExceptions in Spark ThriftServer

2021-01-15 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266025#comment-17266025
 ] 

Kent Yao commented on SPARK-34128:
--

cc [~dongjoon] [~srowen] [~hyukjin.kwon]

>  Suppress excessive logging of TTransportExceptions in Spark ThriftServer
> -
>
> Key: SPARK-34128
> URL: https://issues.apache.org/jira/browse/SPARK-34128
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.
> Due to THRIFT-4805, The SparkThrift Server will print annoying TExceptions. 
> For example, the current thrift server module test in Github action workflow 
> outputs more than 200MB of data for this error only, while the total size of 
> the test log only about 1GB.
>  
> I checked the latest `hive-service-rpc` module in the maven center,  
> [https://mvnrepository.com/artifact/org.apache.hive/hive-service-rpc/3.1.2.] 
>  It still uses the 0.9.3 version. 
>  
> Due to THRIFT-5274 , It looks like we need to wait for thrift 0.14.0 to 
> release or downgrade to 0.9.3 to fix this issue if any of them is appropriate
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34128) Suppress excessive logging of TTransportExceptions in Spark ThriftServer

2021-01-15 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34128:
-
Description: 
Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.

Due to THRIFT-4805, The SparkThrift Server will print annoying TExceptions. For 
example, the current thrift server module test in Github action workflow 
outputs more than 200MB of data for this error only, while the total size of 
the test log only about 1GB.

 

I checked the latest `hive-service-rpc` module in the maven center,  
[https://mvnrepository.com/artifact/org.apache.hive/hive-service-rpc/3.1.2.]  
It still uses the 0.9.3 version. 

 

Due to THRIFT-5274 , It looks like we need to wait for thrift 0.14.0 to release 
or downgrade to 0.9.3 to fix this issue if any of them is appropriate

 

  was:
Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.

Due to THRIFT-4805, The SparkThrift Server will print annoying TExceptions. For 
example, the current thrift server module test in Github action workflow 
outputs more than 200MB of data for this error only, while the total size of 
the test log only about 1GB.

 

I checked the latest `hive-service-rpc` module in the maven center,  
[https://mvnrepository.com/artifact/org.apache.hive/hive-service-rpc/3.1.2.]  
It still use the 0.9.3 version.

 

Due to THRIFT-5274 , It looks like we need to 

 


>  Suppress excessive logging of TTransportExceptions in Spark ThriftServer
> -
>
> Key: SPARK-34128
> URL: https://issues.apache.org/jira/browse/SPARK-34128
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.
> Due to THRIFT-4805, The SparkThrift Server will print annoying TExceptions. 
> For example, the current thrift server module test in Github action workflow 
> outputs more than 200MB of data for this error only, while the total size of 
> the test log only about 1GB.
>  
> I checked the latest `hive-service-rpc` module in the maven center,  
> [https://mvnrepository.com/artifact/org.apache.hive/hive-service-rpc/3.1.2.] 
>  It still uses the 0.9.3 version. 
>  
> Due to THRIFT-5274 , It looks like we need to wait for thrift 0.14.0 to 
> release or downgrade to 0.9.3 to fix this issue if any of them is appropriate
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34128) Suppress excessive logging of TTransportExceptions in Spark ThriftServer

2021-01-15 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34128:
-
Description: 
Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.

Due to THRIFT-4805, The SparkThrift Server will print annoying TExceptions. For 
example, the current thrift server module test in Github action workflow 
outputs more than 200MB of data for this error only, while the total size of 
the test log only about 1GB.

 

I checked the latest `hive-service-rpc` module in the maven center,  
[https://mvnrepository.com/artifact/org.apache.hive/hive-service-rpc/3.1.2.]  
It still use the 0.9.3 version.

 

Due to THRIFT-5274 , It looks like we need to 

 

  was:
Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.

Due to THRIFT-4805, The SparkThrift Server will print annoying TExceptions. For 
example, the current thrift server module test in Github action workflow 
outputs more than 200MB of data for this error only, while the total size of 
the test log only about 1GB.

 

I checked the latest `hive-service-rpc` module in the maven center,  
[https://mvnrepository.com/artifact/org.apache.hive/hive-service-rpc/3.1.2.]  
It still use the 0.9.3 version.

 

Due to https://issues.apache.org/jira/browse/THRIFT-5274

 


>  Suppress excessive logging of TTransportExceptions in Spark ThriftServer
> -
>
> Key: SPARK-34128
> URL: https://issues.apache.org/jira/browse/SPARK-34128
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.
> Due to THRIFT-4805, The SparkThrift Server will print annoying TExceptions. 
> For example, the current thrift server module test in Github action workflow 
> outputs more than 200MB of data for this error only, while the total size of 
> the test log only about 1GB.
>  
> I checked the latest `hive-service-rpc` module in the maven center,  
> [https://mvnrepository.com/artifact/org.apache.hive/hive-service-rpc/3.1.2.] 
>  It still use the 0.9.3 version.
>  
> Due to THRIFT-5274 , It looks like we need to 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34128) Suppress excessive logging of TTransportExceptions in Spark ThriftServer

2021-01-15 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34128:
-
Description: 
Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.

Due to THRIFT-4805, The SparkThrift Server will print annoying TExceptions. For 
example, the current thrift server module test in Github action workflow 
outputs more than 200MB of data for this error only, while the total size of 
the test log only about 1GB.

 

I checked the latest `hive-service-rpc` module in the maven center,  
[https://mvnrepository.com/artifact/org.apache.hive/hive-service-rpc/3.1.2.]  
It still use the 0.9.3 version.

 

Due to https://issues.apache.org/jira/browse/THRIFT-5274

 

  was:
Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.

Due to, The SparkThrift Server will print annoying  THRIFT-4805

THRIFT-4805


>  Suppress excessive logging of TTransportExceptions in Spark ThriftServer
> -
>
> Key: SPARK-34128
> URL: https://issues.apache.org/jira/browse/SPARK-34128
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.
> Due to THRIFT-4805, The SparkThrift Server will print annoying TExceptions. 
> For example, the current thrift server module test in Github action workflow 
> outputs more than 200MB of data for this error only, while the total size of 
> the test log only about 1GB.
>  
> I checked the latest `hive-service-rpc` module in the maven center,  
> [https://mvnrepository.com/artifact/org.apache.hive/hive-service-rpc/3.1.2.] 
>  It still use the 0.9.3 version.
>  
> Due to https://issues.apache.org/jira/browse/THRIFT-5274
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34128) Suppress excessive logging of TTransportExceptions in Spark ThriftServer

2021-01-15 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34128:
-
Description: 
Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.

Due to, The SparkThrift Server will print annoying  THRIFT-4805

THRIFT-4805

  was:
Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.

Due to, The SparkThrift Server will print annoying  

THRIFT-4805


>  Suppress excessive logging of TTransportExceptions in Spark ThriftServer
> -
>
> Key: SPARK-34128
> URL: https://issues.apache.org/jira/browse/SPARK-34128
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.
> Due to, The SparkThrift Server will print annoying  THRIFT-4805
> THRIFT-4805



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34128) Suppress excessive logging of TTransportExceptions in Spark ThriftServer

2021-01-15 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34128:
-
Description: 
Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.

Due to, The SparkThrift Server will print annoying  

THRIFT-4805

  was:
Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.

 

because of to Thrift


>  Suppress excessive logging of TTransportExceptions in Spark ThriftServer
> -
>
> Key: SPARK-34128
> URL: https://issues.apache.org/jira/browse/SPARK-34128
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.
> Due to, The SparkThrift Server will print annoying  
> THRIFT-4805



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34128) Suppress excessive logging of TTransportExceptions in Spark ThriftServer

2021-01-15 Thread Kent Yao (Jira)
Kent Yao created SPARK-34128:


 Summary:  Suppress excessive logging of TTransportExceptions in 
Spark ThriftServer
 Key: SPARK-34128
 URL: https://issues.apache.org/jira/browse/SPARK-34128
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 3.1.0
Reporter: Kent Yao


Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.

 

because of to Thrift



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-15 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-33790:
-
Fix Version/s: 3.0.2

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.0.2, 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34118) Replaces filter and check for emptiness with exists or forall

2021-01-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34118:
-
Fix Version/s: 3.1.1
   3.0.2

> Replaces filter and check for emptiness with exists or forall
> -
>
> Key: SPARK-34118
> URL: https://issues.apache.org/jira/browse/SPARK-34118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 2.4.8, 3.0.2, 3.1.1
>
>
> Semantic consistency and code Simpilefications
> before:
> {code:java}
> seq.filter(p).size == 0)
> seq.filter(p).length > 0
> seq.filterNot(p).isEmpty
> seq.filterNot(p).nonEmpty
> {code}
> after:
> {code:java}
> !seq.exists(p)
> seq.exists(p)
> seq.forall(p)
> !seq.forall(p)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34118) Replaces filter and check for emptiness with exists or forall

2021-01-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34118.
--
Fix Version/s: (was: 3.1.1)
   (was: 3.2.0)
   2.4.8
   Resolution: Fixed

Issue resolved by pull request 31192
[https://github.com/apache/spark/pull/31192]

> Replaces filter and check for emptiness with exists or forall
> -
>
> Key: SPARK-34118
> URL: https://issues.apache.org/jira/browse/SPARK-34118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 2.4.8
>
>
> Semantic consistency and code Simpilefications
> before:
> {code:java}
> seq.filter(p).size == 0)
> seq.filter(p).length > 0
> seq.filterNot(p).isEmpty
> seq.filterNot(p).nonEmpty
> {code}
> after:
> {code:java}
> !seq.exists(p)
> seq.exists(p)
> seq.forall(p)
> !seq.forall(p)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34127) Support table valued command

2021-01-15 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265934#comment-17265934
 ] 

Yuming Wang commented on SPARK-34127:
-

+1

> Support table valued command
> 
>
> Key: SPARK-34127
> URL: https://issues.apache.org/jira/browse/SPARK-34127
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>
> Some command used to display some metadata, such as: SHOW TABLES, SHOW TABLE 
> EXTENDED,SHOW TBLPROPERTIES and so no.
> If the output rows much than screen height, the output very unfriendly to 
> developers.
> So we should have a way to filter the output like the behavior of SELECT ... 
> FROM ... WHERE.
> We could adopt the implement of table valued function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34098) HadoopVersionInfoSuite faild when maven test in Scala 2.13

2021-01-15 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265897#comment-17265897
 ] 

Yang Jie commented on SPARK-34098:
--

Working on this

> HadoopVersionInfoSuite faild when maven test in Scala 2.13
> --
>
> Key: SPARK-34098
> URL: https://issues.apache.org/jira/browse/SPARK-34098
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
>  
> {code:java}
> mvn clean test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -pl sql/hive -Pscala-2.13  
> -DwildcardSuites=org.apache.spark.sql.hive.client.HadoopVersionInfoSuite 
> -Dtest=none 
> {code}
> Independent test HadoopVersionInfoSuite all case passed, but execute
>  
> {code:java}
> mvn clean test -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn 
> -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive 
> -pl sql/hive -Pscala-2.13 
> {code}
>  
> HadoopVersionInfoSuite failed as follow:
> {code:java}
> HadoopVersionInfoSuite: 22:32:30.310 WARN 
> org.apache.spark.sql.hive.client.IsolatedClientLoader: Failed to resolve 
> Hadoop artifacts for the version 2.7.4. We will change the hadoop version 
> from 2.7.4 to 2.7.4 and try again. It is recommended to set jars used by Hive 
> metastore client through spark.sql.hive.metastore.jars in the production 
> environment. - SPARK-32256: Hadoop VersionInfo should be preloaded *** FAILED 
> *** java.lang.RuntimeException: [unresolved dependency: 
> org.apache.hive#hive-metastore;2.0.1: not found, unresolved dependency: 
> org.apache.hive#hive-exec;2.0.1: not found, unresolved dependency: 
> org.apache.hive#hive-common;2.0.1: not found, unresolved dependency: 
> org.apache.hive#hive-serde;2.0.1: not found, unresolved dependency: 
> com.google.guava#guava;14.0.1: not found, unresolved dependency: 
> org.apache.hadoop#hadoop-client;2.7.4: not found] at 
> org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1423)
>  at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.$anonfun$downloadVersion$2(IsolatedClientLoader.scala:122)
>  at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42) at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.downloadVersion(IsolatedClientLoader.scala:122)
>  at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.liftedTree1$1(IsolatedClientLoader.scala:75)
>  at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:63)
>  at 
> org.apache.spark.sql.hive.client.HadoopVersionInfoSuite.$anonfun$new$1(HadoopVersionInfoSuite.scala:46)
>  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at 
> org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at 
> org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> {code}
> need some investigate.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34127) Support table valued command

2021-01-15 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-34127:
--

 Summary: Support table valued command
 Key: SPARK-34127
 URL: https://issues.apache.org/jira/browse/SPARK-34127
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.2.0
Reporter: jiaan.geng


Some command used to display some metadata, such as: SHOW TABLES, SHOW TABLE 
EXTENDED,SHOW TBLPROPERTIES and so no.

If the output rows much than screen height, the output very unfriendly to 
developers.
So we should have a way to filter the output like the behavior of SELECT ... 
FROM ... WHERE.

We could adopt the implement of table valued function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34127) Support table valued command

2021-01-15 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265874#comment-17265874
 ] 

jiaan.geng commented on SPARK-34127:


I'm working on.

> Support table valued command
> 
>
> Key: SPARK-34127
> URL: https://issues.apache.org/jira/browse/SPARK-34127
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>
> Some command used to display some metadata, such as: SHOW TABLES, SHOW TABLE 
> EXTENDED,SHOW TBLPROPERTIES and so no.
> If the output rows much than screen height, the output very unfriendly to 
> developers.
> So we should have a way to filter the output like the behavior of SELECT ... 
> FROM ... WHERE.
> We could adopt the implement of table valued function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33711) Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns

2021-01-15 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265863#comment-17265863
 ] 

Attila Zsolt Piros commented on SPARK-33711:


Sure. Working on that.

>  Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns
> --
>
> Key: SPARK-33711
> URL: https://issues.apache.org/jira/browse/SPARK-33711
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.4, 2.4.7, 3.0.0, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
>
> Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD 
> changes which could wrongfully lead to detecting of missing PODs (PODs known 
> by scheduler backend but missing from POD snapshots) by the executor POD 
> lifecycle manager.
> A key indicator of this is seeing this log msg:
> "The executor with ID [some_id] was not found in the cluster but we didn't 
> get a reason why. Marking the executor as failed. The executor may have been 
> deleted but the driver missed the deletion event."
> So one of the problem is running the missing POD detection even when a single 
> pod is changed without having a full consistent snapshot about all the PODs 
> (see ExecutorPodsPollingSnapshotSource). The other could be a race between 
> the executor POD lifecycle manager and the scheduler backend.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34118) Replaces filter and check for emptiness with exists or forall

2021-01-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34118:
-
Fix Version/s: 3.1.1

> Replaces filter and check for emptiness with exists or forall
> -
>
> Key: SPARK-34118
> URL: https://issues.apache.org/jira/browse/SPARK-34118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.2.0, 3.1.1
>
>
> Semantic consistency and code Simpilefications
> before:
> {code:java}
> seq.filter(p).size == 0)
> seq.filter(p).length > 0
> seq.filterNot(p).isEmpty
> seq.filterNot(p).nonEmpty
> {code}
> after:
> {code:java}
> !seq.exists(p)
> seq.exists(p)
> seq.forall(p)
> !seq.forall(p)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34118) Replaces filter and check for emptiness with exists or forall

2021-01-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34118:


Assignee: Yang Jie

> Replaces filter and check for emptiness with exists or forall
> -
>
> Key: SPARK-34118
> URL: https://issues.apache.org/jira/browse/SPARK-34118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.2.0
>
>
> Semantic consistency and code Simpilefications
> before:
> {code:java}
> seq.filter(p).size == 0)
> seq.filter(p).length > 0
> seq.filterNot(p).isEmpty
> seq.filterNot(p).nonEmpty
> {code}
> after:
> {code:java}
> !seq.exists(p)
> seq.exists(p)
> seq.forall(p)
> !seq.forall(p)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34115) Long runtime on many environment variables

2021-01-15 Thread Norbert Schultz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Norbert Schultz updated SPARK-34115:

Affects Version/s: 3.0.1

> Long runtime on many environment variables
> --
>
> Key: SPARK-34115
> URL: https://issues.apache.org/jira/browse/SPARK-34115
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 2.4.7, 3.0.1
> Environment: Spark 2.4.0 local[2] on a Kubernetes Pod
>Reporter: Norbert Schultz
>Priority: Major
> Attachments: spark-bug-34115.tar.gz
>
>
> I am not sure if this is a bug report or a feature request. The code is is 
> the same in current versions of Spark and maybe this ticket saves someone 
> some time for debugging.
> We migrated some older code to Spark 2.4.0, and suddently the integration 
> tests on our build machine were much slower than expected.
> On local machines it was running perfectly.
> At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
> analyzing in the following functions
>  * AnalysisHelper.assertNotAnalysisRule calling
>  * Utils.isTesting
> Utils.isTesting is traversing all environment variables.
> The offending build machine was a Kubernetes Pod which automatically exposed 
> all services as environment variables, so it had more than 3000 environment 
> variables.
> As Utils.isTesting is called very often throgh 
> AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
> transformUp).
>  
> Of course we will restrict the number of environment variables, on the other 
> side Utils.isTesting could also use a lazy val for
>  
> {code:java}
> sys.env.contains("SPARK_TESTING") {code}
>  
> to not make it that expensive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34115) Long runtime on many environment variables

2021-01-15 Thread Norbert Schultz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265834#comment-17265834
 ] 

Norbert Schultz edited comment on SPARK-34115 at 1/15/21, 8:58 AM:
---

Added demonstration code based upon Spark 2.4.7

Call
 * show_fast.sh for regular running time
 * show_slow.sh for a lot of environment variables

Running time (locally)
 * fast: 4000ms
 * slow: 11303ms

The calculation done is completely useless but should give Spark SQL something 
to optimize

(Also tried Spark 3.0.1, showing the same behaviour) 


was (Author: nob13):
Added demonstration code based upon Spark 2.4.7

Call
 * show_fast.sh for regular running time
 * show_slow.sh for a lot of environment variables

Running time (locally)
 * fast: 4000ms
 * slow: 11303ms

The calculation done is completely useless but should give Spark SQL something 
to optimize

 

> Long runtime on many environment variables
> --
>
> Key: SPARK-34115
> URL: https://issues.apache.org/jira/browse/SPARK-34115
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 2.4.7
> Environment: Spark 2.4.0 local[2] on a Kubernetes Pod
>Reporter: Norbert Schultz
>Priority: Major
> Attachments: spark-bug-34115.tar.gz
>
>
> I am not sure if this is a bug report or a feature request. The code is is 
> the same in current versions of Spark and maybe this ticket saves someone 
> some time for debugging.
> We migrated some older code to Spark 2.4.0, and suddently the integration 
> tests on our build machine were much slower than expected.
> On local machines it was running perfectly.
> At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
> analyzing in the following functions
>  * AnalysisHelper.assertNotAnalysisRule calling
>  * Utils.isTesting
> Utils.isTesting is traversing all environment variables.
> The offending build machine was a Kubernetes Pod which automatically exposed 
> all services as environment variables, so it had more than 3000 environment 
> variables.
> As Utils.isTesting is called very often throgh 
> AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
> transformUp).
>  
> Of course we will restrict the number of environment variables, on the other 
> side Utils.isTesting could also use a lazy val for
>  
> {code:java}
> sys.env.contains("SPARK_TESTING") {code}
>  
> to not make it that expensive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34115) Long runtime on many environment variables

2021-01-15 Thread Norbert Schultz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Norbert Schultz updated SPARK-34115:

Component/s: SQL

> Long runtime on many environment variables
> --
>
> Key: SPARK-34115
> URL: https://issues.apache.org/jira/browse/SPARK-34115
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 2.4.7
> Environment: Spark 2.4.0 local[2] on a Kubernetes Pod
>Reporter: Norbert Schultz
>Priority: Major
> Attachments: spark-bug-34115.tar.gz
>
>
> I am not sure if this is a bug report or a feature request. The code is is 
> the same in current versions of Spark and maybe this ticket saves someone 
> some time for debugging.
> We migrated some older code to Spark 2.4.0, and suddently the integration 
> tests on our build machine were much slower than expected.
> On local machines it was running perfectly.
> At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
> analyzing in the following functions
>  * AnalysisHelper.assertNotAnalysisRule calling
>  * Utils.isTesting
> Utils.isTesting is traversing all environment variables.
> The offending build machine was a Kubernetes Pod which automatically exposed 
> all services as environment variables, so it had more than 3000 environment 
> variables.
> As Utils.isTesting is called very often throgh 
> AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
> transformUp).
>  
> Of course we will restrict the number of environment variables, on the other 
> side Utils.isTesting could also use a lazy val for
>  
> {code:java}
> sys.env.contains("SPARK_TESTING") {code}
>  
> to not make it that expensive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >