[jira] [Created] (SPARK-34124) Upgrade jackson version to fix CVE-2020-36179 in Spark 2.4

2021-01-14 Thread Yang Jie (Jira)
Yang Jie created SPARK-34124:


 Summary: Upgrade jackson version to fix CVE-2020-36179 in Spark 2.4
 Key: SPARK-34124
 URL: https://issues.apache.org/jira/browse/SPARK-34124
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.4.7
Reporter: Yang Jie


 
{code:java}
FasterXML jackson-databind 2.x before 2.9.10.8 mishandles the interaction 
between serialization gadgets and typing, related to 
oadd.org.apache.commons.dbcp.cpdsadapter.DriverAdapterCPDS.{code}
 

[CVE-2020-36179|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-36179]

Spark 2.4.7 still using Jackson 2.6.7



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34117) Disable LeftSemi/LeftAnti push down over Aggregate

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34117:


Assignee: (was: Apache Spark)

> Disable LeftSemi/LeftAnti push down over Aggregate
> --
>
> Key: SPARK-34117
> URL: https://issues.apache.org/jira/browse/SPARK-34117
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: current.jpg, disable_pushdown.jpg
>
>
> After SPARK-34081, it can improve TPC-DS q38 and q87. But it still can not 
> handle [this case(rewritten from 
> q14b)|https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q14b.sql#L2-L32]:
> {code:sql}
> SELECT i_item_sk ss_item_sk
>   FROM item,
> (SELECT
>   distinct
>   iss.i_brand_id brand_id,
>   iss.i_class_id class_id,
>   iss.i_category_id category_id
> FROM store_sales, item iss, date_dim d1
> WHERE ss_item_sk = iss.i_item_sk
>   AND ss_sold_date_sk = d1.d_date_sk
>   AND d1.d_year BETWEEN 1999 AND 1999 + 2
> INTERSECT
> SELECT
> distinct
>   ics.i_brand_id,
>   ics.i_class_id,
>   ics.i_category_id
> FROM catalog_sales, item ics, date_dim d2
> WHERE cs_item_sk = ics.i_item_sk
>   AND cs_sold_date_sk = d2.d_date_sk
>   AND d2.d_year BETWEEN 1999 AND 1999 + 2
> INTERSECT
> SELECT
> distinct
>   iws.i_brand_id,
>   iws.i_class_id,
>   iws.i_category_id
> FROM web_sales, item iws, date_dim d3
> WHERE ws_item_sk = iws.i_item_sk
>   AND ws_sold_date_sk = d3.d_date_sk
>   AND d3.d_year BETWEEN 1999 AND 1999 + 2) x
>   WHERE i_brand_id = brand_id
> AND i_class_id = class_id
> AND i_category_id = category_id;
> {code}
> Optimized Logical Plan:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
> +- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
> class_id#160)) AND (i_category_id#18 = category_id#161)), 
> Statistics(sizeInBytes=2.42E+28 B)
>:- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
> Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
>:  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
> isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
> rowCount=3.69E+5)
>: +- 
> Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
>  parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
>+- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
> class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
>   +- Aggregate [brand_id#159, class_id#160, category_id#161], 
> [brand_id#159, class_id#160, category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
>  +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
> class_id#160, i_category_id#18 AS category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
> :  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
> Statistics(sizeInBytes=3.83E+21 B)
> :  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
> :  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
> Statistics(sizeInBytes=516.5 PiB)
> :  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
> Statistics(sizeInBytes=61.1 GiB)
> :  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
> isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
> Statistics(sizeInBytes=580.6 GiB)
> :  : : : :  +- Project [d_date_sk#52], 
> Statistics(sizeInBytes=8.6 KiB, rowCount=731)
> :  : : : : +- Filter d_year#58 >= 1999) AND 
> (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
> Statistics(sizeInBytes=175.6 KiB, rowCount=731)
> :  : : : :+- 
> Relation[d_date_sk#52,d_date_id#53,d_date#54,d

[jira] [Commented] (SPARK-34117) Disable LeftSemi/LeftAnti push down over Aggregate

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265743#comment-17265743
 ] 

Apache Spark commented on SPARK-34117:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/31193

> Disable LeftSemi/LeftAnti push down over Aggregate
> --
>
> Key: SPARK-34117
> URL: https://issues.apache.org/jira/browse/SPARK-34117
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: current.jpg, disable_pushdown.jpg
>
>
> After SPARK-34081, it can improve TPC-DS q38 and q87. But it still can not 
> handle [this case(rewritten from 
> q14b)|https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q14b.sql#L2-L32]:
> {code:sql}
> SELECT i_item_sk ss_item_sk
>   FROM item,
> (SELECT
>   distinct
>   iss.i_brand_id brand_id,
>   iss.i_class_id class_id,
>   iss.i_category_id category_id
> FROM store_sales, item iss, date_dim d1
> WHERE ss_item_sk = iss.i_item_sk
>   AND ss_sold_date_sk = d1.d_date_sk
>   AND d1.d_year BETWEEN 1999 AND 1999 + 2
> INTERSECT
> SELECT
> distinct
>   ics.i_brand_id,
>   ics.i_class_id,
>   ics.i_category_id
> FROM catalog_sales, item ics, date_dim d2
> WHERE cs_item_sk = ics.i_item_sk
>   AND cs_sold_date_sk = d2.d_date_sk
>   AND d2.d_year BETWEEN 1999 AND 1999 + 2
> INTERSECT
> SELECT
> distinct
>   iws.i_brand_id,
>   iws.i_class_id,
>   iws.i_category_id
> FROM web_sales, item iws, date_dim d3
> WHERE ws_item_sk = iws.i_item_sk
>   AND ws_sold_date_sk = d3.d_date_sk
>   AND d3.d_year BETWEEN 1999 AND 1999 + 2) x
>   WHERE i_brand_id = brand_id
> AND i_class_id = class_id
> AND i_category_id = category_id;
> {code}
> Optimized Logical Plan:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
> +- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
> class_id#160)) AND (i_category_id#18 = category_id#161)), 
> Statistics(sizeInBytes=2.42E+28 B)
>:- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
> Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
>:  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
> isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
> rowCount=3.69E+5)
>: +- 
> Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
>  parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
>+- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
> class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
>   +- Aggregate [brand_id#159, class_id#160, category_id#161], 
> [brand_id#159, class_id#160, category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
>  +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
> class_id#160, i_category_id#18 AS category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
> :  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
> Statistics(sizeInBytes=3.83E+21 B)
> :  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
> :  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
> Statistics(sizeInBytes=516.5 PiB)
> :  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
> Statistics(sizeInBytes=61.1 GiB)
> :  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
> isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
> Statistics(sizeInBytes=580.6 GiB)
> :  : : : :  +- Project [d_date_sk#52], 
> Statistics(sizeInBytes=8.6 KiB, rowCount=731)
> :  : : : : +- Filter d_year#58 >= 1999) AND 
> (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
> Statistics(sizeInBytes=175.6 K

[jira] [Assigned] (SPARK-34117) Disable LeftSemi/LeftAnti push down over Aggregate

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34117:


Assignee: Apache Spark

> Disable LeftSemi/LeftAnti push down over Aggregate
> --
>
> Key: SPARK-34117
> URL: https://issues.apache.org/jira/browse/SPARK-34117
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
> Attachments: current.jpg, disable_pushdown.jpg
>
>
> After SPARK-34081, it can improve TPC-DS q38 and q87. But it still can not 
> handle [this case(rewritten from 
> q14b)|https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q14b.sql#L2-L32]:
> {code:sql}
> SELECT i_item_sk ss_item_sk
>   FROM item,
> (SELECT
>   distinct
>   iss.i_brand_id brand_id,
>   iss.i_class_id class_id,
>   iss.i_category_id category_id
> FROM store_sales, item iss, date_dim d1
> WHERE ss_item_sk = iss.i_item_sk
>   AND ss_sold_date_sk = d1.d_date_sk
>   AND d1.d_year BETWEEN 1999 AND 1999 + 2
> INTERSECT
> SELECT
> distinct
>   ics.i_brand_id,
>   ics.i_class_id,
>   ics.i_category_id
> FROM catalog_sales, item ics, date_dim d2
> WHERE cs_item_sk = ics.i_item_sk
>   AND cs_sold_date_sk = d2.d_date_sk
>   AND d2.d_year BETWEEN 1999 AND 1999 + 2
> INTERSECT
> SELECT
> distinct
>   iws.i_brand_id,
>   iws.i_class_id,
>   iws.i_category_id
> FROM web_sales, item iws, date_dim d3
> WHERE ws_item_sk = iws.i_item_sk
>   AND ws_sold_date_sk = d3.d_date_sk
>   AND d3.d_year BETWEEN 1999 AND 1999 + 2) x
>   WHERE i_brand_id = brand_id
> AND i_class_id = class_id
> AND i_category_id = category_id;
> {code}
> Optimized Logical Plan:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
> +- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
> class_id#160)) AND (i_category_id#18 = category_id#161)), 
> Statistics(sizeInBytes=2.42E+28 B)
>:- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
> Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
>:  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
> isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
> rowCount=3.69E+5)
>: +- 
> Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
>  parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
>+- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
> class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
>   +- Aggregate [brand_id#159, class_id#160, category_id#161], 
> [brand_id#159, class_id#160, category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
>  +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
> class_id#160, i_category_id#18 AS category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
> :  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
> Statistics(sizeInBytes=3.83E+21 B)
> :  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
> :  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
> Statistics(sizeInBytes=516.5 PiB)
> :  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
> Statistics(sizeInBytes=61.1 GiB)
> :  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
> isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
> Statistics(sizeInBytes=580.6 GiB)
> :  : : : :  +- Project [d_date_sk#52], 
> Statistics(sizeInBytes=8.6 KiB, rowCount=731)
> :  : : : : +- Filter d_year#58 >= 1999) AND 
> (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
> Statistics(sizeInBytes=175.6 KiB, rowCount=731)
> :  : : : :+- 
> Relation[d_date_sk#52

[jira] [Commented] (SPARK-34118) Replaces filter and check for emptiness with exists or forall

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265735#comment-17265735
 ] 

Apache Spark commented on SPARK-34118:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/31192

> Replaces filter and check for emptiness with exists or forall
> -
>
> Key: SPARK-34118
> URL: https://issues.apache.org/jira/browse/SPARK-34118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
> Fix For: 3.2.0
>
>
> Semantic consistency and code Simpilefications
> before:
> {code:java}
> seq.filter(p).size == 0)
> seq.filter(p).length > 0
> seq.filterNot(p).isEmpty
> seq.filterNot(p).nonEmpty
> {code}
> after:
> {code:java}
> !seq.exists(p)
> seq.exists(p)
> seq.forall(p)
> !seq.forall(p)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34123) Faster way to display/render entries in HistoryPage (Spark history server summary page)

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34123:


Assignee: (was: Apache Spark)

> Faster way to display/render entries in HistoryPage (Spark history server 
> summary page)
> ---
>
> Key: SPARK-34123
> URL: https://issues.apache.org/jira/browse/SPARK-34123
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Mohanad Elsafty
>Priority: Major
> Attachments: Screenshot 2021-01-15 at 1.21.40 PM.png
>
>
> Since a long time ago my team/company suffered from history server being very 
> slow to display/search entries specially when entries grow over 50k entry, 
> regardless there is a pagination there in that page already but still very 
> slow to display the entries.
>   
> Current situation *Mustache Js* is used to render the entries and 
> *datatables* is used to manipulate it (sort by column and search).
>  
> By getting rid of *Mustache*  (stop rendering the entries using *Mustache*) 
> and using *datatables*  to display it proved to be faster.
>  
> Displaying > 100k entries (my case):
> Existing takes at least 30 to 40 seconds to display the entries, searching 
> takes at least 20 seconds and the page stop responding until it finishes.
> Improved takes ~3 seconds to display the entries searching is very fast and 
> the page stays responsive.
> *(These numbers will be different for others since JS is executed on your 
> browser)*
>  
> I am not sure why *Mustache* is used to display the data since data tables 
> can do the job,
> [~ajbozarth] [~sowen]  please elaborate more about this what is the reason to 
> use *Mustache*? what are the drawbacks if it's not used anymore to display 
> the entries (only this part)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-14 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265731#comment-17265731
 ] 

Jungtaek Lim commented on SPARK-33790:
--

Oh OK I haven't encountered the issue but Scala mutable HashMap looks to have 
the issue...

Would you mind filing separate JIRA issue and raise a PR for branch-2.4? 2.4.x 
is still a supported version, so the PR would be reviewed and accepted even 
that's not applied for 3.x.

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34123) Faster way to display/render entries in HistoryPage (Spark history server summary page)

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265730#comment-17265730
 ] 

Apache Spark commented on SPARK-34123:
--

User 'mohan3d' has created a pull request for this issue:
https://github.com/apache/spark/pull/31191

> Faster way to display/render entries in HistoryPage (Spark history server 
> summary page)
> ---
>
> Key: SPARK-34123
> URL: https://issues.apache.org/jira/browse/SPARK-34123
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Mohanad Elsafty
>Priority: Major
> Attachments: Screenshot 2021-01-15 at 1.21.40 PM.png
>
>
> Since a long time ago my team/company suffered from history server being very 
> slow to display/search entries specially when entries grow over 50k entry, 
> regardless there is a pagination there in that page already but still very 
> slow to display the entries.
>   
> Current situation *Mustache Js* is used to render the entries and 
> *datatables* is used to manipulate it (sort by column and search).
>  
> By getting rid of *Mustache*  (stop rendering the entries using *Mustache*) 
> and using *datatables*  to display it proved to be faster.
>  
> Displaying > 100k entries (my case):
> Existing takes at least 30 to 40 seconds to display the entries, searching 
> takes at least 20 seconds and the page stop responding until it finishes.
> Improved takes ~3 seconds to display the entries searching is very fast and 
> the page stays responsive.
> *(These numbers will be different for others since JS is executed on your 
> browser)*
>  
> I am not sure why *Mustache* is used to display the data since data tables 
> can do the job,
> [~ajbozarth] [~sowen]  please elaborate more about this what is the reason to 
> use *Mustache*? what are the drawbacks if it's not used anymore to display 
> the entries (only this part)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34123) Faster way to display/render entries in HistoryPage (Spark history server summary page)

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34123:


Assignee: Apache Spark

> Faster way to display/render entries in HistoryPage (Spark history server 
> summary page)
> ---
>
> Key: SPARK-34123
> URL: https://issues.apache.org/jira/browse/SPARK-34123
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Mohanad Elsafty
>Assignee: Apache Spark
>Priority: Major
> Attachments: Screenshot 2021-01-15 at 1.21.40 PM.png
>
>
> Since a long time ago my team/company suffered from history server being very 
> slow to display/search entries specially when entries grow over 50k entry, 
> regardless there is a pagination there in that page already but still very 
> slow to display the entries.
>   
> Current situation *Mustache Js* is used to render the entries and 
> *datatables* is used to manipulate it (sort by column and search).
>  
> By getting rid of *Mustache*  (stop rendering the entries using *Mustache*) 
> and using *datatables*  to display it proved to be faster.
>  
> Displaying > 100k entries (my case):
> Existing takes at least 30 to 40 seconds to display the entries, searching 
> takes at least 20 seconds and the page stop responding until it finishes.
> Improved takes ~3 seconds to display the entries searching is very fast and 
> the page stays responsive.
> *(These numbers will be different for others since JS is executed on your 
> browser)*
>  
> I am not sure why *Mustache* is used to display the data since data tables 
> can do the job,
> [~ajbozarth] [~sowen]  please elaborate more about this what is the reason to 
> use *Mustache*? what are the drawbacks if it's not used anymore to display 
> the entries (only this part)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-14 Thread dzcxzl (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265724#comment-17265724
 ] 

dzcxzl commented on SPARK-33790:


Thread stack when not working
!http://git.dev.sh.ctripcorp.com/framework-di/spark-2.2.0/uploads/9cfa9662f563ac64f77f4d4ee6fd9243/image.png!

 

[https://github.com/scala/bug/issues/10436]

 

 

 

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34123) Faster way to display/render entries in HistoryPage (Spark history server summary page)

2021-01-14 Thread Mohanad Elsafty (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohanad Elsafty updated SPARK-34123:

Attachment: Screenshot 2021-01-15 at 1.21.40 PM.png

> Faster way to display/render entries in HistoryPage (Spark history server 
> summary page)
> ---
>
> Key: SPARK-34123
> URL: https://issues.apache.org/jira/browse/SPARK-34123
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Mohanad Elsafty
>Priority: Major
> Attachments: Screenshot 2021-01-15 at 1.21.40 PM.png
>
>
> Since a long time ago my team/company suffered from history server being very 
> slow to display/search entries specially when entries grow over 50k entry, 
> regardless there is a pagination there in that page already but still very 
> slow to display the entries.
>   
> Current situation *Mustache Js* is used to render the entries and 
> *datatables* is used to manipulate it (sort by column and search).
>  
> By getting rid of *Mustache*  (stop rendering the entries using *Mustache*) 
> and using *datatables*  to display it proved to be faster.
>  
> Displaying > 100k entries (my case):
> Existing takes at least 30 to 40 seconds to display the entries, searching 
> takes at least 20 seconds and the page stop responding until it finishes.
> Improved takes ~3 seconds to display the entries searching is very fast and 
> the page stays responsive.
> *(These numbers will be different for others since JS is executed on your 
> browser)*
>  
> I am not sure why *Mustache* is used to display the data since data tables 
> can do the job,
> [~ajbozarth] [~sowen]  please elaborate more about this what is the reason to 
> use *Mustache*? what are the drawbacks if it's not used anymore to display 
> the entries (only this part)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34123) Faster way to display/render entries in HistoryPage (Spark history server summary page)

2021-01-14 Thread Mohanad Elsafty (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohanad Elsafty updated SPARK-34123:

Description: 
Since a long time ago my team/company suffered from history server being very 
slow to display/search entries specially when entries grow over 50k entry, 
regardless there is a pagination there in that page already but still very slow 
to display the entries.

  

Current situation *Mustache Js* is used to render the entries and *datatables* 
is used to manipulate it (sort by column and search).

 

By getting rid of *Mustache*  (stop rendering the entries using *Mustache*) and 
using *datatables*  to display it proved to be faster.

 

Displaying > 100k entries (my case):

Existing takes at least 30 to 40 seconds to display the entries, searching 
takes at least 20 seconds and the page stop responding until it finishes.

Improved takes ~3 seconds to display the entries searching is very fast and the 
page stays responsive.

*(These numbers will be different for others since JS is executed on your 
browser)*

 

I am not sure why *Mustache* is used to display the data since data tables can 
do the job,

[~ajbozarth] [~sowen]  please elaborate more about this what is the reason to 
use *Mustache*? what are the drawbacks if it's not used anymore to display the 
entries (only this part)?

  was:
Since a long time ago my team/company suffered from history server being very 
slow to display/search entries specially when entries grow over 50k entry, 
regardless there is a pagination there in that page already but still very slow 
to display the entries.

  !image-2021-01-15-13-53-16-446.png|width=844,height=151!

Current situation *Mustache Js* is used to render the entries and *datatables* 
is used to manipulate it (sort by column and search).

 

By getting rid of *Mustache*  (stop rendering the entries using *Mustache*) and 
using *datatables*  to display it proved to be faster.

 

Displaying > 100k entries (my case):

Existing takes at least 30 to 40 seconds to display the entries, searching 
takes at least 20 seconds and the page stop responding until it finishes.

Improved takes ~3 seconds to display the entries searching is very fast and the 
page stays responsive.

*(These numbers will be different for others since JS is executed on your 
browser)*

 

I am not sure why *Mustache* is used to display the data since data tables can 
do the job,

[~ajbozarth] [~sowen]  please elaborate more about this what is the reason to 
use *Mustache*? what are the drawbacks if it's not used anymore to display the 
entries (only this part)?


> Faster way to display/render entries in HistoryPage (Spark history server 
> summary page)
> ---
>
> Key: SPARK-34123
> URL: https://issues.apache.org/jira/browse/SPARK-34123
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Mohanad Elsafty
>Priority: Major
>
> Since a long time ago my team/company suffered from history server being very 
> slow to display/search entries specially when entries grow over 50k entry, 
> regardless there is a pagination there in that page already but still very 
> slow to display the entries.
>   
> Current situation *Mustache Js* is used to render the entries and 
> *datatables* is used to manipulate it (sort by column and search).
>  
> By getting rid of *Mustache*  (stop rendering the entries using *Mustache*) 
> and using *datatables*  to display it proved to be faster.
>  
> Displaying > 100k entries (my case):
> Existing takes at least 30 to 40 seconds to display the entries, searching 
> takes at least 20 seconds and the page stop responding until it finishes.
> Improved takes ~3 seconds to display the entries searching is very fast and 
> the page stays responsive.
> *(These numbers will be different for others since JS is executed on your 
> browser)*
>  
> I am not sure why *Mustache* is used to display the data since data tables 
> can do the job,
> [~ajbozarth] [~sowen]  please elaborate more about this what is the reason to 
> use *Mustache*? what are the drawbacks if it's not used anymore to display 
> the entries (only this part)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34123) Faster way to display/render entries in HistoryPage (Spark history server summary page)

2021-01-14 Thread Mohanad Elsafty (Jira)
Mohanad Elsafty created SPARK-34123:
---

 Summary: Faster way to display/render entries in HistoryPage 
(Spark history server summary page)
 Key: SPARK-34123
 URL: https://issues.apache.org/jira/browse/SPARK-34123
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.2.0
Reporter: Mohanad Elsafty


Since a long time ago my team/company suffered from history server being very 
slow to display/search entries specially when entries grow over 50k entry, 
regardless there is a pagination there in that page already but still very slow 
to display the entries.

  !image-2021-01-15-13-53-16-446.png|width=844,height=151!

Current situation *Mustache Js* is used to render the entries and *datatables* 
is used to manipulate it (sort by column and search).

 

By getting rid of *Mustache*  (stop rendering the entries using *Mustache*) and 
using *datatables*  to display it proved to be faster.

 

Displaying > 100k entries (my case):

Existing takes at least 30 to 40 seconds to display the entries, searching 
takes at least 20 seconds and the page stop responding until it finishes.

Improved takes ~3 seconds to display the entries searching is very fast and the 
page stays responsive.

*(These numbers will be different for others since JS is executed on your 
browser)*

 

I am not sure why *Mustache* is used to display the data since data tables can 
do the job,

[~ajbozarth] [~sowen]  please elaborate more about this what is the reason to 
use *Mustache*? what are the drawbacks if it's not used anymore to display the 
entries (only this part)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34118) Replaces filter and check for emptiness with exists or forall

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265708#comment-17265708
 ] 

Apache Spark commented on SPARK-34118:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/31190

> Replaces filter and check for emptiness with exists or forall
> -
>
> Key: SPARK-34118
> URL: https://issues.apache.org/jira/browse/SPARK-34118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
> Fix For: 3.2.0
>
>
> Semantic consistency and code Simpilefications
> before:
> {code:java}
> seq.filter(p).size == 0)
> seq.filter(p).length > 0
> seq.filterNot(p).isEmpty
> seq.filterNot(p).nonEmpty
> {code}
> after:
> {code:java}
> !seq.exists(p)
> seq.exists(p)
> seq.forall(p)
> !seq.forall(p)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34118) Replaces filter and check for emptiness with exists or forall

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265707#comment-17265707
 ] 

Apache Spark commented on SPARK-34118:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/31190

> Replaces filter and check for emptiness with exists or forall
> -
>
> Key: SPARK-34118
> URL: https://issues.apache.org/jira/browse/SPARK-34118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
> Fix For: 3.2.0
>
>
> Semantic consistency and code Simpilefications
> before:
> {code:java}
> seq.filter(p).size == 0)
> seq.filter(p).length > 0
> seq.filterNot(p).isEmpty
> seq.filterNot(p).nonEmpty
> {code}
> after:
> {code:java}
> !seq.exists(p)
> seq.exists(p)
> seq.forall(p)
> !seq.forall(p)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32864) Support ORC forced positional evolution

2021-01-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32864.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 29737
[https://github.com/apache/spark/pull/29737]

> Support ORC forced positional evolution
> ---
>
> Key: SPARK-32864
> URL: https://issues.apache.org/jira/browse/SPARK-32864
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Minor
> Fix For: 3.2.0
>
>
> Hive respects "orc.force.positional.evolution" config, Spark should do it as 
> well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32864) Support ORC forced positional evolution

2021-01-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32864:
-

Assignee: Peter Toth

> Support ORC forced positional evolution
> ---
>
> Key: SPARK-32864
> URL: https://issues.apache.org/jira/browse/SPARK-32864
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Minor
>
> Hive respects "orc.force.positional.evolution" config, Spark should do it as 
> well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34122) Remove duplicated branches in case when

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34122:


Assignee: Apache Spark

> Remove duplicated branches in case when
> ---
>
> Key: SPARK-34122
> URL: https://issues.apache.org/jira/browse/SPARK-34122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Lantao Jin
>Assignee: Apache Spark
>Priority: Minor
>
> CaseWhen with duplicated branches could be dedup
> {code}
> SELECT CASE WHEN key = 1 THEN 1 WHEN key = 1 THEN 1 WHEN key = 1 THEN 1
> ELSE 2 END FROM testData WHERE key = 1 group by key
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34122) Remove duplicated branches in case when

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34122:


Assignee: (was: Apache Spark)

> Remove duplicated branches in case when
> ---
>
> Key: SPARK-34122
> URL: https://issues.apache.org/jira/browse/SPARK-34122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Lantao Jin
>Priority: Minor
>
> CaseWhen with duplicated branches could be dedup
> {code}
> SELECT CASE WHEN key = 1 THEN 1 WHEN key = 1 THEN 1 WHEN key = 1 THEN 1
> ELSE 2 END FROM testData WHERE key = 1 group by key
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34122) Remove duplicated branches in case when

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265702#comment-17265702
 ] 

Apache Spark commented on SPARK-34122:
--

User 'LantaoJin' has created a pull request for this issue:
https://github.com/apache/spark/pull/31189

> Remove duplicated branches in case when
> ---
>
> Key: SPARK-34122
> URL: https://issues.apache.org/jira/browse/SPARK-34122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Lantao Jin
>Priority: Minor
>
> CaseWhen with duplicated branches could be dedup
> {code}
> SELECT CASE WHEN key = 1 THEN 1 WHEN key = 1 THEN 1 WHEN key = 1 THEN 1
> ELSE 2 END FROM testData WHERE key = 1 group by key
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-14 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265700#comment-17265700
 ] 

Jungtaek Lim commented on SPARK-33790:
--

{quote}
The following is my case 2.x version EventLoggingListener.codecMap is of type 
mutable.HashMap, which is not thread-safe and may hang.
{quote}

Could you please elaborate the situation of possible hang?

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34122) Remove duplicated branches in case when

2021-01-14 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-34122:
--

 Summary: Remove duplicated branches in case when
 Key: SPARK-34122
 URL: https://issues.apache.org/jira/browse/SPARK-34122
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Lantao Jin


CaseWhen with duplicated branches could be dedup
{code}
SELECT CASE WHEN key = 1 THEN 1 WHEN key = 1 THEN 1 WHEN key = 1 THEN 1
ELSE 2 END FROM testData WHERE key = 1 group by key
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34095) Spark 3.0.1 is giving WARN while running job

2021-01-14 Thread Sachit Murarka (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachit Murarka updated SPARK-34095:
---
Priority: Minor  (was: Major)

> Spark 3.0.1 is giving WARN while running job
> 
>
> Key: SPARK-34095
> URL: https://issues.apache.org/jira/browse/SPARK-34095
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.0.1
>Reporter: Sachit Murarka
>Priority: Minor
>
> I am running Spark 3.0.1 with Java11. Getting the following WARN.
>  
> WARNING: An illegal reflective access operation has occurred
>  WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> ([file:/opt/spark/jars/spark-unsafe_2.12-3.0.1.jar|file://opt/spark/jars/spark-unsafe_2.12-3.0.1.jar])
>  to constructor java.nio.DirectByteBuffer(long,int)
>  WARNING: Please consider reporting this to the maintainers of 
> org.apache.spark.unsafe.Platform
>  WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
>  WARNING: All illegal access operations will be denied in a future release



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34118) Replaces filter and check for emptiness with exists or forall

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265697#comment-17265697
 ] 

Apache Spark commented on SPARK-34118:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/31188

> Replaces filter and check for emptiness with exists or forall
> -
>
> Key: SPARK-34118
> URL: https://issues.apache.org/jira/browse/SPARK-34118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
> Fix For: 3.2.0
>
>
> Semantic consistency and code Simpilefications
> before:
> {code:java}
> seq.filter(p).size == 0)
> seq.filter(p).length > 0
> seq.filterNot(p).isEmpty
> seq.filterNot(p).nonEmpty
> {code}
> after:
> {code:java}
> !seq.exists(p)
> seq.exists(p)
> seq.forall(p)
> !seq.forall(p)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265695#comment-17265695
 ] 

Apache Spark commented on SPARK-33790:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/31187

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265693#comment-17265693
 ] 

Apache Spark commented on SPARK-33790:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/31187

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-14 Thread dzcxzl (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265691#comment-17265691
 ] 

dzcxzl commented on SPARK-33790:


This is indeed a performance regression problem.

The following is my case 2.x version EventLoggingListener.codecMap is of type 
mutable.HashMap, which is not thread-safe and may hang.

3.x version changed to EventLogFileReader.codecMap changed to ConcurrentHashMap 
type.

In the 2.x version, the history server may not work. 

I tried to use the 3.x version, and found that a round of scan has slowed down 
a lot, 7min rose to about 23min.

In addition, do I need to fix the thread safety issues in version 2.x?

[~kabhwan]

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34114) should not trim right for read-side char length check and padding

2021-01-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34114:
---

Assignee: Kent Yao

> should not trim right for read-side char length check and padding
> -
>
> Key: SPARK-34114
> URL: https://issues.apache.org/jira/browse/SPARK-34114
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34114) should not trim right for read-side char length check and padding

2021-01-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34114.
-
Fix Version/s: 3.1.1
   Resolution: Fixed

Issue resolved by pull request 31181
[https://github.com/apache/spark/pull/31181]

> should not trim right for read-side char length check and padding
> -
>
> Key: SPARK-34114
> URL: https://issues.apache.org/jira/browse/SPARK-34114
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Critical
> Fix For: 3.1.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265679#comment-17265679
 ] 

Apache Spark commented on SPARK-33790:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/31186

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34101) Make spark-sql CLI configurable for the behavior of printing header by SET command

2021-01-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34101.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31173
[https://github.com/apache/spark/pull/31173]

> Make spark-sql CLI configurable for the behavior of printing header by SET 
> command
> --
>
> Key: SPARK-34101
> URL: https://issues.apache.org/jira/browse/SPARK-34101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.2.0
>
>
> Like Hive CLI, spark-sql CLI accepts hive.cli.print.header property and we 
> can change the behavior of printing header.
> But spark-sql CLI doesn't allow users to change Hive specific configurations 
> dynamically by SET command.
> So, it's better to support the way to change the behavior by SET command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2021-01-14 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265674#comment-17265674
 ] 

Hyukjin Kwon commented on SPARK-12890:
--

[~nchammas], shall we file a new JIRA? your description looks much better than 
the current one, and easier to follow.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>Priority: Minor
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34111) Deconflict the jars jakarta.servlet-api-4.0.3.jar and javax.servlet-api-3.1.0.jar

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34111:


Assignee: (was: Apache Spark)

> Deconflict the jars jakarta.servlet-api-4.0.3.jar and 
> javax.servlet-api-3.1.0.jar
> -
>
> Key: SPARK-34111
> URL: https://issues.apache.org/jira/browse/SPARK-34111
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Blocker
>
> After SPARK-33705, we now happened to have two jars in the release artifact 
> with Hadoop 3:
> {{dev/deps/spark-deps-hadoop-3.2-hive-2.3}}:
> {code}
> ...
> jakarta.servlet-api/4.0.3//jakarta.servlet-api-4.0.3.jar
> ...
> javax.servlet-api/3.1.0//javax.servlet-api-3.1.0.jar
> ...
> {code}
> It can potentially cause an issue, and we should better remove 
> {{javax.servlet-api-3.1.0.jar}} which is apparently only required for YARN 
> tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34111) Deconflict the jars jakarta.servlet-api-4.0.3.jar and javax.servlet-api-3.1.0.jar

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265668#comment-17265668
 ] 

Apache Spark commented on SPARK-34111:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31185

> Deconflict the jars jakarta.servlet-api-4.0.3.jar and 
> javax.servlet-api-3.1.0.jar
> -
>
> Key: SPARK-34111
> URL: https://issues.apache.org/jira/browse/SPARK-34111
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Blocker
>
> After SPARK-33705, we now happened to have two jars in the release artifact 
> with Hadoop 3:
> {{dev/deps/spark-deps-hadoop-3.2-hive-2.3}}:
> {code}
> ...
> jakarta.servlet-api/4.0.3//jakarta.servlet-api-4.0.3.jar
> ...
> javax.servlet-api/3.1.0//javax.servlet-api-3.1.0.jar
> ...
> {code}
> It can potentially cause an issue, and we should better remove 
> {{javax.servlet-api-3.1.0.jar}} which is apparently only required for YARN 
> tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34111) Deconflict the jars jakarta.servlet-api-4.0.3.jar and javax.servlet-api-3.1.0.jar

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34111:


Assignee: Apache Spark

> Deconflict the jars jakarta.servlet-api-4.0.3.jar and 
> javax.servlet-api-3.1.0.jar
> -
>
> Key: SPARK-34111
> URL: https://issues.apache.org/jira/browse/SPARK-34111
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Blocker
>
> After SPARK-33705, we now happened to have two jars in the release artifact 
> with Hadoop 3:
> {{dev/deps/spark-deps-hadoop-3.2-hive-2.3}}:
> {code}
> ...
> jakarta.servlet-api/4.0.3//jakarta.servlet-api-4.0.3.jar
> ...
> javax.servlet-api/3.1.0//javax.servlet-api-3.1.0.jar
> ...
> {code}
> It can potentially cause an issue, and we should better remove 
> {{javax.servlet-api-3.1.0.jar}} which is apparently only required for YARN 
> tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34111) Deconflict the jars jakarta.servlet-api-4.0.3.jar and javax.servlet-api-3.1.0.jar

2021-01-14 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265666#comment-17265666
 ] 

Kent Yao commented on SPARK-34111:
--

I have found the problem, LOL

 

In Hadoop 3.2: ({color:#FF}*should be*{color})


 javax.servlet
 javax.servlet-api


 

In Hadoop 2.7:

 javax.servlet
 servlet-api


> Deconflict the jars jakarta.servlet-api-4.0.3.jar and 
> javax.servlet-api-3.1.0.jar
> -
>
> Key: SPARK-34111
> URL: https://issues.apache.org/jira/browse/SPARK-34111
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Blocker
>
> After SPARK-33705, we now happened to have two jars in the release artifact 
> with Hadoop 3:
> {{dev/deps/spark-deps-hadoop-3.2-hive-2.3}}:
> {code}
> ...
> jakarta.servlet-api/4.0.3//jakarta.servlet-api-4.0.3.jar
> ...
> javax.servlet-api/3.1.0//javax.servlet-api-3.1.0.jar
> ...
> {code}
> It can potentially cause an issue, and we should better remove 
> {{javax.servlet-api-3.1.0.jar}} which is apparently only required for YARN 
> tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-14 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265662#comment-17265662
 ] 

Jungtaek Lim commented on SPARK-33790:
--

I've revisited this somehow and I realized this is regression on performance 
for event log v1. (SPARK-28869 caused the regression.)

I'll submit PRs for below branches. This should be fixed in 3.1.x / 3.0.x as 
well.

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34121) Intersect operator missing rowCount when CBO enabled

2021-01-14 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-34121:
---

 Summary: Intersect operator missing rowCount when CBO enabled
 Key: SPARK-34121
 URL: https://issues.apache.org/jira/browse/SPARK-34121
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34099) Refactor table caching in `DataSourceV2Strategy`

2021-01-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34099:
---

Assignee: Maxim Gekk

> Refactor table caching in `DataSourceV2Strategy`
> 
>
> Key: SPARK-34099
> URL: https://issues.apache.org/jira/browse/SPARK-34099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Currently, invalidateCache() performs 3 calls to the Cache Manager to refresh 
> table cache. We can perform only one call recacheByPlan(). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34099) Refactor table caching in `DataSourceV2Strategy`

2021-01-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34099.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31172
[https://github.com/apache/spark/pull/31172]

> Refactor table caching in `DataSourceV2Strategy`
> 
>
> Key: SPARK-34099
> URL: https://issues.apache.org/jira/browse/SPARK-34099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently, invalidateCache() performs 3 calls to the Cache Manager to refresh 
> table cache. We can perform only one call recacheByPlan(). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34119) Keep necessary stats after partition pruning

2021-01-14 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34119:

Parent: SPARK-34120
Issue Type: Sub-task  (was: Improvement)

> Keep necessary stats after partition pruning
> 
>
> Key: SPARK-34119
> URL: https://issues.apache.org/jira/browse/SPARK-34119
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> It missing stats if filter conditions contains dynamicpruning, we should keep 
> these stats after partition pruning:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
> +- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
> class_id#160)) AND (i_category_id#18 = category_id#161)), 
> Statistics(sizeInBytes=2.42E+28 B)
>:- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
> Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
>:  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
> isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
> rowCount=3.69E+5)
>: +- 
> Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
>  parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
>+- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
> class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
>   +- Aggregate [brand_id#159, class_id#160, category_id#161], 
> [brand_id#159, class_id#160, category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
>  +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
> class_id#160, i_category_id#18 AS category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
> :  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
> Statistics(sizeInBytes=3.83E+21 B)
> :  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
> :  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
> Statistics(sizeInBytes=516.5 PiB)
> :  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
> Statistics(sizeInBytes=61.1 GiB)
> :  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
> isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
> Statistics(sizeInBytes=580.6 GiB)
> :  : : : :  +- Project [d_date_sk#52], 
> Statistics(sizeInBytes=8.6 KiB, rowCount=731)
> :  : : : : +- Filter d_year#58 >= 1999) AND 
> (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
> Statistics(sizeInBytes=175.6 KiB, rowCount=731)
> :  : : : :+- 
> Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,...
>  4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
> :  : : : +- 
> Relation[ss_sold_time_sk#29,ss_item_sk#30,ss_customer_sk#31,ss_cdemo_sk#32,ss_hdemo_sk#33,ss_addr_sk#34,ss_store_sk#35,ss_promo_sk#36,ss_ticket_number#37L,ss_quantity#38,ss_wholesale_cost#39,ss_list_price#40,ss_sales_price#41,ss_ext_discount_amt#42,ss_ext_sales_price#43,ss_ext_wholesale_cost#44,ss_ext_list_price#45,ss_ext_tax#46,ss_coupon_amt#47,ss_net_paid#48,ss_net_paid_inc_tax#49,ss_net_profit#50,ss_sold_date_sk#51]
>  parquet, Statistics(sizeInBytes=580.6 GiB)
> :  : : +- Project [i_item_sk#7, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, 
> rowCount=3.69E+5)
> :  : :+- Filter (((isnotnull(i_brand_id#14) AND 
> isnotnull(i_class_id#16)) AND isnotnull(i_category_id#18)) AND 
> isnotnull(i_item_sk#7)), Statistics(sizeInBytes=150.0 MiB, rowCount=3.69E+5)
>

[jira] [Updated] (SPARK-33956) Add rowCount for Range operator

2021-01-14 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33956:

Parent: SPARK-34120
Issue Type: Sub-task  (was: Improvement)

> Add rowCount for Range operator
> ---
>
> Key: SPARK-33956
> URL: https://issues.apache.org/jira/browse/SPARK-33956
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> {code:scala}
> spark.sql("set spark.sql.cbo.enabled=true")
> spark.sql("select id from range(100)").explain("cost")
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Range (0, 100, step=1, splits=None), Statistics(sizeInBytes=800.0 B)
> {noformat}
> Expected:
> {noformat}
> == Optimized Logical Plan ==
> Range (0, 100, step=1, splits=None), Statistics(sizeInBytes=800.0 B, 
> rowCount=100)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33954) Some operator missing rowCount when enable CBO

2021-01-14 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33954:

Parent: SPARK-34120
Issue Type: Sub-task  (was: Improvement)

> Some operator missing rowCount when enable CBO
> --
>
> Key: SPARK-33954
> URL: https://issues.apache.org/jira/browse/SPARK-33954
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Some operator missing rowCount when enable CBO, for example:
> {code:scala}
> spark.range(1000).selectExpr("id as a", "id as b").write.saveAsTable("t1")
> spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
> spark.sql("set spark.sql.cbo.enabled=true")
> spark.sql("set spark.sql.cbo.planStats.enabled=true")
> spark.sql("select * from (select * from t1 distribute by a limit 100) 
> distribute by b").explain("cost")
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> RepartitionByExpression [b#2129L], Statistics(sizeInBytes=2.3 KiB)
> +- GlobalLimit 100, Statistics(sizeInBytes=2.3 KiB, rowCount=100)
>+- LocalLimit 100, Statistics(sizeInBytes=23.4 KiB)
>   +- RepartitionByExpression [a#2128L], Statistics(sizeInBytes=23.4 KiB)
>  +- Relation[a#2128L,b#2129L] parquet, Statistics(sizeInBytes=23.4 
> KiB, rowCount=1.00E+3)
> {noformat}
> Expected:
> {noformat}
> == Optimized Logical Plan ==
> RepartitionByExpression [b#2129L], Statistics(sizeInBytes=2.3 KiB, 
> rowCount=100)
> +- GlobalLimit 100, Statistics(sizeInBytes=2.3 KiB, rowCount=100)
>+- LocalLimit 100, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)
>   +- RepartitionByExpression [a#2128L], Statistics(sizeInBytes=23.4 KiB, 
> rowCount=1.00E+3)
>  +- Relation[a#2128L,b#2129L] parquet, Statistics(sizeInBytes=23.4 
> KiB, rowCount=1.00E+3)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34031) Union operator missing rowCount when enable CBO

2021-01-14 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34031:

Parent: SPARK-34120
Issue Type: Sub-task  (was: Improvement)

> Union operator missing rowCount when enable CBO
> ---
>
> Key: SPARK-34031
> URL: https://issues.apache.org/jira/browse/SPARK-34031
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> {code:scala}
> spark.sql("CREATE TABLE t1 USING parquet AS SELECT id FROM RANGE(10)")
> spark.sql("CREATE TABLE t2 USING parquet AS SELECT id FROM RANGE(10)")
> spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
> spark.sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
> spark.sql("set spark.sql.cbo.enabled=true")
> spark.sql("SELECT * FROM t1 UNION ALL SELECT * FROM t2").explain("cost")
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Union false, false, Statistics(sizeInBytes=320.0 B)
> :- Relation[id#5880L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)
> +- Relation[id#5881L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)
> {noformat}
> Expected
> {noformat}
> == Optimized Logical Plan ==
> Union false, false, Statistics(sizeInBytes=320.0 B, rowCount=20)
> :- Relation[id#2138L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)
> +- Relation[id#2139L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33959) Improve the statistics estimation of the Tail

2021-01-14 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33959:

Parent: SPARK-34120
Issue Type: Sub-task  (was: Improvement)

> Improve the statistics estimation of the Tail
> -
>
> Key: SPARK-33959
> URL: https://issues.apache.org/jira/browse/SPARK-33959
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> {code:scala}
> spark.sql("set spark.sql.cbo.enabled=true")
> spark.range(100).selectExpr("id as a", "id as b", "id as c", "id as 
> e").write.saveAsTable("t1")
> println(Tail(Literal(5), spark.sql("SELECT * FROM 
> t1").queryExecution.logical).queryExecution.explainString(org.apache.spark.sql.execution.CostMode))
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Tail 5, Statistics(sizeInBytes=3.8 KiB)
> +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
> {noformat}
> Expected:
> {noformat}
> == Optimized Logical Plan ==
> Tail 5, Statistics(sizeInBytes=200.0 B, rowCount=5)
> +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34120) Improve the statistics estimation

2021-01-14 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-34120:
---

 Summary: Improve the statistics estimation
 Key: SPARK-34120
 URL: https://issues.apache.org/jira/browse/SPARK-34120
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


This umbrella tickets aim to track improvements in statistics estimates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34117) Disable LeftSemi/LeftAnti push down over Aggregate

2021-01-14 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265631#comment-17265631
 ] 

Yuming Wang commented on SPARK-34117:
-

I'm working on.

> Disable LeftSemi/LeftAnti push down over Aggregate
> --
>
> Key: SPARK-34117
> URL: https://issues.apache.org/jira/browse/SPARK-34117
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: current.jpg, disable_pushdown.jpg
>
>
> After SPARK-34081, it can improve TPC-DS q38 and q87. But it still can not 
> handle [this case(rewritten from 
> q14b)|https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q14b.sql#L2-L32]:
> {code:sql}
> SELECT i_item_sk ss_item_sk
>   FROM item,
> (SELECT
>   distinct
>   iss.i_brand_id brand_id,
>   iss.i_class_id class_id,
>   iss.i_category_id category_id
> FROM store_sales, item iss, date_dim d1
> WHERE ss_item_sk = iss.i_item_sk
>   AND ss_sold_date_sk = d1.d_date_sk
>   AND d1.d_year BETWEEN 1999 AND 1999 + 2
> INTERSECT
> SELECT
> distinct
>   ics.i_brand_id,
>   ics.i_class_id,
>   ics.i_category_id
> FROM catalog_sales, item ics, date_dim d2
> WHERE cs_item_sk = ics.i_item_sk
>   AND cs_sold_date_sk = d2.d_date_sk
>   AND d2.d_year BETWEEN 1999 AND 1999 + 2
> INTERSECT
> SELECT
> distinct
>   iws.i_brand_id,
>   iws.i_class_id,
>   iws.i_category_id
> FROM web_sales, item iws, date_dim d3
> WHERE ws_item_sk = iws.i_item_sk
>   AND ws_sold_date_sk = d3.d_date_sk
>   AND d3.d_year BETWEEN 1999 AND 1999 + 2) x
>   WHERE i_brand_id = brand_id
> AND i_class_id = class_id
> AND i_category_id = category_id;
> {code}
> Optimized Logical Plan:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
> +- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
> class_id#160)) AND (i_category_id#18 = category_id#161)), 
> Statistics(sizeInBytes=2.42E+28 B)
>:- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
> Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
>:  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
> isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
> rowCount=3.69E+5)
>: +- 
> Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
>  parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
>+- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
> class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
>   +- Aggregate [brand_id#159, class_id#160, category_id#161], 
> [brand_id#159, class_id#160, category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
>  +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
> class_id#160, i_category_id#18 AS category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
> :  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
> Statistics(sizeInBytes=3.83E+21 B)
> :  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
> :  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
> Statistics(sizeInBytes=516.5 PiB)
> :  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
> Statistics(sizeInBytes=61.1 GiB)
> :  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
> isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
> Statistics(sizeInBytes=580.6 GiB)
> :  : : : :  +- Project [d_date_sk#52], 
> Statistics(sizeInBytes=8.6 KiB, rowCount=731)
> :  : : : : +- Filter d_year#58 >= 1999) AND 
> (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
> Statistics(sizeInBytes=175.6 KiB, rowCount=731)
> :  : : : :+- 
> Relation[d_date_sk#

[jira] [Commented] (SPARK-34119) Keep necessary stats after partition pruning

2021-01-14 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265657#comment-17265657
 ] 

Yuming Wang commented on SPARK-34119:
-

[~leoluan] Would you like work on this?

> Keep necessary stats after partition pruning
> 
>
> Key: SPARK-34119
> URL: https://issues.apache.org/jira/browse/SPARK-34119
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> It missing stats if filter conditions contains dynamicpruning, we should keep 
> these stats after partition pruning:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
> +- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
> class_id#160)) AND (i_category_id#18 = category_id#161)), 
> Statistics(sizeInBytes=2.42E+28 B)
>:- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
> Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
>:  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
> isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
> rowCount=3.69E+5)
>: +- 
> Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
>  parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
>+- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
> class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
>   +- Aggregate [brand_id#159, class_id#160, category_id#161], 
> [brand_id#159, class_id#160, category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
>  +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
> class_id#160, i_category_id#18 AS category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
> :  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
> Statistics(sizeInBytes=3.83E+21 B)
> :  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
> :  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
> Statistics(sizeInBytes=516.5 PiB)
> :  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
> Statistics(sizeInBytes=61.1 GiB)
> :  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
> isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
> Statistics(sizeInBytes=580.6 GiB)
> :  : : : :  +- Project [d_date_sk#52], 
> Statistics(sizeInBytes=8.6 KiB, rowCount=731)
> :  : : : : +- Filter d_year#58 >= 1999) AND 
> (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
> Statistics(sizeInBytes=175.6 KiB, rowCount=731)
> :  : : : :+- 
> Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,...
>  4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
> :  : : : +- 
> Relation[ss_sold_time_sk#29,ss_item_sk#30,ss_customer_sk#31,ss_cdemo_sk#32,ss_hdemo_sk#33,ss_addr_sk#34,ss_store_sk#35,ss_promo_sk#36,ss_ticket_number#37L,ss_quantity#38,ss_wholesale_cost#39,ss_list_price#40,ss_sales_price#41,ss_ext_discount_amt#42,ss_ext_sales_price#43,ss_ext_wholesale_cost#44,ss_ext_list_price#45,ss_ext_tax#46,ss_coupon_amt#47,ss_net_paid#48,ss_net_paid_inc_tax#49,ss_net_profit#50,ss_sold_date_sk#51]
>  parquet, Statistics(sizeInBytes=580.6 GiB)
> :  : : +- Project [i_item_sk#7, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, 
> rowCount=3.69E+5)
> :  : :+- Filter (((isnotnull(i_brand_id#14) AND 
> isnotnull(i_class_id#16)) AND isnotnull(i_category_id#18)) AND 
> isnotnull(i_item_sk#7)), Statistics(sizeInBytes=150.0 

[jira] [Created] (SPARK-34119) Keep necessary stats after partition pruning

2021-01-14 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-34119:
---

 Summary: Keep necessary stats after partition pruning
 Key: SPARK-34119
 URL: https://issues.apache.org/jira/browse/SPARK-34119
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


It missing stats if filter conditions contains dynamicpruning, we should keep 
these stats after partition pruning:
{noformat}
== Optimized Logical Plan ==
Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
+- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
class_id#160)) AND (i_category_id#18 = category_id#161)), 
Statistics(sizeInBytes=2.42E+28 B)
   :- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
   :  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
rowCount=3.69E+5)
   : +- 
Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
 parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
   +- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
  +- Aggregate [brand_id#159, class_id#160, category_id#161], 
[brand_id#159, class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 
B)
 +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND (class_id#160 
<=> i_class_id#16)) AND (category_id#161 <=> i_category_id#18)), 
Statistics(sizeInBytes=2.73E+21 B)
:- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
(class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> i_category_id#18)), 
Statistics(sizeInBytes=2.73E+21 B)
:  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
class_id#160, i_category_id#18 AS category_id#161], 
Statistics(sizeInBytes=2.73E+21 B)
:  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
Statistics(sizeInBytes=3.83E+21 B)
:  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
:  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
Statistics(sizeInBytes=516.5 PiB)
:  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
Statistics(sizeInBytes=61.1 GiB)
:  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
Statistics(sizeInBytes=580.6 GiB)
:  : : : :  +- Project [d_date_sk#52], 
Statistics(sizeInBytes=8.6 KiB, rowCount=731)
:  : : : : +- Filter d_year#58 >= 1999) AND 
(d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
Statistics(sizeInBytes=175.6 KiB, rowCount=731)
:  : : : :+- 
Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,...
 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
:  : : : +- 
Relation[ss_sold_time_sk#29,ss_item_sk#30,ss_customer_sk#31,ss_cdemo_sk#32,ss_hdemo_sk#33,ss_addr_sk#34,ss_store_sk#35,ss_promo_sk#36,ss_ticket_number#37L,ss_quantity#38,ss_wholesale_cost#39,ss_list_price#40,ss_sales_price#41,ss_ext_discount_amt#42,ss_ext_sales_price#43,ss_ext_wholesale_cost#44,ss_ext_list_price#45,ss_ext_tax#46,ss_coupon_amt#47,ss_net_paid#48,ss_net_paid_inc_tax#49,ss_net_profit#50,ss_sold_date_sk#51]
 parquet, Statistics(sizeInBytes=580.6 GiB)
:  : : +- Project [i_item_sk#7, i_brand_id#14, 
i_class_id#16, i_category_id#18], Statistics(sizeInBytes=8.5 MiB, 
rowCount=3.69E+5)
:  : :+- Filter (((isnotnull(i_brand_id#14) AND 
isnotnull(i_class_id#16)) AND isnotnull(i_category_id#18)) AND 
isnotnull(i_item_sk#7)), Statistics(sizeInBytes=150.0 MiB, rowCount=3.69E+5)
:  : :   +- 
Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
 parquet, S

[jira] [Updated] (SPARK-33790) Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

2021-01-14 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-33790:
-
Priority: Critical  (was: Trivial)

> Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader
> 
>
> Key: SPARK-33790
> URL: https://issues.apache.org/jira/browse/SPARK-33790
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Critical
> Fix For: 3.2.0
>
>
> FsHistoryProvider#checkForLogs already has FileStatus when constructing 
> SingleFileEventLogFileReader, and there is no need to get the FileStatus 
> again when SingleFileEventLogFileReader#fileSizeForLastIndex.
> This can reduce a lot of rpc calls and improve the speed of the history 
> server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34118) Replaces filter and check for emptiness with exists or forall

2021-01-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34118:
-
Fix Version/s: 3.2.0

> Replaces filter and check for emptiness with exists or forall
> -
>
> Key: SPARK-34118
> URL: https://issues.apache.org/jira/browse/SPARK-34118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
> Fix For: 3.2.0
>
>
> Semantic consistency and code Simpilefications
> before:
> {code:java}
> seq.filter(p).size == 0)
> seq.filter(p).length > 0
> seq.filterNot(p).isEmpty
> seq.filterNot(p).nonEmpty
> {code}
> after:
> {code:java}
> !seq.exists(p)
> seq.exists(p)
> seq.forall(p)
> !seq.forall(p)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34118) Replaces filter and check for emptiness with exists or forall

2021-01-14 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-34118:
-
Summary: Replaces filter and check for emptiness with exists or forall  
(was: Replaces filter and check for emptiness with exists or forall.)

> Replaces filter and check for emptiness with exists or forall
> -
>
> Key: SPARK-34118
> URL: https://issues.apache.org/jira/browse/SPARK-34118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> Semantic consistency and code Simpilefications
> before:
> {code:java}
> seq.filter(p).size == 0)
> seq.filter(p).length > 0
> seq.filterNot(p).isEmpty
> seq.filterNot(p).nonEmpty
> {code}
> after:
> {code:java}
> !seq.exists(p)
> seq.exists(p)
> seq.forall(p)
> !seq.forall(p)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34118) Replaces filter and check for emptiness with exists or forall.

2021-01-14 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-34118:
-
Description: 
Semantic consistency and code Simpilefications

before:
{code:java}
seq.filter(p).size == 0)
seq.filter(p).length > 0
seq.filterNot(p).isEmpty
seq.filterNot(p).nonEmpty
{code}
after:
{code:java}
!seq.exists(p)
seq.exists(p)
seq.forall(p)
!seq.forall(p)
{code}
 

> Replaces filter and check for emptiness with exists or forall.
> --
>
> Key: SPARK-34118
> URL: https://issues.apache.org/jira/browse/SPARK-34118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> Semantic consistency and code Simpilefications
> before:
> {code:java}
> seq.filter(p).size == 0)
> seq.filter(p).length > 0
> seq.filterNot(p).isEmpty
> seq.filterNot(p).nonEmpty
> {code}
> after:
> {code:java}
> !seq.exists(p)
> seq.exists(p)
> seq.forall(p)
> !seq.forall(p)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34118) Replaces filter and check for emptiness with exists or forall.

2021-01-14 Thread Yang Jie (Jira)
Yang Jie created SPARK-34118:


 Summary: Replaces filter and check for emptiness with exists or 
forall.
 Key: SPARK-34118
 URL: https://issues.apache.org/jira/browse/SPARK-34118
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.2.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34117) Disable LeftSemi/LeftAnti push down over Aggregate

2021-01-14 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-34117:
---

 Summary: Disable LeftSemi/LeftAnti push down over Aggregate
 Key: SPARK-34117
 URL: https://issues.apache.org/jira/browse/SPARK-34117
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


After SPARK-34081, it can improve TPC-DS q38 and q87. But it still can not 
handle [this case(rewritten from 
q14b)|https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q14b.sql#L2-L32]:
{code:sql}
SELECT i_item_sk ss_item_sk
  FROM item,
(SELECT
  distinct
  iss.i_brand_id brand_id,
  iss.i_class_id class_id,
  iss.i_category_id category_id
FROM store_sales, item iss, date_dim d1
WHERE ss_item_sk = iss.i_item_sk
  AND ss_sold_date_sk = d1.d_date_sk
  AND d1.d_year BETWEEN 1999 AND 1999 + 2
INTERSECT
SELECT
distinct
  ics.i_brand_id,
  ics.i_class_id,
  ics.i_category_id
FROM catalog_sales, item ics, date_dim d2
WHERE cs_item_sk = ics.i_item_sk
  AND cs_sold_date_sk = d2.d_date_sk
  AND d2.d_year BETWEEN 1999 AND 1999 + 2
INTERSECT
SELECT
distinct
  iws.i_brand_id,
  iws.i_class_id,
  iws.i_category_id
FROM web_sales, item iws, date_dim d3
WHERE ws_item_sk = iws.i_item_sk
  AND ws_sold_date_sk = d3.d_date_sk
  AND d3.d_year BETWEEN 1999 AND 1999 + 2) x
  WHERE i_brand_id = brand_id
AND i_class_id = class_id
AND i_category_id = category_id;
{code}

Optimized Logical Plan:
{noformat}
== Optimized Logical Plan ==
Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
+- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
class_id#160)) AND (i_category_id#18 = category_id#161)), 
Statistics(sizeInBytes=2.42E+28 B)
   :- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
   :  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
rowCount=3.69E+5)
   : +- 
Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
 parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
   +- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
  +- Aggregate [brand_id#159, class_id#160, category_id#161], 
[brand_id#159, class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 
B)
 +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND (class_id#160 
<=> i_class_id#16)) AND (category_id#161 <=> i_category_id#18)), 
Statistics(sizeInBytes=2.73E+21 B)
:- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
(class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> i_category_id#18)), 
Statistics(sizeInBytes=2.73E+21 B)
:  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
class_id#160, i_category_id#18 AS category_id#161], 
Statistics(sizeInBytes=2.73E+21 B)
:  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
Statistics(sizeInBytes=3.83E+21 B)
:  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
:  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
Statistics(sizeInBytes=516.5 PiB)
:  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
Statistics(sizeInBytes=61.1 GiB)
:  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
Statistics(sizeInBytes=580.6 GiB)
:  : : : :  +- Project [d_date_sk#52], 
Statistics(sizeInBytes=8.6 KiB, rowCount=731)
:  : : : : +- Filter d_year#58 >= 1999) AND 
(d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
Statistics(sizeInBytes=175.6 KiB, rowCount=731)
:  : : : :+- 
Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,...
 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
:  : : : +- 
Relation[ss_sold_time_sk#29,ss_item_sk#30,ss_customer_sk

[jira] [Updated] (SPARK-34117) Disable LeftSemi/LeftAnti push down over Aggregate

2021-01-14 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34117:

Description: 
After SPARK-34081, it can improve TPC-DS q38 and q87. But it still can not 
handle [this case(rewritten from 
q14b)|https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q14b.sql#L2-L32]:
{code:sql}
SELECT i_item_sk ss_item_sk
  FROM item,
(SELECT
  distinct
  iss.i_brand_id brand_id,
  iss.i_class_id class_id,
  iss.i_category_id category_id
FROM store_sales, item iss, date_dim d1
WHERE ss_item_sk = iss.i_item_sk
  AND ss_sold_date_sk = d1.d_date_sk
  AND d1.d_year BETWEEN 1999 AND 1999 + 2
INTERSECT
SELECT
distinct
  ics.i_brand_id,
  ics.i_class_id,
  ics.i_category_id
FROM catalog_sales, item ics, date_dim d2
WHERE cs_item_sk = ics.i_item_sk
  AND cs_sold_date_sk = d2.d_date_sk
  AND d2.d_year BETWEEN 1999 AND 1999 + 2
INTERSECT
SELECT
distinct
  iws.i_brand_id,
  iws.i_class_id,
  iws.i_category_id
FROM web_sales, item iws, date_dim d3
WHERE ws_item_sk = iws.i_item_sk
  AND ws_sold_date_sk = d3.d_date_sk
  AND d3.d_year BETWEEN 1999 AND 1999 + 2) x
  WHERE i_brand_id = brand_id
AND i_class_id = class_id
AND i_category_id = category_id;
{code}

Optimized Logical Plan:
{noformat}
== Optimized Logical Plan ==
Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
+- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
class_id#160)) AND (i_category_id#18 = category_id#161)), 
Statistics(sizeInBytes=2.42E+28 B)
   :- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
   :  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
rowCount=3.69E+5)
   : +- 
Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
 parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
   +- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
  +- Aggregate [brand_id#159, class_id#160, category_id#161], 
[brand_id#159, class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 
B)
 +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND (class_id#160 
<=> i_class_id#16)) AND (category_id#161 <=> i_category_id#18)), 
Statistics(sizeInBytes=2.73E+21 B)
:- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
(class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> i_category_id#18)), 
Statistics(sizeInBytes=2.73E+21 B)
:  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
class_id#160, i_category_id#18 AS category_id#161], 
Statistics(sizeInBytes=2.73E+21 B)
:  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
Statistics(sizeInBytes=3.83E+21 B)
:  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
:  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
Statistics(sizeInBytes=516.5 PiB)
:  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
Statistics(sizeInBytes=61.1 GiB)
:  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
Statistics(sizeInBytes=580.6 GiB)
:  : : : :  +- Project [d_date_sk#52], 
Statistics(sizeInBytes=8.6 KiB, rowCount=731)
:  : : : : +- Filter d_year#58 >= 1999) AND 
(d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
Statistics(sizeInBytes=175.6 KiB, rowCount=731)
:  : : : :+- 
Relation[d_date_sk#52,d_date_id#53,d_date#54,d_month_seq#55,d_week_seq#56,d_quarter_seq#57,d_year#58,d_dow#59,d_moy#60,d_dom#61,d_qoy#62,d_fy_year#63,d_fy_quarter_seq#64,d_fy_week_seq#65,d_day_name#66,d_quarter_name#67,d_holiday#68,d_weekend#69,d_following_holiday#70,d_first_dom#71,d_last_dom#72,d_same_day_ly#73,d_same_day_lq#74,d_current_day#75,...
 4 more fields] parquet, Statistics(sizeInBytes=17.1 MiB, rowCount=7.30E+4)
:  : : : +- 
Relation[ss_sold_time_sk#29,ss_item_sk#30,ss_customer_sk#31,ss_cdemo_sk#32,ss_hdemo_sk#33,ss_addr_sk#34,ss_store_sk#35,ss_promo_sk#36,ss_ticket_number#37L,ss_quantity#38,ss_wholesale_cost#39,ss_list_price#40,ss_sales_price#41,ss_ext_discount_a

[jira] [Updated] (SPARK-34117) Disable LeftSemi/LeftAnti push down over Aggregate

2021-01-14 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34117:

Attachment: disable_pushdown.jpg
current.jpg

> Disable LeftSemi/LeftAnti push down over Aggregate
> --
>
> Key: SPARK-34117
> URL: https://issues.apache.org/jira/browse/SPARK-34117
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: current.jpg, disable_pushdown.jpg
>
>
> After SPARK-34081, it can improve TPC-DS q38 and q87. But it still can not 
> handle [this case(rewritten from 
> q14b)|https://github.com/apache/spark/blob/a78d6ce376edf2a8836e01f47b9dff5371058d4c/sql/core/src/test/resources/tpcds/q14b.sql#L2-L32]:
> {code:sql}
> SELECT i_item_sk ss_item_sk
>   FROM item,
> (SELECT
>   distinct
>   iss.i_brand_id brand_id,
>   iss.i_class_id class_id,
>   iss.i_category_id category_id
> FROM store_sales, item iss, date_dim d1
> WHERE ss_item_sk = iss.i_item_sk
>   AND ss_sold_date_sk = d1.d_date_sk
>   AND d1.d_year BETWEEN 1999 AND 1999 + 2
> INTERSECT
> SELECT
> distinct
>   ics.i_brand_id,
>   ics.i_class_id,
>   ics.i_category_id
> FROM catalog_sales, item ics, date_dim d2
> WHERE cs_item_sk = ics.i_item_sk
>   AND cs_sold_date_sk = d2.d_date_sk
>   AND d2.d_year BETWEEN 1999 AND 1999 + 2
> INTERSECT
> SELECT
> distinct
>   iws.i_brand_id,
>   iws.i_class_id,
>   iws.i_category_id
> FROM web_sales, item iws, date_dim d3
> WHERE ws_item_sk = iws.i_item_sk
>   AND ws_sold_date_sk = d3.d_date_sk
>   AND d3.d_year BETWEEN 1999 AND 1999 + 2) x
>   WHERE i_brand_id = brand_id
> AND i_class_id = class_id
> AND i_category_id = category_id;
> {code}
> Optimized Logical Plan:
> {noformat}
> == Optimized Logical Plan ==
> Project [i_item_sk#7 AS ss_item_sk#162], Statistics(sizeInBytes=8.07E+27 B)
> +- Join Inner, (((i_brand_id#14 = brand_id#159) AND (i_class_id#16 = 
> class_id#160)) AND (i_category_id#18 = category_id#161)), 
> Statistics(sizeInBytes=2.42E+28 B)
>:- Project [i_item_sk#7, i_brand_id#14, i_class_id#16, i_category_id#18], 
> Statistics(sizeInBytes=8.5 MiB, rowCount=3.69E+5)
>:  +- Filter ((isnotnull(i_brand_id#14) AND isnotnull(i_class_id#16)) AND 
> isnotnull(i_category_id#18)), Statistics(sizeInBytes=150.0 MiB, 
> rowCount=3.69E+5)
>: +- 
> Relation[i_item_sk#7,i_item_id#8,i_rec_start_date#9,i_rec_end_date#10,i_item_desc#11,i_current_price#12,i_wholesale_cost#13,i_brand_id#14,i_brand#15,i_class_id#16,i_class#17,i_category_id#18,i_category#19,i_manufact_id#20,i_manufact#21,i_size#22,i_formulation#23,i_color#24,i_units#25,i_container#26,i_manager_id#27,i_product_name#28]
>  parquet, Statistics(sizeInBytes=151.1 MiB, rowCount=3.72E+5)
>+- Aggregate [brand_id#159, class_id#160, category_id#161], [brand_id#159, 
> class_id#160, category_id#161], Statistics(sizeInBytes=2.73E+21 B)
>   +- Aggregate [brand_id#159, class_id#160, category_id#161], 
> [brand_id#159, class_id#160, category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
>  +- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :- Join LeftSemi, (((brand_id#159 <=> i_brand_id#14) AND 
> (class_id#160 <=> i_class_id#16)) AND (category_id#161 <=> 
> i_category_id#18)), Statistics(sizeInBytes=2.73E+21 B)
> :  :- Project [i_brand_id#14 AS brand_id#159, i_class_id#16 AS 
> class_id#160, i_category_id#18 AS category_id#161], 
> Statistics(sizeInBytes=2.73E+21 B)
> :  :  +- Join Inner, (ss_sold_date_sk#51 = d_date_sk#52), 
> Statistics(sizeInBytes=3.83E+21 B)
> :  : :- Project [ss_sold_date_sk#51, i_brand_id#14, 
> i_class_id#16, i_category_id#18], Statistics(sizeInBytes=387.3 PiB)
> :  : :  +- Join Inner, (ss_item_sk#30 = i_item_sk#7), 
> Statistics(sizeInBytes=516.5 PiB)
> :  : : :- Project [ss_item_sk#30, ss_sold_date_sk#51], 
> Statistics(sizeInBytes=61.1 GiB)
> :  : : :  +- Filter ((isnotnull(ss_item_sk#30) AND 
> isnotnull(ss_sold_date_sk#51)) AND dynamicpruning#168 [ss_sold_date_sk#51]), 
> Statistics(sizeInBytes=580.6 GiB)
> :  : : : :  +- Project [d_date_sk#52], 
> Statistics(sizeInBytes=8.6 KiB, rowCount=731)
> :  : : : : +- Filter d_year#58 >= 1999) AND 
> (d_year#58 <= 2001)) AND isnotnull(d_year#58)) AND isnotnull(d_date_sk#52)), 
> Statistics(sizeInBytes=175.6 KiB, rowCount=731)
> :  : : : :+- 
> Relation[d_date_sk#52,d_date

[jira] [Commented] (SPARK-34118) Replaces filter and check for emptiness with exists or forall

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265648#comment-17265648
 ] 

Apache Spark commented on SPARK-34118:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/31184

> Replaces filter and check for emptiness with exists or forall
> -
>
> Key: SPARK-34118
> URL: https://issues.apache.org/jira/browse/SPARK-34118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> Semantic consistency and code Simpilefications
> before:
> {code:java}
> seq.filter(p).size == 0)
> seq.filter(p).length > 0
> seq.filterNot(p).isEmpty
> seq.filterNot(p).nonEmpty
> {code}
> after:
> {code:java}
> !seq.exists(p)
> seq.exists(p)
> seq.forall(p)
> !seq.forall(p)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34118) Replaces filter and check for emptiness with exists or forall

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34118:


Assignee: (was: Apache Spark)

> Replaces filter and check for emptiness with exists or forall
> -
>
> Key: SPARK-34118
> URL: https://issues.apache.org/jira/browse/SPARK-34118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> Semantic consistency and code Simpilefications
> before:
> {code:java}
> seq.filter(p).size == 0)
> seq.filter(p).length > 0
> seq.filterNot(p).isEmpty
> seq.filterNot(p).nonEmpty
> {code}
> after:
> {code:java}
> !seq.exists(p)
> seq.exists(p)
> seq.forall(p)
> !seq.forall(p)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34118) Replaces filter and check for emptiness with exists or forall

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34118:


Assignee: Apache Spark

> Replaces filter and check for emptiness with exists or forall
> -
>
> Key: SPARK-34118
> URL: https://issues.apache.org/jira/browse/SPARK-34118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> Semantic consistency and code Simpilefications
> before:
> {code:java}
> seq.filter(p).size == 0)
> seq.filter(p).length > 0
> seq.filterNot(p).isEmpty
> seq.filterNot(p).nonEmpty
> {code}
> after:
> {code:java}
> !seq.exists(p)
> seq.exists(p)
> seq.forall(p)
> !seq.forall(p)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34118) Replaces filter and check for emptiness with exists or forall

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265647#comment-17265647
 ] 

Apache Spark commented on SPARK-34118:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/31184

> Replaces filter and check for emptiness with exists or forall
> -
>
> Key: SPARK-34118
> URL: https://issues.apache.org/jira/browse/SPARK-34118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> Semantic consistency and code Simpilefications
> before:
> {code:java}
> seq.filter(p).size == 0)
> seq.filter(p).length > 0
> seq.filterNot(p).isEmpty
> seq.filterNot(p).nonEmpty
> {code}
> after:
> {code:java}
> !seq.exists(p)
> seq.exists(p)
> seq.forall(p)
> !seq.forall(p)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors

2021-01-14 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265611#comment-17265611
 ] 

Hyukjin Kwon commented on SPARK-34080:
--

I am okay with having this in Spark 3.1.1.

> Add UnivariateFeatureSelector to deprecate existing selectors
> -
>
> Key: SPARK-34080
> URL: https://issues.apache.org/jira/browse/SPARK-34080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Xiangrui Meng
>Priority: Critical
>
> In SPARK-26111, we introduced a few univariate feature selectors, which share 
> a common set of params. And they are named after the underlying test, which 
> requires users to understand the test to find the matched scenarios. It would 
> be nice if we introduce a single class called UnivariateFeatureSelector that 
> accepts a selection criterion and a score method (string names). Then we can 
> deprecate all other univariate selectors.
> For the params, instead of ask users to provide what score function to use, 
> it is more friendly to ask users to specify the feature and label types 
> (continuous or categorical) and we set a default score function for each 
> combo. We can also detect the types from feature metadata if given. Advanced 
> users can overwrite it (if there are multiple score function that is 
> compatible with the feature type and label type combo). Example (param names 
> are not finalized):
> {code}
> selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
> labelCol=["target"], featureType="categorical", labelType="continuous", 
> select="bestK", k=100)
> {code}
> cc: [~huaxingao] [~ruifengz] [~weichenxu123]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors

2021-01-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34080:
-
Priority: Critical  (was: Major)

> Add UnivariateFeatureSelector to deprecate existing selectors
> -
>
> Key: SPARK-34080
> URL: https://issues.apache.org/jira/browse/SPARK-34080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Xiangrui Meng
>Priority: Critical
>
> In SPARK-26111, we introduced a few univariate feature selectors, which share 
> a common set of params. And they are named after the underlying test, which 
> requires users to understand the test to find the matched scenarios. It would 
> be nice if we introduce a single class called UnivariateFeatureSelector that 
> accepts a selection criterion and a score method (string names). Then we can 
> deprecate all other univariate selectors.
> For the params, instead of ask users to provide what score function to use, 
> it is more friendly to ask users to specify the feature and label types 
> (continuous or categorical) and we set a default score function for each 
> combo. We can also detect the types from feature metadata if given. Advanced 
> users can overwrite it (if there are multiple score function that is 
> compatible with the feature type and label type combo). Example (param names 
> are not finalized):
> {code}
> selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
> labelCol=["target"], featureType="categorical", labelType="continuous", 
> select="bestK", k=100)
> {code}
> cc: [~huaxingao] [~ruifengz] [~weichenxu123]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34080) Add UnivariateFeatureSelector to deprecate existing selectors

2021-01-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34080:
-
Target Version/s: 3.1.1  (was: 3.2.0)

> Add UnivariateFeatureSelector to deprecate existing selectors
> -
>
> Key: SPARK-34080
> URL: https://issues.apache.org/jira/browse/SPARK-34080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Xiangrui Meng
>Priority: Major
>
> In SPARK-26111, we introduced a few univariate feature selectors, which share 
> a common set of params. And they are named after the underlying test, which 
> requires users to understand the test to find the matched scenarios. It would 
> be nice if we introduce a single class called UnivariateFeatureSelector that 
> accepts a selection criterion and a score method (string names). Then we can 
> deprecate all other univariate selectors.
> For the params, instead of ask users to provide what score function to use, 
> it is more friendly to ask users to specify the feature and label types 
> (continuous or categorical) and we set a default score function for each 
> combo. We can also detect the types from feature metadata if given. Advanced 
> users can overwrite it (if there are multiple score function that is 
> compatible with the feature type and label type combo). Example (param names 
> are not finalized):
> {code}
> selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], 
> labelCol=["target"], featureType="categorical", labelType="continuous", 
> select="bestK", k=100)
> {code}
> cc: [~huaxingao] [~ruifengz] [~weichenxu123]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33711) Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns

2021-01-14 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265604#comment-17265604
 ] 

Dongjoon Hyun commented on SPARK-33711:
---

+1 for having this in branch-3.0. Could you make a backport, [~attilapiros]?

>  Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns
> --
>
> Key: SPARK-33711
> URL: https://issues.apache.org/jira/browse/SPARK-33711
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.4, 2.4.7, 3.0.0, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
>
> Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD 
> changes which could wrongfully lead to detecting of missing PODs (PODs known 
> by scheduler backend but missing from POD snapshots) by the executor POD 
> lifecycle manager.
> A key indicator of this is seeing this log msg:
> "The executor with ID [some_id] was not found in the cluster but we didn't 
> get a reason why. Marking the executor as failed. The executor may have been 
> deleted but the driver missed the deletion event."
> So one of the problem is running the missing POD detection even when a single 
> pod is changed without having a full consistent snapshot about all the PODs 
> (see ExecutorPodsPollingSnapshotSource). The other could be a race between 
> the executor POD lifecycle manager and the scheduler backend.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30747) Update roxygen2 to 7.0.1

2021-01-14 Thread Maciej Szymkiewicz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265566#comment-17265566
 ] 

Maciej Szymkiewicz commented on SPARK-30747:


Thanks for working on this [~shaneknapp]!

> Update roxygen2 to 7.0.1
> 
>
> Key: SPARK-30747
> URL: https://issues.apache.org/jira/browse/SPARK-30747
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, Tests
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Shane Knapp
>Priority: Minor
>
> Currently Spark uses {{roxygen2}} 5.0.1. It is already pretty old 
> (2015-11-11) so it could be a good idea to use current R updates to update it 
> as well.
> At crude inspection:
> * SPARK-22430 has been resolved a while ago.
> * SPARK-30737][SPARK-27262,  https://github.com/apache/spark/pull/27437 and 
> https://github.com/apache/spark/commit/b95ccb1d8b726b11435789cdb5882df6643430ed
>  resolved persisting warnings
> * Documentation builds and CRAN checks pass
> * Generated HTML docs are identical to 5.0.1
> Since {{roxygen2}} shares some potentially unstable dependencies with 
> {{devtools}} (primarily {{rlang}}) it might be a good idea to keep these in 
> sync (as a bonus we wouldn't have to worry about {{DESCRIPTION}} being 
> overwritten by local tests).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34116) Separate state store numKeys metric test

2021-01-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34116.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 31183
[https://github.com/apache/spark/pull/31183]

> Separate state store numKeys metric test
> 
>
> Key: SPARK-34116
> URL: https://issues.apache.org/jira/browse/SPARK-34116
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.1.0
>
>
> Right now in StateStoreSuite, the tests of get/put/remove/commit are mixed 
> with numKeys metric test. I found it is flaky when I was testing with other 
> StateStore implementation. Specifically, we also are able to check these 
> metrics after state store is updated (committed).  So I think we can refactor 
> the test a little bit to make it easier to incorporate other StateStore 
> externally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34116) Separate state store numKeys metric test

2021-01-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34116:
--
Fix Version/s: (was: 3.1.0)
   3.1.1

> Separate state store numKeys metric test
> 
>
> Key: SPARK-34116
> URL: https://issues.apache.org/jira/browse/SPARK-34116
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 3.2.0, 3.1.1
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.1.1
>
>
> Right now in StateStoreSuite, the tests of get/put/remove/commit are mixed 
> with numKeys metric test. I found it is flaky when I was testing with other 
> StateStore implementation. Specifically, we also are able to check these 
> metrics after state store is updated (committed).  So I think we can refactor 
> the test a little bit to make it easier to incorporate other StateStore 
> externally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34116) Separate state store numKeys metric test

2021-01-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34116:
--
Affects Version/s: 3.1.1

> Separate state store numKeys metric test
> 
>
> Key: SPARK-34116
> URL: https://issues.apache.org/jira/browse/SPARK-34116
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 3.2.0, 3.1.1
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.1.0
>
>
> Right now in StateStoreSuite, the tests of get/put/remove/commit are mixed 
> with numKeys metric test. I found it is flaky when I was testing with other 
> StateStore implementation. Specifically, we also are able to check these 
> metrics after state store is updated (committed).  So I think we can refactor 
> the test a little bit to make it easier to incorporate other StateStore 
> externally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33711) Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns

2021-01-14 Thread Shashi Kangayam (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265426#comment-17265426
 ] 

Shashi Kangayam commented on SPARK-33711:
-

[~attilapiros]

We have our jobs on Spark-3.0.1 that are reflecting the same behavior. 

Can you please back port this fix to Spark-3.0

>  Race condition in Spark k8s Pod lifecycle manager that leads to shutdowns
> --
>
> Key: SPARK-33711
> URL: https://issues.apache.org/jira/browse/SPARK-33711
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.4, 2.4.7, 3.0.0, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
>
> Watching a POD (ExecutorPodsWatchSnapshotSource) informs about single POD 
> changes which could wrongfully lead to detecting of missing PODs (PODs known 
> by scheduler backend but missing from POD snapshots) by the executor POD 
> lifecycle manager.
> A key indicator of this is seeing this log msg:
> "The executor with ID [some_id] was not found in the cluster but we didn't 
> get a reason why. Marking the executor as failed. The executor may have been 
> deleted but the driver missed the deletion event."
> So one of the problem is running the missing POD detection even when a single 
> pod is changed without having a full consistent snapshot about all the PODs 
> (see ExecutorPodsPollingSnapshotSource). The other could be a race between 
> the executor POD lifecycle manager and the scheduler backend.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30747) Update roxygen2 to 7.0.1

2021-01-14 Thread Shane Knapp (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp resolved SPARK-30747.
-
Resolution: Fixed

> Update roxygen2 to 7.0.1
> 
>
> Key: SPARK-30747
> URL: https://issues.apache.org/jira/browse/SPARK-30747
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, Tests
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Shane Knapp
>Priority: Minor
>
> Currently Spark uses {{roxygen2}} 5.0.1. It is already pretty old 
> (2015-11-11) so it could be a good idea to use current R updates to update it 
> as well.
> At crude inspection:
> * SPARK-22430 has been resolved a while ago.
> * SPARK-30737][SPARK-27262,  https://github.com/apache/spark/pull/27437 and 
> https://github.com/apache/spark/commit/b95ccb1d8b726b11435789cdb5882df6643430ed
>  resolved persisting warnings
> * Documentation builds and CRAN checks pass
> * Generated HTML docs are identical to 5.0.1
> Since {{roxygen2}} shares some potentially unstable dependencies with 
> {{devtools}} (primarily {{rlang}}) it might be a good idea to keep these in 
> sync (as a bonus we wouldn't have to worry about {{DESCRIPTION}} being 
> overwritten by local tests).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30747) Update roxygen2 to 7.0.1

2021-01-14 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265142#comment-17265142
 ] 

Shane Knapp commented on SPARK-30747:
-

we're currently running 7.1.1 on all of the workers.

> Update roxygen2 to 7.0.1
> 
>
> Key: SPARK-30747
> URL: https://issues.apache.org/jira/browse/SPARK-30747
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, Tests
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Shane Knapp
>Priority: Minor
>
> Currently Spark uses {{roxygen2}} 5.0.1. It is already pretty old 
> (2015-11-11) so it could be a good idea to use current R updates to update it 
> as well.
> At crude inspection:
> * SPARK-22430 has been resolved a while ago.
> * SPARK-30737][SPARK-27262,  https://github.com/apache/spark/pull/27437 and 
> https://github.com/apache/spark/commit/b95ccb1d8b726b11435789cdb5882df6643430ed
>  resolved persisting warnings
> * Documentation builds and CRAN checks pass
> * Generated HTML docs are identical to 5.0.1
> Since {{roxygen2}} shares some potentially unstable dependencies with 
> {{devtools}} (primarily {{rlang}}) it might be a good idea to keep these in 
> sync (as a bonus we wouldn't have to worry about {{DESCRIPTION}} being 
> overwritten by local tests).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29183) Upgrade JDK 11 Installation to 11.0.6

2021-01-14 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265140#comment-17265140
 ] 

Dongjoon Hyun commented on SPARK-29183:
---

Yey! It will be great. Thank you for sharing that, [~shaneknapp]. I'm looking 
forward to seeing it.

> Upgrade JDK 11 Installation to 11.0.6
> -
>
> Key: SPARK-29183
> URL: https://issues.apache.org/jira/browse/SPARK-29183
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Major
>
> Every JDK 11.0.x releases have many fixes including performance regression 
> fix. We had better upgrade it to the latest 11.0.4.
> - https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8221760



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-31693) Investigate AmpLab Jenkins server network issue

2021-01-14 Thread Shane Knapp (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp closed SPARK-31693.
---

> Investigate AmpLab Jenkins server network issue
> ---
>
> Key: SPARK-31693
> URL: https://issues.apache.org/jira/browse/SPARK-31693
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Critical
>
> Given the series of failures in Spark packaging Jenkins job, it seems that 
> there is a network issue in AmbLab Jenkins cluster.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
> - The node failed to talk to GitBox. (SPARK-31687) -> GitHub is okay.
> - The node failed to download the maven mirror. (SPARK-31691) -> The primary 
> host is okay.
> - The node failed to communicate repository.apache.org. (Current master 
> branch Jenkins job failure)
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-deploy-plugin:3.0.0-M1:deploy (default-deploy) 
> on project spark-parent_2.12: ArtifactDeployerException: Failed to retrieve 
> remote metadata 
> org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml: Could 
> not transfer metadata 
> org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml from/to 
> apache.snapshots.https 
> (https://repository.apache.org/content/repositories/snapshots): Transfer 
> failed for 
> https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.12/3.1.0-SNAPSHOT/maven-metadata.xml:
>  Connect to repository.apache.org:443 [repository.apache.org/207.244.88.140] 
> failed: Connection timed out (Connection timed out) -> [Help 1]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29183) Upgrade JDK 11 Installation to 11.0.6

2021-01-14 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265139#comment-17265139
 ] 

Shane Knapp commented on SPARK-29183:
-

in the next ~month or so, we'll be reimaging the remaining ubuntu workers and 
java11 will be at 11.0.9+.  the new ubuntu 20 workers (r-j-w-01..06) all 
currently have 11.0.9 installed.

> Upgrade JDK 11 Installation to 11.0.6
> -
>
> Key: SPARK-29183
> URL: https://issues.apache.org/jira/browse/SPARK-29183
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Major
>
> Every JDK 11.0.x releases have many fixes including performance regression 
> fix. We had better upgrade it to the latest 11.0.4.
> - https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8221760



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31693) Investigate AmpLab Jenkins server network issue

2021-01-14 Thread Shane Knapp (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp resolved SPARK-31693.
-
Resolution: Fixed

> Investigate AmpLab Jenkins server network issue
> ---
>
> Key: SPARK-31693
> URL: https://issues.apache.org/jira/browse/SPARK-31693
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Critical
>
> Given the series of failures in Spark packaging Jenkins job, it seems that 
> there is a network issue in AmbLab Jenkins cluster.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
> - The node failed to talk to GitBox. (SPARK-31687) -> GitHub is okay.
> - The node failed to download the maven mirror. (SPARK-31691) -> The primary 
> host is okay.
> - The node failed to communicate repository.apache.org. (Current master 
> branch Jenkins job failure)
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-deploy-plugin:3.0.0-M1:deploy (default-deploy) 
> on project spark-parent_2.12: ArtifactDeployerException: Failed to retrieve 
> remote metadata 
> org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml: Could 
> not transfer metadata 
> org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml from/to 
> apache.snapshots.https 
> (https://repository.apache.org/content/repositories/snapshots): Transfer 
> failed for 
> https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.12/3.1.0-SNAPSHOT/maven-metadata.xml:
>  Connect to repository.apache.org:443 [repository.apache.org/207.244.88.140] 
> failed: Connection timed out (Connection timed out) -> [Help 1]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34116) Separate state store numKeys metric test

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34116:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Separate state store numKeys metric test
> 
>
> Key: SPARK-34116
> URL: https://issues.apache.org/jira/browse/SPARK-34116
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
>
> Right now in StateStoreSuite, the tests of get/put/remove/commit are mixed 
> with numKeys metric test. I found it is flaky when I was testing with other 
> StateStore implementation. Specifically, we also are able to check these 
> metrics after state store is updated (committed).  So I think we can refactor 
> the test a little bit to make it easier to incorporate other StateStore 
> externally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34116) Separate state store numKeys metric test

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34116:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Separate state store numKeys metric test
> 
>
> Key: SPARK-34116
> URL: https://issues.apache.org/jira/browse/SPARK-34116
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Minor
>
> Right now in StateStoreSuite, the tests of get/put/remove/commit are mixed 
> with numKeys metric test. I found it is flaky when I was testing with other 
> StateStore implementation. Specifically, we also are able to check these 
> metrics after state store is updated (committed).  So I think we can refactor 
> the test a little bit to make it easier to incorporate other StateStore 
> externally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34116) Separate state store numKeys metric test

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34116:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Separate state store numKeys metric test
> 
>
> Key: SPARK-34116
> URL: https://issues.apache.org/jira/browse/SPARK-34116
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Minor
>
> Right now in StateStoreSuite, the tests of get/put/remove/commit are mixed 
> with numKeys metric test. I found it is flaky when I was testing with other 
> StateStore implementation. Specifically, we also are able to check these 
> metrics after state store is updated (committed).  So I think we can refactor 
> the test a little bit to make it easier to incorporate other StateStore 
> externally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34116) Separate state store numKeys metric test

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265110#comment-17265110
 ] 

Apache Spark commented on SPARK-34116:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/31183

> Separate state store numKeys metric test
> 
>
> Key: SPARK-34116
> URL: https://issues.apache.org/jira/browse/SPARK-34116
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
>
> Right now in StateStoreSuite, the tests of get/put/remove/commit are mixed 
> with numKeys metric test. I found it is flaky when I was testing with other 
> StateStore implementation. Specifically, we also are able to check these 
> metrics after state store is updated (committed).  So I think we can refactor 
> the test a little bit to make it easier to incorporate other StateStore 
> externally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34116) Separate state store numKeys metric test

2021-01-14 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-34116:
---

 Summary: Separate state store numKeys metric test
 Key: SPARK-34116
 URL: https://issues.apache.org/jira/browse/SPARK-34116
 Project: Spark
  Issue Type: Test
  Components: Structured Streaming
Affects Versions: 3.2.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


Right now in StateStoreSuite, the tests of get/put/remove/commit are mixed with 
numKeys metric test. I found it is flaky when I was testing with other 
StateStore implementation. Specifically, we also are able to check these 
metrics after state store is updated (committed).  So I think we can refactor 
the test a little bit to make it easier to incorporate other StateStore 
externally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2021-01-14 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253044#comment-17253044
 ] 

Nicholas Chammas edited comment on SPARK-12890 at 1/14/21, 5:41 PM:


I think this is still an open issue. On Spark 2.4.6:
{code:java}
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>>  ascending=False).limit(1).explain()
== Physical Plan == 
TakeOrderedAndProject(limit=1, orderBy=[file_date#144 DESC NULLS LAST], 
output=[file_date#144])
+- *(1) FileScan parquet [file_date#144] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[s3a://some/dataset], PartitionCount: 74, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>>  ascending=False).limit(1).show()
[Stage 23:=>(2110 + 12) / 21049]
{code}
{{file_date}} is a partitioning column:
{code:java}
$ aws s3 ls s3://some/dataset/
   PRE file_date=2018-10-02/
   PRE file_date=2018-10-08/
   PRE file_date=2018-10-15/ 
   ...{code}
Schema merging is not enabled, as far as I can tell.

Shouldn't Spark be able to answer this query without going through ~20K files?

Is the problem that {{FileScan}} somehow needs to be enhanced to recognize when 
it's only projecting partitioning columns?

For the record, the best current workaround appears to be to [use the catalog 
to list partitions and extract what's needed that 
way|https://stackoverflow.com/a/65724151/877069]. But it seems like Spark 
should handle this situation better.


was (Author: nchammas):
I think this is still an open issue. On Spark 2.4.6:
{code:java}
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>>  ascending=False).limit(1).explain()
== Physical Plan == 
TakeOrderedAndProject(limit=1, orderBy=[file_date#144 DESC NULLS LAST], 
output=[file_date#144])
+- *(1) FileScan parquet [file_date#144] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[s3a://some/dataset], PartitionCount: 74, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>>  ascending=False).limit(1).show()
[Stage 23:=>(2110 + 12) / 21049]
{code}
{{file_date}} is a partitioning column:
{code:java}
$ aws s3 ls s3://some/dataset/
   PRE file_date=2018-10-02/
   PRE file_date=2018-10-08/
   PRE file_date=2018-10-15/ 
   ...{code}
Schema merging is not enabled, as far as I can tell.

Shouldn't Spark be able to answer this query without going through ~20K files?

Is the problem that {{FileScan}} somehow needs to be enhanced to recognize when 
it's only projecting partitioning columns?

For the record, the best current workaround appears to be to [use the catalog 
to list partitions and extract what's needed that 
way|https://stackoverflow.com/a/57440760/877069]. But it seems like Spark 
should handle this situation better.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>Priority: Minor
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34108) Caching with permanent view doesn't work in certain cases

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265049#comment-17265049
 ] 

Apache Spark commented on SPARK-34108:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/31182

> Caching with permanent view doesn't work in certain cases
> -
>
> Key: SPARK-34108
> URL: https://issues.apache.org/jira/browse/SPARK-34108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently, caching a permanent view doesn't work in certain cases. For 
> instance, in the following:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t
> CACHE TABLE v1
> SELECT key FROM t
> {code}
> The last SELECT query will hit the cached {{v1}}. On the other hand:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t ORDER by key
> CACHE TABLE v1
> SELECT key FROM t ORDER BY key
> {code}
> The SELECT won't hit the cache.
> It seems this is related to {{EliminateView}}. In the second case, it will 
> insert an extra project operator which makes the comparison on canonicalized 
> plan during cache lookup fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34108) Caching with permanent view doesn't work in certain cases

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34108:


Assignee: (was: Apache Spark)

> Caching with permanent view doesn't work in certain cases
> -
>
> Key: SPARK-34108
> URL: https://issues.apache.org/jira/browse/SPARK-34108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently, caching a permanent view doesn't work in certain cases. For 
> instance, in the following:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t
> CACHE TABLE v1
> SELECT key FROM t
> {code}
> The last SELECT query will hit the cached {{v1}}. On the other hand:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t ORDER by key
> CACHE TABLE v1
> SELECT key FROM t ORDER BY key
> {code}
> The SELECT won't hit the cache.
> It seems this is related to {{EliminateView}}. In the second case, it will 
> insert an extra project operator which makes the comparison on canonicalized 
> plan during cache lookup fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34108) Caching with permanent view doesn't work in certain cases

2021-01-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34108:


Assignee: Apache Spark

> Caching with permanent view doesn't work in certain cases
> -
>
> Key: SPARK-34108
> URL: https://issues.apache.org/jira/browse/SPARK-34108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
>
> Currently, caching a permanent view doesn't work in certain cases. For 
> instance, in the following:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t
> CACHE TABLE v1
> SELECT key FROM t
> {code}
> The last SELECT query will hit the cached {{v1}}. On the other hand:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t ORDER by key
> CACHE TABLE v1
> SELECT key FROM t ORDER BY key
> {code}
> The SELECT won't hit the cache.
> It seems this is related to {{EliminateView}}. In the second case, it will 
> insert an extra project operator which makes the comparison on canonicalized 
> plan during cache lookup fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34108) Caching with permanent view doesn't work in certain cases

2021-01-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265048#comment-17265048
 ] 

Apache Spark commented on SPARK-34108:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/31182

> Caching with permanent view doesn't work in certain cases
> -
>
> Key: SPARK-34108
> URL: https://issues.apache.org/jira/browse/SPARK-34108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently, caching a permanent view doesn't work in certain cases. For 
> instance, in the following:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t
> CACHE TABLE v1
> SELECT key FROM t
> {code}
> The last SELECT query will hit the cached {{v1}}. On the other hand:
> {code:sql}
> CREATE TABLE t (key bigint, value string) USING parquet
> CREATE VIEW v1 AS SELECT key FROM t ORDER by key
> CACHE TABLE v1
> SELECT key FROM t ORDER BY key
> {code}
> The SELECT won't hit the cache.
> It seems this is related to {{EliminateView}}. In the second case, it will 
> insert an extra project operator which makes the comparison on canonicalized 
> plan during cache lookup fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34042) Column pruning is not working as expected for PERMISIVE mode

2021-01-14 Thread Marius Butan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265022#comment-17265022
 ] 

Marius Butan commented on SPARK-34042:
--

Like I said in description I made some tests in 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]:

 

All the tests are running on a file with 3 columns

withPruningEnabledAndMapSingleColumn -> with pruning enabled when we have a 
column in schema and we use it for the select query, the expected result is 2 
not 0

withPruningEnabledAndMap2ColumnsButUse1InSql-> with pruning enabled when we 
have 2 columns in schema and we use only 1 in  the select, the expected result 
is 2 and it is correct

withPruningDisableAndMap2ColumnsButUse1InSql -> with pruning disabled it works 
correctly

> Column pruning is not working as expected for PERMISIVE mode
> 
>
> Key: SPARK-34042
> URL: https://issues.apache.org/jira/browse/SPARK-34042
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.7
>Reporter: Marius Butan
>Priority: Major
>
> In PERMISSIVE mode
> Given a csv with multiple columns per row, if your file schema has a single 
> column and you are doing a SELECT in SQL with a condition like 
> ' is null', the row is marked as corrupted
>  
> BUT if you add an extra column in the file schema and you are not putting 
> that column in SQL SELECT , the row is not marked as corrupted
>  
> PS. I don't know exactly what is the right behavior, I didn't find it for 
> PERMISSIVE mode the documentation.
> What I found is: As an example, CSV file contains the "id,name" header and 
> one row "1234". In Spark 2.4, the selection of the id column consists of a 
> row with one column value 1234 but in Spark 2.3 and earlier, it is empty in 
> the DROPMALFORMED mode. To restore the previous behavior, set 
> {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.
>  
> [https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]
>  
> I made a "unit" test in order to exemplify the issue: 
> [https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34095) Spark 3.0.1 is giving WARN while running job

2021-01-14 Thread Sachit Murarka (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachit Murarka updated SPARK-34095:
---
Summary: Spark 3.0.1 is giving WARN while running job  (was: Spark 3 is 
giving WARN while running job)

> Spark 3.0.1 is giving WARN while running job
> 
>
> Key: SPARK-34095
> URL: https://issues.apache.org/jira/browse/SPARK-34095
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.0.1
>Reporter: Sachit Murarka
>Priority: Major
>
> I am running Spark 3.0.1 with Java11. Getting the following WARN.
>  
> WARNING: An illegal reflective access operation has occurred
>  WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> ([file:/opt/spark/jars/spark-unsafe_2.12-3.0.1.jar|file://opt/spark/jars/spark-unsafe_2.12-3.0.1.jar])
>  to constructor java.nio.DirectByteBuffer(long,int)
>  WARNING: Please consider reporting this to the maintainers of 
> org.apache.spark.unsafe.Platform
>  WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
>  WARNING: All illegal access operations will be denied in a future release



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34095) Spark 3 is giving WARN while running job

2021-01-14 Thread Sachit Murarka (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265019#comment-17265019
 ] 

Sachit Murarka commented on SPARK-34095:


[~sowen] [~holden] -> Can you please suggest something on this?

> Spark 3 is giving WARN while running job
> 
>
> Key: SPARK-34095
> URL: https://issues.apache.org/jira/browse/SPARK-34095
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.0.1
>Reporter: Sachit Murarka
>Priority: Major
>
> I am running Spark 3.0.1 with Java11. Getting the following WARN.
>  
> WARNING: An illegal reflective access operation has occurred
>  WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> ([file:/opt/spark/jars/spark-unsafe_2.12-3.0.1.jar|file://opt/spark/jars/spark-unsafe_2.12-3.0.1.jar])
>  to constructor java.nio.DirectByteBuffer(long,int)
>  WARNING: Please consider reporting this to the maintainers of 
> org.apache.spark.unsafe.Platform
>  WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
>  WARNING: All illegal access operations will be denied in a future release



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-28263) Spark-submit can not find class (ClassNotFoundException)

2021-01-14 Thread Jack LIN (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack LIN updated SPARK-28263:
-
Comment: was deleted

(was: Just guessing, this probably relates to the settings of `sbt`)

> Spark-submit can not find class (ClassNotFoundException)
> 
>
> Key: SPARK-28263
> URL: https://issues.apache.org/jira/browse/SPARK-28263
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.4.3
>Reporter: Zhiyuan
>Priority: Major
>  Labels: beginner
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> I try to run the Main class in my folder using the following code in the 
> script:
>  
> {code:java}
> spark-shell --class com.navercorp.Main /target/node2vec-0.0.1-SNAPSHOT.jar 
> --cmd node2vec ../graph/karate.edgelist --output ../walk/walk.txt {code}
> But it raises the error:
> {code:java}
> 19/07/05 14:39:20 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 19/07/05 14:39:20 WARN deploy.SparkSubmit$$anon$2: Failed to load 
> com.navercorp.Main.
> java.lang.ClassNotFoundException: com.navercorp.Main
> at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:810)
> at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
> at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala){code}
> I have jar file in my folder, this is the structure:
> {code:java}
> 1node2vec
>      2node2vec_spark
>            3main
>                  4resources
>                  4com
>                         5novercorp
>                                6lib
>                                       7Main
>                                       7Node2vec
>                                       7Word2vec
>      2target
>             3lib
>             3classes
>             3maven-archiver
>             3node2vec-0.0.1-SNAPSHOT.jar       
>      2graph
>             3---karate.edgelist
>      2walk
>             3walk.txt
> {code}
> Also, I attach the structure of jar file:
> {code:java}
> ```META-INF/
> META-INF/MANIFEST.MF
> log4j2.properties
> com/
> com/navercorp/
> com/navercorp/Node2vec$.class
> com/navercorp/Main$Params$$typecreator1$1.class
> com/navercorp/Main$$anon$1$$anonfun$11.class
> com/navercorp/Word2vec$.class
> com/navercorp/Main$$anon$1$$anonfun$8.class
> com/navercorp/Node2vec$$anonfun$randomWalk$1$$anonfun$8.class
> com/navercorp/Node2vec$$anonfun$indexingGraph$4.class
> com/navercorp/Node2vec$$anonfun$initTransitionProb$1.class
> com/navercorp/Main$.class
> com/navercorp/Node2vec$$anonfun$loadNode2Id$1.class
> com/navercorp/Node2vec$$anonfun$14.class
> com/navercorp/Node2vec$$anonfun$readIndexedGraph$2$$anonfun$1.class
> ```{code}
> Could someone give me the advice on how to connect Main class?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28263) Spark-submit can not find class (ClassNotFoundException)

2021-01-14 Thread Jack LIN (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264990#comment-17264990
 ] 

Jack LIN commented on SPARK-28263:
--

Just guessing, this probably relates to the settings of `sbt`

> Spark-submit can not find class (ClassNotFoundException)
> 
>
> Key: SPARK-28263
> URL: https://issues.apache.org/jira/browse/SPARK-28263
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.4.3
>Reporter: Zhiyuan
>Priority: Major
>  Labels: beginner
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> I try to run the Main class in my folder using the following code in the 
> script:
>  
> {code:java}
> spark-shell --class com.navercorp.Main /target/node2vec-0.0.1-SNAPSHOT.jar 
> --cmd node2vec ../graph/karate.edgelist --output ../walk/walk.txt {code}
> But it raises the error:
> {code:java}
> 19/07/05 14:39:20 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 19/07/05 14:39:20 WARN deploy.SparkSubmit$$anon$2: Failed to load 
> com.navercorp.Main.
> java.lang.ClassNotFoundException: com.navercorp.Main
> at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:810)
> at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
> at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala){code}
> I have jar file in my folder, this is the structure:
> {code:java}
> 1node2vec
>      2node2vec_spark
>            3main
>                  4resources
>                  4com
>                         5novercorp
>                                6lib
>                                       7Main
>                                       7Node2vec
>                                       7Word2vec
>      2target
>             3lib
>             3classes
>             3maven-archiver
>             3node2vec-0.0.1-SNAPSHOT.jar       
>      2graph
>             3---karate.edgelist
>      2walk
>             3walk.txt
> {code}
> Also, I attach the structure of jar file:
> {code:java}
> ```META-INF/
> META-INF/MANIFEST.MF
> log4j2.properties
> com/
> com/navercorp/
> com/navercorp/Node2vec$.class
> com/navercorp/Main$Params$$typecreator1$1.class
> com/navercorp/Main$$anon$1$$anonfun$11.class
> com/navercorp/Word2vec$.class
> com/navercorp/Main$$anon$1$$anonfun$8.class
> com/navercorp/Node2vec$$anonfun$randomWalk$1$$anonfun$8.class
> com/navercorp/Node2vec$$anonfun$indexingGraph$4.class
> com/navercorp/Node2vec$$anonfun$initTransitionProb$1.class
> com/navercorp/Main$.class
> com/navercorp/Node2vec$$anonfun$loadNode2Id$1.class
> com/navercorp/Node2vec$$anonfun$14.class
> com/navercorp/Node2vec$$anonfun$readIndexedGraph$2$$anonfun$1.class
> ```{code}
> Could someone give me the advice on how to connect Main class?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33989) Strip auto-generated cast when using Cast.sql

2021-01-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33989.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31034
[https://github.com/apache/spark/pull/31034]

> Strip auto-generated cast when using Cast.sql
> -
>
> Key: SPARK-33989
> URL: https://issues.apache.org/jira/browse/SPARK-33989
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.2.0
>
>
> During analysis we may introduce the Cast if exists type cast implicitly. 
> That makes assgined name unclear.
> Let's say we have a sql `select id == null` which id is int type, then the 
> output field name will be `(id = CAST(null as int))`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33989) Strip auto-generated cast when using Cast.sql

2021-01-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33989:
---

Assignee: ulysses you

> Strip auto-generated cast when using Cast.sql
> -
>
> Key: SPARK-33989
> URL: https://issues.apache.org/jira/browse/SPARK-33989
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
>
> During analysis we may introduce the Cast if exists type cast implicitly. 
> That makes assgined name unclear.
> Let's say we have a sql `select id == null` which id is int type, then the 
> output field name will be `(id = CAST(null as int))`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34102) Spark SQL cannot escape both \ and other special characters

2021-01-14 Thread Noah Kawasaki (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noah Kawasaki updated SPARK-34102:
--
Labels: escaping filter string-manipulation  (was: )

> Spark SQL cannot escape both \ and other special characters 
> 
>
> Key: SPARK-34102
> URL: https://issues.apache.org/jira/browse/SPARK-34102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.0, 2.4.5, 3.0.1
>Reporter: Noah Kawasaki
>Priority: Minor
>  Labels: escaping, filter, string-manipulation
>
> Spark literal string parsing does not properly escape backslashes or other 
> special characters. This is an extension of this issue: 
> https://issues.apache.org/jira/browse/SPARK-17647#
>  
> The issue is that depending on how spark.sql.parser.escapedStringLiterals is 
> set, you will either be able to correctly get escaped backslashes in a string 
> literal, but not escaped other special characters, OR, you can have correctly 
> escaped other special characters, but not correctly escaped backslashes.
> So you have to choose which configuration you care about more.
> I have tested Spark versions 2.1, 2.2, 2.3, 2.4, and 3.0 and they all 
> experience the issue:
> {code:java}
> # These do not return the expected backslash
> SET spark.sql.parser.escapedStringLiterals=false;
> SELECT '\\';
> > \
> (should return \\)
> SELECT 'hi\hi';
> > hihi
> (should return hi\hi) 
> # These are correctly escaped
> SELECT '\"';
> > "
>  SELECT '\'';
> > '{code}
> If I switch this: 
> {code:java}
> # These now work
> SET spark.sql.parser.escapedStringLiterals=true;
> SELECT '\\';
> > \\
> SELECT 'hi\hi';
> > hi\hi
> # These are now not correctly escaped
> SELECT '\"';
> > \"
> (should return ")
> SELECT '\'';
> > \'
> (should return ' ){code}
>  So basically we have to choose:
> SET spark.sql.parser.escapedStringLiterals=false; if we want backslashes 
> correctly escaped but not other special characters
> SET spark.sql.parser.escapedStringLiterals=true; if we want other special 
> characters correctly escaped but not backslashes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34115) Long runtime on many environment variables

2021-01-14 Thread Norbert Schultz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Norbert Schultz updated SPARK-34115:

Description: 
I am not sure if this is a bug report or a feature request. The code is is the 
same in current versions of Spark and maybe this ticket saves someone some time 
for debugging.

We migrated some older code to Spark 2.4.0, and suddently the integration tests 
on our build machine were much slower than expected.

On local machines it was running perfectly.

At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
analyzing in the following functions
 * AnalysisHelper.assertNotAnalysisRule calling
 * Utils.isTesting

Utils.isTesting is traversing all environment variables.

The offending build machine was a Kubernetes Pod which automatically exposed 
all services as environment variables, so it had more than 3000 environment 
variables.

As Utils.isTesting is called very often throgh 
AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
transformUp).

 

Of course we will restrict the number of environment variables, on the other 
side Utils.isTesting could also use a lazy val for

 
{code:java}
sys.env.contains("SPARK_TESTING") {code}
 

to not make it that expensive.

  was:
I am not sure if this is a bug report or a feature request. The code is is the 
same in current versions of Spark and maybe this ticket saves someone some time 
for debugging.

We migrated some older code to Spark 2.4.0, and suddently the unit tests on our 
build machine were much slower than expected.

On local machines it was running perfectly.

At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
analyzing in the following functions
 * AnalysisHelper.assertNotAnalysisRule calling
 * Utils.isTesting

Utils.isTesting is traversing all environment variables.

The offending build machine was a Kubernetes Pod which automatically exposed 
all services as environment variables, so it had more than 3000 environment 
variables.

As Utils.isTesting is called very often throgh 
AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
transformUp).

 

Of course we will restrict the number of environment variables, on the other 
side Utils.isTesting could also use a lazy val for

 
{code:java}
sys.env.contains("SPARK_TESTING") {code}
 

to not make it that expensive.


> Long runtime on many environment variables
> --
>
> Key: SPARK-34115
> URL: https://issues.apache.org/jira/browse/SPARK-34115
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 local[2] on a Kubernetes Pod
>Reporter: Norbert Schultz
>Priority: Major
>
> I am not sure if this is a bug report or a feature request. The code is is 
> the same in current versions of Spark and maybe this ticket saves someone 
> some time for debugging.
> We migrated some older code to Spark 2.4.0, and suddently the integration 
> tests on our build machine were much slower than expected.
> On local machines it was running perfectly.
> At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
> analyzing in the following functions
>  * AnalysisHelper.assertNotAnalysisRule calling
>  * Utils.isTesting
> Utils.isTesting is traversing all environment variables.
> The offending build machine was a Kubernetes Pod which automatically exposed 
> all services as environment variables, so it had more than 3000 environment 
> variables.
> As Utils.isTesting is called very often throgh 
> AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
> transformUp).
>  
> Of course we will restrict the number of environment variables, on the other 
> side Utils.isTesting could also use a lazy val for
>  
> {code:java}
> sys.env.contains("SPARK_TESTING") {code}
>  
> to not make it that expensive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34115) Long runtime on many environment variables

2021-01-14 Thread Norbert Schultz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Norbert Schultz updated SPARK-34115:

Description: 
I am not sure if this is a bug report or a feature request. The code is is the 
same in current versions of Spark and maybe this ticket saves someone some time 
for debugging.

We migrated some older code to Spark 2.4.0, and suddently the unit tests on our 
build machine were much slower than expected.

On local machines it was running perfectly.

At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
analyzing in the following functions
 * AnalysisHelper.assertNotAnalysisRule calling
 * Utils.isTesting

Utils.isTesting is traversing all environment variables.

The offending build machine was a Kubernetes Pod which automatically exposed 
all services as environment variables, so it had more than 3000 environment 
variables.

As Utils.isTesting is called very often throgh 
AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
transformUp).

 

Of course we will restrict the number of environment variables, on the other 
side Utils.isTesting could also use a lazy val for

```

sys.env.contains("SPARK_TESTING")

```

to not make it that expensive.

  was:
I am not sure if this is a bug report or a feature request. The code is is the 
same in current versions of Spark and maybe this ticket saves someone some time 
for debugging.

We migrated some older code to Spark 2.4.0, and suddently the unit tests on our 
build machine were much slower than expected.

On local machines it was running perfectly.

At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
analyzing in the following functions
 * AnalysisHelper.assertNotAnalysisRule calling
 * Utils.isTesting

Utils.isTesting is traversing all environment variables.

The offending build machine was a Kubernetes Pod which automatically exposed 
all services as environment variables, so it had more than 3000 environment 
variables.

As Utils.isTesting is called very often throgh 
AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
transformUp).

 

Of course we will restrict the number of environment variables, on the other 
side Utils.isTesting could also use a lazy val for 
"sys.env.contains("SPARK_TESTING")" to not make it that expensive.


> Long runtime on many environment variables
> --
>
> Key: SPARK-34115
> URL: https://issues.apache.org/jira/browse/SPARK-34115
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 local[2] on a Kubernetes Pod
>Reporter: Norbert Schultz
>Priority: Major
>
> I am not sure if this is a bug report or a feature request. The code is is 
> the same in current versions of Spark and maybe this ticket saves someone 
> some time for debugging.
> We migrated some older code to Spark 2.4.0, and suddently the unit tests on 
> our build machine were much slower than expected.
> On local machines it was running perfectly.
> At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
> analyzing in the following functions
>  * AnalysisHelper.assertNotAnalysisRule calling
>  * Utils.isTesting
> Utils.isTesting is traversing all environment variables.
> The offending build machine was a Kubernetes Pod which automatically exposed 
> all services as environment variables, so it had more than 3000 environment 
> variables.
> As Utils.isTesting is called very often throgh 
> AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
> transformUp).
>  
> Of course we will restrict the number of environment variables, on the other 
> side Utils.isTesting could also use a lazy val for
> ```
> sys.env.contains("SPARK_TESTING")
> ```
> to not make it that expensive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34115) Long runtime on many environment variables

2021-01-14 Thread Norbert Schultz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Norbert Schultz updated SPARK-34115:

Description: 
I am not sure if this is a bug report or a feature request. The code is is the 
same in current versions of Spark and maybe this ticket saves someone some time 
for debugging.

We migrated some older code to Spark 2.4.0, and suddently the unit tests on our 
build machine were much slower than expected.

On local machines it was running perfectly.

At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
analyzing in the following functions
 * AnalysisHelper.assertNotAnalysisRule calling
 * Utils.isTesting

Utils.isTesting is traversing all environment variables.

The offending build machine was a Kubernetes Pod which automatically exposed 
all services as environment variables, so it had more than 3000 environment 
variables.

As Utils.isTesting is called very often throgh 
AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
transformUp).

 

Of course we will restrict the number of environment variables, on the other 
side Utils.isTesting could also use a lazy val for

 
{code:java}
sys.env.contains("SPARK_TESTING") {code}
 

to not make it that expensive.

  was:
I am not sure if this is a bug report or a feature request. The code is is the 
same in current versions of Spark and maybe this ticket saves someone some time 
for debugging.

We migrated some older code to Spark 2.4.0, and suddently the unit tests on our 
build machine were much slower than expected.

On local machines it was running perfectly.

At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
analyzing in the following functions
 * AnalysisHelper.assertNotAnalysisRule calling
 * Utils.isTesting

Utils.isTesting is traversing all environment variables.

The offending build machine was a Kubernetes Pod which automatically exposed 
all services as environment variables, so it had more than 3000 environment 
variables.

As Utils.isTesting is called very often throgh 
AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
transformUp).

 

Of course we will restrict the number of environment variables, on the other 
side Utils.isTesting could also use a lazy val for

```

sys.env.contains("SPARK_TESTING")

```

to not make it that expensive.


> Long runtime on many environment variables
> --
>
> Key: SPARK-34115
> URL: https://issues.apache.org/jira/browse/SPARK-34115
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 local[2] on a Kubernetes Pod
>Reporter: Norbert Schultz
>Priority: Major
>
> I am not sure if this is a bug report or a feature request. The code is is 
> the same in current versions of Spark and maybe this ticket saves someone 
> some time for debugging.
> We migrated some older code to Spark 2.4.0, and suddently the unit tests on 
> our build machine were much slower than expected.
> On local machines it was running perfectly.
> At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
> analyzing in the following functions
>  * AnalysisHelper.assertNotAnalysisRule calling
>  * Utils.isTesting
> Utils.isTesting is traversing all environment variables.
> The offending build machine was a Kubernetes Pod which automatically exposed 
> all services as environment variables, so it had more than 3000 environment 
> variables.
> As Utils.isTesting is called very often throgh 
> AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
> transformUp).
>  
> Of course we will restrict the number of environment variables, on the other 
> side Utils.isTesting could also use a lazy val for
>  
> {code:java}
> sys.env.contains("SPARK_TESTING") {code}
>  
> to not make it that expensive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34115) Long runtime on many environment variables

2021-01-14 Thread Norbert Schultz (Jira)
Norbert Schultz created SPARK-34115:
---

 Summary: Long runtime on many environment variables
 Key: SPARK-34115
 URL: https://issues.apache.org/jira/browse/SPARK-34115
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
 Environment: Spark 2.4.0 local[2] on a Kubernetes Pod
Reporter: Norbert Schultz


I am not sure if this is a bug report or a feature request. The code is is the 
same in current versions of Spark and maybe this ticket saves someone some time 
for debugging.

We migrated some older code to Spark 2.4.0, and suddently the unit tests on our 
build machine were much slower than expected.

On local machines it was running perfectly.

At the end it turned out, that Spark was wasting CPU Cycles during DataFrame 
analyzing in the following functions
 * AnalysisHelper.assertNotAnalysisRule calling
 * Utils.isTesting

Utils.isTesting is traversing all environment variables.

The offending build machine was a Kubernetes Pod which automatically exposed 
all services as environment variables, so it had more than 3000 environment 
variables.

As Utils.isTesting is called very often throgh 
AnalysisHelper.assertNotAnalysisRule (via AnalysisHelper.transformDown, 
transformUp).

 

Of course we will restrict the number of environment variables, on the other 
side Utils.isTesting could also use a lazy val for 
"sys.env.contains("SPARK_TESTING")" to not make it that expensive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26385) YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in cache

2021-01-14 Thread jiulongzhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiulongzhu updated SPARK-26385:
---
Attachment: (was: SPARK-26385.0001.patch)

> YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in 
> cache
> ---
>
> Key: SPARK-26385
> URL: https://issues.apache.org/jira/browse/SPARK-26385
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Hadoop 2.6.0, Spark 2.4.0
>Reporter: T M
>Priority: Major
>
>  
> Hello,
>  
> I have Spark Structured Streaming job which is runnning on YARN(Hadoop 2.6.0, 
> Spark 2.4.0). After 25-26 hours, my job stops working with following error:
> {code:java}
> 2018-12-16 22:35:17 ERROR 
> org.apache.spark.internal.Logging$class.logError(Logging.scala:91): Query 
> TestQuery[id = a61ce197-1d1b-4e82-a7af-60162953488b, runId = 
> a56878cf-dfc7-4f6a-ad48-02cf738ccc2f] terminated with error 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for REMOVED: HDFS_DELEGATION_TOKEN owner=REMOVED, renewer=yarn, 
> realUser=, issueDate=1544903057122, maxDate=1545507857122, 
> sequenceNumber=10314, masterKeyId=344) can't be found in cache at 
> org.apache.hadoop.ipc.Client.call(Client.java:1470) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1401) at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>  at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752)
>  at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>  at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) at 
> org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1977) at 
> org.apache.hadoop.fs.Hdfs.getFileStatus(Hdfs.java:133) at 
> org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1120) at 
> org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1116) at 
> org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at 
> org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1116) at 
> org.apache.hadoop.fs.FileContext$Util.exists(FileContext.java:1581) at 
> org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.exists(CheckpointFileManager.scala:326)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:142)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:544)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:554)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecuto

[jira] [Commented] (SPARK-26385) YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in cache

2021-01-14 Thread jiulongzhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264772#comment-17264772
 ] 

jiulongzhu commented on SPARK-26385:


https://issues.apache.org/jira/browse/HDFS-9276

> YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in 
> cache
> ---
>
> Key: SPARK-26385
> URL: https://issues.apache.org/jira/browse/SPARK-26385
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Hadoop 2.6.0, Spark 2.4.0
>Reporter: T M
>Priority: Major
>
>  
> Hello,
>  
> I have Spark Structured Streaming job which is runnning on YARN(Hadoop 2.6.0, 
> Spark 2.4.0). After 25-26 hours, my job stops working with following error:
> {code:java}
> 2018-12-16 22:35:17 ERROR 
> org.apache.spark.internal.Logging$class.logError(Logging.scala:91): Query 
> TestQuery[id = a61ce197-1d1b-4e82-a7af-60162953488b, runId = 
> a56878cf-dfc7-4f6a-ad48-02cf738ccc2f] terminated with error 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for REMOVED: HDFS_DELEGATION_TOKEN owner=REMOVED, renewer=yarn, 
> realUser=, issueDate=1544903057122, maxDate=1545507857122, 
> sequenceNumber=10314, masterKeyId=344) can't be found in cache at 
> org.apache.hadoop.ipc.Client.call(Client.java:1470) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1401) at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>  at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752)
>  at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>  at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) at 
> org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1977) at 
> org.apache.hadoop.fs.Hdfs.getFileStatus(Hdfs.java:133) at 
> org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1120) at 
> org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1116) at 
> org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at 
> org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1116) at 
> org.apache.hadoop.fs.FileContext$Util.exists(FileContext.java:1581) at 
> org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.exists(CheckpointFileManager.scala:326)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:142)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:544)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:554)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution

  1   2   >