date:20210921

[jira] [Updated] (SPARK-36823) Support broadcast nested loop join hint for equi-join

2021-09-21 Thread XiDuo You (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-36823:
--
Description: 
For the join if one side is small and other side is large, the shuffle overhead 
is also very big. Due to the 
bhj limitation, we can only broadcast right side for left join and left side 
for right join. So for the other case, we can try to use 
`BroadcastNestedLoopJoin` as the join strategy.


  was:
For the join if one side is small and other side is large, the shuffle overhead 
is also very big. Due to the limitation, we can only broadcast right side for 
left join and left side for right join. So for the other case, we can try to 
use `BroadcastNestedLoopJoin` as the join strategy.



> Support broadcast nested loop join hint for equi-join
> -
>
> Key: SPARK-36823
> URL: https://issues.apache.org/jira/browse/SPARK-36823
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> For the join if one side is small and other side is large, the shuffle 
> overhead is also very big. Due to the 
> bhj limitation, we can only broadcast right side for left join and left side 
> for right join. So for the other case, we can try to use 
> `BroadcastNestedLoopJoin` as the join strategy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36824) Add sec and csc as R functions

2021-09-21 Thread Yuto Akutsu (Jira)

Yuto Akutsu created SPARK-36824:
---

 Summary: Add sec and csc as R functions
 Key: SPARK-36824
 URL: https://issues.apache.org/jira/browse/SPARK-36824
 Project: Spark
  Issue Type: Improvement
  Components: R
Affects Versions: 3.1.2
Reporter: Yuto Akutsu


Add secant and cosecant as R functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36815) Found duplicate rewrite attributes

2021-09-21 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-36815:

Fix Version/s: (was: 3.0.2)

> Found duplicate rewrite attributes
> --
>
> Key: SPARK-36815
> URL: https://issues.apache.org/jira/browse/SPARK-36815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: gaoyajun02
>Priority: Major
>
> We are using Spark version 3.0.2 in production and some ETLs contain 
> multi-level CETs and the following error occurs when we join them.
> {code:java}
> java.lang.AssertionError: assertion failed: Found duplicate rewrite 
> attributes at scala.Predef$.assert(Predef.scala:223) at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:207) 
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformUpWithNewOutput$1(QueryPlan.scala:193)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:405)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:403)
> {code}
> I reproduced the problem with a simplified SQL as follows:
> {code:java}
> -- SQL
> with
> a as ( select name, get_json_object(json, '$.id') id, n from (
> select get_json_object(json, '$.name') name, json from values 
> ('{"name":"a", "id": 1}' ) people(json)
> ) LATERAL VIEW explode(array(1, 1, 2)) num as n ),
> b as ( select a1.name, a1.id, a1.n from a a1 left join (select name, count(1) 
> c from a group by name) a2 on a1.name = a2.name)
> select b1.name, b1.n, b1.id from b b1 join b b2 on b1.name = b2.name;{code}
> In debugging I found that a reference to the root Project existed in both 
> subqueries, and when `ResolveReferences` resolved the conflict, `rewrite` 
> occurred in both subqueries, containing two new attrMapping, and they were 
> both eventually passed to the root Project, leading to this error
> plan:
> {code:java}
> Project [name#218, id#219, n#229]
> +- Join LeftOuter, (name#218 = name#232)
>:- SubqueryAlias a1
>:  +- SubqueryAlias a
>: +- Project [name#218, get_json_object(json#225, $.id) AS id#219, 
> n#229]
>:+- Generate explode(array(1, 1, 2)), false, num, [n#229]
>:   +- SubqueryAlias __auto_generated_subquery_name
>:  +- Project [get_json_object(json#225, $.name) AS name#218, 
> json#225]
>: +- SubqueryAlias people
>:+- LocalRelation [json#225]
>+- SubqueryAlias a2
>   +- Aggregate [name#232], [name#232, count(1) AS c#220L]
>  +- SubqueryAlias a
> +- Project [name#232, get_json_object(json#226, $.id) AS id#219, 
> n#230]
>+- Generate explode(array(1, 1, 2)), false, num, [n#230]
>   +- SubqueryAlias __auto_generated_subquery_name
>  +- Project [get_json_object(json#226, $.name) AS 
> name#232, json#226]
> +- SubqueryAlias people
>+- LocalRelation [json#226]
> {code}
>  newPlan:
> {code:java}
> !Project [name#218, id#219, n#229]
> +- Join LeftOuter, (name#218 = name#232)
>:- SubqueryAlias a1
>:  +- SubqueryAlias a
>: +- Project [name#218, get_json_object(json#225, $.id) AS id#233, 
> n#229]
>:+- Generate explode(array(1, 1, 2)), false, num, [n#229]
>:   +- SubqueryAlias __auto_generated_subquery_name
>:  +- Project [get_json_object(json#225, $.name) AS name#218, 
> json#225]
>: +- SubqueryAlias people
>:+- LocalRelation [json#225]
>+- SubqueryAlias a2
>   +- Aggregate [name#232], [name#232, count(1) AS c#220L]
>  +- SubqueryAlias a
> +- Project [name#232, get_json_object(json#226, $.id) AS id#234, 
> n#230]
>+- Generate explode(array(1, 1, 2)), false, num, [n#230]
>   +- SubqueryAlias __auto_generated_subquery_name
>  +- Project [get_json_object(json#226, $.name) AS 
> name#232, json#226]
> +- SubqueryAlias people
>+- LocalRelation [json#226]
> {code}
> attrMapping:
> {code:java}
> attrMapping = {ArrayBuffer@9099} "ArrayBuffer" size = 2
>  0 = {Tuple2@17769} "(id#219,id#233)"
>  1 = {Tuple2@17770} "(id#219,id#234)"
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36823) Support broadcast nested loop join hint for equi-join

2021-09-21 Thread XiDuo You (Jira)

XiDuo You created SPARK-36823:
-

 Summary: Support broadcast nested loop join hint for equi-join
 Key: SPARK-36823
 URL: https://issues.apache.org/jira/browse/SPARK-36823
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: XiDuo You


For the join if one side is small and other side is large, the shuffle overhead 
is also very big. Due to the limitation, we can only broadcast right side for 
left join and left side for right join. So for the other case, we can try to 
use `BroadcastNestedLoopJoin` as the join strategy.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36820) Disable LZ4 test for Hadoop 2.7 profile

2021-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418402#comment-17418402
 ] 

Apache Spark commented on SPARK-36820:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34066

> Disable LZ4 test for Hadoop 2.7 profile
> ---
>
> Key: SPARK-36820
> URL: https://issues.apache.org/jira/browse/SPARK-36820
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Minor
>
> Hadoop 2.7 doesn't support lz4-java yet, so we should disable the test in 
> {{FileSourceCodecSuite}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36820) Disable LZ4 test for Hadoop 2.7 profile

2021-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418401#comment-17418401
 ] 

Apache Spark commented on SPARK-36820:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34066

> Disable LZ4 test for Hadoop 2.7 profile
> ---
>
> Key: SPARK-36820
> URL: https://issues.apache.org/jira/browse/SPARK-36820
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Minor
>
> Hadoop 2.7 doesn't support lz4-java yet, so we should disable the test in 
> {{FileSourceCodecSuite}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36781) The log could not get the correct line number

2021-09-21 Thread chenxusheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenxusheng updated SPARK-36781:

Affects Version/s: 3.1.2

> The log could not get the correct line number
> -
>
> Key: SPARK-36781
> URL: https://issues.apache.org/jira/browse/SPARK-36781
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.6, 3.0.3, 3.1.2
>Reporter: chenxusheng
>Priority: Major
>
> INFO 18:16:46 [Thread-1] 
> org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color})
>  MemoryStore cleared
>  INFO 18:16:46 [Thread-1] 
> org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color})
>  BlockManager stopped
>  INFO 18:16:46 [Thread-1] 
> org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color})
>  BlockManagerMaster stopped
>  INFO 18:16:46 [dispatcher-event-loop-0] 
> org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color})
>  OutputCommitCoordinator stopped!
>  INFO 18:16:46 [Thread-1] 
> org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color})
>  Successfully stopped SparkContext
>  INFO 18:16:46 [Thread-1] 
> org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color})
>  Shutdown hook called
> all are : {color:#FF}Logging.scala:54{color}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36763) Pull out ordering expressions

2021-09-21 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418386#comment-17418386
 ] 

Yuming Wang commented on SPARK-36763:
-

Benchmark and benchmark result:
{code:scala}
import org.apache.spark.benchmark.Benchmark
val numRows = 1024 * 1024 * 10
spark.sql(s"CREATE TABLE t1 using parquet AS select id AS a, id AS b FROM 
range(${numRows}L)")
val benchmark = new Benchmark("Benchmark pull out ordering expressions", 
numRows, minNumIters = 5)

Seq(false, true).foreach { pullOutEnabled =>
  val name = s"Pull out ordering expressions ${if (pullOutEnabled) "(Enabled)" 
else "(Disabled)"}"
  benchmark.addCase(name) { _ =>
withSQLConf("spark.sql.pullOutOrderingExpressions" -> s"$pullOutEnabled") {
  spark.sql("SELECT t1.* FROM t1 ORDER BY translate(t1.a, '123', 
'abc')").write.format("noop").mode("Overwrite").save()
}
  }
}
benchmark.run()
{code}
{noformat}
Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Benchmark pull out ordering expressions:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

Pull out ordering expressions (Disabled)   9232   9753 
867  1.1 880.4   1.0X
Pull out ordering expressions (Enabled)7084   7462 
370  1.5 675.5   1.3X
{noformat}

> Pull out ordering expressions
> -
>
> Key: SPARK-36763
> URL: https://issues.apache.org/jira/browse/SPARK-36763
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> Similar to 
> [PullOutGroupingExpressions|https://github.com/apache/spark/blob/7fd3f8f9ec55b364525407213ba1c631705686c5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PullOutGroupingExpressions.scala#L48].
>  We can pull out ordering expressions to improve order performance. For 
> example:
> {code:scala}
> sql("create table t1(a int, b int) using parquet")
> sql("insert into t1 values (1, 2)")
> sql("insert into t1 values (3, 4)")
> sql("select * from t1 order by a - b").explain
> {code}
> {noformat}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Sort [(a#12 - b#13) ASC NULLS FIRST], true, 0
>+- Exchange rangepartitioning((a#12 - b#13) ASC NULLS FIRST, 5), 
> ENSURE_REQUIREMENTS, [id=#39]
>   +- FileScan parquet default.t1[a#12,b#13]
> {noformat}
> The {{Subtract}} will be evaluated 4 times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36771) Fix `pop` of Categorical Series

2021-09-21 Thread Takuya Ueshin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-36771:
--
Fix Version/s: 3.2.0

> Fix `pop` of Categorical Series 
> 
>
> Key: SPARK-36771
> URL: https://issues.apache.org/jira/browse/SPARK-36771
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.2.0, 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36822) BroadcastNestedLoopJoinExec should use all condition instead of non-equi condition

2021-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418373#comment-17418373
 ] 

Apache Spark commented on SPARK-36822:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/34065

> BroadcastNestedLoopJoinExec should use all condition instead of non-equi 
> condition
> --
>
> Key: SPARK-36822
> URL: https://issues.apache.org/jira/browse/SPARK-36822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: XiDuo You
>Priority: Major
>
> At JoinSelection, with ExtractEquiJoinKeys, we use `nonEquiCond` as the join 
> condition. It's wrong since there should exist some equi condition.
> {code:java}
> Seq(joins.BroadcastNestedLoopJoinExec(
>   planLater(left), planLater(right), buildSide, joinType, nonEquiCond))
> {code}
> But it's should not be a bug, since we always use the smj as the default join 
> strategy for ExtractEquiJoinKeys.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36822) BroadcastNestedLoopJoinExec should use all condition instead of non-equi condition

2021-09-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36822:


Assignee: Apache Spark

> BroadcastNestedLoopJoinExec should use all condition instead of non-equi 
> condition
> --
>
> Key: SPARK-36822
> URL: https://issues.apache.org/jira/browse/SPARK-36822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> At JoinSelection, with ExtractEquiJoinKeys, we use `nonEquiCond` as the join 
> condition. It's wrong since there should exist some equi condition.
> {code:java}
> Seq(joins.BroadcastNestedLoopJoinExec(
>   planLater(left), planLater(right), buildSide, joinType, nonEquiCond))
> {code}
> But it's should not be a bug, since we always use the smj as the default join 
> strategy for ExtractEquiJoinKeys.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36822) BroadcastNestedLoopJoinExec should use all condition instead of non-equi condition

2021-09-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36822:


Assignee: (was: Apache Spark)

> BroadcastNestedLoopJoinExec should use all condition instead of non-equi 
> condition
> --
>
> Key: SPARK-36822
> URL: https://issues.apache.org/jira/browse/SPARK-36822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: XiDuo You
>Priority: Major
>
> At JoinSelection, with ExtractEquiJoinKeys, we use `nonEquiCond` as the join 
> condition. It's wrong since there should exist some equi condition.
> {code:java}
> Seq(joins.BroadcastNestedLoopJoinExec(
>   planLater(left), planLater(right), buildSide, joinType, nonEquiCond))
> {code}
> But it's should not be a bug, since we always use the smj as the default join 
> strategy for ExtractEquiJoinKeys.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36822) BroadcastNestedLoopJoinExec should use all condition instead of non-equi condition

2021-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418372#comment-17418372
 ] 

Apache Spark commented on SPARK-36822:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/34065

> BroadcastNestedLoopJoinExec should use all condition instead of non-equi 
> condition
> --
>
> Key: SPARK-36822
> URL: https://issues.apache.org/jira/browse/SPARK-36822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: XiDuo You
>Priority: Major
>
> At JoinSelection, with ExtractEquiJoinKeys, we use `nonEquiCond` as the join 
> condition. It's wrong since there should exist some equi condition.
> {code:java}
> Seq(joins.BroadcastNestedLoopJoinExec(
>   planLater(left), planLater(right), buildSide, joinType, nonEquiCond))
> {code}
> But it's should not be a bug, since we always use the smj as the default join 
> strategy for ExtractEquiJoinKeys.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36822) BroadcastNestedLoopJoinExec should use all condition instead of non-equi condition

2021-09-21 Thread XiDuo You (Jira)

XiDuo You created SPARK-36822:
-

 Summary: BroadcastNestedLoopJoinExec should use all condition 
instead of non-equi condition
 Key: SPARK-36822
 URL: https://issues.apache.org/jira/browse/SPARK-36822
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: XiDuo You


At JoinSelection, with ExtractEquiJoinKeys, we use `nonEquiCond` as the join 
condition. It's wrong since there should exist some equi condition.

{code:java}
Seq(joins.BroadcastNestedLoopJoinExec(
  planLater(left), planLater(right), buildSide, joinType, nonEquiCond))
{code}

But it's should not be a bug, since we always use the smj as the default join 
strategy for ExtractEquiJoinKeys.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics

2021-09-21 Thread Yongjun Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418366#comment-17418366
 ] 

Yongjun Zhang edited comment on SPARK-31646 at 9/22/21, 1:04 AM:
-

HI [~mauzhang],

Thanks a lot for your answers and sorry for late reply. I think I understand it 
better now why you are doing this change: the registeredConnections metrics 
added in ExternalShuffleBlockHandler was not used. 

However, the one added to TransportContext is used, see in 
YarnShuffleService.java:
{code:java}
 // register metrics on the block handler into the Node Manager's metrics 
system.
  blockHandler.getAllMetrics().getMetrics().put("numRegisteredConnections",
  shuffleServer.getRegisteredConnections());
  YarnShuffleServiceMetrics serviceMetrics =
  new YarnShuffleServiceMetrics(blockHandler.getAllMetrics());
  MetricsSystemImpl metricsSystem = (MetricsSystemImpl) 
DefaultMetricsSystem.instance();
  metricsSystem.register(
  "sparkShuffleService", "Metrics on the Spark Shuffle Service", 
serviceMetrics);
  logger.info("Registered metrics with Hadoop's DefaultMetricsSystem");
 {code}
The TransportContext version of registeredConnections is retrieved by 
"shuffleServer.getRegisteredConnections())" in the above code. That means both 
the activeConnections and registeredConnections are still available with your 
change. Is that your expectation?

If my understanding is correct, we can either derive "registeredConnections - 
activeConnections" as the backlogged connections, or we can add a new metrics 
as backloggedConnection to have the value of "registeredConnections - 
activeConnections" .

What do you think?

Thanks!


was (Author: yzhangal):
HI [~mauzhang],

Thanks a lot for your answers and sorry for late reply. I think I understand it 
better now why you are doing this change: the registeredConnections metrics 
added in ExternalShuffleBlockHandler was not used. 

However, the one added to TransportContext is used, see in 
YarnShuffleService.java:
{code:java}
 // register metrics on the block handler into the Node Manager's metrics 
system.
  blockHandler.getAllMetrics().getMetrics().put("numRegisteredConnections",
  shuffleServer.getRegisteredConnections());
  YarnShuffleServiceMetrics serviceMetrics =
  new YarnShuffleServiceMetrics(blockHandler.getAllMetrics());  
MetricsSystemImpl metricsSystem = (MetricsSystemImpl) 
DefaultMetricsSystem.instance();
  metricsSystem.register(
  "sparkShuffleService", "Metrics on the Spark Shuffle Service", 
serviceMetrics);
  logger.info("Registered metrics with Hadoop's DefaultMetricsSystem");
 {code}
The TransportContext version of registeredConnections is retrieved by 
"shuffleServer.getRegisteredConnections())" in the above code. That means both 
the activeConnections and registeredConnections are still available with your 
change. Is that your expectation?

If my understanding is correct, we can either derive "registeredConnections - 
activeConnections" as the backlogged connections, or we can add a new metrics 
as backloggedConnection to have the value of "registeredConnections - 
activeConnections" .

What do you think?

Thanks!

> Remove unused registeredConnections counter from ShuffleMetrics
> ---
>
> Key: SPARK-31646
> URL: https://issues.apache.org/jira/browse/SPARK-31646
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics

2021-09-21 Thread Yongjun Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418366#comment-17418366
 ] 

Yongjun Zhang commented on SPARK-31646:
---

HI [~mauzhang],

Thanks a lot for your answers and sorry for late reply. I think I understand it 
better now why you are doing this change: the registeredConnections metrics 
added in ExternalShuffleBlockHandler was not used. 

However, the one added to TransportContext is used, see in 
YarnShuffleService.java:
{code:java}
 // register metrics on the block handler into the Node Manager's metrics 
system.
  blockHandler.getAllMetrics().getMetrics().put("numRegisteredConnections",
  shuffleServer.getRegisteredConnections());
  YarnShuffleServiceMetrics serviceMetrics =
  new YarnShuffleServiceMetrics(blockHandler.getAllMetrics());  
MetricsSystemImpl metricsSystem = (MetricsSystemImpl) 
DefaultMetricsSystem.instance();
  metricsSystem.register(
  "sparkShuffleService", "Metrics on the Spark Shuffle Service", 
serviceMetrics);
  logger.info("Registered metrics with Hadoop's DefaultMetricsSystem");
 {code}
The TransportContext version of registeredConnections is retrieved by 
"shuffleServer.getRegisteredConnections())" in the above code. That means both 
the activeConnections and registeredConnections are still available with your 
change. Is that your expectation?

If my understanding is correct, we can either derive "registeredConnections - 
activeConnections" as the backlogged connections, or we can add a new metrics 
as backloggedConnection to have the value of "registeredConnections - 
activeConnections" .

What do you think?

Thanks!

> Remove unused registeredConnections counter from ShuffleMetrics
> ---
>
> Key: SPARK-31646
> URL: https://issues.apache.org/jira/browse/SPARK-31646
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36821) Create a test to extend ColumnarBatch

2021-09-21 Thread Yufei Gu (Jira)

Yufei Gu created SPARK-36821:


 Summary: Create a test to extend ColumnarBatch
 Key: SPARK-36821
 URL: https://issues.apache.org/jira/browse/SPARK-36821
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Yufei Gu


As a followup of Spark-36814, to create a test to extend ColumnarBatch to 
prevent future changes to break it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36820) Disable LZ4 test for Hadoop 2.7 profile

2021-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418340#comment-17418340
 ] 

Apache Spark commented on SPARK-36820:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34064

> Disable LZ4 test for Hadoop 2.7 profile
> ---
>
> Key: SPARK-36820
> URL: https://issues.apache.org/jira/browse/SPARK-36820
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Minor
>
> Hadoop 2.7 doesn't support lz4-java yet, so we should disable the test in 
> {{FileSourceCodecSuite}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36820) Disable LZ4 test for Hadoop 2.7 profile

2021-09-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36820:


Assignee: Apache Spark

> Disable LZ4 test for Hadoop 2.7 profile
> ---
>
> Key: SPARK-36820
> URL: https://issues.apache.org/jira/browse/SPARK-36820
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Minor
>
> Hadoop 2.7 doesn't support lz4-java yet, so we should disable the test in 
> {{FileSourceCodecSuite}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36820) Disable LZ4 test for Hadoop 2.7 profile

2021-09-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36820:


Assignee: (was: Apache Spark)

> Disable LZ4 test for Hadoop 2.7 profile
> ---
>
> Key: SPARK-36820
> URL: https://issues.apache.org/jira/browse/SPARK-36820
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Minor
>
> Hadoop 2.7 doesn't support lz4-java yet, so we should disable the test in 
> {{FileSourceCodecSuite}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36820) Disable LZ4 test for Hadoop 2.7 profile

2021-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418339#comment-17418339
 ] 

Apache Spark commented on SPARK-36820:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34064

> Disable LZ4 test for Hadoop 2.7 profile
> ---
>
> Key: SPARK-36820
> URL: https://issues.apache.org/jira/browse/SPARK-36820
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Minor
>
> Hadoop 2.7 doesn't support lz4-java yet, so we should disable the test in 
> {{FileSourceCodecSuite}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36771) Fix `pop` of Categorical Series

2021-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418338#comment-17418338
 ] 

Apache Spark commented on SPARK-36771:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/34063

> Fix `pop` of Categorical Series 
> 
>
> Key: SPARK-36771
> URL: https://issues.apache.org/jira/browse/SPARK-36771
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36820) Disable LZ4 test for Hadoop 2.7 profile

2021-09-21 Thread Chao Sun (Jira)

Chao Sun created SPARK-36820:


 Summary: Disable LZ4 test for Hadoop 2.7 profile
 Key: SPARK-36820
 URL: https://issues.apache.org/jira/browse/SPARK-36820
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


Hadoop 2.7 doesn't support lz4-java yet, so we should disable the test in 
{{FileSourceCodecSuite}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36820) Disable LZ4 test for Hadoop 2.7 profile

2021-09-21 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36820:
-
Issue Type: Test  (was: Bug)

> Disable LZ4 test for Hadoop 2.7 profile
> ---
>
> Key: SPARK-36820
> URL: https://issues.apache.org/jira/browse/SPARK-36820
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Minor
>
> Hadoop 2.7 doesn't support lz4-java yet, so we should disable the test in 
> {{FileSourceCodecSuite}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36771) Fix `pop` of Categorical Series

2021-09-21 Thread Takuya Ueshin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-36771.
---
Fix Version/s: 3.3.0
 Assignee: Xinrong Meng
   Resolution: Fixed

Issue resolved by pull request 34052
https://github.com/apache/spark/pull/34052

> Fix `pop` of Categorical Series 
> 
>
> Key: SPARK-36771
> URL: https://issues.apache.org/jira/browse/SPARK-36771
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36819) DPP: Don't insert redundant filters in case static partition pruning can be done

2021-09-21 Thread Swinky Mann (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Swinky Mann updated SPARK-36819:

Description: 
Don't insert dynamic partition pruning filters in case the filters already 
referred statically. In case the filtering predicate on dimension table is in 
joinKey, no need to insert DPP filter in that case.

DPP is not required in this Sample query:
{code:java}
SELECT f.date_id, f.pid, f.sid FROM
 (select date_id, product_id as pid, store_id as sid from fact_stats) as f
 JOIN dim_stats s
 ON f.sid = s.store_id WHERE s.store_id = 3{code}

  was:
Don't insert dynamic partition pruning filters in case the filters already 
referred statically. In case the filtering predicate on dimension table is in 
joinKey, no need to insert DPP filter in that case.

DPP is not required in this Sample query:
 {{SELECT f.date_id, f.pid, f.sid FROM
(select date_id, product_id as pid, store_id as sid from fact_stats) as f
JOIN dim_stats s
ON f.sid = s.store_id WHERE s.store_id = 3}}


> DPP: Don't insert redundant filters in case static partition pruning can be 
> done
> 
>
> Key: SPARK-36819
> URL: https://issues.apache.org/jira/browse/SPARK-36819
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Swinky Mann
>Priority: Minor
>
> Don't insert dynamic partition pruning filters in case the filters already 
> referred statically. In case the filtering predicate on dimension table is in 
> joinKey, no need to insert DPP filter in that case.
> DPP is not required in this Sample query:
> {code:java}
> SELECT f.date_id, f.pid, f.sid FROM
>  (select date_id, product_id as pid, store_id as sid from fact_stats) as f
>  JOIN dim_stats s
>  ON f.sid = s.store_id WHERE s.store_id = 3{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36819) DPP: Don't insert redundant filters in case static partition pruning can be done

2021-09-21 Thread Swinky Mann (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418300#comment-17418300
 ] 

Swinky Mann commented on SPARK-36819:
-

https://github.com/apache/spark/pull/34062

> DPP: Don't insert redundant filters in case static partition pruning can be 
> done
> 
>
> Key: SPARK-36819
> URL: https://issues.apache.org/jira/browse/SPARK-36819
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Swinky Mann
>Priority: Minor
>
> Don't insert dynamic partition pruning filters in case the filters already 
> referred statically. In case the filtering predicate on dimension table is in 
> joinKey, no need to insert DPP filter in that case.
> DPP is not required in this Sample query:
>  {{SELECT f.date_id, f.pid, f.sid FROM
> (select date_id, product_id as pid, store_id as sid from fact_stats) as f
> JOIN dim_stats s
> ON f.sid = s.store_id WHERE s.store_id = 3}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36819) DPP: Don't insert redundant filters in case static partition pruning can be done

2021-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418301#comment-17418301
 ] 

Apache Spark commented on SPARK-36819:
--

User 'Swinky' has created a pull request for this issue:
https://github.com/apache/spark/pull/34062

> DPP: Don't insert redundant filters in case static partition pruning can be 
> done
> 
>
> Key: SPARK-36819
> URL: https://issues.apache.org/jira/browse/SPARK-36819
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Swinky Mann
>Priority: Minor
>
> Don't insert dynamic partition pruning filters in case the filters already 
> referred statically. In case the filtering predicate on dimension table is in 
> joinKey, no need to insert DPP filter in that case.
> DPP is not required in this Sample query:
>  {{SELECT f.date_id, f.pid, f.sid FROM
> (select date_id, product_id as pid, store_id as sid from fact_stats) as f
> JOIN dim_stats s
> ON f.sid = s.store_id WHERE s.store_id = 3}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36819) DPP: Don't insert redundant filters in case static partition pruning can be done

2021-09-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36819:


Assignee: Apache Spark

> DPP: Don't insert redundant filters in case static partition pruning can be 
> done
> 
>
> Key: SPARK-36819
> URL: https://issues.apache.org/jira/browse/SPARK-36819
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Swinky Mann
>Assignee: Apache Spark
>Priority: Minor
>
> Don't insert dynamic partition pruning filters in case the filters already 
> referred statically. In case the filtering predicate on dimension table is in 
> joinKey, no need to insert DPP filter in that case.
> DPP is not required in this Sample query:
>  {{SELECT f.date_id, f.pid, f.sid FROM
> (select date_id, product_id as pid, store_id as sid from fact_stats) as f
> JOIN dim_stats s
> ON f.sid = s.store_id WHERE s.store_id = 3}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36819) DPP: Don't insert redundant filters in case static partition pruning can be done

2021-09-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36819:


Assignee: (was: Apache Spark)

> DPP: Don't insert redundant filters in case static partition pruning can be 
> done
> 
>
> Key: SPARK-36819
> URL: https://issues.apache.org/jira/browse/SPARK-36819
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Swinky Mann
>Priority: Minor
>
> Don't insert dynamic partition pruning filters in case the filters already 
> referred statically. In case the filtering predicate on dimension table is in 
> joinKey, no need to insert DPP filter in that case.
> DPP is not required in this Sample query:
>  {{SELECT f.date_id, f.pid, f.sid FROM
> (select date_id, product_id as pid, store_id as sid from fact_stats) as f
> JOIN dim_stats s
> ON f.sid = s.store_id WHERE s.store_id = 3}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36819) DPP: Don't insert redundant filters in case static partition pruning can be done

2021-09-21 Thread Swinky Mann (Jira)

Swinky Mann created SPARK-36819:
---

 Summary: DPP: Don't insert redundant filters in case static 
partition pruning can be done
 Key: SPARK-36819
 URL: https://issues.apache.org/jira/browse/SPARK-36819
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.2
Reporter: Swinky Mann


Don't insert dynamic partition pruning filters in case the filters already 
referred statically. In case the filtering predicate on dimension table is in 
joinKey, no need to insert DPP filter in that case.

DPP is not required in this Sample query:
 {{SELECT f.date_id, f.pid, f.sid FROM
(select date_id, product_id as pid, store_id as sid from fact_stats) as f
JOIN dim_stats s
ON f.sid = s.store_id WHERE s.store_id = 3}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36818) Fix filtering a Series by a boolean Series

2021-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418271#comment-17418271
 ] 

Apache Spark commented on SPARK-36818:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/34061

> Fix filtering a Series by a boolean Series
> --
>
> Key: SPARK-36818
> URL: https://issues.apache.org/jira/browse/SPARK-36818
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36818) Fix filtering a Series by a boolean Series

2021-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418269#comment-17418269
 ] 

Apache Spark commented on SPARK-36818:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/34061

> Fix filtering a Series by a boolean Series
> --
>
> Key: SPARK-36818
> URL: https://issues.apache.org/jira/browse/SPARK-36818
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36818) Fix filtering a Series by a boolean Series

2021-09-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36818:


Assignee: Apache Spark

> Fix filtering a Series by a boolean Series
> --
>
> Key: SPARK-36818
> URL: https://issues.apache.org/jira/browse/SPARK-36818
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36818) Fix filtering a Series by a boolean Series

2021-09-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36818:


Assignee: (was: Apache Spark)

> Fix filtering a Series by a boolean Series
> --
>
> Key: SPARK-36818
> URL: https://issues.apache.org/jira/browse/SPARK-36818
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36818) Fix filtering a Series by a boolean Series

2021-09-21 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-36818:


 Summary: Fix filtering a Series by a boolean Series
 Key: SPARK-36818
 URL: https://issues.apache.org/jira/browse/SPARK-36818
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36615) SparkContext should register shutdown hook earlier

2021-09-21 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-36615.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33869
[https://github.com/apache/spark/pull/33869]

> SparkContext should register shutdown hook earlier
> --
>
> Key: SPARK-36615
> URL: https://issues.apache.org/jira/browse/SPARK-36615
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> SparkContext should register shutdown hook earlier



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36615) SparkContext should register shutdown hook earlier

2021-09-21 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-36615:
---

Assignee: angerszhu

> SparkContext should register shutdown hook earlier
> --
>
> Key: SPARK-36615
> URL: https://issues.apache.org/jira/browse/SPARK-36615
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> SparkContext should register shutdown hook earlier



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36817) Does Apache Spark 3 support GPU usage for Spark RDDs?

2021-09-21 Thread Abhishek Shakya (Jira)

Abhishek Shakya created SPARK-36817:
---

 Summary: Does Apache Spark 3 support GPU usage for Spark RDDs?
 Key: SPARK-36817
 URL: https://issues.apache.org/jira/browse/SPARK-36817
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 3.1.2
Reporter: Abhishek Shakya


I am currently trying to run genomic analyses pipelines using 
[Hail|https://hail.is/](library for genomics analyses written in python and 
Scala). Recently, Apache Spark 3 was released and it supported GPU usage.

I tried [spark-rapids|https://nvidia.github.io/spark-rapids/] library start an 
on-premise slurm cluster with gpu nodes. I was able to initialise the cluster. 
However, when I tried running hail tasks, the executors keep getting killed.

On querying in Hail forum, I got the response that
{quote}That’s a GPU code generator for Spark-SQL, and Hail doesn’t use any 
Spark-SQL interfaces, only the RDD interfaces.
{quote}
So, does Spark3 not support GPU usage for RDD interfaces?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36769) Improve `filter` of single-indexed DataFrame

2021-09-21 Thread Takuya Ueshin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-36769.
---
Fix Version/s: 3.3.0
 Assignee: Xinrong Meng
   Resolution: Fixed

Issue resolved by pull request 33998
https://github.com/apache/spark/pull/33998

> Improve `filter` of single-indexed DataFrame
> 
>
> Key: SPARK-36769
> URL: https://issues.apache.org/jira/browse/SPARK-36769
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36814) Make class ColumnarBatch extendable

2021-09-21 Thread DB Tsai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-36814:
---

Assignee: Yufei Gu

> Make class ColumnarBatch extendable
> ---
>
> Key: SPARK-36814
> URL: https://issues.apache.org/jira/browse/SPARK-36814
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yufei Gu
>Assignee: Yufei Gu
>Priority: Major
> Fix For: 3.3.0
>
>
> To support better vectorized reading in multiple data source, ColumnarBatch 
> need to be extendable. For example, To support row-level delete(  
> [https://github.com/apache/iceberg/issues/3141]) in Iceberg's vectorized 
> read, we need to filter out deleted rows in a batch, which requires 
> ColumnarBatch to be extendable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36814) Make class ColumnarBatch extendable

2021-09-21 Thread DB Tsai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-36814.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34054
[https://github.com/apache/spark/pull/34054]

> Make class ColumnarBatch extendable
> ---
>
> Key: SPARK-36814
> URL: https://issues.apache.org/jira/browse/SPARK-36814
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yufei Gu
>Priority: Major
> Fix For: 3.3.0
>
>
> To support better vectorized reading in multiple data source, ColumnarBatch 
> need to be extendable. For example, To support row-level delete(  
> [https://github.com/apache/iceberg/issues/3141]) in Iceberg's vectorized 
> read, we need to filter out deleted rows in a batch, which requires 
> ColumnarBatch to be extendable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36670) Add end-to-end codec test cases for main datasources

2021-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418167#comment-17418167
 ] 

Apache Spark commented on SPARK-36670:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34059

> Add end-to-end codec test cases for main datasources
> 
>
> Key: SPARK-36670
> URL: https://issues.apache.org/jira/browse/SPARK-36670
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> We found there is no e2e test cases available for main datasources like 
> Parquet, Orc. It makes developers harder to identify possible bugs early. We 
> should add such tests in Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36811) Add bitwise functions for the BINARY data type

2021-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418162#comment-17418162
 ] 

Apache Spark commented on SPARK-36811:
--

User 'mkaravel' has created a pull request for this issue:
https://github.com/apache/spark/pull/34056

> Add bitwise functions for the BINARY data type
> --
>
> Key: SPARK-36811
> URL: https://issues.apache.org/jira/browse/SPARK-36811
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Menelais Karavelas
>Priority: Major
> Fix For: 3.3.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Add four new SQL functions operating on the `BINARY` data type for performing 
> bitwise operations: BITAND, BITOR, BITXOR, and BITNOT.
> The BITAND, BITOR, and BITXOR functions take two byte strings as input and 
> return the bitwise AND, OR, or XOR of the two input byte strings. The byte 
> size of the result is the maximum of the byte sizes of the inputs, while the 
> result is computed by aligning the two strings with respect to their least 
> significant byte, and left-padding with zeros the shorter of the two inputs.
> The BITNOT function is a unary function and returns the input byte string 
> with all the bits negated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36811) Add bitwise functions for the BINARY data type

2021-09-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36811:


Assignee: (was: Apache Spark)

> Add bitwise functions for the BINARY data type
> --
>
> Key: SPARK-36811
> URL: https://issues.apache.org/jira/browse/SPARK-36811
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Menelais Karavelas
>Priority: Major
> Fix For: 3.3.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Add four new SQL functions operating on the `BINARY` data type for performing 
> bitwise operations: BITAND, BITOR, BITXOR, and BITNOT.
> The BITAND, BITOR, and BITXOR functions take two byte strings as input and 
> return the bitwise AND, OR, or XOR of the two input byte strings. The byte 
> size of the result is the maximum of the byte sizes of the inputs, while the 
> result is computed by aligning the two strings with respect to their least 
> significant byte, and left-padding with zeros the shorter of the two inputs.
> The BITNOT function is a unary function and returns the input byte string 
> with all the bits negated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36811) Add bitwise functions for the BINARY data type

2021-09-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36811:


Assignee: Apache Spark

> Add bitwise functions for the BINARY data type
> --
>
> Key: SPARK-36811
> URL: https://issues.apache.org/jira/browse/SPARK-36811
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Menelais Karavelas
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.3.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Add four new SQL functions operating on the `BINARY` data type for performing 
> bitwise operations: BITAND, BITOR, BITXOR, and BITNOT.
> The BITAND, BITOR, and BITXOR functions take two byte strings as input and 
> return the bitwise AND, OR, or XOR of the two input byte strings. The byte 
> size of the result is the maximum of the byte sizes of the inputs, while the 
> result is computed by aligning the two strings with respect to their least 
> significant byte, and left-padding with zeros the shorter of the two inputs.
> The BITNOT function is a unary function and returns the input byte string 
> with all the bits negated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36816) Introduce a config variable for the incrementalCollects row batch size

2021-09-21 Thread Ole (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ole updated SPARK-36816:

Description: 
After enabling *_spark.sql.thriftServer.incrementalCollects_* Thrift will 
execute queries in batches (as intended). Unfortunately the batch size cannot 
be configured as it seems to be hardcoded 
[here|https://github.com/apache/spark/blob/6699f76fe2afa7f154b4ba424f3fe048fcee46df/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftCLIServiceClient.java#L404].
 It would be useful to configure that value to be able to adjust it to your 
environment.

 

  was:
After enabling *_spark.sql.thriftServer.incrementalCollects_* Thrift will 
execute queries in batches (as intended). Unfortunately the batch size cannot 
be configured as it seems to be hardcoded 
[here|[https://github.com/apache/spark/blob/6699f76fe2afa7f154b4ba424f3fe048fcee46df/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftCLIServiceClient.java#L404]].
 It would be useful to configure that value to be able to adjust it to your 
environment.

edit: the link does not work - 
https://github.com/apache/spark/blob/6699f76fe2afa7f154b4ba424f3fe048fcee46df/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftCLIServiceClient.java#L404

 


> Introduce a config variable for the incrementalCollects row batch size
> --
>
> Key: SPARK-36816
> URL: https://issues.apache.org/jira/browse/SPARK-36816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Ole
>Priority: Minor
>
> After enabling *_spark.sql.thriftServer.incrementalCollects_* Thrift will 
> execute queries in batches (as intended). Unfortunately the batch size cannot 
> be configured as it seems to be hardcoded 
> [here|https://github.com/apache/spark/blob/6699f76fe2afa7f154b4ba424f3fe048fcee46df/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftCLIServiceClient.java#L404].
>  It would be useful to configure that value to be able to adjust it to your 
> environment.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36816) Introduce a config variable for the incrementalCollects row batch size

2021-09-21 Thread Ole (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ole updated SPARK-36816:

Description: 
After enabling *_spark.sql.thriftServer.incrementalCollects_* Thrift will 
execute queries in batches (as intended). Unfortunately the batch size cannot 
be configured as it seems to be hardcoded 
[here|[https://github.com/apache/spark/blob/6699f76fe2afa7f154b4ba424f3fe048fcee46df/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftCLIServiceClient.java#L404]].
 It would be useful to configure that value to be able to adjust it to your 
environment.

edit: the link does not work - 
https://github.com/apache/spark/blob/6699f76fe2afa7f154b4ba424f3fe048fcee46df/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftCLIServiceClient.java#L404

 

  was:
After enabling *_spark.sql.thriftServer.incrementalCollects_* Thrift will 
execute queries in batches (as intended). Unfortunately the batch size cannot 
be configured as it seems to be hardcoded 
[here|[https://github.com/apache/spark/blob/6699f76fe2afa7f154b4ba424f3fe048fcee46df/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftCLIServiceClient.java#L404]].
 It would be useful to configure that value to be able to adjust it to your 
environment.

 


> Introduce a config variable for the incrementalCollects row batch size
> --
>
> Key: SPARK-36816
> URL: https://issues.apache.org/jira/browse/SPARK-36816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Ole
>Priority: Minor
>
> After enabling *_spark.sql.thriftServer.incrementalCollects_* Thrift will 
> execute queries in batches (as intended). Unfortunately the batch size cannot 
> be configured as it seems to be hardcoded 
> [here|[https://github.com/apache/spark/blob/6699f76fe2afa7f154b4ba424f3fe048fcee46df/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftCLIServiceClient.java#L404]].
>  It would be useful to configure that value to be able to adjust it to your 
> environment.
> edit: the link does not work - 
> https://github.com/apache/spark/blob/6699f76fe2afa7f154b4ba424f3fe048fcee46df/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftCLIServiceClient.java#L404
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36816) Introduce a config variable for the incrementalCollects row batch size

2021-09-21 Thread Ole (Jira)

Ole created SPARK-36816:
---

 Summary: Introduce a config variable for the incrementalCollects 
row batch size
 Key: SPARK-36816
 URL: https://issues.apache.org/jira/browse/SPARK-36816
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.2
Reporter: Ole


After enabling *_spark.sql.thriftServer.incrementalCollects_* Thrift will 
execute queries in batches (as intended). Unfortunately the batch size cannot 
be configured as it seems to be hardcoded 
[here|[https://github.com/apache/spark/blob/6699f76fe2afa7f154b4ba424f3fe048fcee46df/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftCLIServiceClient.java#L404]].
 It would be useful to configure that value to be able to adjust it to your 
environment.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36711) Support multi-index in new syntax

2021-09-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36711:


Assignee: Apache Spark

> Support multi-index in new syntax
> -
>
> Key: SPARK-36711
> URL: https://issues.apache.org/jira/browse/SPARK-36711
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Support multi-index in the new syntax SPARK-36709



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36711) Support multi-index in new syntax

2021-09-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36711:


Assignee: (was: Apache Spark)

> Support multi-index in new syntax
> -
>
> Key: SPARK-36711
> URL: https://issues.apache.org/jira/browse/SPARK-36711
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Support multi-index in the new syntax SPARK-36709



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36711) Support multi-index in new syntax

2021-09-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418150#comment-17418150
 ] 

Apache Spark commented on SPARK-36711:
--

User 'dgd-contributor' has created a pull request for this issue:
https://github.com/apache/spark/pull/34058

> Support multi-index in new syntax
> -
>
> Key: SPARK-36711
> URL: https://issues.apache.org/jira/browse/SPARK-36711
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Support multi-index in the new syntax SPARK-36709



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-36815) Found duplicate rewrite attributes

2021-09-21 Thread gaoyajun02 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418064#comment-17418064
 ] 

gaoyajun02 edited comment on SPARK-36815 at 9/21/21, 2:19 PM:
--

https://issues.apache.org/jira/browse/SPARK-33272 fixes this issue, but the 
Spark 3.0.2 branch does not.

Hi [~cloud_fan], can we open backport PRs for 3.0?


was (Author: gaoyajun02):
https://issues.apache.org/jira/browse/SPARK-33272 fixes this issue, but the 
Spark 3.0.2 branch does not.

Hi [~cloud_fan], can we open backport PRs for 3.0.2?

> Found duplicate rewrite attributes
> --
>
> Key: SPARK-36815
> URL: https://issues.apache.org/jira/browse/SPARK-36815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: gaoyajun02
>Priority: Major
> Fix For: 3.0.2
>
>
> We are using Spark version 3.0.2 in production and some ETLs contain 
> multi-level CETs and the following error occurs when we join them.
> {code:java}
> java.lang.AssertionError: assertion failed: Found duplicate rewrite 
> attributes at scala.Predef$.assert(Predef.scala:223) at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:207) 
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformUpWithNewOutput$1(QueryPlan.scala:193)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:405)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:403)
> {code}
> I reproduced the problem with a simplified SQL as follows:
> {code:java}
> -- SQL
> with
> a as ( select name, get_json_object(json, '$.id') id, n from (
> select get_json_object(json, '$.name') name, json from values 
> ('{"name":"a", "id": 1}' ) people(json)
> ) LATERAL VIEW explode(array(1, 1, 2)) num as n ),
> b as ( select a1.name, a1.id, a1.n from a a1 left join (select name, count(1) 
> c from a group by name) a2 on a1.name = a2.name)
> select b1.name, b1.n, b1.id from b b1 join b b2 on b1.name = b2.name;{code}
> In debugging I found that a reference to the root Project existed in both 
> subqueries, and when `ResolveReferences` resolved the conflict, `rewrite` 
> occurred in both subqueries, containing two new attrMapping, and they were 
> both eventually passed to the root Project, leading to this error
> plan:
> {code:java}
> Project [name#218, id#219, n#229]
> +- Join LeftOuter, (name#218 = name#232)
>:- SubqueryAlias a1
>:  +- SubqueryAlias a
>: +- Project [name#218, get_json_object(json#225, $.id) AS id#219, 
> n#229]
>:+- Generate explode(array(1, 1, 2)), false, num, [n#229]
>:   +- SubqueryAlias __auto_generated_subquery_name
>:  +- Project [get_json_object(json#225, $.name) AS name#218, 
> json#225]
>: +- SubqueryAlias people
>:+- LocalRelation [json#225]
>+- SubqueryAlias a2
>   +- Aggregate [name#232], [name#232, count(1) AS c#220L]
>  +- SubqueryAlias a
> +- Project [name#232, get_json_object(json#226, $.id) AS id#219, 
> n#230]
>+- Generate explode(array(1, 1, 2)), false, num, [n#230]
>   +- SubqueryAlias __auto_generated_subquery_name
>  +- Project [get_json_object(json#226, $.name) AS 
> name#232, json#226]
> +- SubqueryAlias people
>+- LocalRelation [json#226]
> {code}
>  newPlan:
> {code:java}
> !Project [name#218, id#219, n#229]
> +- Join LeftOuter, (name#218 = name#232)
>:- SubqueryAlias a1
>:  +- SubqueryAlias a
>: +- Project [name#218, get_json_object(json#225, $.id) AS id#233, 
> n#229]
>:+- Generate explode(array(1, 1, 2)), false, num, [n#229]
>:   +- SubqueryAlias __auto_generated_subquery_name
>:  +- Project [get_json_object(json#225, $.name) AS name#218, 
> json#225]
>: +- SubqueryAlias people
>:+- LocalRelation [json#225]
>+- SubqueryAlias a2
>   +- Aggregate [name#232], [name#232, count(1) AS c#220L]
>  +- SubqueryAlias a
> +- Project [name#232, get_json_object(json#226, $.id) AS id#234, 
> n#230]
>+- Generate explode(array(1, 1, 2)), false, num, [n#230]
>   +- SubqueryAlias __auto_generated_subquery_name
>  +- Project [get_json_object(json#226, $.name) AS 
> name#232, json#226]
> +- SubqueryAlias people
>+- LocalRelation [json#226]
> {code}
> attrMapping:
> {code:java}
> attrMapping = {ArrayBuffer@9099} "ArrayBuffer" size = 2
>  0 = {Tuple2@17769}

[jira] [Comment Edited] (SPARK-36815) Found duplicate rewrite attributes

2021-09-21 Thread gaoyajun02 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418064#comment-17418064
 ] 

gaoyajun02 edited comment on SPARK-36815 at 9/21/21, 12:17 PM:
---

https://issues.apache.org/jira/browse/SPARK-33272 fixes this issue, but the 
Spark 3.0.2 branch does not.

Hi [~cloud_fan], can we open backport PRs for 3.0.2?


was (Author: gaoyajun02):
https://issues.apache.org/jira/browse/SPARK-33272 fixes this issue, but the 
Spark 3.0.2 branch does not

> Found duplicate rewrite attributes
> --
>
> Key: SPARK-36815
> URL: https://issues.apache.org/jira/browse/SPARK-36815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: gaoyajun02
>Priority: Major
> Fix For: 3.0.2
>
>
> We are using Spark version 3.0.2 in production and some ETLs contain 
> multi-level CETs and the following error occurs when we join them.
> {code:java}
> java.lang.AssertionError: assertion failed: Found duplicate rewrite 
> attributes at scala.Predef$.assert(Predef.scala:223) at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:207) 
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformUpWithNewOutput$1(QueryPlan.scala:193)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:405)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:403)
> {code}
> I reproduced the problem with a simplified SQL as follows:
> {code:java}
> -- SQL
> with
> a as ( select name, get_json_object(json, '$.id') id, n from (
> select get_json_object(json, '$.name') name, json from values 
> ('{"name":"a", "id": 1}' ) people(json)
> ) LATERAL VIEW explode(array(1, 1, 2)) num as n ),
> b as ( select a1.name, a1.id, a1.n from a a1 left join (select name, count(1) 
> c from a group by name) a2 on a1.name = a2.name)
> select b1.name, b1.n, b1.id from b b1 join b b2 on b1.name = b2.name;{code}
> In debugging I found that a reference to the root Project existed in both 
> subqueries, and when `ResolveReferences` resolved the conflict, `rewrite` 
> occurred in both subqueries, containing two new attrMapping, and they were 
> both eventually passed to the root Project, leading to this error
> plan:
> {code:java}
> Project [name#218, id#219, n#229]
> +- Join LeftOuter, (name#218 = name#232)
>:- SubqueryAlias a1
>:  +- SubqueryAlias a
>: +- Project [name#218, get_json_object(json#225, $.id) AS id#219, 
> n#229]
>:+- Generate explode(array(1, 1, 2)), false, num, [n#229]
>:   +- SubqueryAlias __auto_generated_subquery_name
>:  +- Project [get_json_object(json#225, $.name) AS name#218, 
> json#225]
>: +- SubqueryAlias people
>:+- LocalRelation [json#225]
>+- SubqueryAlias a2
>   +- Aggregate [name#232], [name#232, count(1) AS c#220L]
>  +- SubqueryAlias a
> +- Project [name#232, get_json_object(json#226, $.id) AS id#219, 
> n#230]
>+- Generate explode(array(1, 1, 2)), false, num, [n#230]
>   +- SubqueryAlias __auto_generated_subquery_name
>  +- Project [get_json_object(json#226, $.name) AS 
> name#232, json#226]
> +- SubqueryAlias people
>+- LocalRelation [json#226]
> {code}
>  newPlan:
> {code:java}
> !Project [name#218, id#219, n#229]
> +- Join LeftOuter, (name#218 = name#232)
>:- SubqueryAlias a1
>:  +- SubqueryAlias a
>: +- Project [name#218, get_json_object(json#225, $.id) AS id#233, 
> n#229]
>:+- Generate explode(array(1, 1, 2)), false, num, [n#229]
>:   +- SubqueryAlias __auto_generated_subquery_name
>:  +- Project [get_json_object(json#225, $.name) AS name#218, 
> json#225]
>: +- SubqueryAlias people
>:+- LocalRelation [json#225]
>+- SubqueryAlias a2
>   +- Aggregate [name#232], [name#232, count(1) AS c#220L]
>  +- SubqueryAlias a
> +- Project [name#232, get_json_object(json#226, $.id) AS id#234, 
> n#230]
>+- Generate explode(array(1, 1, 2)), false, num, [n#230]
>   +- SubqueryAlias __auto_generated_subquery_name
>  +- Project [get_json_object(json#226, $.name) AS 
> name#232, json#226]
> +- SubqueryAlias people
>+- LocalRelation [json#226]
> {code}
> attrMapping:
> {code:java}
> attrMapping = {ArrayBuffer@9099} "ArrayBuffer" size = 2
>  0 = {Tuple2@17769} "(id#219,id#233)"
>  1 = {Tuple2@17770} "(id#219,id#234)"

[jira] [Commented] (SPARK-36815) Found duplicate rewrite attributes

2021-09-21 Thread gaoyajun02 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418064#comment-17418064
 ] 

gaoyajun02 commented on SPARK-36815:


https://issues.apache.org/jira/browse/SPARK-33272 fixes this issue, but the 
Spark 3.0.2 branch does not

> Found duplicate rewrite attributes
> --
>
> Key: SPARK-36815
> URL: https://issues.apache.org/jira/browse/SPARK-36815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: gaoyajun02
>Priority: Major
> Fix For: 3.0.2
>
>
> We are using Spark version 3.0.2 in production and some ETLs contain 
> multi-level CETs and the following error occurs when we join them.
> {code:java}
> java.lang.AssertionError: assertion failed: Found duplicate rewrite 
> attributes at scala.Predef$.assert(Predef.scala:223) at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:207) 
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformUpWithNewOutput$1(QueryPlan.scala:193)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:405)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:403)
> {code}
> I reproduced the problem with a simplified SQL as follows:
> {code:java}
> -- SQL
> with
> a as ( select name, get_json_object(json, '$.id') id, n from (
> select get_json_object(json, '$.name') name, json from values 
> ('{"name":"a", "id": 1}' ) people(json)
> ) LATERAL VIEW explode(array(1, 1, 2)) num as n ),
> b as ( select a1.name, a1.id, a1.n from a a1 left join (select name, count(1) 
> c from a group by name) a2 on a1.name = a2.name)
> select b1.name, b1.n, b1.id from b b1 join b b2 on b1.name = b2.name;{code}
> In debugging I found that a reference to the root Project existed in both 
> subqueries, and when `ResolveReferences` resolved the conflict, `rewrite` 
> occurred in both subqueries, containing two new attrMapping, and they were 
> both eventually passed to the root Project, leading to this error
> plan:
> {code:java}
> Project [name#218, id#219, n#229]
> +- Join LeftOuter, (name#218 = name#232)
>:- SubqueryAlias a1
>:  +- SubqueryAlias a
>: +- Project [name#218, get_json_object(json#225, $.id) AS id#219, 
> n#229]
>:+- Generate explode(array(1, 1, 2)), false, num, [n#229]
>:   +- SubqueryAlias __auto_generated_subquery_name
>:  +- Project [get_json_object(json#225, $.name) AS name#218, 
> json#225]
>: +- SubqueryAlias people
>:+- LocalRelation [json#225]
>+- SubqueryAlias a2
>   +- Aggregate [name#232], [name#232, count(1) AS c#220L]
>  +- SubqueryAlias a
> +- Project [name#232, get_json_object(json#226, $.id) AS id#219, 
> n#230]
>+- Generate explode(array(1, 1, 2)), false, num, [n#230]
>   +- SubqueryAlias __auto_generated_subquery_name
>  +- Project [get_json_object(json#226, $.name) AS 
> name#232, json#226]
> +- SubqueryAlias people
>+- LocalRelation [json#226]
> {code}
>  newPlan:
> {code:java}
> !Project [name#218, id#219, n#229]
> +- Join LeftOuter, (name#218 = name#232)
>:- SubqueryAlias a1
>:  +- SubqueryAlias a
>: +- Project [name#218, get_json_object(json#225, $.id) AS id#233, 
> n#229]
>:+- Generate explode(array(1, 1, 2)), false, num, [n#229]
>:   +- SubqueryAlias __auto_generated_subquery_name
>:  +- Project [get_json_object(json#225, $.name) AS name#218, 
> json#225]
>: +- SubqueryAlias people
>:+- LocalRelation [json#225]
>+- SubqueryAlias a2
>   +- Aggregate [name#232], [name#232, count(1) AS c#220L]
>  +- SubqueryAlias a
> +- Project [name#232, get_json_object(json#226, $.id) AS id#234, 
> n#230]
>+- Generate explode(array(1, 1, 2)), false, num, [n#230]
>   +- SubqueryAlias __auto_generated_subquery_name
>  +- Project [get_json_object(json#226, $.name) AS 
> name#232, json#226]
> +- SubqueryAlias people
>+- LocalRelation [json#226]
> {code}
> attrMapping:
> {code:java}
> attrMapping = {ArrayBuffer@9099} "ArrayBuffer" size = 2
>  0 = {Tuple2@17769} "(id#219,id#233)"
>  1 = {Tuple2@17770} "(id#219,id#234)"
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands,

[jira] [Updated] (SPARK-36815) Found duplicate rewrite attributes

2021-09-21 Thread gaoyajun02 (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaoyajun02 updated SPARK-36815:
---
Description: 
We are using Spark version 3.0.2 in production and some ETLs contain 
multi-level CETs and the following error occurs when we join them.
{code:java}
java.lang.AssertionError: assertion failed: Found duplicate rewrite attributes 
at scala.Predef$.assert(Predef.scala:223) at 
org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:207) at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformUpWithNewOutput$1(QueryPlan.scala:193)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:405)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:403)
{code}
I reproduced the problem with a simplified SQL as follows:
{code:java}
-- SQL
with
a as ( select name, get_json_object(json, '$.id') id, n from (
select get_json_object(json, '$.name') name, json from values 
('{"name":"a", "id": 1}' ) people(json)
) LATERAL VIEW explode(array(1, 1, 2)) num as n ),
b as ( select a1.name, a1.id, a1.n from a a1 left join (select name, count(1) c 
from a group by name) a2 on a1.name = a2.name)
select b1.name, b1.n, b1.id from b b1 join b b2 on b1.name = b2.name;{code}
In debugging I found that a reference to the root Project existed in both 
subqueries, and when `ResolveReferences` resolved the conflict, `rewrite` 
occurred in both subqueries, containing two new attrMapping, and they were both 
eventually passed to the root Project, leading to this error

plan:
{code:java}
Project [name#218, id#219, n#229]
+- Join LeftOuter, (name#218 = name#232)
   :- SubqueryAlias a1
   :  +- SubqueryAlias a
   : +- Project [name#218, get_json_object(json#225, $.id) AS id#219, n#229]
   :+- Generate explode(array(1, 1, 2)), false, num, [n#229]
   :   +- SubqueryAlias __auto_generated_subquery_name
   :  +- Project [get_json_object(json#225, $.name) AS name#218, 
json#225]
   : +- SubqueryAlias people
   :+- LocalRelation [json#225]
   +- SubqueryAlias a2
  +- Aggregate [name#232], [name#232, count(1) AS c#220L]
 +- SubqueryAlias a
+- Project [name#232, get_json_object(json#226, $.id) AS id#219, 
n#230]
   +- Generate explode(array(1, 1, 2)), false, num, [n#230]
  +- SubqueryAlias __auto_generated_subquery_name
 +- Project [get_json_object(json#226, $.name) AS name#232, 
json#226]
+- SubqueryAlias people
   +- LocalRelation [json#226]

{code}
 newPlan:
{code:java}

!Project [name#218, id#219, n#229]
+- Join LeftOuter, (name#218 = name#232)
   :- SubqueryAlias a1
   :  +- SubqueryAlias a
   : +- Project [name#218, get_json_object(json#225, $.id) AS id#233, n#229]
   :+- Generate explode(array(1, 1, 2)), false, num, [n#229]
   :   +- SubqueryAlias __auto_generated_subquery_name
   :  +- Project [get_json_object(json#225, $.name) AS name#218, 
json#225]
   : +- SubqueryAlias people
   :+- LocalRelation [json#225]
   +- SubqueryAlias a2
  +- Aggregate [name#232], [name#232, count(1) AS c#220L]
 +- SubqueryAlias a
+- Project [name#232, get_json_object(json#226, $.id) AS id#234, 
n#230]
   +- Generate explode(array(1, 1, 2)), false, num, [n#230]
  +- SubqueryAlias __auto_generated_subquery_name
 +- Project [get_json_object(json#226, $.name) AS name#232, 
json#226]
+- SubqueryAlias people
   +- LocalRelation [json#226]

{code}
attrMapping:
{code:java}
attrMapping = {ArrayBuffer@9099} "ArrayBuffer" size = 2
 0 = {Tuple2@17769} "(id#219,id#233)"
 1 = {Tuple2@17770} "(id#219,id#234)"
{code}
 

 

 

  was:
We are using Spark version 3.0.2 in production and some ETLs contain 
multi-level CETs and the following error occurs when we join them.
{code:java}
java.lang.AssertionError: assertion failed: Found duplicate rewrite attributes 
at scala.Predef$.assert(Predef.scala:223) at 
org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:207) at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformUpWithNewOutput$1(QueryPlan.scala:193)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:405)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:403)
{code}
I reproduced the problem with a simplified SQL as follows:
{code:java}
-- SQL
with
a as ( select name, get_json_object(json, '$.id') id, n from (
select get_json_object(json,

[jira] [Commented] (SPARK-36600) reduces memory consumption win Pyspark CreateDataFrame

2021-09-21 Thread Anton Kholodkov (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418052#comment-17418052
 ] 

Anton Kholodkov commented on SPARK-36600:
-

Hello! I would like to help with the task. Please assign it to me, if possible 

> reduces memory consumption win Pyspark CreateDataFrame
> --
>
> Key: SPARK-36600
> URL: https://issues.apache.org/jira/browse/SPARK-36600
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.2
>Reporter: Philippe Prados
>Priority: Trivial
>  Labels: easyfix
> Attachments: optimize_memory_pyspark.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The Python method {{SparkSession._createFromLocal()}} start to the data, and 
> create a list if it's not an instance of list. But it is necessary only if 
> the scheme is not present.
> {quote}# make sure data could consumed multiple times
>  if not isinstance(data, list):
>   data = list(data)
> {quote}
> If you use {{createDataFrame(data=_a_generator_,...)}}, all the datas were 
> save in memory in a list, then convert to a row in memory, then convert to 
> buffer in pickle format, etc.
> Two lists were present at the same time in memory. The list created by 
> _createFromLocal() and the list created later with
> {quote}# convert python objects to sql data
> data = [schema.toInternal(row) for row in data]
> {quote}
> The purpose of using a generator is to reduce the memory footprint when the 
> data are dynamically build.
> {quote}def _createFromLocal(self, data, schema):
>   """
>   Create an RDD for DataFrame from a list or pandas.DataFrame, returns
>   the RDD and schema.
>   """
>   if schema is None or isinstance(schema, (list, tuple)):
>     *# make sure data could consumed multiple times*
>     *if inspect.isgeneratorfunction(data):* 
>       *data = list(data)*
>     struct = self._inferSchemaFromList(data, names=schema)
>     converter = _create_converter(struct)
>     data = map(converter, data)
>     if isinstance(schema, (list, tuple)):
>       for i, name in enumerate(schema):
>         struct.fields[i].name = name
>         struct.names[i] = name
>       schema = struct
>     elif not isinstance(schema, StructType):
>       raise TypeError("schema should be StructType or list or None, but got: 
> %s" % schema)
>   # convert python objects to sql data
>   data = [schema.toInternal(row) for row in data]
>   return self._sc.parallelize(data), schema{quote}
> Then, it is interesting to use a generator.
>  
> {quote}The patch:
> diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py
> index 57c680fd04..0dba590451 100644
> --- a/python/pyspark/sql/session.py
> +++ b/python/pyspark/sql/session.py
> @@ -15,6 +15,7 @@
>  # limitations under the License.
>  #
>  
> +import inspect
>  import sys
>  import warnings
>  from functools import reduce
> @@ -504,11 +505,11 @@ class SparkSession(SparkConversionMixin):
>  Create an RDD for DataFrame from a list or pandas.DataFrame, returns
>  the RDD and schema.
>  """
> - # make sure data could consumed multiple times
> - if not isinstance(data, list):
> - data = list(data)
>  
>  if schema is None or isinstance(schema, (list, tuple)):
> + # make sure data could consumed multiple times
> + if inspect.isgeneratorfunction(data): # PPR
> + data = list(data)
>  struct = self._inferSchemaFromList(data, names=schema)
>  converter = _create_converter(struct)
>  data = map(converter, data)
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36815) Found duplicate rewrite attributes

2021-09-21 Thread gaoyajun02 (Jira)

gaoyajun02 created SPARK-36815:
--

 Summary: Found duplicate rewrite attributes
 Key: SPARK-36815
 URL: https://issues.apache.org/jira/browse/SPARK-36815
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.2
Reporter: gaoyajun02
 Fix For: 3.0.2


We are using Spark version 3.0.2 in production and some ETLs contain 
multi-level CETs and the following error occurs when we join them.
{code:java}
java.lang.AssertionError: assertion failed: Found duplicate rewrite attributes 
at scala.Predef$.assert(Predef.scala:223) at 
org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:207) at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformUpWithNewOutput$1(QueryPlan.scala:193)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:405)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:403)
{code}
I reproduced the problem with a simplified SQL as follows:
{code:java}
-- SQL
with
a as ( select name, get_json_object(json, '$.id') id, n from (
select get_json_object(json, '$.name') name, json from values 
('{"name":"a", "id": 1}' ) people(json)
) LATERAL VIEW explode(array(1, 1, 2)) num as n ),
b as ( select a1.name, a1.id, a1.n from a a1 left join (select name, count(1) c 
from a group by name) a2 on a1.name = a2.name)
select b1.name, b1.n, b1.id from b b1 join b b2 on b1.name = b2.name;{code}
In debugging I found that a reference to the root Project existed in both 
subqueries, and when `ResolveReferences` resolved the conflict, `rewrite` 
occurred in both subqueries, containing two new attrMapping, and they were both 
eventually passed to the root Project, leading to this error

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36807) Merge ANSI interval types to a tightest common type

2021-09-21 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-36807.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 34049
[https://github.com/apache/spark/pull/34049]

> Merge ANSI interval types to a tightest common type
> ---
>
> Key: SPARK-36807
> URL: https://issues.apache.org/jira/browse/SPARK-36807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Implement merging of ANSI interval types with different interval fields. Need 
> to change StructType to support schema merging. For instance, this will allow 
> to merge schemas loaded from parquet files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

59 matches

Mail list logo