[jira] [Created] (SPARK-31661) Document usage of blockSize
zhengruifeng created SPARK-31661: Summary: Document usage of blockSize Key: SPARK-31661 URL: https://issues.apache.org/jira/browse/SPARK-31661 Project: Spark Issue Type: Sub-task Components: Documentation, ML Affects Versions: 3.1.0 Reporter: zhengruifeng -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31620) TreeNodeException: Binding attribute, tree: sum#19L
[ https://issues.apache.org/jira/browse/SPARK-31620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102216#comment-17102216 ] angerszhu commented on SPARK-31620: --- cc [~cloud_fan] > TreeNodeException: Binding attribute, tree: sum#19L > --- > > Key: SPARK-31620 > URL: https://issues.apache.org/jira/browse/SPARK-31620 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.5, 3.0.0 >Reporter: Yuming Wang >Priority: Major > > {noformat} > scala> spark.sql("create temporary view t1 as select * from values (1, 2) as > t1(a, b)") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("create temporary view t2 as select * from values (3, 4) as > t2(c, d)") > res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("select sum(if(c > (select a from t1), d, 0)) as csum from > t2").show > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: sum#19L > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChild$2(TreeNode.scala:368) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$4(TreeNode.scala:427) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:427) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:96) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:96) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.$anonfun$doConsumeWithoutKeys$4(HashAggregateExec.scala:348) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithoutKeys(HashAggregateExec.scala:347) > at >
[jira] [Assigned] (SPARK-31652) Add ANOVASelector and FValueSelector to PySpark
[ https://issues.apache.org/jira/browse/SPARK-31652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-31652: Assignee: Huaxin Gao > Add ANOVASelector and FValueSelector to PySpark > --- > > Key: SPARK-31652 > URL: https://issues.apache.org/jira/browse/SPARK-31652 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > > Add ANOVASelector and FValueSelector to PySpark -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31652) Add ANOVASelector and FValueSelector to PySpark
[ https://issues.apache.org/jira/browse/SPARK-31652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-31652. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28464 [https://github.com/apache/spark/pull/28464] > Add ANOVASelector and FValueSelector to PySpark > --- > > Key: SPARK-31652 > URL: https://issues.apache.org/jira/browse/SPARK-31652 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.1.0 > > > Add ANOVASelector and FValueSelector to PySpark -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31659) Add VarianceThresholdSelector examples and doc
[ https://issues.apache.org/jira/browse/SPARK-31659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-31659: Assignee: Huaxin Gao > Add VarianceThresholdSelector examples and doc > -- > > Key: SPARK-31659 > URL: https://issues.apache.org/jira/browse/SPARK-31659 > Project: Spark > Issue Type: New Feature > Components: Documentation, ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > > Add VarianceThresholdSelector examples and doc -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31659) Add VarianceThresholdSelector examples and doc
[ https://issues.apache.org/jira/browse/SPARK-31659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-31659. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28478 [https://github.com/apache/spark/pull/28478] > Add VarianceThresholdSelector examples and doc > -- > > Key: SPARK-31659 > URL: https://issues.apache.org/jira/browse/SPARK-31659 > Project: Spark > Issue Type: New Feature > Components: Documentation, ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.1.0 > > > Add VarianceThresholdSelector examples and doc -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31588) merge small files may need more common setting
[ https://issues.apache.org/jira/browse/SPARK-31588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102195#comment-17102195 ] Hyukjin Kwon commented on SPARK-31588: -- the repartition won't set the hard limit on the size. You should rather control the block size in HDFS. > merge small files may need more common setting > -- > > Key: SPARK-31588 > URL: https://issues.apache.org/jira/browse/SPARK-31588 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.5 > Environment: spark:2.4.5 > hdp:2.7 >Reporter: philipse >Priority: Major > > Hi , > SparkSql now allow us to use repartition or coalesce to manually control the > small files like the following > /*+ REPARTITION(1) */ > /*+ COALESCE(1) */ > But it can only be tuning case by case ,we need to decide whether we need to > use COALESCE or REPARTITION,can we try a more common way to reduce the > decision by set the target size as hive did > *Good points:* > 1)we will also the new partitions number > 2)with an ON-OFF parameter provided , user can close it if needed > 3)the parmeter can be set at cluster level instand of user side,it will be > more easier to controll samll files. > 4)greatly reduce the pressue of namenode > > *Not good points:* > 1)It will add a new task to calculate the target numbers by stastics the out > files. > > I don't know whether we have planned this in future. > > Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30660) LinearRegression blockify input vectors
[ https://issues.apache.org/jira/browse/SPARK-30660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-30660. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28471 [https://github.com/apache/spark/pull/28471] > LinearRegression blockify input vectors > --- > > Key: SPARK-30660 > URL: https://issues.apache.org/jira/browse/SPARK-30660 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31583) grouping_id calculation should be improved
[ https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102118#comment-17102118 ] Takeshi Yamamuro edited comment on SPARK-31583 at 5/8/20, 12:09 AM: > the order they were first seen in the specified grouping sets. Ah, I got it. Thanks for the explanation. Yea, as you imagined, Spark currently decides the order where Spark sees columns in a grouping-set clause if no column selected in a group-by clause: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L552-L555] I think the most promising approach to sort them in a predictable order is that you define them in a grouping-by clause, e.g., {code:java} select a, b, c, d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin from abc group by a, b, c, d -- selected in a preferable order GROUPING SETS ( (), (a,b,d), (a,c), (a,d) ) {code} The suggested approach based on ordinal positions in a select clause looks fine for simple cases, but how about the case where partial columns specified in a select clause? e.g., {code:java} select d, a, -- partially selected count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin from abc GROUPING SETS ( (), (a,b,d), (a,c), (a,d) ) {code} I personally think this makes the resolution logic complicated and more unpredictable. Btw, any other DBMS-like systems following the suggested one? If we change the behaviour, we'd better follow them. was (Author: maropu): > the order they were first seen in the specified grouping sets. Ah, I got it. Thanks for the explanation. Yea, as you imagined, Spark currently decides the order where Spark sees columns in a grouping-set clause if no column selected in a group-by clause: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L552-L555] I think the most promising approach to sort them in a predictable order is that you define them in a grouping-by clause, e.g., {code:java} select a, b, c, d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin from abc group by a, b, c, d -- selected in a preferable order GROUPING SETS ( (), (a,b,d), (a,c), (a,d) ) {code} The suggested approach based on ordinal positions in a select clause looks fine for simple cases, but how about the case where partial columns specified in a select clause? e.g., {code:java} select d, a, -- partially selected count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin from abc GROUPING SETS ( (), (a,b,d), (a,c), (a,d) ) {code} I personally think this makes the resolution logic complicated and more unpredictable. Btw, any other DBMS-like systems following your suggestion? > grouping_id calculation should be improved > -- > > Key: SPARK-31583 > URL: https://issues.apache.org/jira/browse/SPARK-31583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Costas Piliotis >Priority: Minor > > Unrelated to SPARK-21858 which identifies that grouping_id is determined by > exclusion from a grouping_set rather than inclusion, when performing complex > grouping_sets that are not in the order of the base select statement, > flipping the bit in the grouping_id seems to be happen when the grouping set > is identified rather than when the columns are selected in the sql. I will > of course use the exclusion strategy identified in SPARK-21858 as the > baseline for this. > > {code:scala} > import spark.implicits._ > val df= Seq( > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d") > ).toDF("a","b","c","d").createOrReplaceTempView("abc") > {code} > expected to have these references in the grouping_id: > d=1 > c=2 > b=4 > a=8 > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > This returns: > {noformat} > ++++++---+---+ > |a |b |c |d |count(1)|gid|gid_bin| > ++++++---+---+ > |a |null|c |null|4 |6 |110| > |null|null|null|null|4 |15 | | > |a |null|null|d |4 |5 |101| > |a |b |null|d |4 |1 |1 | > ++++++---+---+ > {noformat} > > In other words, I would have expected the excluded values one way but I > received them excluded in the order they were first seen in the specified > grouping sets. > a,b,d included = excldes c = 2; expected gid=2. received gid=1 > a,d included = excludes b=4, c=2 expected gid=6, received gid=5 > The grouping_id that actually
[jira] [Comment Edited] (SPARK-31583) grouping_id calculation should be improved
[ https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102118#comment-17102118 ] Takeshi Yamamuro edited comment on SPARK-31583 at 5/8/20, 12:06 AM: > the order they were first seen in the specified grouping sets. Ah, I got it. Thanks for the explanation. Yea, as you imagined, Spark currently decides the order where Spark sees columns in a grouping-set clause if no column selected in a group-by clause: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L552-L555] I think the most promising approach to sort them in a predictable order is that you define them in a grouping-by clause, e.g., {code:java} select a, b, c, d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin from abc group by a, b, c, d -- selected in a preferable order GROUPING SETS ( (), (a,b,d), (a,c), (a,d) ) {code} The suggested approach based on ordinal positions in a select clause looks fine for simple cases, but how about the case where partial columns specified in a select clause? e.g., {code:java} select d, a, -- partially selected count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin from abc GROUPING SETS ( (), (a,b,d), (a,c), (a,d) ) {code} I personally think this makes the resolution logic complicated and more unpredictable. Btw, any other DBMS-like systems following your suggestion? was (Author: maropu): > the order they were first seen in the specified grouping sets. Ah, I got it. Thanks for the explanation. Yea, as you imagined, Spark currently decides the order where Spark sees columns in a grouping-set clause if no column selected in a group-by clause: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L552-L555] I think the most promising approach to sort them in a predictable order is that you define them in a grouping-by clause, e.g., {code:java} select a, b, c, d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin from abc group by a, b, c, d -- selected in a preferable order GROUPING SETS ( (), (a,b,d), (a,c), (a,d) ) {code} The suggested approach based on ordinal positions in a select clause looks fine for simple cases, but how about the case where partial columns specified in a select clause? e.g., {code:java} select d, a, -- partially selected count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin from abc GROUPING SETS ( (), (a,b,d), (a,c), (a,d) ) {code} I personally think this makes the resolution logic complicated and a bit unpredictable. Btw, any other DBMS-like systems following your suggestion? > grouping_id calculation should be improved > -- > > Key: SPARK-31583 > URL: https://issues.apache.org/jira/browse/SPARK-31583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Costas Piliotis >Priority: Minor > > Unrelated to SPARK-21858 which identifies that grouping_id is determined by > exclusion from a grouping_set rather than inclusion, when performing complex > grouping_sets that are not in the order of the base select statement, > flipping the bit in the grouping_id seems to be happen when the grouping set > is identified rather than when the columns are selected in the sql. I will > of course use the exclusion strategy identified in SPARK-21858 as the > baseline for this. > > {code:scala} > import spark.implicits._ > val df= Seq( > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d") > ).toDF("a","b","c","d").createOrReplaceTempView("abc") > {code} > expected to have these references in the grouping_id: > d=1 > c=2 > b=4 > a=8 > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > This returns: > {noformat} > ++++++---+---+ > |a |b |c |d |count(1)|gid|gid_bin| > ++++++---+---+ > |a |null|c |null|4 |6 |110| > |null|null|null|null|4 |15 | | > |a |null|null|d |4 |5 |101| > |a |b |null|d |4 |1 |1 | > ++++++---+---+ > {noformat} > > In other words, I would have expected the excluded values one way but I > received them excluded in the order they were first seen in the specified > grouping sets. > a,b,d included = excldes c = 2; expected gid=2. received gid=1 > a,d included = excludes b=4, c=2 expected gid=6, received gid=5 > The grouping_id that actually is expected is (a,b,d,c) > {code:scala} > spark.sql("""
[jira] [Comment Edited] (SPARK-31583) grouping_id calculation should be improved
[ https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102118#comment-17102118 ] Takeshi Yamamuro edited comment on SPARK-31583 at 5/8/20, 12:02 AM: > the order they were first seen in the specified grouping sets. Ah, I got it. Thanks for the explanation. Yea, as you imagined, Spark currently decides the order where Spark sees columns in a grouping-set clause if no column selected in a group-by clause: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L552-L555] I think the most promising approach to sort them in a predictable order is that you define them in a grouping-by clause, e.g., {code:java} select a, b, c, d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin from abc group by a, b, c, d -- selected in a preferable order GROUPING SETS ( (), (a,b,d), (a,c), (a,d) ) {code} The suggested approach based on ordinal positions in a select clause looks fine for simple cases, but how about the case where partial columns specified in a select clause? e.g., {code:java} select d, a, -- partially selected count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin from abc GROUPING SETS ( (), (a,b,d), (a,c), (a,d) ) {code} I personally think this makes the resolution logic complicated and a bit unpredictable. Btw, any other DBMS-like systems following your suggestion? was (Author: maropu): > the order they were first seen in the specified grouping sets. Ah, I got it. Thanks for the explanation. Yea, as you imagined, Spark currently decides the order where Spark sees columns in a grouping-set clause if no column selected in a group-by clause: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L552-L555] I think the most promising approach to sort them in a predictable order is that you define them in a grouping-by clause, e.g., {code:java} select a, b, c, d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin from abc group by a, b, c, d -- selected in a preferrable order GROUPING SETS ( (), (a,b,d), (a,c), (a,d) ) {code} The suggested approach based on ordinal positions in a select clause looks fine for simple cases, but how about the case where partial columns specified in a select clause? e.g., {code:java} select d, a, count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin from abc GROUPING SETS ( (), (a,b,d), (a,c), (a,d) ) {code} I personally think this makes the resolution logic complicated and a bit unpredictable. Btw, any other DBMS-like systems following your suggestion? > grouping_id calculation should be improved > -- > > Key: SPARK-31583 > URL: https://issues.apache.org/jira/browse/SPARK-31583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Costas Piliotis >Priority: Minor > > Unrelated to SPARK-21858 which identifies that grouping_id is determined by > exclusion from a grouping_set rather than inclusion, when performing complex > grouping_sets that are not in the order of the base select statement, > flipping the bit in the grouping_id seems to be happen when the grouping set > is identified rather than when the columns are selected in the sql. I will > of course use the exclusion strategy identified in SPARK-21858 as the > baseline for this. > > {code:scala} > import spark.implicits._ > val df= Seq( > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d") > ).toDF("a","b","c","d").createOrReplaceTempView("abc") > {code} > expected to have these references in the grouping_id: > d=1 > c=2 > b=4 > a=8 > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > This returns: > {noformat} > ++++++---+---+ > |a |b |c |d |count(1)|gid|gid_bin| > ++++++---+---+ > |a |null|c |null|4 |6 |110| > |null|null|null|null|4 |15 | | > |a |null|null|d |4 |5 |101| > |a |b |null|d |4 |1 |1 | > ++++++---+---+ > {noformat} > > In other words, I would have expected the excluded values one way but I > received them excluded in the order they were first seen in the specified > grouping sets. > a,b,d included = excldes c = 2; expected gid=2. received gid=1 > a,d included = excludes b=4, c=2 expected gid=6, received gid=5 > The grouping_id that actually is expected is (a,b,d,c) > {code:scala} > spark.sql(""" > select
[jira] [Commented] (SPARK-31583) grouping_id calculation should be improved
[ https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102118#comment-17102118 ] Takeshi Yamamuro commented on SPARK-31583: -- > the order they were first seen in the specified grouping sets. Ah, I got it. Thanks for the explanation. Yea, as you imagined, Spark currently decides the order where Spark sees columns in a grouping-set clause if no column selected in a group-by clause: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L552-L555] I think the most promising approach to sort them in a predictable order is that you define them in a grouping-by clause, e.g., {code:java} select a, b, c, d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin from abc group by a, b, c, d -- selected in a preferrable order GROUPING SETS ( (), (a,b,d), (a,c), (a,d) ) {code} The suggested approach based on ordinal positions in a select clause looks fine for simple cases, but how about the case where partial columns specified in a select clause? e.g., {code:java} select d, a, count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin from abc GROUPING SETS ( (), (a,b,d), (a,c), (a,d) ) {code} I personally think this makes the resolution logic complicated and a bit unpredictable. Btw, any other DBMS-like systems following your suggestion? > grouping_id calculation should be improved > -- > > Key: SPARK-31583 > URL: https://issues.apache.org/jira/browse/SPARK-31583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Costas Piliotis >Priority: Minor > > Unrelated to SPARK-21858 which identifies that grouping_id is determined by > exclusion from a grouping_set rather than inclusion, when performing complex > grouping_sets that are not in the order of the base select statement, > flipping the bit in the grouping_id seems to be happen when the grouping set > is identified rather than when the columns are selected in the sql. I will > of course use the exclusion strategy identified in SPARK-21858 as the > baseline for this. > > {code:scala} > import spark.implicits._ > val df= Seq( > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d") > ).toDF("a","b","c","d").createOrReplaceTempView("abc") > {code} > expected to have these references in the grouping_id: > d=1 > c=2 > b=4 > a=8 > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > This returns: > {noformat} > ++++++---+---+ > |a |b |c |d |count(1)|gid|gid_bin| > ++++++---+---+ > |a |null|c |null|4 |6 |110| > |null|null|null|null|4 |15 | | > |a |null|null|d |4 |5 |101| > |a |b |null|d |4 |1 |1 | > ++++++---+---+ > {noformat} > > In other words, I would have expected the excluded values one way but I > received them excluded in the order they were first seen in the specified > grouping sets. > a,b,d included = excldes c = 2; expected gid=2. received gid=1 > a,d included = excludes b=4, c=2 expected gid=6, received gid=5 > The grouping_id that actually is expected is (a,b,d,c) > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, > bin(grouping_id(a,b,d,c)) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > columns forming groupingid seem to be creatred as the grouping sets are > identified rather than ordinal position in parent query. > I'd like to at least point out that grouping_id is documented in many other > rdbms and I believe the spark project should use a policy of flipping the > bits so 1=inclusion; 0=exclusion in the grouping set. > However many rdms that do have the feature of a grouping_id do implement it > by the ordinal position recognized as fields in the select clause, rather > than allocating them as they are observed in the grouping sets. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics
[ https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31646: - Assignee: Manu Zhang > Remove unused registeredConnections counter from ShuffleMetrics > --- > > Key: SPARK-31646 > URL: https://issues.apache.org/jira/browse/SPARK-31646 > Project: Spark > Issue Type: Improvement > Components: Deploy, Shuffle >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics
[ https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31646. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28457 [https://github.com/apache/spark/pull/28457] > Remove unused registeredConnections counter from ShuffleMetrics > --- > > Key: SPARK-31646 > URL: https://issues.apache.org/jira/browse/SPARK-31646 > Project: Spark > Issue Type: Improvement > Components: Deploy, Shuffle >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31655) Upgrade snappy to version 1.1.7.5
[ https://issues.apache.org/jira/browse/SPARK-31655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31655: -- Component/s: (was: Spark Core) Build > Upgrade snappy to version 1.1.7.5 > - > > Key: SPARK-31655 > URL: https://issues.apache.org/jira/browse/SPARK-31655 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Minor > Fix For: 3.1.0 > > > Upgrade snappy to version 1.1.7.5 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31655) Upgrade snappy to version 1.1.7.5
[ https://issues.apache.org/jira/browse/SPARK-31655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31655. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28472 [https://github.com/apache/spark/pull/28472] > Upgrade snappy to version 1.1.7.5 > - > > Key: SPARK-31655 > URL: https://issues.apache.org/jira/browse/SPARK-31655 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Minor > Fix For: 3.1.0 > > > Upgrade snappy to version 1.1.7.5 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31655) Upgrade snappy to version 1.1.7.5
[ https://issues.apache.org/jira/browse/SPARK-31655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31655: - Assignee: angerszhu > Upgrade snappy to version 1.1.7.5 > - > > Key: SPARK-31655 > URL: https://issues.apache.org/jira/browse/SPARK-31655 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Minor > > Upgrade snappy to version 1.1.7.5 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31654) sequence producing inconsistent intervals for month step
[ https://issues.apache.org/jira/browse/SPARK-31654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roman Yalki updated SPARK-31654: Description: Taking an example from [https://spark.apache.org/docs/latest/api/sql/] {code:java} > SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 > month);{code} [2018-01-01,2018-02-01,2018-03-01] if one is to expand `stop` till the end of the year some intervals are returned as the last day of the month whereas first day of the month is expected {code:java} > SELECT sequence(to_date('2018-01-01'), to_date('2019-01-01'), interval 1 > month){code} [2018-01-01, 2018-02-01, 2018-03-01, *2018-03-31, 2018-04-30, 2018-05-31, 2018-06-30, 2018-07-31, 2018-08-31, 2018-09-30, 2018-10-31*, 2018-12-01, 2019-01-01] was: Taking an example from [https://spark.apache.org/docs/latest/api/sql/] {code:java} > SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 > month);{code} [2018-01-01,2018-02-01,2018-03-01] if one is to expand `stop` till the end of the year some intervals are returned as the last day of the month whereas fist day of the month is expected {code:java} > SELECT sequence(to_date('2018-01-01'), to_date('2019-01-01'), interval 1 > month){code} [2018-01-01, 2018-02-01, 2018-03-01, *2018-03-31, 2018-04-30, 2018-05-31, 2018-06-30, 2018-07-31, 2018-08-31, 2018-09-30, 2018-10-31*, 2018-12-01, 2019-01-01] > sequence producing inconsistent intervals for month step > > > Key: SPARK-31654 > URL: https://issues.apache.org/jira/browse/SPARK-31654 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Roman Yalki >Priority: Major > > Taking an example from [https://spark.apache.org/docs/latest/api/sql/] > {code:java} > > SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 > > month);{code} > [2018-01-01,2018-02-01,2018-03-01] > if one is to expand `stop` till the end of the year some intervals are > returned as the last day of the month whereas first day of the month is > expected > {code:java} > > SELECT sequence(to_date('2018-01-01'), to_date('2019-01-01'), interval 1 > > month){code} > [2018-01-01, 2018-02-01, 2018-03-01, *2018-03-31, 2018-04-30, 2018-05-31, > 2018-06-30, 2018-07-31, 2018-08-31, 2018-09-30, 2018-10-31*, 2018-12-01, > 2019-01-01] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24266) Spark client terminates while driver is still running
[ https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-24266: - Target Version/s: 3.0.0, 3.1.0, 2.4.7 (was: 2.4.6, 3.0.0, 3.1.0) > Spark client terminates while driver is still running > - > > Key: SPARK-24266 > URL: https://issues.apache.org/jira/browse/SPARK-24266 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0, 3.0.0 >Reporter: Chun Chen >Priority: Major > > {code} > Warning: Ignoring non-spark config property: Default=system properties > included when running spark-submit. > 18/05/11 14:50:12 WARN Config: Error reading service account token from: > [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring. > 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: > Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf) > 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. > Mounting Hadoop specific files > 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: N/A >start time: N/A >container images: N/A >phase: Pending >status: [] > 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: tbds-100-98-45-69 >start time: N/A >container images: N/A >phase: Pending >status: [] > 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: tbds-100-98-45-69 >start time: 2018-05-11T06:50:17Z >container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 >phase: Pending >status: [ContainerStatus(containerID=null, > image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, > lastState=ContainerState(running=null, terminated=null, waiting=null, > additionalProperties={}), name=spark-kubernetes-driver, ready=false, > restartCount=0, state=ContainerState(running=null, terminated=null, > waiting=ContainerStateWaiting(message=null, reason=PodInitializing, > additionalProperties={}), additionalProperties={}), additionalProperties={})] > 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to > finish... > 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: tbds-100-98-45-69 >
[jira] [Resolved] (SPARK-31543) Backport SPARK-26306 More memory to de-flake SorterSuite
[ https://issues.apache.org/jira/browse/SPARK-31543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau resolved SPARK-31543. -- Resolution: Won't Fix > Backport SPARK-26306 More memory to de-flake SorterSuite > -- > > Key: SPARK-31543 > URL: https://issues.apache.org/jira/browse/SPARK-31543 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > SPARK-26306 More memory to de-flake SorterSuite -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31538) Backport SPARK-25338 Ensure to call super.beforeAll() and super.afterAll() in test cases
[ https://issues.apache.org/jira/browse/SPARK-31538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau resolved SPARK-31538. -- Resolution: Won't Fix > Backport SPARK-25338 Ensure to call super.beforeAll() and > super.afterAll() in test cases > -- > > Key: SPARK-31538 > URL: https://issues.apache.org/jira/browse/SPARK-31538 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > Backport SPARK-25338 Ensure to call super.beforeAll() and > super.afterAll() in test cases -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31541) Backport SPARK-26095 Disable parallelization in make-distibution.sh. (Avoid build hanging)
[ https://issues.apache.org/jira/browse/SPARK-31541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau resolved SPARK-31541. -- Resolution: Won't Fix > Backport SPARK-26095 Disable parallelization in make-distibution.sh. > (Avoid build hanging) > > > Key: SPARK-31541 > URL: https://issues.apache.org/jira/browse/SPARK-31541 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > Backport SPARK-26095 Disable parallelization in make-distibution.sh. > (Avoid build hanging) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26908) Fix toMillis
[ https://issues.apache.org/jira/browse/SPARK-26908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-26908: - Fix Version/s: (was: 2.4.6) 2.4.7 > Fix toMillis > > > Key: SPARK-26908 > URL: https://issues.apache.org/jira/browse/SPARK-26908 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0, 2.4.7 > > > The toMillis() method of the DateTimeUtils object can produce inaccurate > result for some negative values. Minor differences can be around 1 ms. For > example: > {code} > input = -9223372036844776001L > {code} > should be converted to -9223372036844777L -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26908) Fix toMillis
[ https://issues.apache.org/jira/browse/SPARK-26908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101926#comment-17101926 ] Holden Karau commented on SPARK-26908: -- retagged to 2.4.7, will revist if we end up cutting an RC2 > Fix toMillis > > > Key: SPARK-26908 > URL: https://issues.apache.org/jira/browse/SPARK-26908 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0, 2.4.7 > > > The toMillis() method of the DateTimeUtils object can produce inaccurate > result for some negative values. Minor differences can be around 1 ms. For > example: > {code} > input = -9223372036844776001L > {code} > should be converted to -9223372036844777L -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30737) Reenable to generate Rd files
[ https://issues.apache.org/jira/browse/SPARK-30737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-30737: - Fix Version/s: (was: 2.4.5) 2.4.6 > Reenable to generate Rd files > - > > Key: SPARK-30737 > URL: https://issues.apache.org/jira/browse/SPARK-30737 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 2.4.5, 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Fix For: 2.4.6, 3.0.0 > > > In SPARK-30733, due to: > {code} > * creating vignettes ... ERROR > Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics: > package ���htmltools��� was installed by an R version with different > internals; it needs to be reinstalled for use with this R version > {code} > It was disable to generate rd files. We should install related packages > correctly and reenable it back. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27262) Add explicit UTF-8 Encoding to DESCRIPTION
[ https://issues.apache.org/jira/browse/SPARK-27262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-27262: - Fix Version/s: (was: 2.4.5) 2.4.6 > Add explicit UTF-8 Encoding to DESCRIPTION > -- > > Key: SPARK-27262 > URL: https://issues.apache.org/jira/browse/SPARK-27262 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Michael Chirico >Priority: Trivial > Fix For: 2.4.6, 3.0.0 > > > This will remove the following warning > {code} > Warning message: > roxygen2 requires Encoding: UTF-8 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30823) %PYTHONPATH% not set in python/docs/make2.bat, resulting in failed/wrong documentation builds
[ https://issues.apache.org/jira/browse/SPARK-30823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau reassigned SPARK-30823: Assignee: David Toneian > %PYTHONPATH% not set in python/docs/make2.bat, resulting in failed/wrong > documentation builds > - > > Key: SPARK-30823 > URL: https://issues.apache.org/jira/browse/SPARK-30823 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark, Windows >Affects Versions: 2.4.5 > Environment: Tested on Windows 10. >Reporter: David Toneian >Assignee: David Toneian >Priority: Minor > Fix For: 2.4.6 > > > When building the PySpark documentation on Windows, by changing directory to > {{python/docs}} and running {{make.bat}} (which runs {{make2.bat}}), the > majority of the documentation may not be built if {{pyspark}} is not in the > default {{%PYTHONPATH%}}. Sphinx then reports that {{pyspark}} (and possibly > dependencies) cannot be imported. > If {{pyspark}} is in the default {{%PYTHONPATH%}}, I suppose it is that > version of {{pyspark}} – as opposed to the version found above the > {{python/docs}} directory – that is considered when building the > documentation, which may result in documentation that does not correspond to > the development version one is trying to build. > {{python/docs/Makefile}} avoids this issue by setting > ??export PYTHONPATH=$(realpath ..):$(realpath ../lib/py4j-0.10.8.1-src.zip)?? > on line 10, but {{make2.bat}} does no such thing. The fix consist of adding > ??set PYTHONPATH=..;..\lib\py4j-0.10.8.1-src.zip?? > to {{make2.bat}}. > See [GitHub PR #27569|https://github.com/apache/spark/pull/27569]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30823) %PYTHONPATH% not set in python/docs/make2.bat, resulting in failed/wrong documentation builds
[ https://issues.apache.org/jira/browse/SPARK-30823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-30823: - Fix Version/s: 2.4.6 > %PYTHONPATH% not set in python/docs/make2.bat, resulting in failed/wrong > documentation builds > - > > Key: SPARK-30823 > URL: https://issues.apache.org/jira/browse/SPARK-30823 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark, Windows >Affects Versions: 2.4.5 > Environment: Tested on Windows 10. >Reporter: David Toneian >Priority: Minor > Fix For: 2.4.6 > > > When building the PySpark documentation on Windows, by changing directory to > {{python/docs}} and running {{make.bat}} (which runs {{make2.bat}}), the > majority of the documentation may not be built if {{pyspark}} is not in the > default {{%PYTHONPATH%}}. Sphinx then reports that {{pyspark}} (and possibly > dependencies) cannot be imported. > If {{pyspark}} is in the default {{%PYTHONPATH%}}, I suppose it is that > version of {{pyspark}} – as opposed to the version found above the > {{python/docs}} directory – that is considered when building the > documentation, which may result in documentation that does not correspond to > the development version one is trying to build. > {{python/docs/Makefile}} avoids this issue by setting > ??export PYTHONPATH=$(realpath ..):$(realpath ../lib/py4j-0.10.8.1-src.zip)?? > on line 10, but {{make2.bat}} does no such thing. The fix consist of adding > ??set PYTHONPATH=..;..\lib\py4j-0.10.8.1-src.zip?? > to {{make2.bat}}. > See [GitHub PR #27569|https://github.com/apache/spark/pull/27569]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31660) Dataset.joinWith supports JoinType object as input parameter
Rex Xiong created SPARK-31660: - Summary: Dataset.joinWith supports JoinType object as input parameter Key: SPARK-31660 URL: https://issues.apache.org/jira/browse/SPARK-31660 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.5 Reporter: Rex Xiong Current Dataset.joinWith API accepts String type joinType, it doesn't support JoinType object. I prefer JoinType object (like enum) than String, less chance to have typo and has better readability {code:scala} def joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)] = {{code} https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala If I pass LeftOuter.sql to joinType, it will throw exception, since there's a white space in LeftOuter.sql {code:scala} case object LeftOuter extends JoinType { override def sql: String = "LEFT OUTER" } {code} While the constructor of JoinType only removes underscore, doesn't handle white spaces, {code:scala} object JoinType { def apply(typ: String): JoinType = typ.toLowerCase(Locale.ROOT).replace("_", "") match { case "inner" => Inner case "outer" | "full" | "fullouter" => FullOuter case "leftouter" | "left" => LeftOuter case "rightouter" | "right" => RightOuter case "leftsemi" | "semi" => LeftSemi case "leftanti" | "anti" => LeftAnti case "cross" => Cross case _ => val supported = Seq( "inner", "outer", "full", "fullouter", "full_outer", "leftouter", "left", "left_outer", "rightouter", "right", "right_outer", "leftsemi", "left_semi", "semi", "leftanti", "left_anti", "anti", "cross") throw new IllegalArgumentException(s"Unsupported join type '$typ'. " + "Supported join types include: " + supported.mkString("'", "', '", "'") + ".") } }{code} https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala I suggest we either add another set of APIs which provide JoinType instead of String, or change JoinType.apply to remove white space as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31306) rand() function documentation suggests an inclusive upper bound of 1.0
[ https://issues.apache.org/jira/browse/SPARK-31306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-31306: - Fix Version/s: 3.0.0 2.4.6 > rand() function documentation suggests an inclusive upper bound of 1.0 > -- > > Key: SPARK-31306 > URL: https://issues.apache.org/jira/browse/SPARK-31306 > Project: Spark > Issue Type: Documentation > Components: PySpark, R, Spark Core >Affects Versions: 2.4.5, 3.0.0 >Reporter: Ben >Assignee: Ben >Priority: Major > Fix For: 2.4.6, 3.0.0 > > > The rand() function in PySpark, Spark, and R is documented as drawing from > U[0.0, 1.0]. This suggests an inclusive upper bound, and can be confusing > (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so writing > `U[0.0, 1.0]` suggests the value returned could include 1.0). The function > itself uses Rand(), which is [documented |#L71] as having a result in the > range [0, 1). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31231) Support setuptools 46.1.0+ in PySpark packaging
[ https://issues.apache.org/jira/browse/SPARK-31231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-31231: - Fix Version/s: 3.1.0 3.0.0 > Support setuptools 46.1.0+ in PySpark packaging > --- > > Key: SPARK-31231 > URL: https://issues.apache.org/jira/browse/SPARK-31231 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Blocker > Fix For: 2.4.6, 3.0.0, 3.1.0 > > > PIP packaging test started to fail (see > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120218/testReport/) > as of setuptools 46.1.0 release. > In https://github.com/pypa/setuptools/issues/1424, they decided to don't keep > the modes in {{package_data}}. In PySpark pip installation, we keep the > executable scripts in {{package_data}} > https://github.com/apache/spark/blob/master/python/setup.py#L199-L200, and > expose their symbolic links as executable scripts. > So, the symbolic links (or copied scripts) executes the scripts copied from > {{package_data}}, which didn't keep the modes: > {code} > /tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: > /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: > Permission denied > /tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: exec: > /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: > cannot execute: Permission denied > {code} > The current issue is being tracked at > https://github.com/pypa/setuptools/issues/2041 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31231) Support setuptools 46.1.0+ in PySpark packaging
[ https://issues.apache.org/jira/browse/SPARK-31231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-31231: - Fix Version/s: 2.4.6 > Support setuptools 46.1.0+ in PySpark packaging > --- > > Key: SPARK-31231 > URL: https://issues.apache.org/jira/browse/SPARK-31231 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Blocker > Fix For: 2.4.6 > > > PIP packaging test started to fail (see > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120218/testReport/) > as of setuptools 46.1.0 release. > In https://github.com/pypa/setuptools/issues/1424, they decided to don't keep > the modes in {{package_data}}. In PySpark pip installation, we keep the > executable scripts in {{package_data}} > https://github.com/apache/spark/blob/master/python/setup.py#L199-L200, and > expose their symbolic links as executable scripts. > So, the symbolic links (or copied scripts) executes the scripts copied from > {{package_data}}, which didn't keep the modes: > {code} > /tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: > /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: > Permission denied > /tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: exec: > /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: > cannot execute: Permission denied > {code} > The current issue is being tracked at > https://github.com/pypa/setuptools/issues/2041 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31659) Add VarianceThresholdSelector examples and doc
[ https://issues.apache.org/jira/browse/SPARK-31659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101879#comment-17101879 ] Apache Spark commented on SPARK-31659: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/28478 > Add VarianceThresholdSelector examples and doc > -- > > Key: SPARK-31659 > URL: https://issues.apache.org/jira/browse/SPARK-31659 > Project: Spark > Issue Type: New Feature > Components: Documentation, ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Minor > > Add VarianceThresholdSelector examples and doc -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31659) Add VarianceThresholdSelector examples and doc
[ https://issues.apache.org/jira/browse/SPARK-31659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101877#comment-17101877 ] Apache Spark commented on SPARK-31659: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/28478 > Add VarianceThresholdSelector examples and doc > -- > > Key: SPARK-31659 > URL: https://issues.apache.org/jira/browse/SPARK-31659 > Project: Spark > Issue Type: New Feature > Components: Documentation, ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Minor > > Add VarianceThresholdSelector examples and doc -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31659) Add VarianceThresholdSelector examples and doc
[ https://issues.apache.org/jira/browse/SPARK-31659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31659: Assignee: Apache Spark > Add VarianceThresholdSelector examples and doc > -- > > Key: SPARK-31659 > URL: https://issues.apache.org/jira/browse/SPARK-31659 > Project: Spark > Issue Type: New Feature > Components: Documentation, ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Minor > > Add VarianceThresholdSelector examples and doc -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31659) Add VarianceThresholdSelector examples and doc
[ https://issues.apache.org/jira/browse/SPARK-31659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31659: Assignee: (was: Apache Spark) > Add VarianceThresholdSelector examples and doc > -- > > Key: SPARK-31659 > URL: https://issues.apache.org/jira/browse/SPARK-31659 > Project: Spark > Issue Type: New Feature > Components: Documentation, ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Minor > > Add VarianceThresholdSelector examples and doc -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31659) Add VarianceThresholdSelector examples and doc
Huaxin Gao created SPARK-31659: -- Summary: Add VarianceThresholdSelector examples and doc Key: SPARK-31659 URL: https://issues.apache.org/jira/browse/SPARK-31659 Project: Spark Issue Type: New Feature Components: Documentation, ML Affects Versions: 3.1.0 Reporter: Huaxin Gao Add VarianceThresholdSelector examples and doc -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31620) TreeNodeException: Binding attribute, tree: sum#19L
[ https://issues.apache.org/jira/browse/SPARK-31620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101862#comment-17101862 ] angerszhu commented on SPARK-31620: --- It is because in PlanSubqueries, we have scalar-subquery in Expression. then in transformAllExpression method, since expression changed, in {color:#FF}QueryPlan.mapExperssion(){color} method we will make a copy of the tree. For aggregate expression, when makeCopy, it will got a new DeclarativeAggregate expression with lazy val {color:#0747a6}inputAggBufferAttributes{color} will be null. When we reuse this value in *HashAggregateExec.doConsumeWithoutKeys,* {color:#0747a6}inputAggBufferAttributes{color} will be re-initial with a new exprId, then error happended. In right SparkPlan, inputAggBufferAttributes is same as child's output. > TreeNodeException: Binding attribute, tree: sum#19L > --- > > Key: SPARK-31620 > URL: https://issues.apache.org/jira/browse/SPARK-31620 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.5, 3.0.0 >Reporter: Yuming Wang >Priority: Major > > {noformat} > scala> spark.sql("create temporary view t1 as select * from values (1, 2) as > t1(a, b)") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("create temporary view t2 as select * from values (3, 4) as > t2(c, d)") > res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("select sum(if(c > (select a from t1), d, 0)) as csum from > t2").show > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: sum#19L > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChild$2(TreeNode.scala:368) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$4(TreeNode.scala:427) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:427) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:96) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at >
[jira] [Commented] (SPARK-31588) merge small files may need more common setting
[ https://issues.apache.org/jira/browse/SPARK-31588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101775#comment-17101775 ] philipse commented on SPARK-31588: -- For example: if we have output 3 files,size as 10M,50M,200M,the block size as 128M,we may keep the file size more close the average,but we also should keep the size bigger than the block, just in case someone set wrong paramters. case 1:we set the target size as 60M.the expected average file size as Max(blocksize,60M) it will output an integer file count as the repartition number :[total_file_size /average file size]+1 the final result will be 3 files:size as 128M,128M,4M if we set the target size as 5120M, then it will repartition as 1 file. size as 260M. thus ,we can set the target size as the global paramter,it will benefit all task. > merge small files may need more common setting > -- > > Key: SPARK-31588 > URL: https://issues.apache.org/jira/browse/SPARK-31588 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.5 > Environment: spark:2.4.5 > hdp:2.7 >Reporter: philipse >Priority: Major > > Hi , > SparkSql now allow us to use repartition or coalesce to manually control the > small files like the following > /*+ REPARTITION(1) */ > /*+ COALESCE(1) */ > But it can only be tuning case by case ,we need to decide whether we need to > use COALESCE or REPARTITION,can we try a more common way to reduce the > decision by set the target size as hive did > *Good points:* > 1)we will also the new partitions number > 2)with an ON-OFF parameter provided , user can close it if needed > 3)the parmeter can be set at cluster level instand of user side,it will be > more easier to controll samll files. > 4)greatly reduce the pressue of namenode > > *Not good points:* > 1)It will add a new task to calculate the target numbers by stastics the out > files. > > I don't know whether we have planned this in future. > > Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26908) Fix toMillis
[ https://issues.apache.org/jira/browse/SPARK-26908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-26908: Fix Version/s: 2.4.6 > Fix toMillis > > > Key: SPARK-26908 > URL: https://issues.apache.org/jira/browse/SPARK-26908 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.6, 3.0.0 > > > The toMillis() method of the DateTimeUtils object can produce inaccurate > result for some negative values. Minor differences can be around 1 ms. For > example: > {code} > input = -9223372036844776001L > {code} > should be converted to -9223372036844777L -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31405) fail by default when read/write datetime values and not sure if they need rebase or not
[ https://issues.apache.org/jira/browse/SPARK-31405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101744#comment-17101744 ] Apache Spark commented on SPARK-31405: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/28477 > fail by default when read/write datetime values and not sure if they need > rebase or not > --- > > Key: SPARK-31405 > URL: https://issues.apache.org/jira/browse/SPARK-31405 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31405) fail by default when read/write datetime values and not sure if they need rebase or not
[ https://issues.apache.org/jira/browse/SPARK-31405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101743#comment-17101743 ] Apache Spark commented on SPARK-31405: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/28477 > fail by default when read/write datetime values and not sure if they need > rebase or not > --- > > Key: SPARK-31405 > URL: https://issues.apache.org/jira/browse/SPARK-31405 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31405) fail by default when read/write datetime values and not sure if they need rebase or not
[ https://issues.apache.org/jira/browse/SPARK-31405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31405: Assignee: Apache Spark > fail by default when read/write datetime values and not sure if they need > rebase or not > --- > > Key: SPARK-31405 > URL: https://issues.apache.org/jira/browse/SPARK-31405 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31405) fail by default when read/write datetime values and not sure if they need rebase or not
[ https://issues.apache.org/jira/browse/SPARK-31405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31405: Assignee: (was: Apache Spark) > fail by default when read/write datetime values and not sure if they need > rebase or not > --- > > Key: SPARK-31405 > URL: https://issues.apache.org/jira/browse/SPARK-31405 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31337) Support MS Sql Kerberos login in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-31337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101705#comment-17101705 ] Gabor Somogyi commented on SPARK-31337: --- Started to work on this. > Support MS Sql Kerberos login in JDBC connector > --- > > Key: SPARK-31337 > URL: https://issues.apache.org/jira/browse/SPARK-31337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31485) Barrier stage can hang if only partial tasks launched
[ https://issues.apache.org/jira/browse/SPARK-31485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101690#comment-17101690 ] Apache Spark commented on SPARK-31485: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/28476 > Barrier stage can hang if only partial tasks launched > - > > Key: SPARK-31485 > URL: https://issues.apache.org/jira/browse/SPARK-31485 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 2.4.6 > > > The issue can be reproduced by following test: > > {code:java} > initLocalClusterSparkContext(2) > val rdd0 = sc.parallelize(Seq(0, 1, 2, 3), 2) > val dep = new OneToOneDependency[Int](rdd0) > val rdd = new MyRDD(sc, 2, List(dep), > Seq(Seq("executor_h_0"),Seq("executor_h_0"))) > rdd.barrier().mapPartitions { iter => > BarrierTaskContext.get().barrier() > iter > }.collect() > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31448) Difference in Storage Levels used in cache() and persist() for pyspark dataframes
[ https://issues.apache.org/jira/browse/SPARK-31448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101626#comment-17101626 ] Abhishek Dixit edited comment on SPARK-31448 at 5/7/20, 12:34 PM: -- Let me try to explain the problem more. Please look at this code in pyspark/dataframe.py: {code:java} @since(1.3) def cache(self): """Persists the :class:`DataFrame` with the default storage level (C{MEMORY_AND_DISK}). .. note:: The default storage level has changed to C{MEMORY_AND_DISK} to match Scala in 2.0. """ self.is_cached = True self._jdf.cache() return self {code} Cache method in pyspark data frame directly calls scala's cache method. Hence Storage level used is based on Scala defaults i.e. StorageLevel(true, true, false, true) with deserialized equal to true. But since, data from python is already serialized by the Pickle library, we should be using storage level with deserialized = false for pyspark dataframes. But if you look at cache method in pyspark/rdd.py, it sets the storage level in pyspark only and then calls the scala method with parameter. Hence correct storage level is used in this case with deserialzied = false. {code:java} def cache(self): """ Persist this RDD with the default storage level (C{MEMORY_ONLY}). """ self.is_cached = True self.persist(StorageLevel.MEMORY_ONLY) return self{code} We need to implement a similar way in cache method in dataframe.py to avoid using the scala defaults of deserialized = true was (Author: abhishekd0907): Let me try to explain the problem more. Please look at this code in pyspark/dataframe.py: {code:java} def cache(self): """ Persist this RDD with the default storage level (C{MEMORY_ONLY}). """ self.is_cached = True self.persist(StorageLevel.MEMORY_ONLY) return self {code} Cache method in pyspark data frame directly calls scala's cache method. Hence Storage level used is based on Scala defaults i.e. StorageLevel(true, true, false, true) with deserialized equal to true. But since, data from python is already serialized by the Pickle library, we should be using storage level with deserialized = false for pyspark dataframes. But if you look at cache method in pyspark/rdd.py, it sets the storage level in pyspark only and then calls the scala method with parameter. Hence correct storage level is used in this case with deserialzied = false. {code:java} def cache(self): """ Persist this RDD with the default storage level (C{MEMORY_ONLY}). """ self.is_cached = True self.persist(StorageLevel.MEMORY_ONLY) return self{code} We need to implement a similar way in cache method in dataframe.py to avoid using the scala defaults of deserialized = true > Difference in Storage Levels used in cache() and persist() for pyspark > dataframes > - > > Key: SPARK-31448 > URL: https://issues.apache.org/jira/browse/SPARK-31448 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 >Reporter: Abhishek Dixit >Priority: Major > > There is a difference in default storage level *MEMORY_AND_DISK* in pyspark > and scala. > *Scala*: StorageLevel(true, true, false, true) > *Pyspark:* StorageLevel(True, True, False, False) > > *Problem Description:* > Calling *df.cache()* for pyspark dataframe directly invokes Scala method > cache() and Storage Level used is StorageLevel(true, true, false, true). > But calling *df.persist()* for pyspark dataframe sets the > newStorageLevel=StorageLevel(true, true, false, false) inside pyspark and > then invokes Scala function persist(newStorageLevel). > *Possible Fix:* > Invoke pyspark function persist inside pyspark function cache instead of > calling the scala function directly. > I can raise a PR for this fix if someone can confirm that this is a bug and > the possible fix is the correct approach. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31448) Difference in Storage Levels used in cache() and persist() for pyspark dataframes
[ https://issues.apache.org/jira/browse/SPARK-31448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101626#comment-17101626 ] Abhishek Dixit edited comment on SPARK-31448 at 5/7/20, 12:33 PM: -- Let me try to explain the problem more. Please look at this code in pyspark/dataframe.py: {code:java} def cache(self): """ Persist this RDD with the default storage level (C{MEMORY_ONLY}). """ self.is_cached = True self.persist(StorageLevel.MEMORY_ONLY) return self {code} Cache method in pyspark data frame directly calls scala's cache method. Hence Storage level used is based on Scala defaults i.e. StorageLevel(true, true, false, true) with deserialized equal to true. But since, data from python is already serialized by the Pickle library, we should be using storage level with deserialized = false for pyspark dataframes. But if you look at cache method in pyspark/rdd.py, it sets the storage level in pyspark only and then calls the scala method with parameter. Hence correct storage level is used in this case with deserialzied = false. {code:java} def cache(self): """ Persist this RDD with the default storage level (C{MEMORY_ONLY}). """ self.is_cached = True self.persist(StorageLevel.MEMORY_ONLY) return self{code} We need to implement a similar way in cache method in dataframe.py to avoid using the scala defaults of deserialized = true was (Author: abhishekd0907): Let me try to explain the problem more. Please look at this code in pyspark/dataframe.py: {code:java} // @since(1.3)def cache(self):"""Persists the :class:`DataFrame` with the default storage level (C{MEMORY_AND_DISK})... note:: The default storage level has changed to C{MEMORY_AND_DISK} to match Scala in 2.0. """self.is_cached = Trueself._jdf.cache()return self {code} Cache method in pyspark data frame directly calls scala's cache method. Hence Storage level used is based on Scala defaults i.e. StorageLevel(true, true, false, true) with deserialized equal to true. But since, data from python is already serialized by the Pickle library, we should be using storage level with deserialized = false for pyspark dataframes. But if you look at cache method in pyspark/rdd.py, it sets the storage level in pyspark only and then calls the scala method with parameter. Hence correct storage level is used in this case with deserialzied = false. {code:java} // def cache(self):"""Persist this RDD with the default storage level (C{MEMORY_ONLY})."""self.is_cached = True self.persist(StorageLevel.MEMORY_ONLY)return self {code} We need to implement a similar way in cache method in dataframe.py to avoid using the scala defaults of deserialized = true > Difference in Storage Levels used in cache() and persist() for pyspark > dataframes > - > > Key: SPARK-31448 > URL: https://issues.apache.org/jira/browse/SPARK-31448 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 >Reporter: Abhishek Dixit >Priority: Major > > There is a difference in default storage level *MEMORY_AND_DISK* in pyspark > and scala. > *Scala*: StorageLevel(true, true, false, true) > *Pyspark:* StorageLevel(True, True, False, False) > > *Problem Description:* > Calling *df.cache()* for pyspark dataframe directly invokes Scala method > cache() and Storage Level used is StorageLevel(true, true, false, true). > But calling *df.persist()* for pyspark dataframe sets the > newStorageLevel=StorageLevel(true, true, false, false) inside pyspark and > then invokes Scala function persist(newStorageLevel). > *Possible Fix:* > Invoke pyspark function persist inside pyspark function cache instead of > calling the scala function directly. > I can raise a PR for this fix if someone can confirm that this is a bug and > the possible fix is the correct approach. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-31448) Difference in Storage Levels used in cache() and persist() for pyspark dataframes
[ https://issues.apache.org/jira/browse/SPARK-31448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Dixit reopened SPARK-31448: Let me try to explain the problem more. Please look at this code in pyspark/dataframe.py: {code:java} // @since(1.3)def cache(self):"""Persists the :class:`DataFrame` with the default storage level (C{MEMORY_AND_DISK})... note:: The default storage level has changed to C{MEMORY_AND_DISK} to match Scala in 2.0. """self.is_cached = Trueself._jdf.cache()return self {code} Cache method in pyspark data frame directly calls scala's cache method. Hence Storage level used is based on Scala defaults i.e. StorageLevel(true, true, false, true) with deserialized equal to true. But since, data from python is already serialized by the Pickle library, we should be using storage level with deserialized = false for pyspark dataframes. But if you look at cache method in pyspark/rdd.py, it sets the storage level in pyspark only and then calls the scala method with parameter. Hence correct storage level is used in this case with deserialzied = false. {code:java} // def cache(self):"""Persist this RDD with the default storage level (C{MEMORY_ONLY})."""self.is_cached = True self.persist(StorageLevel.MEMORY_ONLY)return self {code} We need to implement a similar way in cache method in dataframe.py to avoid using the scala defaults of deserialized = true > Difference in Storage Levels used in cache() and persist() for pyspark > dataframes > - > > Key: SPARK-31448 > URL: https://issues.apache.org/jira/browse/SPARK-31448 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 >Reporter: Abhishek Dixit >Priority: Major > > There is a difference in default storage level *MEMORY_AND_DISK* in pyspark > and scala. > *Scala*: StorageLevel(true, true, false, true) > *Pyspark:* StorageLevel(True, True, False, False) > > *Problem Description:* > Calling *df.cache()* for pyspark dataframe directly invokes Scala method > cache() and Storage Level used is StorageLevel(true, true, false, true). > But calling *df.persist()* for pyspark dataframe sets the > newStorageLevel=StorageLevel(true, true, false, false) inside pyspark and > then invokes Scala function persist(newStorageLevel). > *Possible Fix:* > Invoke pyspark function persist inside pyspark function cache instead of > calling the scala function directly. > I can raise a PR for this fix if someone can confirm that this is a bug and > the possible fix is the correct approach. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31631) Fix test flakiness caused by MiniKdc which throws "address in use" BindException
[ https://issues.apache.org/jira/browse/SPARK-31631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-31631: - Fix Version/s: (was: 3.1.0) 3.0.0 > Fix test flakiness caused by MiniKdc which throws "address in use" > BindException > > > Key: SPARK-31631 > URL: https://issues.apache.org/jira/browse/SPARK-31631 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > {code:java} > [info] org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED > *** (15 seconds, 426 milliseconds) > [info] java.net.BindException: Address already in use > [info] at sun.nio.ch.Net.bind0(Native Method) > [info] at sun.nio.ch.Net.bind(Net.java:433) > [info] at sun.nio.ch.Net.bind(Net.java:425) > [info] at > sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) > [info] at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) > [info] at > org.apache.mina.transport.socket.nio.NioSocketAcceptor.open(NioSocketAcceptor.java:198) > [info] at > org.apache.mina.transport.socket.nio.NioSocketAcceptor.open(NioSocketAcceptor.java:51) > [info] at > org.apache.mina.core.polling.AbstractPollingIoAcceptor.registerHandles(AbstractPollingIoAcceptor.java:547) > [info] at > org.apache.mina.core.polling.AbstractPollingIoAcceptor.access$400(AbstractPollingIoAcceptor.java:68) > [info] at > org.apache.mina.core.polling.AbstractPollingIoAcceptor$Acceptor.run(AbstractPollingIoAcceptor.java:422) > [info] at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > [info] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [info] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [info] at java.lang.Thread.run(Thread.java:748) > {code} > This is an issue fixed in hadoop 2.8.0 > https://issues.apache.org/jira/browse/HADOOP-12656 > We may need apply the approach from HBASE first before we drop Hadoop 2.7.x > https://issues.apache.org/jira/browse/HBASE-14734 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26908) Fix toMillis
[ https://issues.apache.org/jira/browse/SPARK-26908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101581#comment-17101581 ] Apache Spark commented on SPARK-26908: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/28475 > Fix toMillis > > > Key: SPARK-26908 > URL: https://issues.apache.org/jira/browse/SPARK-26908 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > The toMillis() method of the DateTimeUtils object can produce inaccurate > result for some negative values. Minor differences can be around 1 ms. For > example: > {code} > input = -9223372036844776001L > {code} > should be converted to -9223372036844777L -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26908) Fix toMillis
[ https://issues.apache.org/jira/browse/SPARK-26908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101580#comment-17101580 ] Apache Spark commented on SPARK-26908: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/28475 > Fix toMillis > > > Key: SPARK-26908 > URL: https://issues.apache.org/jira/browse/SPARK-26908 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > The toMillis() method of the DateTimeUtils object can produce inaccurate > result for some negative values. Minor differences can be around 1 ms. For > example: > {code} > input = -9223372036844776001L > {code} > should be converted to -9223372036844777L -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31658) SQL UI doesn't show write commands of AQE plan
[ https://issues.apache.org/jira/browse/SPARK-31658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31658: Assignee: (was: Apache Spark) > SQL UI doesn't show write commands of AQE plan > -- > > Key: SPARK-31658 > URL: https://issues.apache.org/jira/browse/SPARK-31658 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31658) SQL UI doesn't show write commands of AQE plan
[ https://issues.apache.org/jira/browse/SPARK-31658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31658: Assignee: Apache Spark > SQL UI doesn't show write commands of AQE plan > -- > > Key: SPARK-31658 > URL: https://issues.apache.org/jira/browse/SPARK-31658 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31658) SQL UI doesn't show write commands of AQE plan
[ https://issues.apache.org/jira/browse/SPARK-31658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101568#comment-17101568 ] Apache Spark commented on SPARK-31658: -- User 'manuzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/28474 > SQL UI doesn't show write commands of AQE plan > -- > > Key: SPARK-31658 > URL: https://issues.apache.org/jira/browse/SPARK-31658 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31658) SQL UI doesn't show write commands of AQE plan
Manu Zhang created SPARK-31658: -- Summary: SQL UI doesn't show write commands of AQE plan Key: SPARK-31658 URL: https://issues.apache.org/jira/browse/SPARK-31658 Project: Spark Issue Type: Improvement Components: SQL, Web UI Affects Versions: 3.0.0 Reporter: Manu Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29250) Upgrade to Hadoop 3.2.1
[ https://issues.apache.org/jira/browse/SPARK-29250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101553#comment-17101553 ] Steve Loughran commented on SPARK-29250: I feel your pain > Upgrade to Hadoop 3.2.1 > --- > > Key: SPARK-29250 > URL: https://issues.apache.org/jira/browse/SPARK-29250 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31657) CSV Writer writes no header for empty DataFrames
Furcy Pin created SPARK-31657: - Summary: CSV Writer writes no header for empty DataFrames Key: SPARK-31657 URL: https://issues.apache.org/jira/browse/SPARK-31657 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.4.1 Environment: Local pyspark 2.41 Reporter: Furcy Pin When writing a DataFrame as csv with the Header option set to true, the header is not written when the DataFrame is empty. This creates failures for processes that read the csv back. Example (please notice the limit(0) in the second example): ``` {code:java} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.1 /_/ Using Python version 2.7.17 (default, Nov 7 2019 10:07:09) SparkSession available as 'spark'. >>> df1 = spark.sql("SELECT 1 as a") >>> df1.limit(1).write.mode("OVERWRITE").option("Header", >>> True).csv("data/test/csv") >>> spark.read.option("Header", True).csv("data/test/csv").show() +---+ | a| +---+ | 1| +---+ >>> >>> df1.limit(0).write.mode("OVERWRITE").option("Header", >>> True).csv("data/test/csv") >>> spark.read.option("Header", True).csv("data/test/csv").show() ++ || ++ ++ {code} Expected behavior: {code:java} >>> df1.limit(0).write.mode("OVERWRITE").option("Header", >>> True).csv("data/test/csv") >>> spark.read.option("Header", True).csv("data/test/csv").show() +---+ | a| +---+ +---+{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27990) Recursive data loading from file sources
[ https://issues.apache.org/jira/browse/SPARK-27990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-27990: Summary: Recursive data loading from file sources (was: Provide a way to recursively load data from datasource) > Recursive data loading from file sources > > > Key: SPARK-27990 > URL: https://issues.apache.org/jira/browse/SPARK-27990 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Affects Versions: 2.4.3 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Provide a way to recursively load data from datasource. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org