[jira] [Created] (SPARK-31661) Document usage of blockSize

2020-05-07 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-31661:


 Summary: Document usage of blockSize
 Key: SPARK-31661
 URL: https://issues.apache.org/jira/browse/SPARK-31661
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML
Affects Versions: 3.1.0
Reporter: zhengruifeng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31620) TreeNodeException: Binding attribute, tree: sum#19L

2020-05-07 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102216#comment-17102216
 ] 

angerszhu commented on SPARK-31620:
---

cc [~cloud_fan]

> TreeNodeException: Binding attribute, tree: sum#19L
> ---
>
> Key: SPARK-31620
> URL: https://issues.apache.org/jira/browse/SPARK-31620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.5, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> scala> spark.sql("create temporary view t1 as select * from values (1, 2) as 
> t1(a, b)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("create temporary view t2 as select * from values (3, 4) as 
> t2(c, d)")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("select sum(if(c > (select a from t1), d, 0)) as csum from 
> t2").show
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: sum#19L
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChild$2(TreeNode.scala:368)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$4(TreeNode.scala:427)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>   at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:427)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:96)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:96)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.$anonfun$doConsumeWithoutKeys$4(HashAggregateExec.scala:348)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithoutKeys(HashAggregateExec.scala:347)
>   at 
> 

[jira] [Assigned] (SPARK-31652) Add ANOVASelector and FValueSelector to PySpark

2020-05-07 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-31652:


Assignee: Huaxin Gao

> Add ANOVASelector and FValueSelector to PySpark
> ---
>
> Key: SPARK-31652
> URL: https://issues.apache.org/jira/browse/SPARK-31652
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> Add ANOVASelector and FValueSelector to PySpark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31652) Add ANOVASelector and FValueSelector to PySpark

2020-05-07 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-31652.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28464
[https://github.com/apache/spark/pull/28464]

> Add ANOVASelector and FValueSelector to PySpark
> ---
>
> Key: SPARK-31652
> URL: https://issues.apache.org/jira/browse/SPARK-31652
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.1.0
>
>
> Add ANOVASelector and FValueSelector to PySpark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31659) Add VarianceThresholdSelector examples and doc

2020-05-07 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-31659:


Assignee: Huaxin Gao

> Add VarianceThresholdSelector examples and doc
> --
>
> Key: SPARK-31659
> URL: https://issues.apache.org/jira/browse/SPARK-31659
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>
> Add VarianceThresholdSelector examples and doc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31659) Add VarianceThresholdSelector examples and doc

2020-05-07 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-31659.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28478
[https://github.com/apache/spark/pull/28478]

> Add VarianceThresholdSelector examples and doc
> --
>
> Key: SPARK-31659
> URL: https://issues.apache.org/jira/browse/SPARK-31659
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.1.0
>
>
> Add VarianceThresholdSelector examples and doc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31588) merge small files may need more common setting

2020-05-07 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102195#comment-17102195
 ] 

Hyukjin Kwon commented on SPARK-31588:
--

the repartition won't set the hard limit on the size. You should rather control 
the block size in HDFS.

> merge small files may need more common setting
> --
>
> Key: SPARK-31588
> URL: https://issues.apache.org/jira/browse/SPARK-31588
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: spark:2.4.5
> hdp:2.7
>Reporter: philipse
>Priority: Major
>
> Hi ,
> SparkSql now allow us to use  repartition or coalesce to manually control the 
> small files like the following
> /*+ REPARTITION(1) */
> /*+ COALESCE(1) */
> But it can only be  tuning case by case ,we need to decide whether we need to 
> use COALESCE or REPARTITION,can we try a more common way to reduce the 
> decision by set the target size  as hive did
> *Good points:*
> 1)we will also the new partitions number
> 2)with an ON-OFF parameter  provided , user can close it if needed
> 3)the parmeter can be set at cluster level instand of user side,it will be 
> more easier to controll samll files.
> 4)greatly reduce the pressue of namenode
>  
> *Not good points:*
> 1)It will add a new task to calculate the target numbers by stastics the out 
> files.
>  
> I don't know whether we have planned this in future.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30660) LinearRegression blockify input vectors

2020-05-07 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-30660.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28471
[https://github.com/apache/spark/pull/28471]

> LinearRegression blockify input vectors
> ---
>
> Key: SPARK-30660
> URL: https://issues.apache.org/jira/browse/SPARK-30660
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31583) grouping_id calculation should be improved

2020-05-07 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102118#comment-17102118
 ] 

Takeshi Yamamuro edited comment on SPARK-31583 at 5/8/20, 12:09 AM:


> the order they were first seen in the specified grouping sets.

Ah, I got it. Thanks for the explanation. Yea, as you imagined, Spark currently 
decides the order where Spark sees columns in a grouping-set clause if no 
column selected in a group-by clause: 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L552-L555]

I think the most promising approach to sort them in a predictable order is that 
you define them in a grouping-by clause, e.g.,
{code:java}
select a, b, c, d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
from abc
group by 
  a, b, c, d -- selected in a preferable order
GROUPING SETS (
(),
(a,b,d),
(a,c),
(a,d)
)
{code}
The suggested approach based on ordinal positions in a select clause looks fine 
for simple cases, but how about the case where partial columns specified in a 
select clause? e.g.,
{code:java}
select
  d, a, -- partially selected
  count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
from abc
GROUPING SETS (
(),
(a,b,d),
(a,c),
(a,d)
)
{code}
I personally think this makes the resolution logic complicated and more 
unpredictable. Btw, any other DBMS-like systems following the suggested one? If 
we change the behaviour, we'd better follow them.


was (Author: maropu):
> the order they were first seen in the specified grouping sets.

Ah, I got it. Thanks for the explanation. Yea, as you imagined, Spark currently 
decides the order where Spark sees columns in a grouping-set clause if no 
column selected in a group-by clause: 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L552-L555]

I think the most promising approach to sort them in a predictable order is that 
you define them in a grouping-by clause, e.g.,
{code:java}
select a, b, c, d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
from abc
group by 
  a, b, c, d -- selected in a preferable order
GROUPING SETS (
(),
(a,b,d),
(a,c),
(a,d)
)
{code}
The suggested approach based on ordinal positions in a select clause looks fine 
for simple cases, but how about the case where partial columns specified in a 
select clause? e.g.,
{code:java}
select
  d, a, -- partially selected
  count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
from abc
GROUPING SETS (
(),
(a,b,d),
(a,c),
(a,d)
)
{code}
I personally think this makes the resolution logic complicated and more 
unpredictable. Btw, any other DBMS-like systems following your suggestion?

> grouping_id calculation should be improved
> --
>
> Key: SPARK-31583
> URL: https://issues.apache.org/jira/browse/SPARK-31583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Costas Piliotis
>Priority: Minor
>
> Unrelated to SPARK-21858 which identifies that grouping_id is determined by 
> exclusion from a grouping_set rather than inclusion, when performing complex 
> grouping_sets that are not in the order of the base select statement, 
> flipping the bit in the grouping_id seems to be happen when the grouping set 
> is identified rather than when the columns are selected in the sql.   I will 
> of course use the exclusion strategy identified in SPARK-21858 as the 
> baseline for this.  
>  
> {code:scala}
> import spark.implicits._
> val df= Seq(
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d")
> ).toDF("a","b","c","d").createOrReplaceTempView("abc")
> {code}
> expected to have these references in the grouping_id:
>  d=1
>  c=2
>  b=4
>  a=8
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
> This returns:
> {noformat}
> ++++++---+---+
> |a   |b   |c   |d   |count(1)|gid|gid_bin|
> ++++++---+---+
> |a   |null|c   |null|4   |6  |110|
> |null|null|null|null|4   |15 |   |
> |a   |null|null|d   |4   |5  |101|
> |a   |b   |null|d   |4   |1  |1  |
> ++++++---+---+
> {noformat}
>  
>  In other words, I would have expected the excluded values one way but I 
> received them excluded in the order they were first seen in the specified 
> grouping sets.
>  a,b,d included = excldes c = 2; expected gid=2. received gid=1
>  a,d included = excludes b=4, c=2 expected gid=6, received gid=5
> The grouping_id that actually 

[jira] [Comment Edited] (SPARK-31583) grouping_id calculation should be improved

2020-05-07 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102118#comment-17102118
 ] 

Takeshi Yamamuro edited comment on SPARK-31583 at 5/8/20, 12:06 AM:


> the order they were first seen in the specified grouping sets.

Ah, I got it. Thanks for the explanation. Yea, as you imagined, Spark currently 
decides the order where Spark sees columns in a grouping-set clause if no 
column selected in a group-by clause: 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L552-L555]

I think the most promising approach to sort them in a predictable order is that 
you define them in a grouping-by clause, e.g.,
{code:java}
select a, b, c, d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
from abc
group by 
  a, b, c, d -- selected in a preferable order
GROUPING SETS (
(),
(a,b,d),
(a,c),
(a,d)
)
{code}
The suggested approach based on ordinal positions in a select clause looks fine 
for simple cases, but how about the case where partial columns specified in a 
select clause? e.g.,
{code:java}
select
  d, a, -- partially selected
  count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
from abc
GROUPING SETS (
(),
(a,b,d),
(a,c),
(a,d)
)
{code}
I personally think this makes the resolution logic complicated and more 
unpredictable. Btw, any other DBMS-like systems following your suggestion?


was (Author: maropu):
> the order they were first seen in the specified grouping sets.

Ah, I got it. Thanks for the explanation. Yea, as you imagined, Spark currently 
decides the order where Spark sees columns in a grouping-set clause if no 
column selected in a group-by clause: 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L552-L555]

I think the most promising approach to sort them in a predictable order is that 
you define them in a grouping-by clause, e.g.,
{code:java}
select a, b, c, d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
from abc
group by 
  a, b, c, d -- selected in a preferable order
GROUPING SETS (
(),
(a,b,d),
(a,c),
(a,d)
)
{code}
The suggested approach based on ordinal positions in a select clause looks fine 
for simple cases, but how about the case where partial columns specified in a 
select clause? e.g.,
{code:java}
select
  d, a, -- partially selected
  count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
from abc
GROUPING SETS (
(),
(a,b,d),
(a,c),
(a,d)
)
{code}
I personally think this makes the resolution logic complicated and a bit 
unpredictable. Btw, any other DBMS-like systems following your suggestion?

> grouping_id calculation should be improved
> --
>
> Key: SPARK-31583
> URL: https://issues.apache.org/jira/browse/SPARK-31583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Costas Piliotis
>Priority: Minor
>
> Unrelated to SPARK-21858 which identifies that grouping_id is determined by 
> exclusion from a grouping_set rather than inclusion, when performing complex 
> grouping_sets that are not in the order of the base select statement, 
> flipping the bit in the grouping_id seems to be happen when the grouping set 
> is identified rather than when the columns are selected in the sql.   I will 
> of course use the exclusion strategy identified in SPARK-21858 as the 
> baseline for this.  
>  
> {code:scala}
> import spark.implicits._
> val df= Seq(
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d")
> ).toDF("a","b","c","d").createOrReplaceTempView("abc")
> {code}
> expected to have these references in the grouping_id:
>  d=1
>  c=2
>  b=4
>  a=8
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
> This returns:
> {noformat}
> ++++++---+---+
> |a   |b   |c   |d   |count(1)|gid|gid_bin|
> ++++++---+---+
> |a   |null|c   |null|4   |6  |110|
> |null|null|null|null|4   |15 |   |
> |a   |null|null|d   |4   |5  |101|
> |a   |b   |null|d   |4   |1  |1  |
> ++++++---+---+
> {noformat}
>  
>  In other words, I would have expected the excluded values one way but I 
> received them excluded in the order they were first seen in the specified 
> grouping sets.
>  a,b,d included = excldes c = 2; expected gid=2. received gid=1
>  a,d included = excludes b=4, c=2 expected gid=6, received gid=5
> The grouping_id that actually is expected is (a,b,d,c) 
> {code:scala}
> spark.sql("""

[jira] [Comment Edited] (SPARK-31583) grouping_id calculation should be improved

2020-05-07 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102118#comment-17102118
 ] 

Takeshi Yamamuro edited comment on SPARK-31583 at 5/8/20, 12:02 AM:


> the order they were first seen in the specified grouping sets.

Ah, I got it. Thanks for the explanation. Yea, as you imagined, Spark currently 
decides the order where Spark sees columns in a grouping-set clause if no 
column selected in a group-by clause: 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L552-L555]

I think the most promising approach to sort them in a predictable order is that 
you define them in a grouping-by clause, e.g.,
{code:java}
select a, b, c, d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
from abc
group by 
  a, b, c, d -- selected in a preferable order
GROUPING SETS (
(),
(a,b,d),
(a,c),
(a,d)
)
{code}
The suggested approach based on ordinal positions in a select clause looks fine 
for simple cases, but how about the case where partial columns specified in a 
select clause? e.g.,
{code:java}
select
  d, a, -- partially selected
  count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
from abc
GROUPING SETS (
(),
(a,b,d),
(a,c),
(a,d)
)
{code}
I personally think this makes the resolution logic complicated and a bit 
unpredictable. Btw, any other DBMS-like systems following your suggestion?


was (Author: maropu):
> the order they were first seen in the specified grouping sets.

Ah, I got it. Thanks for the explanation. Yea, as you imagined, Spark currently 
decides the order where Spark sees columns in a grouping-set clause if no 
column selected in a group-by clause: 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L552-L555]

I think the most promising approach to sort them in a predictable order is that 
you define them in a grouping-by clause, e.g.,
{code:java}
select a, b, c, d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
from abc
group by 
  a, b, c, d -- selected in a preferrable order
GROUPING SETS (
(),
(a,b,d),
(a,c),
(a,d)
)
{code}
The suggested approach based on ordinal positions in a select clause looks fine 
for simple cases, but how about the case where partial columns specified in a 
select clause? e.g.,
{code:java}
select d, a, count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
from abc
GROUPING SETS (
(),
(a,b,d),
(a,c),
(a,d)
)
{code}
I personally think this makes the resolution logic complicated and a bit 
unpredictable. Btw, any other DBMS-like systems following your suggestion?

> grouping_id calculation should be improved
> --
>
> Key: SPARK-31583
> URL: https://issues.apache.org/jira/browse/SPARK-31583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Costas Piliotis
>Priority: Minor
>
> Unrelated to SPARK-21858 which identifies that grouping_id is determined by 
> exclusion from a grouping_set rather than inclusion, when performing complex 
> grouping_sets that are not in the order of the base select statement, 
> flipping the bit in the grouping_id seems to be happen when the grouping set 
> is identified rather than when the columns are selected in the sql.   I will 
> of course use the exclusion strategy identified in SPARK-21858 as the 
> baseline for this.  
>  
> {code:scala}
> import spark.implicits._
> val df= Seq(
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d")
> ).toDF("a","b","c","d").createOrReplaceTempView("abc")
> {code}
> expected to have these references in the grouping_id:
>  d=1
>  c=2
>  b=4
>  a=8
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
> This returns:
> {noformat}
> ++++++---+---+
> |a   |b   |c   |d   |count(1)|gid|gid_bin|
> ++++++---+---+
> |a   |null|c   |null|4   |6  |110|
> |null|null|null|null|4   |15 |   |
> |a   |null|null|d   |4   |5  |101|
> |a   |b   |null|d   |4   |1  |1  |
> ++++++---+---+
> {noformat}
>  
>  In other words, I would have expected the excluded values one way but I 
> received them excluded in the order they were first seen in the specified 
> grouping sets.
>  a,b,d included = excldes c = 2; expected gid=2. received gid=1
>  a,d included = excludes b=4, c=2 expected gid=6, received gid=5
> The grouping_id that actually is expected is (a,b,d,c) 
> {code:scala}
> spark.sql("""
>  select 

[jira] [Commented] (SPARK-31583) grouping_id calculation should be improved

2020-05-07 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102118#comment-17102118
 ] 

Takeshi Yamamuro commented on SPARK-31583:
--

> the order they were first seen in the specified grouping sets.

Ah, I got it. Thanks for the explanation. Yea, as you imagined, Spark currently 
decides the order where Spark sees columns in a grouping-set clause if no 
column selected in a group-by clause: 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L552-L555]

I think the most promising approach to sort them in a predictable order is that 
you define them in a grouping-by clause, e.g.,
{code:java}
select a, b, c, d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
from abc
group by 
  a, b, c, d -- selected in a preferrable order
GROUPING SETS (
(),
(a,b,d),
(a,c),
(a,d)
)
{code}
The suggested approach based on ordinal positions in a select clause looks fine 
for simple cases, but how about the case where partial columns specified in a 
select clause? e.g.,
{code:java}
select d, a, count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
from abc
GROUPING SETS (
(),
(a,b,d),
(a,c),
(a,d)
)
{code}
I personally think this makes the resolution logic complicated and a bit 
unpredictable. Btw, any other DBMS-like systems following your suggestion?

> grouping_id calculation should be improved
> --
>
> Key: SPARK-31583
> URL: https://issues.apache.org/jira/browse/SPARK-31583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Costas Piliotis
>Priority: Minor
>
> Unrelated to SPARK-21858 which identifies that grouping_id is determined by 
> exclusion from a grouping_set rather than inclusion, when performing complex 
> grouping_sets that are not in the order of the base select statement, 
> flipping the bit in the grouping_id seems to be happen when the grouping set 
> is identified rather than when the columns are selected in the sql.   I will 
> of course use the exclusion strategy identified in SPARK-21858 as the 
> baseline for this.  
>  
> {code:scala}
> import spark.implicits._
> val df= Seq(
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d")
> ).toDF("a","b","c","d").createOrReplaceTempView("abc")
> {code}
> expected to have these references in the grouping_id:
>  d=1
>  c=2
>  b=4
>  a=8
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
> This returns:
> {noformat}
> ++++++---+---+
> |a   |b   |c   |d   |count(1)|gid|gid_bin|
> ++++++---+---+
> |a   |null|c   |null|4   |6  |110|
> |null|null|null|null|4   |15 |   |
> |a   |null|null|d   |4   |5  |101|
> |a   |b   |null|d   |4   |1  |1  |
> ++++++---+---+
> {noformat}
>  
>  In other words, I would have expected the excluded values one way but I 
> received them excluded in the order they were first seen in the specified 
> grouping sets.
>  a,b,d included = excldes c = 2; expected gid=2. received gid=1
>  a,d included = excludes b=4, c=2 expected gid=6, received gid=5
> The grouping_id that actually is expected is (a,b,d,c) 
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, 
> bin(grouping_id(a,b,d,c)) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
>  columns forming groupingid seem to be creatred as the grouping sets are 
> identified rather than ordinal position in parent query.
> I'd like to at least point out that grouping_id is documented in many other 
> rdbms and I believe the spark project should use a policy of flipping the 
> bits so 1=inclusion; 0=exclusion in the grouping set.
> However many rdms that do have the feature of a grouping_id do implement it 
> by the ordinal position recognized as fields in the select clause, rather 
> than allocating them as they are observed in the grouping sets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics

2020-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31646:
-

Assignee: Manu Zhang

> Remove unused registeredConnections counter from ShuffleMetrics
> ---
>
> Key: SPARK-31646
> URL: https://issues.apache.org/jira/browse/SPARK-31646
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Shuffle
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics

2020-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31646.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28457
[https://github.com/apache/spark/pull/28457]

> Remove unused registeredConnections counter from ShuffleMetrics
> ---
>
> Key: SPARK-31646
> URL: https://issues.apache.org/jira/browse/SPARK-31646
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Shuffle
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31655) Upgrade snappy to version 1.1.7.5

2020-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31655:
--
Component/s: (was: Spark Core)
 Build

> Upgrade snappy to version 1.1.7.5
> -
>
> Key: SPARK-31655
> URL: https://issues.apache.org/jira/browse/SPARK-31655
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
> Fix For: 3.1.0
>
>
> Upgrade snappy to version 1.1.7.5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31655) Upgrade snappy to version 1.1.7.5

2020-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31655.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28472
[https://github.com/apache/spark/pull/28472]

> Upgrade snappy to version 1.1.7.5
> -
>
> Key: SPARK-31655
> URL: https://issues.apache.org/jira/browse/SPARK-31655
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
> Fix For: 3.1.0
>
>
> Upgrade snappy to version 1.1.7.5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31655) Upgrade snappy to version 1.1.7.5

2020-05-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31655:
-

Assignee: angerszhu

> Upgrade snappy to version 1.1.7.5
> -
>
> Key: SPARK-31655
> URL: https://issues.apache.org/jira/browse/SPARK-31655
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
>
> Upgrade snappy to version 1.1.7.5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31654) sequence producing inconsistent intervals for month step

2020-05-07 Thread Roman Yalki (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Yalki updated SPARK-31654:

Description: 
Taking an example from [https://spark.apache.org/docs/latest/api/sql/]
{code:java}
> SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 
> month);{code}
[2018-01-01,2018-02-01,2018-03-01]

if one is to expand `stop` till the end of the year some intervals are returned 
as the last day of the month whereas first day of the month is expected
{code:java}
> SELECT sequence(to_date('2018-01-01'), to_date('2019-01-01'), interval 1 
> month){code}
[2018-01-01, 2018-02-01, 2018-03-01, *2018-03-31, 2018-04-30, 2018-05-31, 
2018-06-30, 2018-07-31, 2018-08-31, 2018-09-30, 2018-10-31*, 2018-12-01, 
2019-01-01]

 

  was:
Taking an example from [https://spark.apache.org/docs/latest/api/sql/]
{code:java}
> SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 
> month);{code}
[2018-01-01,2018-02-01,2018-03-01]

if one is to expand `stop` till the end of the year some intervals are returned 
as the last day of the month whereas fist day of the month is expected
{code:java}
> SELECT sequence(to_date('2018-01-01'), to_date('2019-01-01'), interval 1 
> month){code}
[2018-01-01, 2018-02-01, 2018-03-01, *2018-03-31, 2018-04-30, 2018-05-31, 
2018-06-30, 2018-07-31, 2018-08-31, 2018-09-30, 2018-10-31*, 2018-12-01, 
2019-01-01]

 


> sequence producing inconsistent intervals for month step
> 
>
> Key: SPARK-31654
> URL: https://issues.apache.org/jira/browse/SPARK-31654
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Roman Yalki
>Priority: Major
>
> Taking an example from [https://spark.apache.org/docs/latest/api/sql/]
> {code:java}
> > SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 
> > month);{code}
> [2018-01-01,2018-02-01,2018-03-01]
> if one is to expand `stop` till the end of the year some intervals are 
> returned as the last day of the month whereas first day of the month is 
> expected
> {code:java}
> > SELECT sequence(to_date('2018-01-01'), to_date('2019-01-01'), interval 1 
> > month){code}
> [2018-01-01, 2018-02-01, 2018-03-01, *2018-03-31, 2018-04-30, 2018-05-31, 
> 2018-06-30, 2018-07-31, 2018-08-31, 2018-09-30, 2018-10-31*, 2018-12-01, 
> 2019-01-01]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24266) Spark client terminates while driver is still running

2020-05-07 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-24266:
-
Target Version/s: 3.0.0, 3.1.0, 2.4.7  (was: 2.4.6, 3.0.0, 3.1.0)

> Spark client terminates while driver is still running
> -
>
> Key: SPARK-24266
> URL: https://issues.apache.org/jira/browse/SPARK-24266
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0, 3.0.0
>Reporter: Chun Chen
>Priority: Major
>
> {code}
> Warning: Ignoring non-spark config property: Default=system properties 
> included when running spark-submit.
> 18/05/11 14:50:12 WARN Config: Error reading service account token from: 
> [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring.
> 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: 
> Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf)
> 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. 
> Mounting Hadoop specific files
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: N/A
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: 2018-05-11T06:50:17Z
>container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
>phase: Pending
>status: [ContainerStatus(containerID=null, 
> image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
> lastState=ContainerState(running=null, terminated=null, waiting=null, 
> additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
> restartCount=0, state=ContainerState(running=null, terminated=null, 
> waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
> additionalProperties={}), additionalProperties={}), additionalProperties={})]
> 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to 
> finish...
> 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>   

[jira] [Resolved] (SPARK-31543) Backport SPARK-26306 More memory to de-flake SorterSuite

2020-05-07 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-31543.
--
Resolution: Won't Fix

> Backport SPARK-26306   More memory to de-flake SorterSuite
> --
>
> Key: SPARK-31543
> URL: https://issues.apache.org/jira/browse/SPARK-31543
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> SPARK-26306       More memory to de-flake SorterSuite



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31538) Backport SPARK-25338 Ensure to call super.beforeAll() and super.afterAll() in test cases

2020-05-07 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-31538.
--
Resolution: Won't Fix

> Backport SPARK-25338   Ensure to call super.beforeAll() and 
> super.afterAll() in test cases
> --
>
> Key: SPARK-31538
> URL: https://issues.apache.org/jira/browse/SPARK-31538
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> Backport SPARK-25338       Ensure to call super.beforeAll() and 
> super.afterAll() in test cases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31541) Backport SPARK-26095 Disable parallelization in make-distibution.sh. (Avoid build hanging)

2020-05-07 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-31541.
--
Resolution: Won't Fix

> Backport SPARK-26095   Disable parallelization in make-distibution.sh. 
> (Avoid build hanging)
> 
>
> Key: SPARK-31541
> URL: https://issues.apache.org/jira/browse/SPARK-31541
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> Backport SPARK-26095       Disable parallelization in make-distibution.sh. 
> (Avoid build hanging)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26908) Fix toMillis

2020-05-07 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-26908:
-
Fix Version/s: (was: 2.4.6)
   2.4.7

> Fix toMillis
> 
>
> Key: SPARK-26908
> URL: https://issues.apache.org/jira/browse/SPARK-26908
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0, 2.4.7
>
>
> The toMillis() method of the DateTimeUtils object can produce inaccurate 
> result for some negative values. Minor differences can be around 1 ms. For 
> example:
> {code}
> input = -9223372036844776001L
> {code}
> should be converted to -9223372036844777L



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26908) Fix toMillis

2020-05-07 Thread Holden Karau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101926#comment-17101926
 ] 

Holden Karau commented on SPARK-26908:
--

retagged to 2.4.7, will revist if we end up cutting an RC2

> Fix toMillis
> 
>
> Key: SPARK-26908
> URL: https://issues.apache.org/jira/browse/SPARK-26908
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0, 2.4.7
>
>
> The toMillis() method of the DateTimeUtils object can produce inaccurate 
> result for some negative values. Minor differences can be around 1 ms. For 
> example:
> {code}
> input = -9223372036844776001L
> {code}
> should be converted to -9223372036844777L



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30737) Reenable to generate Rd files

2020-05-07 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-30737:
-
Fix Version/s: (was: 2.4.5)
   2.4.6

> Reenable to generate Rd files
> -
>
> Key: SPARK-30737
> URL: https://issues.apache.org/jira/browse/SPARK-30737
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.6, 3.0.0
>
>
> In SPARK-30733, due to:
> {code}
> * creating vignettes ... ERROR
> Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics:
> package ���htmltools��� was installed by an R version with different 
> internals; it needs to be reinstalled for use with this R version
> {code}
> It was disable to generate rd files. We should install related packages 
> correctly and reenable it back.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27262) Add explicit UTF-8 Encoding to DESCRIPTION

2020-05-07 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-27262:
-
Fix Version/s: (was: 2.4.5)
   2.4.6

> Add explicit UTF-8 Encoding to DESCRIPTION
> --
>
> Key: SPARK-27262
> URL: https://issues.apache.org/jira/browse/SPARK-27262
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Michael Chirico
>Priority: Trivial
> Fix For: 2.4.6, 3.0.0
>
>
> This will remove the following warning
> {code}
> Warning message:
> roxygen2 requires Encoding: UTF-8 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30823) %PYTHONPATH% not set in python/docs/make2.bat, resulting in failed/wrong documentation builds

2020-05-07 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau reassigned SPARK-30823:


Assignee: David Toneian

> %PYTHONPATH% not set in python/docs/make2.bat, resulting in failed/wrong 
> documentation builds
> -
>
> Key: SPARK-30823
> URL: https://issues.apache.org/jira/browse/SPARK-30823
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark, Windows
>Affects Versions: 2.4.5
> Environment: Tested on Windows 10.
>Reporter: David Toneian
>Assignee: David Toneian
>Priority: Minor
> Fix For: 2.4.6
>
>
> When building the PySpark documentation on Windows, by changing directory to 
> {{python/docs}} and running {{make.bat}} (which runs {{make2.bat}}), the 
> majority of the documentation may not be built if {{pyspark}} is not in the 
> default {{%PYTHONPATH%}}. Sphinx then reports that {{pyspark}} (and possibly 
> dependencies) cannot be imported.
> If {{pyspark}} is in the default {{%PYTHONPATH%}}, I suppose it is that 
> version of {{pyspark}} – as opposed to the version found above the 
> {{python/docs}} directory – that is considered when building the 
> documentation, which may result in documentation that does not correspond to 
> the development version one is trying to build.
> {{python/docs/Makefile}} avoids this issue by setting
>  ??export PYTHONPATH=$(realpath ..):$(realpath ../lib/py4j-0.10.8.1-src.zip)??
>  on line 10, but {{make2.bat}} does no such thing. The fix consist of adding
>  ??set PYTHONPATH=..;..\lib\py4j-0.10.8.1-src.zip??
>  to {{make2.bat}}.
> See [GitHub PR #27569|https://github.com/apache/spark/pull/27569].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30823) %PYTHONPATH% not set in python/docs/make2.bat, resulting in failed/wrong documentation builds

2020-05-07 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-30823:
-
Fix Version/s: 2.4.6

> %PYTHONPATH% not set in python/docs/make2.bat, resulting in failed/wrong 
> documentation builds
> -
>
> Key: SPARK-30823
> URL: https://issues.apache.org/jira/browse/SPARK-30823
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark, Windows
>Affects Versions: 2.4.5
> Environment: Tested on Windows 10.
>Reporter: David Toneian
>Priority: Minor
> Fix For: 2.4.6
>
>
> When building the PySpark documentation on Windows, by changing directory to 
> {{python/docs}} and running {{make.bat}} (which runs {{make2.bat}}), the 
> majority of the documentation may not be built if {{pyspark}} is not in the 
> default {{%PYTHONPATH%}}. Sphinx then reports that {{pyspark}} (and possibly 
> dependencies) cannot be imported.
> If {{pyspark}} is in the default {{%PYTHONPATH%}}, I suppose it is that 
> version of {{pyspark}} – as opposed to the version found above the 
> {{python/docs}} directory – that is considered when building the 
> documentation, which may result in documentation that does not correspond to 
> the development version one is trying to build.
> {{python/docs/Makefile}} avoids this issue by setting
>  ??export PYTHONPATH=$(realpath ..):$(realpath ../lib/py4j-0.10.8.1-src.zip)??
>  on line 10, but {{make2.bat}} does no such thing. The fix consist of adding
>  ??set PYTHONPATH=..;..\lib\py4j-0.10.8.1-src.zip??
>  to {{make2.bat}}.
> See [GitHub PR #27569|https://github.com/apache/spark/pull/27569].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31660) Dataset.joinWith supports JoinType object as input parameter

2020-05-07 Thread Rex Xiong (Jira)
Rex Xiong created SPARK-31660:
-

 Summary: Dataset.joinWith supports JoinType object as input 
parameter
 Key: SPARK-31660
 URL: https://issues.apache.org/jira/browse/SPARK-31660
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.5
Reporter: Rex Xiong


Current Dataset.joinWith API accepts String type joinType, it doesn't support 
JoinType object.
I prefer JoinType object (like enum) than String, less chance to have typo and 
has better readability
{code:scala}
def joinWith[U](other: Dataset[U], condition: Column, joinType: String): 
Dataset[(T, U)] = {{code}
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala


If I pass LeftOuter.sql to joinType, it will throw exception, since there's a 
white space in LeftOuter.sql
{code:scala}
case object LeftOuter extends JoinType {
  override def sql: String = "LEFT OUTER"
}
{code}
While the constructor of JoinType only removes underscore, doesn't handle white 
spaces, 
{code:scala}
object JoinType {
  def apply(typ: String): JoinType = typ.toLowerCase(Locale.ROOT).replace("_", 
"") match {
case "inner" => Inner
case "outer" | "full" | "fullouter" => FullOuter
case "leftouter" | "left" => LeftOuter
case "rightouter" | "right" => RightOuter
case "leftsemi" | "semi" => LeftSemi
case "leftanti" | "anti" => LeftAnti
case "cross" => Cross
case _ =>
  val supported = Seq(
"inner",
"outer", "full", "fullouter", "full_outer",
"leftouter", "left", "left_outer",
"rightouter", "right", "right_outer",
"leftsemi", "left_semi", "semi",
"leftanti", "left_anti", "anti",
"cross")

  throw new IllegalArgumentException(s"Unsupported join type '$typ'. " +
"Supported join types include: " + supported.mkString("'", "', '", "'") 
+ ".")
  }
}{code}
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala

I suggest we either add another set of APIs which provide JoinType instead of 
String, or change JoinType.apply to remove white space as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31306) rand() function documentation suggests an inclusive upper bound of 1.0

2020-05-07 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-31306:
-
Fix Version/s: 3.0.0
   2.4.6

> rand() function documentation suggests an inclusive upper bound of 1.0
> --
>
> Key: SPARK-31306
> URL: https://issues.apache.org/jira/browse/SPARK-31306
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, R, Spark Core
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Ben
>Assignee: Ben
>Priority: Major
> Fix For: 2.4.6, 3.0.0
>
>
>  The rand() function in PySpark, Spark, and R is documented as drawing from 
> U[0.0, 1.0]. This suggests an inclusive upper bound, and can be confusing 
> (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so writing 
> `U[0.0, 1.0]` suggests the value returned could include 1.0). The function 
> itself uses Rand(), which is [documented |#L71] as having a result in the 
> range [0, 1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31231) Support setuptools 46.1.0+ in PySpark packaging

2020-05-07 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-31231:
-
Fix Version/s: 3.1.0
   3.0.0

> Support setuptools 46.1.0+ in PySpark packaging
> ---
>
> Key: SPARK-31231
> URL: https://issues.apache.org/jira/browse/SPARK-31231
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Fix For: 2.4.6, 3.0.0, 3.1.0
>
>
> PIP packaging test started to fail (see 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120218/testReport/)
>  as of  setuptools 46.1.0 release.
> In https://github.com/pypa/setuptools/issues/1424, they decided to don't keep 
> the modes in {{package_data}}. In PySpark pip installation, we keep the 
> executable scripts in {{package_data}} 
> https://github.com/apache/spark/blob/master/python/setup.py#L199-L200, and 
> expose their symbolic links as executable scripts.
> So, the symbolic links (or copied scripts) executes the scripts copied from 
> {{package_data}}, which didn't keep the modes:
> {code}
> /tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: 
> /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: 
> Permission denied
> /tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: exec: 
> /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: 
> cannot execute: Permission denied
> {code}
> The current issue is being tracked at 
> https://github.com/pypa/setuptools/issues/2041



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31231) Support setuptools 46.1.0+ in PySpark packaging

2020-05-07 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-31231:
-
Fix Version/s: 2.4.6

> Support setuptools 46.1.0+ in PySpark packaging
> ---
>
> Key: SPARK-31231
> URL: https://issues.apache.org/jira/browse/SPARK-31231
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Fix For: 2.4.6
>
>
> PIP packaging test started to fail (see 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120218/testReport/)
>  as of  setuptools 46.1.0 release.
> In https://github.com/pypa/setuptools/issues/1424, they decided to don't keep 
> the modes in {{package_data}}. In PySpark pip installation, we keep the 
> executable scripts in {{package_data}} 
> https://github.com/apache/spark/blob/master/python/setup.py#L199-L200, and 
> expose their symbolic links as executable scripts.
> So, the symbolic links (or copied scripts) executes the scripts copied from 
> {{package_data}}, which didn't keep the modes:
> {code}
> /tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: 
> /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: 
> Permission denied
> /tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: exec: 
> /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: 
> cannot execute: Permission denied
> {code}
> The current issue is being tracked at 
> https://github.com/pypa/setuptools/issues/2041



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31659) Add VarianceThresholdSelector examples and doc

2020-05-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101879#comment-17101879
 ] 

Apache Spark commented on SPARK-31659:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/28478

> Add VarianceThresholdSelector examples and doc
> --
>
> Key: SPARK-31659
> URL: https://issues.apache.org/jira/browse/SPARK-31659
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Add VarianceThresholdSelector examples and doc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31659) Add VarianceThresholdSelector examples and doc

2020-05-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101877#comment-17101877
 ] 

Apache Spark commented on SPARK-31659:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/28478

> Add VarianceThresholdSelector examples and doc
> --
>
> Key: SPARK-31659
> URL: https://issues.apache.org/jira/browse/SPARK-31659
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Add VarianceThresholdSelector examples and doc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31659) Add VarianceThresholdSelector examples and doc

2020-05-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31659:


Assignee: Apache Spark

> Add VarianceThresholdSelector examples and doc
> --
>
> Key: SPARK-31659
> URL: https://issues.apache.org/jira/browse/SPARK-31659
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> Add VarianceThresholdSelector examples and doc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31659) Add VarianceThresholdSelector examples and doc

2020-05-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31659:


Assignee: (was: Apache Spark)

> Add VarianceThresholdSelector examples and doc
> --
>
> Key: SPARK-31659
> URL: https://issues.apache.org/jira/browse/SPARK-31659
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Add VarianceThresholdSelector examples and doc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31659) Add VarianceThresholdSelector examples and doc

2020-05-07 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-31659:
--

 Summary: Add VarianceThresholdSelector examples and doc
 Key: SPARK-31659
 URL: https://issues.apache.org/jira/browse/SPARK-31659
 Project: Spark
  Issue Type: New Feature
  Components: Documentation, ML
Affects Versions: 3.1.0
Reporter: Huaxin Gao


Add VarianceThresholdSelector examples and doc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31620) TreeNodeException: Binding attribute, tree: sum#19L

2020-05-07 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101862#comment-17101862
 ] 

angerszhu commented on SPARK-31620:
---

It is because in  PlanSubqueries, we have scalar-subquery in Expression. 

then in transformAllExpression method, since expression changed, in 
{color:#FF}QueryPlan.mapExperssion(){color} method we will make a copy of 
the tree. For aggregate expression, when makeCopy, it will got a new 
DeclarativeAggregate expression with lazy val 
{color:#0747a6}inputAggBufferAttributes{color}  will be null. When we reuse 
this value in  *HashAggregateExec.doConsumeWithoutKeys,*  
{color:#0747a6}inputAggBufferAttributes{color} will be re-initial with a new 
exprId, then error happended.

 

In right SparkPlan,  inputAggBufferAttributes is same as child's output. 

> TreeNodeException: Binding attribute, tree: sum#19L
> ---
>
> Key: SPARK-31620
> URL: https://issues.apache.org/jira/browse/SPARK-31620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.5, 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> scala> spark.sql("create temporary view t1 as select * from values (1, 2) as 
> t1(a, b)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("create temporary view t2 as select * from values (3, 4) as 
> t2(c, d)")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("select sum(if(c > (select a from t1), d, 0)) as csum from 
> t2").show
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: sum#19L
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChild$2(TreeNode.scala:368)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$4(TreeNode.scala:427)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>   at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:427)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:96)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> 

[jira] [Commented] (SPARK-31588) merge small files may need more common setting

2020-05-07 Thread philipse (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101775#comment-17101775
 ] 

philipse commented on SPARK-31588:
--

For example:

if we have output 3 files,size as 10M,50M,200M,the block size as 128M,we may 
keep the file size more close the average,but we also should keep the size 
bigger than the block, just in case someone set wrong paramters. 

case 1:we set the target size as 60M.the  expected average file size as 
Max(blocksize,60M) it will output an integer file count as the repartition 
number :[total_file_size /average file size]+1

the final result will be 3 files:size as 128M,128M,4M

 

if we set the target size as 5120M, then it will repartition as 1 file. size as 
 260M.

thus ,we can set the target size as the global paramter,it will benefit all 
task.

> merge small files may need more common setting
> --
>
> Key: SPARK-31588
> URL: https://issues.apache.org/jira/browse/SPARK-31588
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: spark:2.4.5
> hdp:2.7
>Reporter: philipse
>Priority: Major
>
> Hi ,
> SparkSql now allow us to use  repartition or coalesce to manually control the 
> small files like the following
> /*+ REPARTITION(1) */
> /*+ COALESCE(1) */
> But it can only be  tuning case by case ,we need to decide whether we need to 
> use COALESCE or REPARTITION,can we try a more common way to reduce the 
> decision by set the target size  as hive did
> *Good points:*
> 1)we will also the new partitions number
> 2)with an ON-OFF parameter  provided , user can close it if needed
> 3)the parmeter can be set at cluster level instand of user side,it will be 
> more easier to controll samll files.
> 4)greatly reduce the pressue of namenode
>  
> *Not good points:*
> 1)It will add a new task to calculate the target numbers by stastics the out 
> files.
>  
> I don't know whether we have planned this in future.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26908) Fix toMillis

2020-05-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-26908:

Fix Version/s: 2.4.6

> Fix toMillis
> 
>
> Key: SPARK-26908
> URL: https://issues.apache.org/jira/browse/SPARK-26908
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.6, 3.0.0
>
>
> The toMillis() method of the DateTimeUtils object can produce inaccurate 
> result for some negative values. Minor differences can be around 1 ms. For 
> example:
> {code}
> input = -9223372036844776001L
> {code}
> should be converted to -9223372036844777L



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31405) fail by default when read/write datetime values and not sure if they need rebase or not

2020-05-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101744#comment-17101744
 ] 

Apache Spark commented on SPARK-31405:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/28477

> fail by default when read/write datetime values and not sure if they need 
> rebase or not
> ---
>
> Key: SPARK-31405
> URL: https://issues.apache.org/jira/browse/SPARK-31405
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31405) fail by default when read/write datetime values and not sure if they need rebase or not

2020-05-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101743#comment-17101743
 ] 

Apache Spark commented on SPARK-31405:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/28477

> fail by default when read/write datetime values and not sure if they need 
> rebase or not
> ---
>
> Key: SPARK-31405
> URL: https://issues.apache.org/jira/browse/SPARK-31405
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31405) fail by default when read/write datetime values and not sure if they need rebase or not

2020-05-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31405:


Assignee: Apache Spark

> fail by default when read/write datetime values and not sure if they need 
> rebase or not
> ---
>
> Key: SPARK-31405
> URL: https://issues.apache.org/jira/browse/SPARK-31405
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31405) fail by default when read/write datetime values and not sure if they need rebase or not

2020-05-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31405:


Assignee: (was: Apache Spark)

> fail by default when read/write datetime values and not sure if they need 
> rebase or not
> ---
>
> Key: SPARK-31405
> URL: https://issues.apache.org/jira/browse/SPARK-31405
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31337) Support MS Sql Kerberos login in JDBC connector

2020-05-07 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101705#comment-17101705
 ] 

Gabor Somogyi commented on SPARK-31337:
---

Started to work on this.

> Support MS Sql Kerberos login in JDBC connector
> ---
>
> Key: SPARK-31337
> URL: https://issues.apache.org/jira/browse/SPARK-31337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31485) Barrier stage can hang if only partial tasks launched

2020-05-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101690#comment-17101690
 ] 

Apache Spark commented on SPARK-31485:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/28476

> Barrier stage can hang if only partial tasks launched
> -
>
> Key: SPARK-31485
> URL: https://issues.apache.org/jira/browse/SPARK-31485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 2.4.6
>
>
> The issue can be reproduced by following test:
>  
> {code:java}
> initLocalClusterSparkContext(2)
> val rdd0 = sc.parallelize(Seq(0, 1, 2, 3), 2)
> val dep = new OneToOneDependency[Int](rdd0)
> val rdd = new MyRDD(sc, 2, List(dep), 
> Seq(Seq("executor_h_0"),Seq("executor_h_0")))
> rdd.barrier().mapPartitions { iter =>
>   BarrierTaskContext.get().barrier()
>   iter
> }.collect()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31448) Difference in Storage Levels used in cache() and persist() for pyspark dataframes

2020-05-07 Thread Abhishek Dixit (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101626#comment-17101626
 ] 

Abhishek Dixit edited comment on SPARK-31448 at 5/7/20, 12:34 PM:
--

Let me try to explain the problem more. 

Please look at this code in pyspark/dataframe.py: 
{code:java}
@since(1.3)
def cache(self):
"""Persists the :class:`DataFrame` with the default storage level 
(C{MEMORY_AND_DISK}).
.. note:: The default storage level has changed to C{MEMORY_AND_DISK} 
to match Scala in 2.0.
"""
self.is_cached = True
self._jdf.cache()
return self
{code}
Cache method in pyspark data frame directly calls scala's cache method. Hence 
Storage level used is based on Scala defaults i.e. StorageLevel(true, true, 
false, true)  with deserialized equal to true. But since, data from python is 
already serialized by the Pickle library, we should be using storage level with 
deserialized = false for pyspark dataframes.

But if you look at cache method in pyspark/rdd.py, it sets the storage level in 
pyspark only and then calls the scala method with parameter. Hence correct 
storage level is used in this case with deserialzied = false.
{code:java}
def cache(self):
"""
Persist this RDD with the default storage level (C{MEMORY_ONLY}).
"""
self.is_cached = True
self.persist(StorageLevel.MEMORY_ONLY)
return self{code}
 We need to implement a similar way in cache method in dataframe.py to avoid 
using the scala defaults of deserialized = true

 

 


was (Author: abhishekd0907):
Let me try to explain the problem more. 

Please look at this code in pyspark/dataframe.py: 
{code:java}
def cache(self):
"""
Persist this RDD with the default storage level (C{MEMORY_ONLY}).
"""
self.is_cached = True
self.persist(StorageLevel.MEMORY_ONLY)
return self
{code}
Cache method in pyspark data frame directly calls scala's cache method. Hence 
Storage level used is based on Scala defaults i.e. StorageLevel(true, true, 
false, true)  with deserialized equal to true. But since, data from python is 
already serialized by the Pickle library, we should be using storage level with 
deserialized = false for pyspark dataframes.

But if you look at cache method in pyspark/rdd.py, it sets the storage level in 
pyspark only and then calls the scala method with parameter. Hence correct 
storage level is used in this case with deserialzied = false.
{code:java}
def cache(self):
"""
Persist this RDD with the default storage level (C{MEMORY_ONLY}).
"""
self.is_cached = True
self.persist(StorageLevel.MEMORY_ONLY)
return self{code}
 We need to implement a similar way in cache method in dataframe.py to avoid 
using the scala defaults of deserialized = true

 

 

> Difference in Storage Levels used in cache() and persist() for pyspark 
> dataframes
> -
>
> Key: SPARK-31448
> URL: https://issues.apache.org/jira/browse/SPARK-31448
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Abhishek Dixit
>Priority: Major
>
> There is a difference in default storage level *MEMORY_AND_DISK* in pyspark 
> and scala.
> *Scala*: StorageLevel(true, true, false, true)
> *Pyspark:* StorageLevel(True, True, False, False)
>  
> *Problem Description:* 
> Calling *df.cache()*  for pyspark dataframe directly invokes Scala method 
> cache() and Storage Level used is StorageLevel(true, true, false, true).
> But calling *df.persist()* for pyspark dataframe sets the 
> newStorageLevel=StorageLevel(true, true, false, false) inside pyspark and 
> then invokes Scala function persist(newStorageLevel).
> *Possible Fix:*
> Invoke pyspark function persist inside pyspark function cache instead of 
> calling the scala function directly.
> I can raise a PR for this fix if someone can confirm that this is a bug and 
> the possible fix is the correct approach.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31448) Difference in Storage Levels used in cache() and persist() for pyspark dataframes

2020-05-07 Thread Abhishek Dixit (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101626#comment-17101626
 ] 

Abhishek Dixit edited comment on SPARK-31448 at 5/7/20, 12:33 PM:
--

Let me try to explain the problem more. 

Please look at this code in pyspark/dataframe.py: 
{code:java}
def cache(self):
"""
Persist this RDD with the default storage level (C{MEMORY_ONLY}).
"""
self.is_cached = True
self.persist(StorageLevel.MEMORY_ONLY)
return self
{code}
Cache method in pyspark data frame directly calls scala's cache method. Hence 
Storage level used is based on Scala defaults i.e. StorageLevel(true, true, 
false, true)  with deserialized equal to true. But since, data from python is 
already serialized by the Pickle library, we should be using storage level with 
deserialized = false for pyspark dataframes.

But if you look at cache method in pyspark/rdd.py, it sets the storage level in 
pyspark only and then calls the scala method with parameter. Hence correct 
storage level is used in this case with deserialzied = false.
{code:java}
def cache(self):
"""
Persist this RDD with the default storage level (C{MEMORY_ONLY}).
"""
self.is_cached = True
self.persist(StorageLevel.MEMORY_ONLY)
return self{code}
 We need to implement a similar way in cache method in dataframe.py to avoid 
using the scala defaults of deserialized = true

 

 


was (Author: abhishekd0907):
Let me try to explain the problem more. 

Please look at this code in pyspark/dataframe.py: 
{code:java}
//  @since(1.3)def cache(self):"""Persists the :class:`DataFrame` 
with the default storage level (C{MEMORY_AND_DISK})... note:: The 
default storage level has changed to C{MEMORY_AND_DISK} to match Scala in 2.0.  
  """self.is_cached = Trueself._jdf.cache()return 
self
{code}
Cache method in pyspark data frame directly calls scala's cache method. Hence 
Storage level used is based on Scala defaults i.e. StorageLevel(true, true, 
false, true)  with deserialized equal to true. But since, data from python is 
already serialized by the Pickle library, we should be using storage level with 
deserialized = false for pyspark dataframes.

But if you look at cache method in pyspark/rdd.py, it sets the storage level in 
pyspark only and then calls the scala method with parameter. Hence correct 
storage level is used in this case with deserialzied = false.
{code:java}
// def cache(self):"""Persist this RDD with the default storage 
level (C{MEMORY_ONLY})."""self.is_cached = True
self.persist(StorageLevel.MEMORY_ONLY)return self
{code}
 We need to implement a similar way in cache method in dataframe.py to avoid 
using the scala defaults of deserialized = true

 

 

> Difference in Storage Levels used in cache() and persist() for pyspark 
> dataframes
> -
>
> Key: SPARK-31448
> URL: https://issues.apache.org/jira/browse/SPARK-31448
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Abhishek Dixit
>Priority: Major
>
> There is a difference in default storage level *MEMORY_AND_DISK* in pyspark 
> and scala.
> *Scala*: StorageLevel(true, true, false, true)
> *Pyspark:* StorageLevel(True, True, False, False)
>  
> *Problem Description:* 
> Calling *df.cache()*  for pyspark dataframe directly invokes Scala method 
> cache() and Storage Level used is StorageLevel(true, true, false, true).
> But calling *df.persist()* for pyspark dataframe sets the 
> newStorageLevel=StorageLevel(true, true, false, false) inside pyspark and 
> then invokes Scala function persist(newStorageLevel).
> *Possible Fix:*
> Invoke pyspark function persist inside pyspark function cache instead of 
> calling the scala function directly.
> I can raise a PR for this fix if someone can confirm that this is a bug and 
> the possible fix is the correct approach.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-31448) Difference in Storage Levels used in cache() and persist() for pyspark dataframes

2020-05-07 Thread Abhishek Dixit (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Dixit reopened SPARK-31448:


Let me try to explain the problem more. 

Please look at this code in pyspark/dataframe.py: 
{code:java}
//  @since(1.3)def cache(self):"""Persists the :class:`DataFrame` 
with the default storage level (C{MEMORY_AND_DISK})... note:: The 
default storage level has changed to C{MEMORY_AND_DISK} to match Scala in 2.0.  
  """self.is_cached = Trueself._jdf.cache()return 
self
{code}
Cache method in pyspark data frame directly calls scala's cache method. Hence 
Storage level used is based on Scala defaults i.e. StorageLevel(true, true, 
false, true)  with deserialized equal to true. But since, data from python is 
already serialized by the Pickle library, we should be using storage level with 
deserialized = false for pyspark dataframes.

But if you look at cache method in pyspark/rdd.py, it sets the storage level in 
pyspark only and then calls the scala method with parameter. Hence correct 
storage level is used in this case with deserialzied = false.
{code:java}
// def cache(self):"""Persist this RDD with the default storage 
level (C{MEMORY_ONLY})."""self.is_cached = True
self.persist(StorageLevel.MEMORY_ONLY)return self
{code}
 We need to implement a similar way in cache method in dataframe.py to avoid 
using the scala defaults of deserialized = true

 

 

> Difference in Storage Levels used in cache() and persist() for pyspark 
> dataframes
> -
>
> Key: SPARK-31448
> URL: https://issues.apache.org/jira/browse/SPARK-31448
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Abhishek Dixit
>Priority: Major
>
> There is a difference in default storage level *MEMORY_AND_DISK* in pyspark 
> and scala.
> *Scala*: StorageLevel(true, true, false, true)
> *Pyspark:* StorageLevel(True, True, False, False)
>  
> *Problem Description:* 
> Calling *df.cache()*  for pyspark dataframe directly invokes Scala method 
> cache() and Storage Level used is StorageLevel(true, true, false, true).
> But calling *df.persist()* for pyspark dataframe sets the 
> newStorageLevel=StorageLevel(true, true, false, false) inside pyspark and 
> then invokes Scala function persist(newStorageLevel).
> *Possible Fix:*
> Invoke pyspark function persist inside pyspark function cache instead of 
> calling the scala function directly.
> I can raise a PR for this fix if someone can confirm that this is a bug and 
> the possible fix is the correct approach.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31631) Fix test flakiness caused by MiniKdc which throws "address in use" BindException

2020-05-07 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-31631:
-
Fix Version/s: (was: 3.1.0)
   3.0.0

> Fix test flakiness caused by MiniKdc which throws "address in use" 
> BindException
> 
>
> Key: SPARK-31631
> URL: https://issues.apache.org/jira/browse/SPARK-31631
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:java}
> [info] org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED 
> *** (15 seconds, 426 milliseconds)
> [info]   java.net.BindException: Address already in use
> [info]   at sun.nio.ch.Net.bind0(Native Method)
> [info]   at sun.nio.ch.Net.bind(Net.java:433)
> [info]   at sun.nio.ch.Net.bind(Net.java:425)
> [info]   at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
> [info]   at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
> [info]   at 
> org.apache.mina.transport.socket.nio.NioSocketAcceptor.open(NioSocketAcceptor.java:198)
> [info]   at 
> org.apache.mina.transport.socket.nio.NioSocketAcceptor.open(NioSocketAcceptor.java:51)
> [info]   at 
> org.apache.mina.core.polling.AbstractPollingIoAcceptor.registerHandles(AbstractPollingIoAcceptor.java:547)
> [info]   at 
> org.apache.mina.core.polling.AbstractPollingIoAcceptor.access$400(AbstractPollingIoAcceptor.java:68)
> [info]   at 
> org.apache.mina.core.polling.AbstractPollingIoAcceptor$Acceptor.run(AbstractPollingIoAcceptor.java:422)
> [info]   at 
> org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [info]   at java.lang.Thread.run(Thread.java:748)
> {code}
> This is an issue fixed in hadoop 2.8.0
> https://issues.apache.org/jira/browse/HADOOP-12656
> We may need apply the approach from HBASE first before we drop Hadoop 2.7.x
> https://issues.apache.org/jira/browse/HBASE-14734



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26908) Fix toMillis

2020-05-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101581#comment-17101581
 ] 

Apache Spark commented on SPARK-26908:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28475

> Fix toMillis
> 
>
> Key: SPARK-26908
> URL: https://issues.apache.org/jira/browse/SPARK-26908
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> The toMillis() method of the DateTimeUtils object can produce inaccurate 
> result for some negative values. Minor differences can be around 1 ms. For 
> example:
> {code}
> input = -9223372036844776001L
> {code}
> should be converted to -9223372036844777L



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26908) Fix toMillis

2020-05-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101580#comment-17101580
 ] 

Apache Spark commented on SPARK-26908:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28475

> Fix toMillis
> 
>
> Key: SPARK-26908
> URL: https://issues.apache.org/jira/browse/SPARK-26908
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> The toMillis() method of the DateTimeUtils object can produce inaccurate 
> result for some negative values. Minor differences can be around 1 ms. For 
> example:
> {code}
> input = -9223372036844776001L
> {code}
> should be converted to -9223372036844777L



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31658) SQL UI doesn't show write commands of AQE plan

2020-05-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31658:


Assignee: (was: Apache Spark)

> SQL UI doesn't show write commands of AQE plan
> --
>
> Key: SPARK-31658
> URL: https://issues.apache.org/jira/browse/SPARK-31658
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31658) SQL UI doesn't show write commands of AQE plan

2020-05-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31658:


Assignee: Apache Spark

> SQL UI doesn't show write commands of AQE plan
> --
>
> Key: SPARK-31658
> URL: https://issues.apache.org/jira/browse/SPARK-31658
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31658) SQL UI doesn't show write commands of AQE plan

2020-05-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101568#comment-17101568
 ] 

Apache Spark commented on SPARK-31658:
--

User 'manuzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/28474

> SQL UI doesn't show write commands of AQE plan
> --
>
> Key: SPARK-31658
> URL: https://issues.apache.org/jira/browse/SPARK-31658
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31658) SQL UI doesn't show write commands of AQE plan

2020-05-07 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-31658:
--

 Summary: SQL UI doesn't show write commands of AQE plan
 Key: SPARK-31658
 URL: https://issues.apache.org/jira/browse/SPARK-31658
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Web UI
Affects Versions: 3.0.0
Reporter: Manu Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29250) Upgrade to Hadoop 3.2.1

2020-05-07 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101553#comment-17101553
 ] 

Steve Loughran commented on SPARK-29250:


I feel your pain

> Upgrade to Hadoop 3.2.1
> ---
>
> Key: SPARK-29250
> URL: https://issues.apache.org/jira/browse/SPARK-29250
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31657) CSV Writer writes no header for empty DataFrames

2020-05-07 Thread Furcy Pin (Jira)
Furcy Pin created SPARK-31657:
-

 Summary: CSV Writer writes no header for empty DataFrames
 Key: SPARK-31657
 URL: https://issues.apache.org/jira/browse/SPARK-31657
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.4.1
 Environment: Local pyspark 2.41
Reporter: Furcy Pin


When writing a DataFrame as csv with the Header option set to true,
the header is not written when the DataFrame is empty.

This creates failures for processes that read the csv back.

Example (please notice the limit(0) in the second example):
```

 
{code:java}
Welcome to
  __
 / __/__ ___ _/ /__
 _\ \/ _ \/ _ `/ __/ '_/
 /__ / .__/\_,_/_/ /_/\_\ version 2.4.1
 /_/
Using Python version 2.7.17 (default, Nov 7 2019 10:07:09)
SparkSession available as 'spark'.
>>> df1 = spark.sql("SELECT 1 as a")
>>> df1.limit(1).write.mode("OVERWRITE").option("Header", 
>>> True).csv("data/test/csv")
>>> spark.read.option("Header", True).csv("data/test/csv").show()
+---+
| a|
+---+
| 1|
+---+
>>> 
>>> df1.limit(0).write.mode("OVERWRITE").option("Header", 
>>> True).csv("data/test/csv")
>>> spark.read.option("Header", True).csv("data/test/csv").show()
++
||
++
++
{code}
 


Expected behavior:
{code:java}
>>> df1.limit(0).write.mode("OVERWRITE").option("Header", 
>>> True).csv("data/test/csv")
>>> spark.read.option("Header", True).csv("data/test/csv").show()
+---+
| a|
+---+
+---+{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27990) Recursive data loading from file sources

2020-05-07 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-27990:

Summary: Recursive data loading from file sources  (was: Provide a way to 
recursively load data from datasource)

> Recursive data loading from file sources
> 
>
> Key: SPARK-27990
> URL: https://issues.apache.org/jira/browse/SPARK-27990
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 2.4.3
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Provide a way to recursively load data from datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org