[jira] [Created] (SPARK-30200) Add ExplainMode for Dataset.explain

2019-12-09 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-30200:


 Summary: Add ExplainMode for Dataset.explain
 Key: SPARK-30200
 URL: https://issues.apache.org/jira/browse/SPARK-30200
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


This pr targets to add ExplainMode for explaining Dataset/DataFrame with a 
given format mode (ExplainMode). ExplainMode has four types along with the SQL 
EXPLAIN command: Simple, Extended, Codegen, Cost, and Formatted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30199) Recover spark.ui.port and spark.blockManager.port from checkpoint

2019-12-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30199:
--
Issue Type: Improvement  (was: Bug)

> Recover spark.ui.port and spark.blockManager.port from checkpoint
> -
>
> Key: SPARK-30199
> URL: https://issues.apache.org/jira/browse/SPARK-30199
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30199) Recover spark.ui.port and spark.blockManager.port from checkpoint

2019-12-09 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-30199:
-

 Summary: Recover spark.ui.port and spark.blockManager.port from 
checkpoint
 Key: SPARK-30199
 URL: https://issues.apache.org/jira/browse/SPARK-30199
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.4.4, 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30198) BytesToBytesMap does not grow internal long array as expected

2019-12-09 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-30198:
---

 Summary: BytesToBytesMap does not grow internal long array as 
expected
 Key: SPARK-30198
 URL: https://issues.apache.org/jira/browse/SPARK-30198
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


One Spark job on our cluster hangs forever at BytesToBytesMap.safeLookup. After 
inspecting, the long array size is 536870912.

Currently in BytesToBytesMap.append, we only grow the internal array if the 
size of the array is less than its MAX_CAPACITY that is 536870912. So in above 
case, the array can not be grown up, and safeLookup can not find an empty slot 
forever.

But it is wrong because we use two array entries per key, so the array size is 
twice the capacity. We should compare the current capacity of the array, 
instead of its size.







--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30130) Hardcoded numeric values in common table expressions which utilize GROUP BY are interpreted as ordinal positions

2019-12-09 Thread Ankit Raj Boudh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992185#comment-16992185
 ] 

Ankit Raj Boudh commented on SPARK-30130:
-

hi Matt boegner, can you please me to reproduce this issue

> Hardcoded numeric values in common table expressions which utilize GROUP BY 
> are interpreted as ordinal positions
> 
>
> Key: SPARK-30130
> URL: https://issues.apache.org/jira/browse/SPARK-30130
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Matt Boegner
>Priority: Minor
>
> Hardcoded numeric values in common table expressions which utilize GROUP BY 
> are interpreted as ordinal positions. 
> {code:java}
> val df = spark.sql("""
>  with a as (select 0 as test, count group by test)
>  select * from a
>  """)
>  df.show(){code}
>  This results in an error message like {color:#e01e5a}GROUP BY position 0 is 
> not in select list (valid range is [1, 2]){color} .
>  
> However, this error does not appear in a traditional subselect format. For 
> example, this query executes correctly:
> {code:java}
> val df = spark.sql("""
>  select * from (select 0 as test, count group by test) a
>  """)
>  df.show(){code}
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30130) Hardcoded numeric values in common table expressions which utilize GROUP BY are interpreted as ordinal positions

2019-12-09 Thread Ankit Raj Boudh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992185#comment-16992185
 ] 

Ankit Raj Boudh edited comment on SPARK-30130 at 12/10/19 4:41 AM:
---

hi Matt boegner, can you please help me to reproduce this issue


was (Author: ankitraj):
hi Matt boegner, can you please me to reproduce this issue

> Hardcoded numeric values in common table expressions which utilize GROUP BY 
> are interpreted as ordinal positions
> 
>
> Key: SPARK-30130
> URL: https://issues.apache.org/jira/browse/SPARK-30130
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Matt Boegner
>Priority: Minor
>
> Hardcoded numeric values in common table expressions which utilize GROUP BY 
> are interpreted as ordinal positions. 
> {code:java}
> val df = spark.sql("""
>  with a as (select 0 as test, count group by test)
>  select * from a
>  """)
>  df.show(){code}
>  This results in an error message like {color:#e01e5a}GROUP BY position 0 is 
> not in select list (valid range is [1, 2]){color} .
>  
> However, this error does not appear in a traditional subselect format. For 
> example, this query executes correctly:
> {code:java}
> val df = spark.sql("""
>  select * from (select 0 as test, count group by test) a
>  """)
>  df.show(){code}
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30196) Bump lz4-java version to 1.7.0

2019-12-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30196:


Assignee: Takeshi Yamamuro

> Bump lz4-java version to 1.7.0
> --
>
> Key: SPARK-30196
> URL: https://issues.apache.org/jira/browse/SPARK-30196
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30196) Bump lz4-java version to 1.7.0

2019-12-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30196.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26823
[https://github.com/apache/spark/pull/26823]

> Bump lz4-java version to 1.7.0
> --
>
> Key: SPARK-30196
> URL: https://issues.apache.org/jira/browse/SPARK-30196
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30193) Wrong 2nd argument type. Found: 'org.apache.spark.sql.Column', required: 'scala.collection.Seq'

2019-12-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30193.
--
Resolution: Not A Problem

> Wrong 2nd argument type. Found: 'org.apache.spark.sql.Column', required: 
> 'scala.collection.Seq' 
> -
>
> Key: SPARK-30193
> URL: https://issues.apache.org/jira/browse/SPARK-30193
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.4
>Reporter: Ankit Raj Boudh
>Priority: Minor
>
>  
> [https://github.com/apache/spark/pull/26773/checks?check_run_id=340387020]
> JavaDataFrameSuite.java
>  JavaTokenizerExample.java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30179) Improve test in SingleSessionSuite

2019-12-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30179.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26812
[https://github.com/apache/spark/pull/26812]

> Improve test in SingleSessionSuite
> --
>
> Key: SPARK-30179
> URL: https://issues.apache.org/jira/browse/SPARK-30179
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> https://github.com/apache/spark/blob/58be82ad4b98fc17e821e916e69e77a6aa36209d/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala#L782-L824
> We should also verify the UDF works.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30179) Improve test in SingleSessionSuite

2019-12-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30179:


Assignee: Yuming Wang

> Improve test in SingleSessionSuite
> --
>
> Key: SPARK-30179
> URL: https://issues.apache.org/jira/browse/SPARK-30179
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> https://github.com/apache/spark/blob/58be82ad4b98fc17e821e916e69e77a6aa36209d/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala#L782-L824
> We should also verify the UDF works.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30197) Add `requirements.txt` file to `python` directory

2019-12-09 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-30197:
-

 Summary: Add `requirements.txt` file to `python` directory
 Key: SPARK-30197
 URL: https://issues.apache.org/jira/browse/SPARK-30197
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30196) Bump lz4-java version to 1.7.0

2019-12-09 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-30196:


 Summary: Bump lz4-java version to 1.7.0
 Key: SPARK-30196
 URL: https://issues.apache.org/jira/browse/SPARK-30196
 Project: Spark
  Issue Type: Improvement
  Components: Build, Spark Core
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30195) Some imports, function need more explicit resolution in 2.13

2019-12-09 Thread Sean R. Owen (Jira)
Sean R. Owen created SPARK-30195:


 Summary: Some imports, function need more explicit resolution in 
2.13
 Key: SPARK-30195
 URL: https://issues.apache.org/jira/browse/SPARK-30195
 Project: Spark
  Issue Type: Sub-task
  Components: ML, Spark Core, SQL, Structured Streaming
Affects Versions: 3.0.0
Reporter: Sean R. Owen
Assignee: Sean R. Owen


This is a grouping of related but not identical issues in the 2.13 migration, 
where the compiler is more picky about explicit types and imports. I'm grouping 
them as they seem moderately related.

Some are fairly self-evident like wanting an explicit generic type. In a few 
cases it looks like import resolution rules tightened up a bit and have to be 
explicit.

A few more cause problems like:

{code}
[ERROR] [Error] 
/Users/seanowen/Documents/spark_2.13/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala:220:
 missing parameter type for expanded function
The argument types of an anonymous function must be fully known. (SLS 8.5)
Expected type was: ?
{code}

In some cases it's just a matter of adding an explicit type, like {{.map { m: 
Matrix =>}}.

Many seem to concern functions of tuples, or tuples of tuples.

{{.mapGroups { case (g, iter) =>}} needs to be simply {{.mapGroups { (g, iter) 
=>}}

Or more annoyingly:

{code}
}.reduceByKey { case ((wc1, df1), (wc2, df2)) =>
  (wc1 + wc2, df1 + df2)
}
{code}

Apparently can only be fully known without nesting tuples. This _won't_ work:

{code}
}.reduceByKey { case ((wc1: Long, df1: Int), (wc2: Long, df2: Int)) =>
  (wc1 + wc2, df1 + df2)
}
{code}

This does:

{code}
}.reduceByKey { (wcdf1, wcdf2) =>
  (wcdf1._1 + wcdf2._1, wcdf1._2 + wcdf2._2)
}
{code}

I'm not super clear why most of the problems seem to affect reduceByKey.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30158) Resolve Array + reference type compile problems in 2.13, with sc.parallelize

2019-12-09 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30158.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26787
[https://github.com/apache/spark/pull/26787]

> Resolve Array + reference type compile problems in 2.13, with sc.parallelize
> 
>
> Key: SPARK-30158
> URL: https://issues.apache.org/jira/browse/SPARK-30158
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
> Fix For: 3.0.0
>
>
> Scala 2.13 has some different rules about resolving Arrays as Seqs when the 
> array is of a reference type. This primarily affects calls to 
> {{sc.parallelize(Array(...))}} where elements aren't primitive:
> {code}
> [ERROR] [Error] 
> /Users/seanowen/Documents/spark_2.13/mllib/src/main/scala/org/apache/spark/mllib/pmml/PMMLExportable.scala:61:
>  overloaded method value apply with alternatives:
>   (x: Unit,xs: Unit*)Array[Unit] 
>   (x: Double,xs: Double*)Array[Double] 
>   ...
> {code}
> This is easy to resolve by using Seq instead.
> Closely related: WrappedArray is a type def in 2.13, which makes it unusable 
> in Java. One set of tests needs to adapt. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30146) add setWeightCol to GBTs in PySpark

2019-12-09 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30146:


Assignee: Huaxin Gao

> add setWeightCol to GBTs in PySpark
> ---
>
> Key: SPARK-30146
> URL: https://issues.apache.org/jira/browse/SPARK-30146
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>
> add setWeightCol and setMinWeightFractionPerNode in Python side of 
> GBTClassifier and GBTRegressor



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30146) add setWeightCol to GBTs in PySpark

2019-12-09 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30146.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26774
[https://github.com/apache/spark/pull/26774]

> add setWeightCol to GBTs in PySpark
> ---
>
> Key: SPARK-30146
> URL: https://issues.apache.org/jira/browse/SPARK-30146
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> add setWeightCol and setMinWeightFractionPerNode in Python side of 
> GBTClassifier and GBTRegressor



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30180) listJars() and listFiles() function display issue when file path contains spaces

2019-12-09 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30180.
--
Resolution: Not A Problem

> listJars() and listFiles() function display issue when file path contains 
> spaces
> 
>
> Key: SPARK-30180
> URL: https://issues.apache.org/jira/browse/SPARK-30180
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Ankit Raj Boudh
>Priority: Minor
>
>  
> {{scala> sc.listJars()
> res2: Seq[String] = Vector(spark://11.242.181.153:50811/jars/c6%20test.jar)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30193) Wrong 2nd argument type. Found: 'org.apache.spark.sql.Column', required: 'scala.collection.Seq'

2019-12-09 Thread Ankit Raj Boudh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Raj Boudh updated SPARK-30193:

Description: 
 

[https://github.com/apache/spark/pull/26773/checks?check_run_id=340387020]

JavaDataFrameSuite.java

 JavaTokenizerExample.java

  was:
 

[https://github.com/apache/spark/pull/26773/checks?check_run_id=340387020]

 

JavaDataFrameSuite

JavaDataFrameSuite

 


> Wrong 2nd argument type. Found: 'org.apache.spark.sql.Column', required: 
> 'scala.collection.Seq' 
> -
>
> Key: SPARK-30193
> URL: https://issues.apache.org/jira/browse/SPARK-30193
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.4
>Reporter: Ankit Raj Boudh
>Priority: Minor
>
>  
> [https://github.com/apache/spark/pull/26773/checks?check_run_id=340387020]
> JavaDataFrameSuite.java
>  JavaTokenizerExample.java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30193) Wrong 2nd argument type. Found: 'org.apache.spark.sql.Column', required: 'scala.collection.Seq'

2019-12-09 Thread Ankit Raj Boudh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Raj Boudh updated SPARK-30193:

Description: 
 

[https://github.com/apache/spark/pull/26773/checks?check_run_id=340387020]

 

JavaDataFrameSuite

JavaDataFrameSuite

 

  was:
unused import file.

 

[https://github.com/apache/spark/pull/26773/checks?check_run_id=340387020]


> Wrong 2nd argument type. Found: 'org.apache.spark.sql.Column', required: 
> 'scala.collection.Seq' 
> -
>
> Key: SPARK-30193
> URL: https://issues.apache.org/jira/browse/SPARK-30193
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.4
>Reporter: Ankit Raj Boudh
>Priority: Minor
>
>  
> [https://github.com/apache/spark/pull/26773/checks?check_run_id=340387020]
>  
> JavaDataFrameSuite
> JavaDataFrameSuite
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30193) Wrong 2nd argument type. Found: 'org.apache.spark.sql.Column', required: 'scala.collection.Seq'

2019-12-09 Thread Ankit Raj Boudh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Raj Boudh updated SPARK-30193:

Summary: Wrong 2nd argument type. Found: 'org.apache.spark.sql.Column', 
required: 'scala.collection.Seq'   (was: remove 
unused import file)

> Wrong 2nd argument type. Found: 'org.apache.spark.sql.Column', required: 
> 'scala.collection.Seq' 
> -
>
> Key: SPARK-30193
> URL: https://issues.apache.org/jira/browse/SPARK-30193
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.4
>Reporter: Ankit Raj Boudh
>Priority: Minor
>
> unused import file.
>  
> [https://github.com/apache/spark/pull/26773/checks?check_run_id=340387020]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30108) Add robust accumulator for observable metrics

2019-12-09 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-30108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell updated SPARK-30108:
--
Description: 
Spark accumulators reflect the work that has been done, and not the data that 
has been processed. There are situations where one tuple can be processed 
multiple times, e.g.: task/stage retries, speculation, determination of ranges 
for global ordered, etc... For observed metrics we need the value of the 
accumulator to be based on the data and not on processing.

The current aggregating accumulator is already robust to some of these issues 
(like task failure), but we need to add some additional checks to make sure it 
is fool proof.

> Add robust accumulator for observable metrics
> -
>
> Key: SPARK-30108
> URL: https://issues.apache.org/jira/browse/SPARK-30108
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Herman van Hövell
>Priority: Major
>
> Spark accumulators reflect the work that has been done, and not the data that 
> has been processed. There are situations where one tuple can be processed 
> multiple times, e.g.: task/stage retries, speculation, determination of 
> ranges for global ordered, etc... For observed metrics we need the value of 
> the accumulator to be based on the data and not on processing.
> The current aggregating accumulator is already robust to some of these issues 
> (like task failure), but we need to add some additional checks to make sure 
> it is fool proof.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30194) Re-enable checkstyle for Java

2019-12-09 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-30194:
-

 Summary: Re-enable checkstyle for Java
 Key: SPARK-30194
 URL: https://issues.apache.org/jira/browse/SPARK-30194
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30193) remove unused import file

2019-12-09 Thread Ankit Raj Boudh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991756#comment-16991756
 ] 

Ankit Raj Boudh commented on SPARK-30193:
-

i am working in this jira

> remove unused import file
> -
>
> Key: SPARK-30193
> URL: https://issues.apache.org/jira/browse/SPARK-30193
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.4
>Reporter: Ankit Raj Boudh
>Priority: Minor
>
> unused import file.
>  
> [https://github.com/apache/spark/pull/26773/checks?check_run_id=340387020]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30193) remove unused import file

2019-12-09 Thread Ankit Raj Boudh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Raj Boudh updated SPARK-30193:

Description: 
unused import file.

 

[https://github.com/apache/spark/pull/26773/checks?check_run_id=340387020]

  was:unused import file.


> remove unused import file
> -
>
> Key: SPARK-30193
> URL: https://issues.apache.org/jira/browse/SPARK-30193
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.4
>Reporter: Ankit Raj Boudh
>Priority: Minor
>
> unused import file.
>  
> [https://github.com/apache/spark/pull/26773/checks?check_run_id=340387020]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30193) remove unused import file

2019-12-09 Thread Ankit Raj Boudh (Jira)
Ankit Raj Boudh created SPARK-30193:
---

 Summary: remove unused import file
 Key: SPARK-30193
 URL: https://issues.apache.org/jira/browse/SPARK-30193
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.4.4
Reporter: Ankit Raj Boudh


unused import file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly

2019-12-09 Thread Andrew White (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991752#comment-16991752
 ] 

Andrew White commented on SPARK-22876:
--

This issue is most certainly unresolved, and the documentation is misleading at 
best.

[~choojoyq] - perhaps you could share your code on how you deal with this in 
general? The general solution seems non-obvious to me given all of the 
intricacies of Spark + Yarn interactions such as shutdown hooks, handling 
signals, client vs cluster mode. Maybe I'm over thinking the solution. 

This feature is critical for supporting long-running applications.

> spark.yarn.am.attemptFailuresValidityInterval does not work correctly
> -
>
> Key: SPARK-22876
> URL: https://issues.apache.org/jira/browse/SPARK-22876
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
> Environment: hadoop version 2.7.3
>Reporter: Jinhan Zhong
>Priority: Minor
>  Labels: bulk-closed
>
> I assume we can use spark.yarn.maxAppAttempts together with 
> spark.yarn.am.attemptFailuresValidityInterval to make a long running 
> application avoid stopping  after acceptable number of failures.
> But after testing, I found that the application always stops after failing n 
> times ( n is minimum value of spark.yarn.maxAppAttempts and 
> yarn.resourcemanager.am.max-attempts from client yarn-site.xml)
> for example, following setup will allow the application master to fail 20 
> times.
> * spark.yarn.am.attemptFailuresValidityInterval=1s
> * spark.yarn.maxAppAttempts=20
> * yarn client: yarn.resourcemanager.am.max-attempts=20
> * yarn resource manager: yarn.resourcemanager.am.max-attempts=3
> And after checking the source code, I found in source file 
> ApplicationMaster.scala 
> https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293
> there's a ShutdownHook that checks the attempt id against the maxAppAttempts, 
> if attempt id >= maxAppAttempts, it will try to unregister the application 
> and the application will finish.
> is this a expected design or a bug?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27189) Add Executor metrics and memory usage instrumentation to the metrics system

2019-12-09 Thread Imran Rashid (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-27189.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Fixed by https://github.com/apache/spark/pull/24132

> Add Executor metrics and memory usage instrumentation to the metrics system
> ---
>
> Key: SPARK-27189
> URL: https://issues.apache.org/jira/browse/SPARK-27189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: Example_dashboard_Spark_Memory_Metrics.PNG
>
>
> This proposes to add instrumentation of memory usage via the Spark 
> Dropwizard/Codahale metrics system. Memory usage metrics are available via 
> the Executor metrics, recently implemented as detailed in 
> https://issues.apache.org/jira/browse/SPARK-23206. 
> Making metrics usage metrics available via the Spark Dropwzard metrics system 
> allow to improve Spark performance dashboards and study memory usage, as in 
> the attached example graph.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27189) Add Executor metrics and memory usage instrumentation to the metrics system

2019-12-09 Thread Imran Rashid (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-27189:


Assignee: Luca Canali

> Add Executor metrics and memory usage instrumentation to the metrics system
> ---
>
> Key: SPARK-27189
> URL: https://issues.apache.org/jira/browse/SPARK-27189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Attachments: Example_dashboard_Spark_Memory_Metrics.PNG
>
>
> This proposes to add instrumentation of memory usage via the Spark 
> Dropwizard/Codahale metrics system. Memory usage metrics are available via 
> the Executor metrics, recently implemented as detailed in 
> https://issues.apache.org/jira/browse/SPARK-23206. 
> Making metrics usage metrics available via the Spark Dropwzard metrics system 
> allow to improve Spark performance dashboards and study memory usage, as in 
> the attached example graph.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30192) support column position in DS v2

2019-12-09 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-30192:
---

 Summary: support column position in DS v2
 Key: SPARK-30192
 URL: https://issues.apache.org/jira/browse/SPARK-30192
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30191) AM should update pending resource request faster when driver lost executor

2019-12-09 Thread Max Xie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max  Xie updated SPARK-30191:
-
Description: 
I run spark on yarn.  I found that when driver lost its executors because of 
machine hardware problem and all of service includes nodemanager, executor on 
the node has been killed,  it means that Resourcemanager can't update the 
containers info on the node until Resourcemanager try to remove the node,   but 
it always takes 10 mins or longger, and in the meantime, AM don't add the new 
resource request and driver missing the executors.

So maybe AM should add the factor `numExecutorsExiting` in YarnAllocator's 
method `

updateResourceRequests`  to optimize it.

 

 

 

  was:
I run spark on yarn.  I found that when driver lost its executors because of 
machine hardware problem and all of service includes nodemanager, executor on 
the node has killed,  it means that Resourcemanager can't update the containers 
info on the node until Resourcemanager try to remove the node,   but it always 
takes 10 mins or longger, and in the meantime, AM don't add the new resource 
request and driver missing the executors.

So maybe AM should add the factor `numExecutorsExiting` in YarnAllocator's 
method `

updateResourceRequests`  to optimize it.

 

 

 


> AM should update pending resource request faster when driver lost executor
> --
>
> Key: SPARK-30191
> URL: https://issues.apache.org/jira/browse/SPARK-30191
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.4
>Reporter: Max  Xie
>Priority: Minor
>
> I run spark on yarn.  I found that when driver lost its executors because of 
> machine hardware problem and all of service includes nodemanager, executor on 
> the node has been killed,  it means that Resourcemanager can't update the 
> containers info on the node until Resourcemanager try to remove the node,   
> but it always takes 10 mins or longger, and in the meantime, AM don't add the 
> new resource request and driver missing the executors.
> So maybe AM should add the factor `numExecutorsExiting` in YarnAllocator's 
> method `
> updateResourceRequests`  to optimize it.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30191) AM should update pending resource request faster when driver lost executor

2019-12-09 Thread Max Xie (Jira)
Max  Xie created SPARK-30191:


 Summary: AM should update pending resource request faster when 
driver lost executor
 Key: SPARK-30191
 URL: https://issues.apache.org/jira/browse/SPARK-30191
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.4.4
Reporter: Max  Xie


I run spark on yarn.  I found that when driver lost its executors because of 
machine hardware problem and all of service includes nodemanager, executor on 
the node has killed,  it means that Resourcemanager can't update the containers 
info on the node until Resourcemanager try to remove the node,   but it always 
takes 10 mins or longger, and in the meantime, AM don't add the new resource 
request and driver missing the executors.

So maybe AM should add the factor `numExecutorsExiting` in YarnAllocator's 
method `

updateResourceRequests`  to optimize it.

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30190) HistoryServerDiskManager will fail on appStoreDir in s3

2019-12-09 Thread thierry accart (Jira)
thierry accart created SPARK-30190:
--

 Summary: HistoryServerDiskManager will fail on appStoreDir in s3
 Key: SPARK-30190
 URL: https://issues.apache.org/jira/browse/SPARK-30190
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: thierry accart


Hi

While setting spark.eventLog.dir to s3a://... I realized that it *requires 
destination directory to preexists for S3* 

This is explained I think in HistoryServerDiskManager's appStoreDir: it tries 
check if directory exists or can be created

{{if (!appStoreDir.isDirectory() && !appStoreDir.mkdir()) \{throw new 
IllegalArgumentException(s"Failed to create app directory ($appStoreDir).")}}}

But in S3, a directory does not exists and cannot be created: directories don't 
exists by themselves, they are only materialized due to existence of objects.


Before proposing a patch, I wanted to know what are the prefered options : 
should we have a spark option to skip the appStoreDir test, or skip it only 
when a particular scheme is set, have a custom implementation of 
HistoryServerDiskManager ...? 

 

_Note for people facing the {{IllegalArgumentException:}} {{Failed to create 
app directory}} *you just have to put an empty file in bucket destination 
'path'*._



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30159) Fix the method calls of QueryTest.checkAnswer

2019-12-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30159.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26788
[https://github.com/apache/spark/pull/26788]

> Fix the method calls of QueryTest.checkAnswer
> -
>
> Key: SPARK-30159
> URL: https://issues.apache.org/jira/browse/SPARK-30159
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, the method `checkAnswer` in Object `QueryTest` returns an optional 
> string. It doesn't throw exceptions when errors happen.
> The actual exceptions are thrown in the trait `QueryTest`.
> However, there are some test suites that use the no-op method 
> `QueryTest.checkAnswer` and expect it to fail test cases when the input 
> Dataframe doesn't match the expected answer. E.g. StreamSuite, 
> SessionStateSuite, BinaryFileFormatSuite, etc.
> We should fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30162) Filter is not being pushed down for Parquet files

2019-12-09 Thread Nasir Ali (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991573#comment-16991573
 ] 

Nasir Ali commented on SPARK-30162:
---

[~aman_omer] added in my question

> Filter is not being pushed down for Parquet files
> -
>
> Key: SPARK-30162
> URL: https://issues.apache.org/jira/browse/SPARK-30162
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: pyspark 3.0 preview
> Ubuntu/Centos
> pyarrow 0.14.1 
>Reporter: Nasir Ali
>Priority: Major
>
> Filters are not pushed down in Spark 3.0 preview. Also the output of 
> "explain" method is different. It is hard to debug in 3.0 whether filters 
> were pushed down or not. Below code could reproduce the bug:
>  
> {code:java}
> // code placeholder
> df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"),
> ("usr1",13.00, "2018-03-11T12:27:18+00:00"),
> ("usr1",25.00, "2018-03-12T11:27:18+00:00"),
> ("usr1",20.00, "2018-03-13T15:27:18+00:00"),
> ("usr1",17.00, "2018-03-14T12:27:18+00:00"),
> ("usr2",99.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",156.00, "2018-03-22T11:27:18+00:00"),
> ("usr2",17.00, "2018-03-31T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-16T11:27:18+00:00")
> ],
>["user","id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = 
> spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True)
> {code}
> {code:java}
> // Spark 2.4 output
> == Parsed Logical Plan ==
> 'Filter ('user = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan ==
> id: double, ts: timestamp, user: string
> Filter (user#40 = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan ==
> Filter (isnotnull(user#40) && (user#40 = usr2))
> +- Relation[id#38,ts#39,user#40] parquet== Physical Plan ==
> *(1) FileScan parquet [id#38,ts#39,user#40] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/home/cnali/data], PartitionCount: 1, 
> PartitionFilters: [isnotnull(user#40), (user#40 = usr2)], PushedFilters: [], 
> ReadSchema: struct{code}
> {code:java}
> // Spark 3.0.0-preview output
> == Parsed Logical Plan ==
> 'Filter ('user = usr2)
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Analyzed 
> Logical Plan ==
> id: double, ts: timestamp, user: string
> Filter (user#2 = usr2)
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Optimized 
> Logical Plan ==
> Filter (isnotnull(user#2) AND (user#2 = usr2))
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Physical 
> Plan ==
> *(1) Project [id#0, ts#1, user#2]
> +- *(1) Filter (isnotnull(user#2) AND (user#2 = usr2))
>+- *(1) ColumnarToRow
>   +- BatchScan[id#0, ts#1, user#2] ParquetScan Location: 
> InMemoryFileIndex[file:/home/cnali/data], ReadSchema: 
> struct
> {code}
> I have tested it on much larger dataset. Spark 3.0 tries to load whole data 
> and then apply filter. Whereas Spark 2.4 push down the filter. Above output 
> shows that Spark 2.4 applied partition filter but not the Spark 3.0 preview.
>  
> Minor: in Spark 3.0 "explain()" output is truncated (maybe fixed length?) and 
> it's hard to debug.  spark.sql.orc.cache.stripe.details.size=1 doesn't 
> work.
>  
> {code:java}
> // pyspark 3 shell output
> $ pyspark
> Python 3.6.8 (default, Aug  7 2019, 17:28:10) 
> [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Ignoring non-spark config property: 
> java.io.dir=/md2k/data1,/md2k/data2,/md2k/data3,/md2k/data4,/md2k/data5,/md2k/data6,/md2k/data7,/md2k/data8
> 19/12/09 07:05:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 19/12/09 07:05:36 WARN SparkConf: Note that spark.local.dir will be 
> overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in 
> mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.0.0-preview
>   /_/Using 

[jira] [Updated] (SPARK-30162) Filter is not being pushed down for Parquet files

2019-12-09 Thread Nasir Ali (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nasir Ali updated SPARK-30162:
--
Description: 
Filters are not pushed down in Spark 3.0 preview. Also the output of "explain" 
method is different. It is hard to debug in 3.0 whether filters were pushed 
down or not. Below code could reproduce the bug:

 
{code:java}
// code placeholder
df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"),
("usr1",13.00, "2018-03-11T12:27:18+00:00"),
("usr1",25.00, "2018-03-12T11:27:18+00:00"),
("usr1",20.00, "2018-03-13T15:27:18+00:00"),
("usr1",17.00, "2018-03-14T12:27:18+00:00"),
("usr2",99.00, "2018-03-15T11:27:18+00:00"),
("usr2",156.00, "2018-03-22T11:27:18+00:00"),
("usr2",17.00, "2018-03-31T11:27:18+00:00"),
("usr2",25.00, "2018-03-15T11:27:18+00:00"),
("usr2",25.00, "2018-03-16T11:27:18+00:00")
],
   ["user","id", "ts"])
df = df.withColumn('ts', df.ts.cast('timestamp'))
df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = 
spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True)
{code}
{code:java}
// Spark 2.4 output
== Parsed Logical Plan ==
'Filter ('user = usr2)
+- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan ==
id: double, ts: timestamp, user: string
Filter (user#40 = usr2)
+- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan ==
Filter (isnotnull(user#40) && (user#40 = usr2))
+- Relation[id#38,ts#39,user#40] parquet== Physical Plan ==
*(1) FileScan parquet [id#38,ts#39,user#40] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[file:/home/cnali/data], PartitionCount: 1, 
PartitionFilters: [isnotnull(user#40), (user#40 = usr2)], PushedFilters: [], 
ReadSchema: struct{code}
{code:java}
// Spark 3.0.0-preview output
== Parsed Logical Plan ==
'Filter ('user = usr2)
+- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Analyzed 
Logical Plan ==
id: double, ts: timestamp, user: string
Filter (user#2 = usr2)
+- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Optimized 
Logical Plan ==
Filter (isnotnull(user#2) AND (user#2 = usr2))
+- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Physical Plan 
==
*(1) Project [id#0, ts#1, user#2]
+- *(1) Filter (isnotnull(user#2) AND (user#2 = usr2))
   +- *(1) ColumnarToRow
  +- BatchScan[id#0, ts#1, user#2] ParquetScan Location: 
InMemoryFileIndex[file:/home/cnali/data], ReadSchema: 
struct
{code}
I have tested it on much larger dataset. Spark 3.0 tries to load whole data and 
then apply filter. Whereas Spark 2.4 push down the filter. Above output shows 
that Spark 2.4 applied partition filter but not the Spark 3.0 preview.

 

Minor: in Spark 3.0 "explain()" output is truncated (maybe fixed length?) and 
it's hard to debug.  spark.sql.orc.cache.stripe.details.size=1 doesn't work.

 
{code:java}
// pyspark 3 shell output
$ pyspark
Python 3.6.8 (default, Aug  7 2019, 17:28:10) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Warning: Ignoring non-spark config property: 
java.io.dir=/md2k/data1,/md2k/data2,/md2k/data3,/md2k/data4,/md2k/data5,/md2k/data6,/md2k/data7,/md2k/data8
19/12/09 07:05:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
19/12/09 07:05:36 WARN SparkConf: Note that spark.local.dir will be overridden 
by the value set by the cluster manager (via SPARK_LOCAL_DIRS in 
mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.0.0-preview
  /_/Using Python version 3.6.8 (default, Aug  7 2019 17:28:10)
SparkSession available as 'spark'.
{code}
{code:java}
// pyspark 2.4.4 shell output
pyspark
Python 3.6.8 (default, Aug  7 2019, 17:28:10) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
2019-12-09 07:09:07 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / 

[jira] [Resolved] (SPARK-30145) sparkContext.addJar fails when file path contains spaces

2019-12-09 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30145.
--
Resolution: Duplicate

> sparkContext.addJar fails when file path contains spaces
> 
>
> Key: SPARK-30145
> URL: https://issues.apache.org/jira/browse/SPARK-30145
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30072) Create dedicated planner for subqueries

2019-12-09 Thread Xiaoju Wu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991554#comment-16991554
 ] 

Xiaoju Wu commented on SPARK-30072:
---

[~afroozeh] I think the change from checking if queryExecution equals to 
checking isSubquery boolean is not correct in some scenarios. If a subquery is 
executed under SubqueryExec.executionContext, there's no way to pass the 
information to tell InsertAdaptiveSparkPlan rule that it is a subquery. 

> Create dedicated planner for subqueries
> ---
>
> Key: SPARK-30072
> URL: https://issues.apache.org/jira/browse/SPARK-30072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Assignee: Ali Afroozeh
>Priority: Minor
> Fix For: 3.0.0
>
>
> This PR changes subquery planning by calling the planner and plan preparation 
> rules on the subquery plan directly. Before we were creating a QueryExecution 
> instance for subqueries to get the executedPlan. This would re-run analysis 
> and optimization on the subqueries plan. Running the analysis again on an 
> optimized query plan can have unwanted consequences, as some rules, for 
> example DecimalPrecision, are not idempotent.
> As an example, consider the expression 1.7 * avg(a) which after applying the 
> DecimalPrecision rule becomes:
> promote_precision(1.7) * promote_precision(avg(a))
> After the optimization, more specifically the constant folding rule, this 
> expression becomes:
> 1.7 * promote_precision(avg(a))
> Now if we run the analyzer on this optimized query again, we will get:
> promote_precision(1.7) * promote_precision(promote_precision(avg(a)))
> Which will later optimized as:
> 1.7 * promote_precision(promote_precision(avg(a)))
> As can be seen, re-running the analysis and optimization on this expression 
> results in an expression with extra nested promote_preceision nodes. Adding 
> unneeded nodes to the plan is problematic because it can eliminate situations 
> where we can reuse the plan.
> We opted to introduce dedicated planners for subuqueries, instead of making 
> the DecimalPrecision rule idempotent, because this eliminates this entire 
> category of problems. Another benefit is that planning time for subqueries is 
> reduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30189) Interval from year-month/date-time string handling whitespaces

2019-12-09 Thread Kent Yao (Jira)
Kent Yao created SPARK-30189:


 Summary: Interval from year-month/date-time string handling 
whitespaces
 Key: SPARK-30189
 URL: https://issues.apache.org/jira/browse/SPARK-30189
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


# for pg feature parity
 # for consistency with other types and other interval parser

 
{code:sql}
ostgres=# select interval E'2-2\t' year to month;
interval

 2 years 2 mons
(1 row)

postgres=# select interval E'2-2\t' year to month;
interval

 2 years 2 mons
(1 row)

postgres=# select interval E'2-\t2\t' year to month;
ERROR:  invalid input syntax for type interval: "2- 2   "
LINE 1: select interval E'2-\t2\t' year to month;
^
postgres=# select interval '2  00:00:01' day to second;
interval
-
 2 days 00:00:01
(1 row)

postgres=# select interval '- 2  00:00:01' day to second;
 interval
---
 -2 days +00:00:01
(1 row)

{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30188) Enable Adaptive Query Execution default

2019-12-09 Thread Ke Jia (Jira)
Ke Jia created SPARK-30188:
--

 Summary: Enable Adaptive Query Execution default
 Key: SPARK-30188
 URL: https://issues.apache.org/jira/browse/SPARK-30188
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ke Jia


Enable Adaptive Query Execution default



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30187) NULL handling in PySpark-PandasUDF

2019-12-09 Thread Abdeali Kothari (Jira)
Abdeali Kothari created SPARK-30187:
---

 Summary: NULL handling in PySpark-PandasUDF
 Key: SPARK-30187
 URL: https://issues.apache.org/jira/browse/SPARK-30187
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.4
Reporter: Abdeali Kothari


The new pandasUDF has been very helpful in simplifying writing UDFs and more 
performant. But I cannot relliably use it because of it's different NULL value 
handling as compared to normal spark.

Here is my understanding ...

In spark, nulls/missing are as follows:
 * Float: null/NaN
 * Integer: null
 * String: null
 * DateTime: null

 

In pandas, null/missing are as follows:
 * Float: NaN
 * Integer: 
 * String: null
 * DateTime: NaT

When I use spark and am using a Pandas UDF, it looks to me like there is 
information loss, as I am unable to differentiate between null and NaN anymore. 
Which I could do when I was using the older PythonUDF

 
{code:java}
>>> @F.pandas_udf("integer", F.PandasUDFType.GROUPED_AGG)
... def pd_sum(x):
... return x.sum()
>>> sdf = spark.createDataFrame(
[ [1.0, 2.0],
  [None, 3.0],
  [float('nan'), 4.0]
], ['a', 'b'])
>>> sdf.agg(pd_sum(sdf['a'])).collect()
[Row(pd_sum(a)=1.0)]
>>> sdf.select(F.sum(sdf['a'])).collect()
[Row(sum(a)=nan)]
{code}
 

If I use an integer with NULL values -> the PandasUDF actually gets a float 
type:

 
{code:java}
>>> sdf = spark.createDataFrame([ [1, 2.0], [None, 3.0] ], ['a', 'b'])

>>> sdf.dtypes
[('a', 'bigint'), ('b', 'double')]
>>> @F.pandas_udf("integer", F.PandasUDFType.GROUPED_AGG)
... def pd_sum(x):
...     print(x)
...     return x.sum()
>>> sdf.agg(pd_sum(sdf['a'])).collect()
0 1.0
1 NaN
Name: _0, dtype: float64 float64
[Row(pd_sum(a)=1)]
{code}
 

I'm not sure whether this is something Spark should handle, but wanted to 
understand whether there is a plan to manage this ?

Because from what I understand, if someone wants to use pandas DataFrames as of 
now, they need to make some asusmptions like:
 * The entire range of BigInteger will not work, because it gets converted to 
float (if null values present)
 * The float type should have either NaN or NULL - not both

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30186) support Dynamic Partition Pruning in Adaptive Execution

2019-12-09 Thread Xiaoju Wu (Jira)
Xiaoju Wu created SPARK-30186:
-

 Summary: support Dynamic Partition Pruning in Adaptive Execution
 Key: SPARK-30186
 URL: https://issues.apache.org/jira/browse/SPARK-30186
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xiaoju Wu
 Fix For: 3.0.0


Currently Adaptive Execution cannot work if Dynamic Partition Pruning is 
applied.

private def supportAdaptive(plan: SparkPlan): Boolean = {
 // TODO migrate dynamic-partition-pruning onto adaptive execution.
 sanityCheck(plan) &&
 !plan.logicalLink.exists(_.isStreaming) &&
 
*!plan.expressions.exists(_.find(_.isInstanceOf[DynamicPruningSubquery]).isDefined)*
 &&
 plan.children.forall(supportAdaptive)
}

It means we cannot benefit the performance from both AE and DPP.

This ticket is target to make DPP + AE works together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30180) listJars() and listFiles() function display issue when file path contains spaces

2019-12-09 Thread Ankit Raj Boudh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Raj Boudh updated SPARK-30180:

Summary: listJars() and listFiles() function display issue when file path 
contains spaces  (was: listJars() function display issue.)

> listJars() and listFiles() function display issue when file path contains 
> spaces
> 
>
> Key: SPARK-30180
> URL: https://issues.apache.org/jira/browse/SPARK-30180
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Ankit Raj Boudh
>Priority: Minor
>
>  
> {{scala> sc.listJars()
> res2: Seq[String] = Vector(spark://11.242.181.153:50811/jars/c6%20test.jar)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26433) Tail method for spark DataFrame

2019-12-09 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991520#comment-16991520
 ] 

Hyukjin Kwon commented on SPARK-26433:
--

Looking back this JIRA, I realised that I underestimated it given multiple 
requests and other systems. I re-created a JIRA and made a PR.

> Tail method for spark DataFrame
> ---
>
> Key: SPARK-26433
> URL: https://issues.apache.org/jira/browse/SPARK-26433
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Jan Gorecki
>Priority: Major
>
> There is a head method for spark dataframes which work fine but there doesn't 
> seems to be tail method.
> ```
> >>> ans   
> >>>   
> DataFrame[v1: bigint] 
>   
> >>> ans.head(3)   
> >>>  
> [Row(v1=299443), Row(v1=299493), Row(v1=300751)]
> >>> ans.tail(3)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/jan/git/db-benchmark/spark/py-spark/lib/python3.6/site-packages/py
> spark/sql/dataframe.py", line 1300, in __getattr__
> "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
> AttributeError: 'DataFrame' object has no attribute 'tail'
> ```
> I would like to feature request Tail method for spark dataframe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-26433) Tail method for spark DataFrame

2019-12-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-26433:
--

> Tail method for spark DataFrame
> ---
>
> Key: SPARK-26433
> URL: https://issues.apache.org/jira/browse/SPARK-26433
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Jan Gorecki
>Priority: Major
>
> There is a head method for spark dataframes which work fine but there doesn't 
> seems to be tail method.
> ```
> >>> ans   
> >>>   
> DataFrame[v1: bigint] 
>   
> >>> ans.head(3)   
> >>>  
> [Row(v1=299443), Row(v1=299493), Row(v1=300751)]
> >>> ans.tail(3)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/jan/git/db-benchmark/spark/py-spark/lib/python3.6/site-packages/py
> spark/sql/dataframe.py", line 1300, in __getattr__
> "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
> AttributeError: 'DataFrame' object has no attribute 'tail'
> ```
> I would like to feature request Tail method for spark dataframe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26433) Tail method for spark DataFrame

2019-12-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26433.
--
Resolution: Duplicate

> Tail method for spark DataFrame
> ---
>
> Key: SPARK-26433
> URL: https://issues.apache.org/jira/browse/SPARK-26433
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Jan Gorecki
>Priority: Major
>
> There is a head method for spark dataframes which work fine but there doesn't 
> seems to be tail method.
> ```
> >>> ans   
> >>>   
> DataFrame[v1: bigint] 
>   
> >>> ans.head(3)   
> >>>  
> [Row(v1=299443), Row(v1=299493), Row(v1=300751)]
> >>> ans.tail(3)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/jan/git/db-benchmark/spark/py-spark/lib/python3.6/site-packages/py
> spark/sql/dataframe.py", line 1300, in __getattr__
> "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
> AttributeError: 'DataFrame' object has no attribute 'tail'
> ```
> I would like to feature request Tail method for spark dataframe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30138) Separate configuration key of max iterations for analyzer and optimizer

2019-12-09 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-30138.
--
Fix Version/s: 3.0.0
 Assignee: Hu Fuwang
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/26766

> Separate configuration key of max iterations for analyzer and optimizer
> ---
>
> Key: SPARK-30138
> URL: https://issues.apache.org/jira/browse/SPARK-30138
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hu Fuwang
>Assignee: Hu Fuwang
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, both Analyzer and Optimizer use conf 
> "spark.sql.optimizer.excludedRules" to set the max iterations to run, which 
> is a little confusing.
> It is clearer to add a new conf "spark.sql.analyzer.excludedRules" for 
> analyzer max iterations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30185) Implement Dataset.tail API

2019-12-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30185:
-
Description: 
I would like to propose an API called DataFrame.tail.

*Background & Motivation*

Many other systems support the way to take data from the end, for instance, 
pandas[1] and
 Python[2][3]. Scala collections APIs also have head and tail

On the other hand, in Spark, we only provide a way to take data from the start
 (e.g., DataFrame.head). This has been requested multiple times here and there 
in Spark
 user mailing list[4], StackOverFlow[5][6], JIRA[7] and other third party 
projects such as
 Koalas[8].

It seems we're missing non-trivial use case in Spark and this motivated me to 
propose this
 API.

*Proposal*

I would like to propose an API against DataFrame called tail that collects rows 
from the
 end in contrast with head.

Namely, as below:


{code:java}
 scala> spark.range(10).head(5)
 res1: Array[Long] = Array(0, 1, 2, 3, 4)
 scala> spark.range(10).tail(5)
 res2: Array[Long] = Array(5, 6, 7, 8, 9){code}
Implementation details will be similar with head but it will be reversed:

Run the job against the last partition and collect rows. If this is enough, 
return as is.
 If this is not enough, calculate the number of partitions to select more based 
upon
 ‘spark.sql.limit.scaleUpFactor’
 Run more jobs against more partitions (in a reversed order compared to head)
 as many as the number calculated from 2.
 Go to 2.
 [1] 
[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html?highlight=tail#pandas.DataFrame.tail]
 [2] [https://stackoverflow.com/questions/10532473/head-and-tail-in-one-line]
 [3] 
[https://stackoverflow.com/questions/646644/how-to-get-last-items-of-a-list-in-python]
 [4] [http://apache-spark-user-list.1001560.n3.nabble.com/RDD-tail-td4217.html]
 [5] 
[https://stackoverflow.com/questions/39544796/how-to-select-last-row-and-also-how-to-access-pyspark-dataframe-by-index]
 [6] 
[https://stackoverflow.com/questions/45406762/how-to-get-the-last-row-from-dataframe]
 [7] https://issues.apache.org/jira/browse/SPARK-26433
 [8] [https://github.com/databricks/koalas/issues/343]

  was:
I would like to propose an API called DataFrame.tail.


*Background & Motivation*

Many other systems support the way to take data from the end, for instance, 
pandas[1] and
 Python[2][3]. Scala collections APIs also have head and tail

On the other hand, in Spark, we only provide a way to take data from the start
 (e.g., DataFrame.head). This has been requested multiple times here and there 
in Spark
 user mailing list[4], StackOverFlow[5][6], JIRA[7] and other third party 
projects such as
 Koalas[8].

It seems we're missing non-trivial use case in Spark and this motivated me to 
propose this
 API.

*Proposal*

I would like to propose an API against DataFrame called tail that collects rows 
from the
 end in contrast with head.

Namely, as below:
 scala> spark.range(10).head(5)
 res1: Array[Long] = Array(0, 1, 2, 3, 4)

scala> spark.range(10).tail(5)
 res2: Array[Long] = Array(5, 6, 7, 8, 9)
 Implementation details will be similar with head but it will be reversed:

Run the job against the last partition and collect rows. If this is enough, 
return as is.
 If this is not enough, calculate the number of partitions to select more based 
upon
 ‘spark.sql.limit.scaleUpFactor’
 Run more jobs against more partitions (in a reversed order compared to head)
 as many as the number calculated from 2.
 Go to 2.
 [1] 
[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html?highlight=tail#pandas.DataFrame.tail]
 [2] [https://stackoverflow.com/questions/10532473/head-and-tail-in-one-line]
 [3] 
[https://stackoverflow.com/questions/646644/how-to-get-last-items-of-a-list-in-python]
 [4] [http://apache-spark-user-list.1001560.n3.nabble.com/RDD-tail-td4217.html]
 [5] 
[https://stackoverflow.com/questions/39544796/how-to-select-last-row-and-also-how-to-access-pyspark-dataframe-by-index]
 [6] 
[https://stackoverflow.com/questions/45406762/how-to-get-the-last-row-from-dataframe]
 [7] https://issues.apache.org/jira/browse/SPARK-26433
 [8] [https://github.com/databricks/koalas/issues/343]


> Implement Dataset.tail API
> --
>
> Key: SPARK-30185
> URL: https://issues.apache.org/jira/browse/SPARK-30185
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> I would like to propose an API called DataFrame.tail.
> *Background & Motivation*
> Many other systems support the way to take data from the end, for instance, 
> pandas[1] and
>  Python[2][3]. Scala collections APIs also have head and tail
> On the other hand, in Spark, we only provide a way to 

[jira] [Created] (SPARK-30185) Implement Dataset.tail API

2019-12-09 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-30185:


 Summary: Implement Dataset.tail API
 Key: SPARK-30185
 URL: https://issues.apache.org/jira/browse/SPARK-30185
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon
Assignee: Hyukjin Kwon


I would like to propose an API called DataFrame.tail.


*Background & Motivation*

Many other systems support the way to take data from the end, for instance, 
pandas[1] and
 Python[2][3]. Scala collections APIs also have head and tail

On the other hand, in Spark, we only provide a way to take data from the start
 (e.g., DataFrame.head). This has been requested multiple times here and there 
in Spark
 user mailing list[4], StackOverFlow[5][6], JIRA[7] and other third party 
projects such as
 Koalas[8].

It seems we're missing non-trivial use case in Spark and this motivated me to 
propose this
 API.

*Proposal*

I would like to propose an API against DataFrame called tail that collects rows 
from the
 end in contrast with head.

Namely, as below:
 scala> spark.range(10).head(5)
 res1: Array[Long] = Array(0, 1, 2, 3, 4)

scala> spark.range(10).tail(5)
 res2: Array[Long] = Array(5, 6, 7, 8, 9)
 Implementation details will be similar with head but it will be reversed:

Run the job against the last partition and collect rows. If this is enough, 
return as is.
 If this is not enough, calculate the number of partitions to select more based 
upon
 ‘spark.sql.limit.scaleUpFactor’
 Run more jobs against more partitions (in a reversed order compared to head)
 as many as the number calculated from 2.
 Go to 2.
 [1] 
[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html?highlight=tail#pandas.DataFrame.tail]
 [2] [https://stackoverflow.com/questions/10532473/head-and-tail-in-one-line]
 [3] 
[https://stackoverflow.com/questions/646644/how-to-get-last-items-of-a-list-in-python]
 [4] [http://apache-spark-user-list.1001560.n3.nabble.com/RDD-tail-td4217.html]
 [5] 
[https://stackoverflow.com/questions/39544796/how-to-select-last-row-and-also-how-to-access-pyspark-dataframe-by-index]
 [6] 
[https://stackoverflow.com/questions/45406762/how-to-get-the-last-row-from-dataframe]
 [7] https://issues.apache.org/jira/browse/SPARK-26433
 [8] [https://github.com/databricks/koalas/issues/343]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30184) Implement a helper method for aliasing functions

2019-12-09 Thread Aman Omer (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Omer updated SPARK-30184:
--
Summary: Implement a helper method for aliasing functions  (was: Implement 
a helper method for aliasing )

> Implement a helper method for aliasing functions
> 
>
> Key: SPARK-30184
> URL: https://issues.apache.org/jira/browse/SPARK-30184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Aman Omer
>Priority: Major
>
> In PR https://github.com/apache/spark/pull/26712 , we have introduced 
> expressionWithAlias function in FunctionRegistry.scala which forwards 
> function name submitted at runtime.
> This Jira is to use expressionWithAlias for remaining functions for which 
> alias name can be used. Remaining functions are:
> Average, First, Last, ApproximatePercentile, StddevSamp, VarianceSamp



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30184) Implement a helper method for aliasing

2019-12-09 Thread Aman Omer (Jira)
Aman Omer created SPARK-30184:
-

 Summary: Implement a helper method for aliasing 
 Key: SPARK-30184
 URL: https://issues.apache.org/jira/browse/SPARK-30184
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Aman Omer


In PR https://github.com/apache/spark/pull/26712 , we have introduced 
expressionWithAlias function in FunctionRegistry.scala which forwards function 
name submitted at runtime.

This Jira is to use expressionWithAlias for remaining functions for which alias 
name can be used. Remaining functions are:
Average, First, Last, ApproximatePercentile, StddevSamp, VarianceSamp



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-30176) Eliminate warnings: part 6

2019-12-09 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-30176:
---
Comment: was deleted

(was: i will work on this.)

> Eliminate warnings: part 6
> --
>
> Key: SPARK-30176
> URL: https://issues.apache.org/jira/browse/SPARK-30176
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Minor
>
>   
> sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala
> {code:java}
>  Warning:Warning:line (32)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (91)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (100)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (109)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (118)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> {code}
>   sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala
> {code:java}
> Warning:Warning:line (242)object typed in package scalalang is deprecated 
> (since 3.0.0): please use untyped builtin aggregate functions.
>   df.as[Data].select(typed.sumLong((d: Data) => 
> d.l)).queryExecution.toRdd.foreach(_ => ())
> {code}
>   sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala
> {code:java}
> Warning:Warning:line (714)method from_utc_timestamp in object functions is 
> deprecated (since 3.0.0): This function is deprecated and will be removed in 
> future versions.
> df.select(from_utc_timestamp(col("a"), "PST")),
> Warning:Warning:line (719)method from_utc_timestamp in object functions 
> is deprecated (since 3.0.0): This function is deprecated and will be removed 
> in future versions.
> df.select(from_utc_timestamp(col("b"), "PST")),
> Warning:Warning:line (725)method from_utc_timestamp in object functions 
> is deprecated (since 3.0.0): This function is deprecated and will be removed 
> in future versions.
>   df.select(from_utc_timestamp(col("a"), "PST")).collect()
> Warning:Warning:line (737)method from_utc_timestamp in object functions 
> is deprecated (since 3.0.0): This function is deprecated and will be removed 
> in future versions.
> df.select(from_utc_timestamp(col("a"), col("c"))),
> Warning:Warning:line (742)method from_utc_timestamp in object functions 
> is deprecated (since 3.0.0): This function is deprecated and will be removed 
> in future versions.
> df.select(from_utc_timestamp(col("b"), col("c"))),
> Warning:Warning:line (756)method to_utc_timestamp in object functions is 
> deprecated (since 3.0.0): This function is deprecated and will be removed in 
> future versions.
> df.select(to_utc_timestamp(col("a"), "PST")),
> Warning:Warning:line (761)method to_utc_timestamp in object functions is 
> deprecated (since 3.0.0): This function is deprecated and will be removed in 
> future versions.
> df.select(to_utc_timestamp(col("b"), "PST")),
> Warning:Warning:line (767)method to_utc_timestamp in object functions is 
> deprecated (since 3.0.0): This function is deprecated and will be removed in 
> future versions.
>   df.select(to_utc_timestamp(col("a"), "PST")).collect()
> Warning:Warning:line (779)method to_utc_timestamp in object functions is 
> deprecated (since 3.0.0): This function is deprecated and will be removed in 
> future versions.
> df.select(to_utc_timestamp(col("a"), col("c"))),
> Warning:Warning:line (784)method to_utc_timestamp in object functions is 
> deprecated (since 3.0.0): This function is deprecated and will be removed in 
> future versions.
> df.select(to_utc_timestamp(col("b"), col("c"))),
> {code}
>   sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
> {code:java}
> Warning:Warning:line (241)method merge in object Row is deprecated (since 
> 3.0.0): This method is deprecated and will be removed in future versions.
>   testData.rdd.flatMap(row => Seq.fill(16)(Row.merge(row, 
> row))).collect().toSeq)
> {code}
>   sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
> {code:java}
>  Warning:Warning:line (787)method merge in object Row is deprecated (since 
> 3.0.0): This method is deprecated and will be removed in future versions.
> row => Seq.fill(16)(Row.merge(row, row))).collect().toSeq)
> {code}
>   
> 

[jira] [Issue Comment Deleted] (SPARK-30175) Eliminate warnings: part 5

2019-12-09 Thread Sandeep Katta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Katta updated SPARK-30175:
--
Comment: was deleted

(was: thanks for raising, will raise PR soon)

> Eliminate warnings: part 5
> --
>
> Key: SPARK-30175
> URL: https://issues.apache.org/jira/browse/SPARK-30175
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Minor
>
> sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/WriteToMicroBatchDataSource.scala
> {code:java}
> Warning:Warning:line (36)class WriteToDataSourceV2 in package v2 is 
> deprecated (since 2.4.0): Use specific logical plans like AppendData instead
>   def createPlan(batchId: Long): WriteToDataSourceV2 = {
> Warning:Warning:line (37)class WriteToDataSourceV2 in package v2 is 
> deprecated (since 2.4.0): Use specific logical plans like AppendData instead
> WriteToDataSourceV2(new MicroBatchWrite(batchId, write), query)
> {code}
> sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
> {code:java}
>  Warning:Warning:line (703)a pure expression does nothing in statement 
> position; multiline expressions might require enclosing parentheses
>   q1
> {code}
> sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingAggregationSuite.scala
> {code:java}
> Warning:Warning:line (285)object typed in package scalalang is deprecated 
> (since 3.0.0): please use untyped builtin aggregate functions.
> val aggregated = 
> inputData.toDS().groupByKey(_._1).agg(typed.sumLong(_._2))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30108) Add robust accumulator for observable metrics

2019-12-09 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-30108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991381#comment-16991381
 ] 

Herman van Hövell commented on SPARK-30108:
---

[~Ankitraj] Great to hear! Here are some pointers, the goal of this ticket to 
make the current AggregatingAccumulator to only add the result of a partition 
once. For this we need to do a couple of things:
 - We need to add some tracking mechanism to the AggregatingAccumulator that 
keeps track of the partitions that have been processed. A (compressible) bit 
map should be enough.
 - We need a way to tell the accumulator's merge method which partition it is 
processing. One way of doing this would be making a small change in the way the 
DAGScheduler interacts with the accumulator, and also pass Task/Partition info 
on calling update. A less intrusive way of doing this would be to pass back 
partition info with the to-be-merged accumulator.

> Add robust accumulator for observable metrics
> -
>
> Key: SPARK-30108
> URL: https://issues.apache.org/jira/browse/SPARK-30108
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30183) Disallow to specify reserved properties in CREATE NAMESPACE syntax

2019-12-09 Thread Kent Yao (Jira)
Kent Yao created SPARK-30183:


 Summary: Disallow to specify reserved properties in CREATE 
NAMESPACE syntax 
 Key: SPARK-30183
 URL: https://issues.apache.org/jira/browse/SPARK-30183
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


Currently, COMMENT and LOCATION are reserved properties for datasource v2 
namespaces. The should be used in COMMENT/LOCATION clauses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30182) Support nested aggregates

2019-12-09 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-30182:
--

 Summary: Support nested aggregates
 Key: SPARK-30182
 URL: https://issues.apache.org/jira/browse/SPARK-30182
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: jiaan.geng


Spark SQL cannot supports a SQL with nested aggregate as below:
{code:java}
SELECT sum(salary), row_number() OVER (ORDER BY depname), sum(
 sum(salary) FILTER (WHERE enroll_date > '2007-01-01')
) FILTER (WHERE depname <> 'sales') OVER (ORDER BY depname DESC) AS 
"filtered_sum",
 depname
FROM empsalary GROUP BY depname;{code}
And Spark will throw exception as follows:
{code:java}
org.apache.spark.sql.AnalysisException
It is not allowed to use an aggregate function in the argument of another 
aggregate function. Please use the inner aggregate function in a 
sub-query.{code}
But PostgreSQL supports this syntax.
{code:java}
SELECT sum(salary), row_number() OVER (ORDER BY depname), sum(
 sum(salary) FILTER (WHERE enroll_date > '2007-01-01')
) FILTER (WHERE depname <> 'sales') OVER (ORDER BY depname DESC) AS 
"filtered_sum",
 depname
FROM empsalary GROUP BY depname;
 sum | row_number | filtered_sum | depname 
---++--+---
 25100 | 1 | 22600 | develop
 7400 | 2 | 3500 | personnel
 14600 | 3 | | sales
(3 rows){code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30181) Throws runtime exception when filter metastore partition key that's not string type or integral types

2019-12-09 Thread Yu-Jhe Li (Jira)
Yu-Jhe Li created SPARK-30181:
-

 Summary: Throws runtime exception when filter metastore partition 
key that's not string type or integral types
 Key: SPARK-30181
 URL: https://issues.apache.org/jira/browse/SPARK-30181
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.4.0
Reporter: Yu-Jhe Li


SQL below will throw a runtime exception since spark-2.4.0. I think it's a bug 
brought from SPARK-22384 
{code:scala}
spark.sql("CREATE TABLE timestamp_part (value INT) PARTITIONED BY (dt 
TIMESTAMP)")
val df = Seq(
(1, java.sql.Timestamp.valueOf("2019-12-01 00:00:00"), 1), 
(2, java.sql.Timestamp.valueOf("2019-12-01 01:00:00"), 1)
  ).toDF("id", "dt", "value")
df.write.partitionBy("dt").mode("overwrite").saveAsTable("timestamp_part")

spark.sql("select * from timestamp_part where dt >= '2019-12-01 
00:00:00'").explain(true)
{code}
{noformat}
Caught Hive MetaException attempting to get partition metadata by filter from 
Hive. You can set the Spark configuration setting 
spark.sql.hive.manageFilesourcePartitions to false to work around this problem, 
however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK
  at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:774)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:679)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:677)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:677)
  at 
org.apache.spark.sql.hive.client.HiveClientSuite.testMetastorePartitionFiltering(HiveClientSuite.scala:310)
  at 
org.apache.spark.sql.hive.client.HiveClientSuite.org$apache$spark$sql$hive$client$HiveClientSuite$$testMetastorePartitionFiltering(HiveClientSuite.scala:282)
  at 
org.apache.spark.sql.hive.client.HiveClientSuite$$anonfun$1.apply$mcV$sp(HiveClientSuite.scala:105)
  at 
org.apache.spark.sql.hive.client.HiveClientSuite$$anonfun$1.apply(HiveClientSuite.scala:105)
  at 
org.apache.spark.sql.hive.client.HiveClientSuite$$anonfun$1.apply(HiveClientSuite.scala:105)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
  at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
  at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
  at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
  at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
  at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
  at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
  at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
  at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
  at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
  at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
  at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
  at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
  at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
  at org.scalatest.Suite$class.run(Suite.scala:1147)
  at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
  at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
  at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
  at