[jira] [Created] (SPARK-30200) Add ExplainMode for Dataset.explain
Takeshi Yamamuro created SPARK-30200: Summary: Add ExplainMode for Dataset.explain Key: SPARK-30200 URL: https://issues.apache.org/jira/browse/SPARK-30200 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro This pr targets to add ExplainMode for explaining Dataset/DataFrame with a given format mode (ExplainMode). ExplainMode has four types along with the SQL EXPLAIN command: Simple, Extended, Codegen, Cost, and Formatted. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30199) Recover spark.ui.port and spark.blockManager.port from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-30199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30199: -- Issue Type: Improvement (was: Bug) > Recover spark.ui.port and spark.blockManager.port from checkpoint > - > > Key: SPARK-30199 > URL: https://issues.apache.org/jira/browse/SPARK-30199 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.4.4, 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30199) Recover spark.ui.port and spark.blockManager.port from checkpoint
Dongjoon Hyun created SPARK-30199: - Summary: Recover spark.ui.port and spark.blockManager.port from checkpoint Key: SPARK-30199 URL: https://issues.apache.org/jira/browse/SPARK-30199 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 2.4.4, 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30198) BytesToBytesMap does not grow internal long array as expected
L. C. Hsieh created SPARK-30198: --- Summary: BytesToBytesMap does not grow internal long array as expected Key: SPARK-30198 URL: https://issues.apache.org/jira/browse/SPARK-30198 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh One Spark job on our cluster hangs forever at BytesToBytesMap.safeLookup. After inspecting, the long array size is 536870912. Currently in BytesToBytesMap.append, we only grow the internal array if the size of the array is less than its MAX_CAPACITY that is 536870912. So in above case, the array can not be grown up, and safeLookup can not find an empty slot forever. But it is wrong because we use two array entries per key, so the array size is twice the capacity. We should compare the current capacity of the array, instead of its size. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30130) Hardcoded numeric values in common table expressions which utilize GROUP BY are interpreted as ordinal positions
[ https://issues.apache.org/jira/browse/SPARK-30130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992185#comment-16992185 ] Ankit Raj Boudh commented on SPARK-30130: - hi Matt boegner, can you please me to reproduce this issue > Hardcoded numeric values in common table expressions which utilize GROUP BY > are interpreted as ordinal positions > > > Key: SPARK-30130 > URL: https://issues.apache.org/jira/browse/SPARK-30130 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Matt Boegner >Priority: Minor > > Hardcoded numeric values in common table expressions which utilize GROUP BY > are interpreted as ordinal positions. > {code:java} > val df = spark.sql(""" > with a as (select 0 as test, count group by test) > select * from a > """) > df.show(){code} > This results in an error message like {color:#e01e5a}GROUP BY position 0 is > not in select list (valid range is [1, 2]){color} . > > However, this error does not appear in a traditional subselect format. For > example, this query executes correctly: > {code:java} > val df = spark.sql(""" > select * from (select 0 as test, count group by test) a > """) > df.show(){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30130) Hardcoded numeric values in common table expressions which utilize GROUP BY are interpreted as ordinal positions
[ https://issues.apache.org/jira/browse/SPARK-30130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992185#comment-16992185 ] Ankit Raj Boudh edited comment on SPARK-30130 at 12/10/19 4:41 AM: --- hi Matt boegner, can you please help me to reproduce this issue was (Author: ankitraj): hi Matt boegner, can you please me to reproduce this issue > Hardcoded numeric values in common table expressions which utilize GROUP BY > are interpreted as ordinal positions > > > Key: SPARK-30130 > URL: https://issues.apache.org/jira/browse/SPARK-30130 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Matt Boegner >Priority: Minor > > Hardcoded numeric values in common table expressions which utilize GROUP BY > are interpreted as ordinal positions. > {code:java} > val df = spark.sql(""" > with a as (select 0 as test, count group by test) > select * from a > """) > df.show(){code} > This results in an error message like {color:#e01e5a}GROUP BY position 0 is > not in select list (valid range is [1, 2]){color} . > > However, this error does not appear in a traditional subselect format. For > example, this query executes correctly: > {code:java} > val df = spark.sql(""" > select * from (select 0 as test, count group by test) a > """) > df.show(){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30196) Bump lz4-java version to 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-30196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30196: Assignee: Takeshi Yamamuro > Bump lz4-java version to 1.7.0 > -- > > Key: SPARK-30196 > URL: https://issues.apache.org/jira/browse/SPARK-30196 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30196) Bump lz4-java version to 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-30196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30196. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26823 [https://github.com/apache/spark/pull/26823] > Bump lz4-java version to 1.7.0 > -- > > Key: SPARK-30196 > URL: https://issues.apache.org/jira/browse/SPARK-30196 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30193) Wrong 2nd argument type. Found: 'org.apache.spark.sql.Column', required: 'scala.collection.Seq'
[ https://issues.apache.org/jira/browse/SPARK-30193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30193. -- Resolution: Not A Problem > Wrong 2nd argument type. Found: 'org.apache.spark.sql.Column', required: > 'scala.collection.Seq' > - > > Key: SPARK-30193 > URL: https://issues.apache.org/jira/browse/SPARK-30193 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.4 >Reporter: Ankit Raj Boudh >Priority: Minor > > > [https://github.com/apache/spark/pull/26773/checks?check_run_id=340387020] > JavaDataFrameSuite.java > JavaTokenizerExample.java -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30179) Improve test in SingleSessionSuite
[ https://issues.apache.org/jira/browse/SPARK-30179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30179. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26812 [https://github.com/apache/spark/pull/26812] > Improve test in SingleSessionSuite > -- > > Key: SPARK-30179 > URL: https://issues.apache.org/jira/browse/SPARK-30179 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > https://github.com/apache/spark/blob/58be82ad4b98fc17e821e916e69e77a6aa36209d/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala#L782-L824 > We should also verify the UDF works. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30179) Improve test in SingleSessionSuite
[ https://issues.apache.org/jira/browse/SPARK-30179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30179: Assignee: Yuming Wang > Improve test in SingleSessionSuite > -- > > Key: SPARK-30179 > URL: https://issues.apache.org/jira/browse/SPARK-30179 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > https://github.com/apache/spark/blob/58be82ad4b98fc17e821e916e69e77a6aa36209d/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala#L782-L824 > We should also verify the UDF works. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30197) Add `requirements.txt` file to `python` directory
Dongjoon Hyun created SPARK-30197: - Summary: Add `requirements.txt` file to `python` directory Key: SPARK-30197 URL: https://issues.apache.org/jira/browse/SPARK-30197 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30196) Bump lz4-java version to 1.7.0
Takeshi Yamamuro created SPARK-30196: Summary: Bump lz4-java version to 1.7.0 Key: SPARK-30196 URL: https://issues.apache.org/jira/browse/SPARK-30196 Project: Spark Issue Type: Improvement Components: Build, Spark Core Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30195) Some imports, function need more explicit resolution in 2.13
Sean R. Owen created SPARK-30195: Summary: Some imports, function need more explicit resolution in 2.13 Key: SPARK-30195 URL: https://issues.apache.org/jira/browse/SPARK-30195 Project: Spark Issue Type: Sub-task Components: ML, Spark Core, SQL, Structured Streaming Affects Versions: 3.0.0 Reporter: Sean R. Owen Assignee: Sean R. Owen This is a grouping of related but not identical issues in the 2.13 migration, where the compiler is more picky about explicit types and imports. I'm grouping them as they seem moderately related. Some are fairly self-evident like wanting an explicit generic type. In a few cases it looks like import resolution rules tightened up a bit and have to be explicit. A few more cause problems like: {code} [ERROR] [Error] /Users/seanowen/Documents/spark_2.13/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala:220: missing parameter type for expanded function The argument types of an anonymous function must be fully known. (SLS 8.5) Expected type was: ? {code} In some cases it's just a matter of adding an explicit type, like {{.map { m: Matrix =>}}. Many seem to concern functions of tuples, or tuples of tuples. {{.mapGroups { case (g, iter) =>}} needs to be simply {{.mapGroups { (g, iter) =>}} Or more annoyingly: {code} }.reduceByKey { case ((wc1, df1), (wc2, df2)) => (wc1 + wc2, df1 + df2) } {code} Apparently can only be fully known without nesting tuples. This _won't_ work: {code} }.reduceByKey { case ((wc1: Long, df1: Int), (wc2: Long, df2: Int)) => (wc1 + wc2, df1 + df2) } {code} This does: {code} }.reduceByKey { (wcdf1, wcdf2) => (wcdf1._1 + wcdf2._1, wcdf1._2 + wcdf2._2) } {code} I'm not super clear why most of the problems seem to affect reduceByKey. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30158) Resolve Array + reference type compile problems in 2.13, with sc.parallelize
[ https://issues.apache.org/jira/browse/SPARK-30158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30158. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26787 [https://github.com/apache/spark/pull/26787] > Resolve Array + reference type compile problems in 2.13, with sc.parallelize > > > Key: SPARK-30158 > URL: https://issues.apache.org/jira/browse/SPARK-30158 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Minor > Fix For: 3.0.0 > > > Scala 2.13 has some different rules about resolving Arrays as Seqs when the > array is of a reference type. This primarily affects calls to > {{sc.parallelize(Array(...))}} where elements aren't primitive: > {code} > [ERROR] [Error] > /Users/seanowen/Documents/spark_2.13/mllib/src/main/scala/org/apache/spark/mllib/pmml/PMMLExportable.scala:61: > overloaded method value apply with alternatives: > (x: Unit,xs: Unit*)Array[Unit] > (x: Double,xs: Double*)Array[Double] > ... > {code} > This is easy to resolve by using Seq instead. > Closely related: WrappedArray is a type def in 2.13, which makes it unusable > in Java. One set of tests needs to adapt. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30146) add setWeightCol to GBTs in PySpark
[ https://issues.apache.org/jira/browse/SPARK-30146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-30146: Assignee: Huaxin Gao > add setWeightCol to GBTs in PySpark > --- > > Key: SPARK-30146 > URL: https://issues.apache.org/jira/browse/SPARK-30146 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > > add setWeightCol and setMinWeightFractionPerNode in Python side of > GBTClassifier and GBTRegressor -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30146) add setWeightCol to GBTs in PySpark
[ https://issues.apache.org/jira/browse/SPARK-30146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30146. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26774 [https://github.com/apache/spark/pull/26774] > add setWeightCol to GBTs in PySpark > --- > > Key: SPARK-30146 > URL: https://issues.apache.org/jira/browse/SPARK-30146 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > add setWeightCol and setMinWeightFractionPerNode in Python side of > GBTClassifier and GBTRegressor -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30180) listJars() and listFiles() function display issue when file path contains spaces
[ https://issues.apache.org/jira/browse/SPARK-30180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30180. -- Resolution: Not A Problem > listJars() and listFiles() function display issue when file path contains > spaces > > > Key: SPARK-30180 > URL: https://issues.apache.org/jira/browse/SPARK-30180 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Ankit Raj Boudh >Priority: Minor > > > {{scala> sc.listJars() > res2: Seq[String] = Vector(spark://11.242.181.153:50811/jars/c6%20test.jar)}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30193) Wrong 2nd argument type. Found: 'org.apache.spark.sql.Column', required: 'scala.collection.Seq'
[ https://issues.apache.org/jira/browse/SPARK-30193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Raj Boudh updated SPARK-30193: Description: [https://github.com/apache/spark/pull/26773/checks?check_run_id=340387020] JavaDataFrameSuite.java JavaTokenizerExample.java was: [https://github.com/apache/spark/pull/26773/checks?check_run_id=340387020] JavaDataFrameSuite JavaDataFrameSuite > Wrong 2nd argument type. Found: 'org.apache.spark.sql.Column', required: > 'scala.collection.Seq' > - > > Key: SPARK-30193 > URL: https://issues.apache.org/jira/browse/SPARK-30193 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.4 >Reporter: Ankit Raj Boudh >Priority: Minor > > > [https://github.com/apache/spark/pull/26773/checks?check_run_id=340387020] > JavaDataFrameSuite.java > JavaTokenizerExample.java -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30193) Wrong 2nd argument type. Found: 'org.apache.spark.sql.Column', required: 'scala.collection.Seq'
[ https://issues.apache.org/jira/browse/SPARK-30193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Raj Boudh updated SPARK-30193: Description: [https://github.com/apache/spark/pull/26773/checks?check_run_id=340387020] JavaDataFrameSuite JavaDataFrameSuite was: unused import file. [https://github.com/apache/spark/pull/26773/checks?check_run_id=340387020] > Wrong 2nd argument type. Found: 'org.apache.spark.sql.Column', required: > 'scala.collection.Seq' > - > > Key: SPARK-30193 > URL: https://issues.apache.org/jira/browse/SPARK-30193 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.4 >Reporter: Ankit Raj Boudh >Priority: Minor > > > [https://github.com/apache/spark/pull/26773/checks?check_run_id=340387020] > > JavaDataFrameSuite > JavaDataFrameSuite > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30193) Wrong 2nd argument type. Found: 'org.apache.spark.sql.Column', required: 'scala.collection.Seq'
[ https://issues.apache.org/jira/browse/SPARK-30193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Raj Boudh updated SPARK-30193: Summary: Wrong 2nd argument type. Found: 'org.apache.spark.sql.Column', required: 'scala.collection.Seq' (was: remove unused import file) > Wrong 2nd argument type. Found: 'org.apache.spark.sql.Column', required: > 'scala.collection.Seq' > - > > Key: SPARK-30193 > URL: https://issues.apache.org/jira/browse/SPARK-30193 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.4 >Reporter: Ankit Raj Boudh >Priority: Minor > > unused import file. > > [https://github.com/apache/spark/pull/26773/checks?check_run_id=340387020] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30108) Add robust accumulator for observable metrics
[ https://issues.apache.org/jira/browse/SPARK-30108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell updated SPARK-30108: -- Description: Spark accumulators reflect the work that has been done, and not the data that has been processed. There are situations where one tuple can be processed multiple times, e.g.: task/stage retries, speculation, determination of ranges for global ordered, etc... For observed metrics we need the value of the accumulator to be based on the data and not on processing. The current aggregating accumulator is already robust to some of these issues (like task failure), but we need to add some additional checks to make sure it is fool proof. > Add robust accumulator for observable metrics > - > > Key: SPARK-30108 > URL: https://issues.apache.org/jira/browse/SPARK-30108 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Herman van Hövell >Priority: Major > > Spark accumulators reflect the work that has been done, and not the data that > has been processed. There are situations where one tuple can be processed > multiple times, e.g.: task/stage retries, speculation, determination of > ranges for global ordered, etc... For observed metrics we need the value of > the accumulator to be based on the data and not on processing. > The current aggregating accumulator is already robust to some of these issues > (like task failure), but we need to add some additional checks to make sure > it is fool proof. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30194) Re-enable checkstyle for Java
Dongjoon Hyun created SPARK-30194: - Summary: Re-enable checkstyle for Java Key: SPARK-30194 URL: https://issues.apache.org/jira/browse/SPARK-30194 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30193) remove unused import file
[ https://issues.apache.org/jira/browse/SPARK-30193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991756#comment-16991756 ] Ankit Raj Boudh commented on SPARK-30193: - i am working in this jira > remove unused import file > - > > Key: SPARK-30193 > URL: https://issues.apache.org/jira/browse/SPARK-30193 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.4 >Reporter: Ankit Raj Boudh >Priority: Minor > > unused import file. > > [https://github.com/apache/spark/pull/26773/checks?check_run_id=340387020] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30193) remove unused import file
[ https://issues.apache.org/jira/browse/SPARK-30193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Raj Boudh updated SPARK-30193: Description: unused import file. [https://github.com/apache/spark/pull/26773/checks?check_run_id=340387020] was:unused import file. > remove unused import file > - > > Key: SPARK-30193 > URL: https://issues.apache.org/jira/browse/SPARK-30193 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.4 >Reporter: Ankit Raj Boudh >Priority: Minor > > unused import file. > > [https://github.com/apache/spark/pull/26773/checks?check_run_id=340387020] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30193) remove unused import file
Ankit Raj Boudh created SPARK-30193: --- Summary: remove unused import file Key: SPARK-30193 URL: https://issues.apache.org/jira/browse/SPARK-30193 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.4.4 Reporter: Ankit Raj Boudh unused import file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly
[ https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991752#comment-16991752 ] Andrew White commented on SPARK-22876: -- This issue is most certainly unresolved, and the documentation is misleading at best. [~choojoyq] - perhaps you could share your code on how you deal with this in general? The general solution seems non-obvious to me given all of the intricacies of Spark + Yarn interactions such as shutdown hooks, handling signals, client vs cluster mode. Maybe I'm over thinking the solution. This feature is critical for supporting long-running applications. > spark.yarn.am.attemptFailuresValidityInterval does not work correctly > - > > Key: SPARK-22876 > URL: https://issues.apache.org/jira/browse/SPARK-22876 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.2.0 > Environment: hadoop version 2.7.3 >Reporter: Jinhan Zhong >Priority: Minor > Labels: bulk-closed > > I assume we can use spark.yarn.maxAppAttempts together with > spark.yarn.am.attemptFailuresValidityInterval to make a long running > application avoid stopping after acceptable number of failures. > But after testing, I found that the application always stops after failing n > times ( n is minimum value of spark.yarn.maxAppAttempts and > yarn.resourcemanager.am.max-attempts from client yarn-site.xml) > for example, following setup will allow the application master to fail 20 > times. > * spark.yarn.am.attemptFailuresValidityInterval=1s > * spark.yarn.maxAppAttempts=20 > * yarn client: yarn.resourcemanager.am.max-attempts=20 > * yarn resource manager: yarn.resourcemanager.am.max-attempts=3 > And after checking the source code, I found in source file > ApplicationMaster.scala > https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293 > there's a ShutdownHook that checks the attempt id against the maxAppAttempts, > if attempt id >= maxAppAttempts, it will try to unregister the application > and the application will finish. > is this a expected design or a bug? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27189) Add Executor metrics and memory usage instrumentation to the metrics system
[ https://issues.apache.org/jira/browse/SPARK-27189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-27189. -- Fix Version/s: 3.0.0 Resolution: Fixed Fixed by https://github.com/apache/spark/pull/24132 > Add Executor metrics and memory usage instrumentation to the metrics system > --- > > Key: SPARK-27189 > URL: https://issues.apache.org/jira/browse/SPARK-27189 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > Fix For: 3.0.0 > > Attachments: Example_dashboard_Spark_Memory_Metrics.PNG > > > This proposes to add instrumentation of memory usage via the Spark > Dropwizard/Codahale metrics system. Memory usage metrics are available via > the Executor metrics, recently implemented as detailed in > https://issues.apache.org/jira/browse/SPARK-23206. > Making metrics usage metrics available via the Spark Dropwzard metrics system > allow to improve Spark performance dashboards and study memory usage, as in > the attached example graph. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27189) Add Executor metrics and memory usage instrumentation to the metrics system
[ https://issues.apache.org/jira/browse/SPARK-27189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-27189: Assignee: Luca Canali > Add Executor metrics and memory usage instrumentation to the metrics system > --- > > Key: SPARK-27189 > URL: https://issues.apache.org/jira/browse/SPARK-27189 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > Attachments: Example_dashboard_Spark_Memory_Metrics.PNG > > > This proposes to add instrumentation of memory usage via the Spark > Dropwizard/Codahale metrics system. Memory usage metrics are available via > the Executor metrics, recently implemented as detailed in > https://issues.apache.org/jira/browse/SPARK-23206. > Making metrics usage metrics available via the Spark Dropwzard metrics system > allow to improve Spark performance dashboards and study memory usage, as in > the attached example graph. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30192) support column position in DS v2
Wenchen Fan created SPARK-30192: --- Summary: support column position in DS v2 Key: SPARK-30192 URL: https://issues.apache.org/jira/browse/SPARK-30192 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30191) AM should update pending resource request faster when driver lost executor
[ https://issues.apache.org/jira/browse/SPARK-30191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Xie updated SPARK-30191: - Description: I run spark on yarn. I found that when driver lost its executors because of machine hardware problem and all of service includes nodemanager, executor on the node has been killed, it means that Resourcemanager can't update the containers info on the node until Resourcemanager try to remove the node, but it always takes 10 mins or longger, and in the meantime, AM don't add the new resource request and driver missing the executors. So maybe AM should add the factor `numExecutorsExiting` in YarnAllocator's method ` updateResourceRequests` to optimize it. was: I run spark on yarn. I found that when driver lost its executors because of machine hardware problem and all of service includes nodemanager, executor on the node has killed, it means that Resourcemanager can't update the containers info on the node until Resourcemanager try to remove the node, but it always takes 10 mins or longger, and in the meantime, AM don't add the new resource request and driver missing the executors. So maybe AM should add the factor `numExecutorsExiting` in YarnAllocator's method ` updateResourceRequests` to optimize it. > AM should update pending resource request faster when driver lost executor > -- > > Key: SPARK-30191 > URL: https://issues.apache.org/jira/browse/SPARK-30191 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.4 >Reporter: Max Xie >Priority: Minor > > I run spark on yarn. I found that when driver lost its executors because of > machine hardware problem and all of service includes nodemanager, executor on > the node has been killed, it means that Resourcemanager can't update the > containers info on the node until Resourcemanager try to remove the node, > but it always takes 10 mins or longger, and in the meantime, AM don't add the > new resource request and driver missing the executors. > So maybe AM should add the factor `numExecutorsExiting` in YarnAllocator's > method ` > updateResourceRequests` to optimize it. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30191) AM should update pending resource request faster when driver lost executor
Max Xie created SPARK-30191: Summary: AM should update pending resource request faster when driver lost executor Key: SPARK-30191 URL: https://issues.apache.org/jira/browse/SPARK-30191 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 2.4.4 Reporter: Max Xie I run spark on yarn. I found that when driver lost its executors because of machine hardware problem and all of service includes nodemanager, executor on the node has killed, it means that Resourcemanager can't update the containers info on the node until Resourcemanager try to remove the node, but it always takes 10 mins or longger, and in the meantime, AM don't add the new resource request and driver missing the executors. So maybe AM should add the factor `numExecutorsExiting` in YarnAllocator's method ` updateResourceRequests` to optimize it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30190) HistoryServerDiskManager will fail on appStoreDir in s3
thierry accart created SPARK-30190: -- Summary: HistoryServerDiskManager will fail on appStoreDir in s3 Key: SPARK-30190 URL: https://issues.apache.org/jira/browse/SPARK-30190 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.4 Reporter: thierry accart Hi While setting spark.eventLog.dir to s3a://... I realized that it *requires destination directory to preexists for S3* This is explained I think in HistoryServerDiskManager's appStoreDir: it tries check if directory exists or can be created {{if (!appStoreDir.isDirectory() && !appStoreDir.mkdir()) \{throw new IllegalArgumentException(s"Failed to create app directory ($appStoreDir).")}}} But in S3, a directory does not exists and cannot be created: directories don't exists by themselves, they are only materialized due to existence of objects. Before proposing a patch, I wanted to know what are the prefered options : should we have a spark option to skip the appStoreDir test, or skip it only when a particular scheme is set, have a custom implementation of HistoryServerDiskManager ...? _Note for people facing the {{IllegalArgumentException:}} {{Failed to create app directory}} *you just have to put an empty file in bucket destination 'path'*._ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30159) Fix the method calls of QueryTest.checkAnswer
[ https://issues.apache.org/jira/browse/SPARK-30159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30159. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26788 [https://github.com/apache/spark/pull/26788] > Fix the method calls of QueryTest.checkAnswer > - > > Key: SPARK-30159 > URL: https://issues.apache.org/jira/browse/SPARK-30159 > Project: Spark > Issue Type: Task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > > Currently, the method `checkAnswer` in Object `QueryTest` returns an optional > string. It doesn't throw exceptions when errors happen. > The actual exceptions are thrown in the trait `QueryTest`. > However, there are some test suites that use the no-op method > `QueryTest.checkAnswer` and expect it to fail test cases when the input > Dataframe doesn't match the expected answer. E.g. StreamSuite, > SessionStateSuite, BinaryFileFormatSuite, etc. > We should fix this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30162) Filter is not being pushed down for Parquet files
[ https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991573#comment-16991573 ] Nasir Ali commented on SPARK-30162: --- [~aman_omer] added in my question > Filter is not being pushed down for Parquet files > - > > Key: SPARK-30162 > URL: https://issues.apache.org/jira/browse/SPARK-30162 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: pyspark 3.0 preview > Ubuntu/Centos > pyarrow 0.14.1 >Reporter: Nasir Ali >Priority: Major > > Filters are not pushed down in Spark 3.0 preview. Also the output of > "explain" method is different. It is hard to debug in 3.0 whether filters > were pushed down or not. Below code could reproduce the bug: > > {code:java} > // code placeholder > df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"), > ("usr1",13.00, "2018-03-11T12:27:18+00:00"), > ("usr1",25.00, "2018-03-12T11:27:18+00:00"), > ("usr1",20.00, "2018-03-13T15:27:18+00:00"), > ("usr1",17.00, "2018-03-14T12:27:18+00:00"), > ("usr2",99.00, "2018-03-15T11:27:18+00:00"), > ("usr2",156.00, "2018-03-22T11:27:18+00:00"), > ("usr2",17.00, "2018-03-31T11:27:18+00:00"), > ("usr2",25.00, "2018-03-15T11:27:18+00:00"), > ("usr2",25.00, "2018-03-16T11:27:18+00:00") > ], >["user","id", "ts"]) > df = df.withColumn('ts', df.ts.cast('timestamp')) > df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = > spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True) > {code} > {code:java} > // Spark 2.4 output > == Parsed Logical Plan == > 'Filter ('user = usr2) > +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan == > id: double, ts: timestamp, user: string > Filter (user#40 = usr2) > +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan == > Filter (isnotnull(user#40) && (user#40 = usr2)) > +- Relation[id#38,ts#39,user#40] parquet== Physical Plan == > *(1) FileScan parquet [id#38,ts#39,user#40] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/home/cnali/data], PartitionCount: 1, > PartitionFilters: [isnotnull(user#40), (user#40 = usr2)], PushedFilters: [], > ReadSchema: struct{code} > {code:java} > // Spark 3.0.0-preview output > == Parsed Logical Plan == > 'Filter ('user = usr2) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Analyzed > Logical Plan == > id: double, ts: timestamp, user: string > Filter (user#2 = usr2) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Optimized > Logical Plan == > Filter (isnotnull(user#2) AND (user#2 = usr2)) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Physical > Plan == > *(1) Project [id#0, ts#1, user#2] > +- *(1) Filter (isnotnull(user#2) AND (user#2 = usr2)) >+- *(1) ColumnarToRow > +- BatchScan[id#0, ts#1, user#2] ParquetScan Location: > InMemoryFileIndex[file:/home/cnali/data], ReadSchema: > struct > {code} > I have tested it on much larger dataset. Spark 3.0 tries to load whole data > and then apply filter. Whereas Spark 2.4 push down the filter. Above output > shows that Spark 2.4 applied partition filter but not the Spark 3.0 preview. > > Minor: in Spark 3.0 "explain()" output is truncated (maybe fixed length?) and > it's hard to debug. spark.sql.orc.cache.stripe.details.size=1 doesn't > work. > > {code:java} > // pyspark 3 shell output > $ pyspark > Python 3.6.8 (default, Aug 7 2019, 17:28:10) > [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Ignoring non-spark config property: > java.io.dir=/md2k/data1,/md2k/data2,/md2k/data3,/md2k/data4,/md2k/data5,/md2k/data6,/md2k/data7,/md2k/data8 > 19/12/09 07:05:36 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 19/12/09 07:05:36 WARN SparkConf: Note that spark.local.dir will be > overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in > mesos/standalone/kubernetes and LOCAL_DIRS in YARN). > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 3.0.0-preview > /_/Using
[jira] [Updated] (SPARK-30162) Filter is not being pushed down for Parquet files
[ https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nasir Ali updated SPARK-30162: -- Description: Filters are not pushed down in Spark 3.0 preview. Also the output of "explain" method is different. It is hard to debug in 3.0 whether filters were pushed down or not. Below code could reproduce the bug: {code:java} // code placeholder df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"), ("usr1",13.00, "2018-03-11T12:27:18+00:00"), ("usr1",25.00, "2018-03-12T11:27:18+00:00"), ("usr1",20.00, "2018-03-13T15:27:18+00:00"), ("usr1",17.00, "2018-03-14T12:27:18+00:00"), ("usr2",99.00, "2018-03-15T11:27:18+00:00"), ("usr2",156.00, "2018-03-22T11:27:18+00:00"), ("usr2",17.00, "2018-03-31T11:27:18+00:00"), ("usr2",25.00, "2018-03-15T11:27:18+00:00"), ("usr2",25.00, "2018-03-16T11:27:18+00:00") ], ["user","id", "ts"]) df = df.withColumn('ts', df.ts.cast('timestamp')) df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True) {code} {code:java} // Spark 2.4 output == Parsed Logical Plan == 'Filter ('user = usr2) +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan == id: double, ts: timestamp, user: string Filter (user#40 = usr2) +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan == Filter (isnotnull(user#40) && (user#40 = usr2)) +- Relation[id#38,ts#39,user#40] parquet== Physical Plan == *(1) FileScan parquet [id#38,ts#39,user#40] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/home/cnali/data], PartitionCount: 1, PartitionFilters: [isnotnull(user#40), (user#40 = usr2)], PushedFilters: [], ReadSchema: struct{code} {code:java} // Spark 3.0.0-preview output == Parsed Logical Plan == 'Filter ('user = usr2) +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Analyzed Logical Plan == id: double, ts: timestamp, user: string Filter (user#2 = usr2) +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Optimized Logical Plan == Filter (isnotnull(user#2) AND (user#2 = usr2)) +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Physical Plan == *(1) Project [id#0, ts#1, user#2] +- *(1) Filter (isnotnull(user#2) AND (user#2 = usr2)) +- *(1) ColumnarToRow +- BatchScan[id#0, ts#1, user#2] ParquetScan Location: InMemoryFileIndex[file:/home/cnali/data], ReadSchema: struct {code} I have tested it on much larger dataset. Spark 3.0 tries to load whole data and then apply filter. Whereas Spark 2.4 push down the filter. Above output shows that Spark 2.4 applied partition filter but not the Spark 3.0 preview. Minor: in Spark 3.0 "explain()" output is truncated (maybe fixed length?) and it's hard to debug. spark.sql.orc.cache.stripe.details.size=1 doesn't work. {code:java} // pyspark 3 shell output $ pyspark Python 3.6.8 (default, Aug 7 2019, 17:28:10) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux Type "help", "copyright", "credits" or "license" for more information. Warning: Ignoring non-spark config property: java.io.dir=/md2k/data1,/md2k/data2,/md2k/data3,/md2k/data4,/md2k/data5,/md2k/data6,/md2k/data7,/md2k/data8 19/12/09 07:05:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 19/12/09 07:05:36 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN). Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.0.0-preview /_/Using Python version 3.6.8 (default, Aug 7 2019 17:28:10) SparkSession available as 'spark'. {code} {code:java} // pyspark 2.4.4 shell output pyspark Python 3.6.8 (default, Aug 7 2019, 17:28:10) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux Type "help", "copyright", "credits" or "license" for more information. 2019-12-09 07:09:07 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ /
[jira] [Resolved] (SPARK-30145) sparkContext.addJar fails when file path contains spaces
[ https://issues.apache.org/jira/browse/SPARK-30145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30145. -- Resolution: Duplicate > sparkContext.addJar fails when file path contains spaces > > > Key: SPARK-30145 > URL: https://issues.apache.org/jira/browse/SPARK-30145 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30072) Create dedicated planner for subqueries
[ https://issues.apache.org/jira/browse/SPARK-30072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991554#comment-16991554 ] Xiaoju Wu commented on SPARK-30072: --- [~afroozeh] I think the change from checking if queryExecution equals to checking isSubquery boolean is not correct in some scenarios. If a subquery is executed under SubqueryExec.executionContext, there's no way to pass the information to tell InsertAdaptiveSparkPlan rule that it is a subquery. > Create dedicated planner for subqueries > --- > > Key: SPARK-30072 > URL: https://issues.apache.org/jira/browse/SPARK-30072 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ali Afroozeh >Assignee: Ali Afroozeh >Priority: Minor > Fix For: 3.0.0 > > > This PR changes subquery planning by calling the planner and plan preparation > rules on the subquery plan directly. Before we were creating a QueryExecution > instance for subqueries to get the executedPlan. This would re-run analysis > and optimization on the subqueries plan. Running the analysis again on an > optimized query plan can have unwanted consequences, as some rules, for > example DecimalPrecision, are not idempotent. > As an example, consider the expression 1.7 * avg(a) which after applying the > DecimalPrecision rule becomes: > promote_precision(1.7) * promote_precision(avg(a)) > After the optimization, more specifically the constant folding rule, this > expression becomes: > 1.7 * promote_precision(avg(a)) > Now if we run the analyzer on this optimized query again, we will get: > promote_precision(1.7) * promote_precision(promote_precision(avg(a))) > Which will later optimized as: > 1.7 * promote_precision(promote_precision(avg(a))) > As can be seen, re-running the analysis and optimization on this expression > results in an expression with extra nested promote_preceision nodes. Adding > unneeded nodes to the plan is problematic because it can eliminate situations > where we can reuse the plan. > We opted to introduce dedicated planners for subuqueries, instead of making > the DecimalPrecision rule idempotent, because this eliminates this entire > category of problems. Another benefit is that planning time for subqueries is > reduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30189) Interval from year-month/date-time string handling whitespaces
Kent Yao created SPARK-30189: Summary: Interval from year-month/date-time string handling whitespaces Key: SPARK-30189 URL: https://issues.apache.org/jira/browse/SPARK-30189 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Kent Yao # for pg feature parity # for consistency with other types and other interval parser {code:sql} ostgres=# select interval E'2-2\t' year to month; interval 2 years 2 mons (1 row) postgres=# select interval E'2-2\t' year to month; interval 2 years 2 mons (1 row) postgres=# select interval E'2-\t2\t' year to month; ERROR: invalid input syntax for type interval: "2- 2 " LINE 1: select interval E'2-\t2\t' year to month; ^ postgres=# select interval '2 00:00:01' day to second; interval - 2 days 00:00:01 (1 row) postgres=# select interval '- 2 00:00:01' day to second; interval --- -2 days +00:00:01 (1 row) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30188) Enable Adaptive Query Execution default
Ke Jia created SPARK-30188: -- Summary: Enable Adaptive Query Execution default Key: SPARK-30188 URL: https://issues.apache.org/jira/browse/SPARK-30188 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Ke Jia Enable Adaptive Query Execution default -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30187) NULL handling in PySpark-PandasUDF
Abdeali Kothari created SPARK-30187: --- Summary: NULL handling in PySpark-PandasUDF Key: SPARK-30187 URL: https://issues.apache.org/jira/browse/SPARK-30187 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.4 Reporter: Abdeali Kothari The new pandasUDF has been very helpful in simplifying writing UDFs and more performant. But I cannot relliably use it because of it's different NULL value handling as compared to normal spark. Here is my understanding ... In spark, nulls/missing are as follows: * Float: null/NaN * Integer: null * String: null * DateTime: null In pandas, null/missing are as follows: * Float: NaN * Integer: * String: null * DateTime: NaT When I use spark and am using a Pandas UDF, it looks to me like there is information loss, as I am unable to differentiate between null and NaN anymore. Which I could do when I was using the older PythonUDF {code:java} >>> @F.pandas_udf("integer", F.PandasUDFType.GROUPED_AGG) ... def pd_sum(x): ... return x.sum() >>> sdf = spark.createDataFrame( [ [1.0, 2.0], [None, 3.0], [float('nan'), 4.0] ], ['a', 'b']) >>> sdf.agg(pd_sum(sdf['a'])).collect() [Row(pd_sum(a)=1.0)] >>> sdf.select(F.sum(sdf['a'])).collect() [Row(sum(a)=nan)] {code} If I use an integer with NULL values -> the PandasUDF actually gets a float type: {code:java} >>> sdf = spark.createDataFrame([ [1, 2.0], [None, 3.0] ], ['a', 'b']) >>> sdf.dtypes [('a', 'bigint'), ('b', 'double')] >>> @F.pandas_udf("integer", F.PandasUDFType.GROUPED_AGG) ... def pd_sum(x): ... print(x) ... return x.sum() >>> sdf.agg(pd_sum(sdf['a'])).collect() 0 1.0 1 NaN Name: _0, dtype: float64 float64 [Row(pd_sum(a)=1)] {code} I'm not sure whether this is something Spark should handle, but wanted to understand whether there is a plan to manage this ? Because from what I understand, if someone wants to use pandas DataFrames as of now, they need to make some asusmptions like: * The entire range of BigInteger will not work, because it gets converted to float (if null values present) * The float type should have either NaN or NULL - not both -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30186) support Dynamic Partition Pruning in Adaptive Execution
Xiaoju Wu created SPARK-30186: - Summary: support Dynamic Partition Pruning in Adaptive Execution Key: SPARK-30186 URL: https://issues.apache.org/jira/browse/SPARK-30186 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Xiaoju Wu Fix For: 3.0.0 Currently Adaptive Execution cannot work if Dynamic Partition Pruning is applied. private def supportAdaptive(plan: SparkPlan): Boolean = { // TODO migrate dynamic-partition-pruning onto adaptive execution. sanityCheck(plan) && !plan.logicalLink.exists(_.isStreaming) && *!plan.expressions.exists(_.find(_.isInstanceOf[DynamicPruningSubquery]).isDefined)* && plan.children.forall(supportAdaptive) } It means we cannot benefit the performance from both AE and DPP. This ticket is target to make DPP + AE works together. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30180) listJars() and listFiles() function display issue when file path contains spaces
[ https://issues.apache.org/jira/browse/SPARK-30180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Raj Boudh updated SPARK-30180: Summary: listJars() and listFiles() function display issue when file path contains spaces (was: listJars() function display issue.) > listJars() and listFiles() function display issue when file path contains > spaces > > > Key: SPARK-30180 > URL: https://issues.apache.org/jira/browse/SPARK-30180 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Ankit Raj Boudh >Priority: Minor > > > {{scala> sc.listJars() > res2: Seq[String] = Vector(spark://11.242.181.153:50811/jars/c6%20test.jar)}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26433) Tail method for spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-26433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991520#comment-16991520 ] Hyukjin Kwon commented on SPARK-26433: -- Looking back this JIRA, I realised that I underestimated it given multiple requests and other systems. I re-created a JIRA and made a PR. > Tail method for spark DataFrame > --- > > Key: SPARK-26433 > URL: https://issues.apache.org/jira/browse/SPARK-26433 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Jan Gorecki >Priority: Major > > There is a head method for spark dataframes which work fine but there doesn't > seems to be tail method. > ``` > >>> ans > >>> > DataFrame[v1: bigint] > > >>> ans.head(3) > >>> > [Row(v1=299443), Row(v1=299493), Row(v1=300751)] > >>> ans.tail(3) > Traceback (most recent call last): > File "", line 1, in > File > "/home/jan/git/db-benchmark/spark/py-spark/lib/python3.6/site-packages/py > spark/sql/dataframe.py", line 1300, in __getattr__ > "'%s' object has no attribute '%s'" % (self.__class__.__name__, name)) > AttributeError: 'DataFrame' object has no attribute 'tail' > ``` > I would like to feature request Tail method for spark dataframe -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-26433) Tail method for spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-26433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-26433: -- > Tail method for spark DataFrame > --- > > Key: SPARK-26433 > URL: https://issues.apache.org/jira/browse/SPARK-26433 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Jan Gorecki >Priority: Major > > There is a head method for spark dataframes which work fine but there doesn't > seems to be tail method. > ``` > >>> ans > >>> > DataFrame[v1: bigint] > > >>> ans.head(3) > >>> > [Row(v1=299443), Row(v1=299493), Row(v1=300751)] > >>> ans.tail(3) > Traceback (most recent call last): > File "", line 1, in > File > "/home/jan/git/db-benchmark/spark/py-spark/lib/python3.6/site-packages/py > spark/sql/dataframe.py", line 1300, in __getattr__ > "'%s' object has no attribute '%s'" % (self.__class__.__name__, name)) > AttributeError: 'DataFrame' object has no attribute 'tail' > ``` > I would like to feature request Tail method for spark dataframe -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26433) Tail method for spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-26433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26433. -- Resolution: Duplicate > Tail method for spark DataFrame > --- > > Key: SPARK-26433 > URL: https://issues.apache.org/jira/browse/SPARK-26433 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Jan Gorecki >Priority: Major > > There is a head method for spark dataframes which work fine but there doesn't > seems to be tail method. > ``` > >>> ans > >>> > DataFrame[v1: bigint] > > >>> ans.head(3) > >>> > [Row(v1=299443), Row(v1=299493), Row(v1=300751)] > >>> ans.tail(3) > Traceback (most recent call last): > File "", line 1, in > File > "/home/jan/git/db-benchmark/spark/py-spark/lib/python3.6/site-packages/py > spark/sql/dataframe.py", line 1300, in __getattr__ > "'%s' object has no attribute '%s'" % (self.__class__.__name__, name)) > AttributeError: 'DataFrame' object has no attribute 'tail' > ``` > I would like to feature request Tail method for spark dataframe -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30138) Separate configuration key of max iterations for analyzer and optimizer
[ https://issues.apache.org/jira/browse/SPARK-30138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-30138. -- Fix Version/s: 3.0.0 Assignee: Hu Fuwang Resolution: Fixed Resolved by https://github.com/apache/spark/pull/26766 > Separate configuration key of max iterations for analyzer and optimizer > --- > > Key: SPARK-30138 > URL: https://issues.apache.org/jira/browse/SPARK-30138 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hu Fuwang >Assignee: Hu Fuwang >Priority: Major > Fix For: 3.0.0 > > > Currently, both Analyzer and Optimizer use conf > "spark.sql.optimizer.excludedRules" to set the max iterations to run, which > is a little confusing. > It is clearer to add a new conf "spark.sql.analyzer.excludedRules" for > analyzer max iterations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30185) Implement Dataset.tail API
[ https://issues.apache.org/jira/browse/SPARK-30185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-30185: - Description: I would like to propose an API called DataFrame.tail. *Background & Motivation* Many other systems support the way to take data from the end, for instance, pandas[1] and Python[2][3]. Scala collections APIs also have head and tail On the other hand, in Spark, we only provide a way to take data from the start (e.g., DataFrame.head). This has been requested multiple times here and there in Spark user mailing list[4], StackOverFlow[5][6], JIRA[7] and other third party projects such as Koalas[8]. It seems we're missing non-trivial use case in Spark and this motivated me to propose this API. *Proposal* I would like to propose an API against DataFrame called tail that collects rows from the end in contrast with head. Namely, as below: {code:java} scala> spark.range(10).head(5) res1: Array[Long] = Array(0, 1, 2, 3, 4) scala> spark.range(10).tail(5) res2: Array[Long] = Array(5, 6, 7, 8, 9){code} Implementation details will be similar with head but it will be reversed: Run the job against the last partition and collect rows. If this is enough, return as is. If this is not enough, calculate the number of partitions to select more based upon ‘spark.sql.limit.scaleUpFactor’ Run more jobs against more partitions (in a reversed order compared to head) as many as the number calculated from 2. Go to 2. [1] [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html?highlight=tail#pandas.DataFrame.tail] [2] [https://stackoverflow.com/questions/10532473/head-and-tail-in-one-line] [3] [https://stackoverflow.com/questions/646644/how-to-get-last-items-of-a-list-in-python] [4] [http://apache-spark-user-list.1001560.n3.nabble.com/RDD-tail-td4217.html] [5] [https://stackoverflow.com/questions/39544796/how-to-select-last-row-and-also-how-to-access-pyspark-dataframe-by-index] [6] [https://stackoverflow.com/questions/45406762/how-to-get-the-last-row-from-dataframe] [7] https://issues.apache.org/jira/browse/SPARK-26433 [8] [https://github.com/databricks/koalas/issues/343] was: I would like to propose an API called DataFrame.tail. *Background & Motivation* Many other systems support the way to take data from the end, for instance, pandas[1] and Python[2][3]. Scala collections APIs also have head and tail On the other hand, in Spark, we only provide a way to take data from the start (e.g., DataFrame.head). This has been requested multiple times here and there in Spark user mailing list[4], StackOverFlow[5][6], JIRA[7] and other third party projects such as Koalas[8]. It seems we're missing non-trivial use case in Spark and this motivated me to propose this API. *Proposal* I would like to propose an API against DataFrame called tail that collects rows from the end in contrast with head. Namely, as below: scala> spark.range(10).head(5) res1: Array[Long] = Array(0, 1, 2, 3, 4) scala> spark.range(10).tail(5) res2: Array[Long] = Array(5, 6, 7, 8, 9) Implementation details will be similar with head but it will be reversed: Run the job against the last partition and collect rows. If this is enough, return as is. If this is not enough, calculate the number of partitions to select more based upon ‘spark.sql.limit.scaleUpFactor’ Run more jobs against more partitions (in a reversed order compared to head) as many as the number calculated from 2. Go to 2. [1] [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html?highlight=tail#pandas.DataFrame.tail] [2] [https://stackoverflow.com/questions/10532473/head-and-tail-in-one-line] [3] [https://stackoverflow.com/questions/646644/how-to-get-last-items-of-a-list-in-python] [4] [http://apache-spark-user-list.1001560.n3.nabble.com/RDD-tail-td4217.html] [5] [https://stackoverflow.com/questions/39544796/how-to-select-last-row-and-also-how-to-access-pyspark-dataframe-by-index] [6] [https://stackoverflow.com/questions/45406762/how-to-get-the-last-row-from-dataframe] [7] https://issues.apache.org/jira/browse/SPARK-26433 [8] [https://github.com/databricks/koalas/issues/343] > Implement Dataset.tail API > -- > > Key: SPARK-30185 > URL: https://issues.apache.org/jira/browse/SPARK-30185 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > I would like to propose an API called DataFrame.tail. > *Background & Motivation* > Many other systems support the way to take data from the end, for instance, > pandas[1] and > Python[2][3]. Scala collections APIs also have head and tail > On the other hand, in Spark, we only provide a way to
[jira] [Created] (SPARK-30185) Implement Dataset.tail API
Hyukjin Kwon created SPARK-30185: Summary: Implement Dataset.tail API Key: SPARK-30185 URL: https://issues.apache.org/jira/browse/SPARK-30185 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon Assignee: Hyukjin Kwon I would like to propose an API called DataFrame.tail. *Background & Motivation* Many other systems support the way to take data from the end, for instance, pandas[1] and Python[2][3]. Scala collections APIs also have head and tail On the other hand, in Spark, we only provide a way to take data from the start (e.g., DataFrame.head). This has been requested multiple times here and there in Spark user mailing list[4], StackOverFlow[5][6], JIRA[7] and other third party projects such as Koalas[8]. It seems we're missing non-trivial use case in Spark and this motivated me to propose this API. *Proposal* I would like to propose an API against DataFrame called tail that collects rows from the end in contrast with head. Namely, as below: scala> spark.range(10).head(5) res1: Array[Long] = Array(0, 1, 2, 3, 4) scala> spark.range(10).tail(5) res2: Array[Long] = Array(5, 6, 7, 8, 9) Implementation details will be similar with head but it will be reversed: Run the job against the last partition and collect rows. If this is enough, return as is. If this is not enough, calculate the number of partitions to select more based upon ‘spark.sql.limit.scaleUpFactor’ Run more jobs against more partitions (in a reversed order compared to head) as many as the number calculated from 2. Go to 2. [1] [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html?highlight=tail#pandas.DataFrame.tail] [2] [https://stackoverflow.com/questions/10532473/head-and-tail-in-one-line] [3] [https://stackoverflow.com/questions/646644/how-to-get-last-items-of-a-list-in-python] [4] [http://apache-spark-user-list.1001560.n3.nabble.com/RDD-tail-td4217.html] [5] [https://stackoverflow.com/questions/39544796/how-to-select-last-row-and-also-how-to-access-pyspark-dataframe-by-index] [6] [https://stackoverflow.com/questions/45406762/how-to-get-the-last-row-from-dataframe] [7] https://issues.apache.org/jira/browse/SPARK-26433 [8] [https://github.com/databricks/koalas/issues/343] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30184) Implement a helper method for aliasing functions
[ https://issues.apache.org/jira/browse/SPARK-30184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Omer updated SPARK-30184: -- Summary: Implement a helper method for aliasing functions (was: Implement a helper method for aliasing ) > Implement a helper method for aliasing functions > > > Key: SPARK-30184 > URL: https://issues.apache.org/jira/browse/SPARK-30184 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Aman Omer >Priority: Major > > In PR https://github.com/apache/spark/pull/26712 , we have introduced > expressionWithAlias function in FunctionRegistry.scala which forwards > function name submitted at runtime. > This Jira is to use expressionWithAlias for remaining functions for which > alias name can be used. Remaining functions are: > Average, First, Last, ApproximatePercentile, StddevSamp, VarianceSamp -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30184) Implement a helper method for aliasing
Aman Omer created SPARK-30184: - Summary: Implement a helper method for aliasing Key: SPARK-30184 URL: https://issues.apache.org/jira/browse/SPARK-30184 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Aman Omer In PR https://github.com/apache/spark/pull/26712 , we have introduced expressionWithAlias function in FunctionRegistry.scala which forwards function name submitted at runtime. This Jira is to use expressionWithAlias for remaining functions for which alias name can be used. Remaining functions are: Average, First, Last, ApproximatePercentile, StddevSamp, VarianceSamp -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-30176) Eliminate warnings: part 6
[ https://issues.apache.org/jira/browse/SPARK-30176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-30176: --- Comment: was deleted (was: i will work on this.) > Eliminate warnings: part 6 > -- > > Key: SPARK-30176 > URL: https://issues.apache.org/jira/browse/SPARK-30176 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Minor > > > sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala > {code:java} > Warning:Warning:line (32)java: > org.apache.spark.sql.expressions.javalang.typed in > org.apache.spark.sql.expressions.javalang has been deprecated > Warning:Warning:line (91)java: > org.apache.spark.sql.expressions.javalang.typed in > org.apache.spark.sql.expressions.javalang has been deprecated > Warning:Warning:line (100)java: > org.apache.spark.sql.expressions.javalang.typed in > org.apache.spark.sql.expressions.javalang has been deprecated > Warning:Warning:line (109)java: > org.apache.spark.sql.expressions.javalang.typed in > org.apache.spark.sql.expressions.javalang has been deprecated > Warning:Warning:line (118)java: > org.apache.spark.sql.expressions.javalang.typed in > org.apache.spark.sql.expressions.javalang has been deprecated > {code} > sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala > {code:java} > Warning:Warning:line (242)object typed in package scalalang is deprecated > (since 3.0.0): please use untyped builtin aggregate functions. > df.as[Data].select(typed.sumLong((d: Data) => > d.l)).queryExecution.toRdd.foreach(_ => ()) > {code} > sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala > {code:java} > Warning:Warning:line (714)method from_utc_timestamp in object functions is > deprecated (since 3.0.0): This function is deprecated and will be removed in > future versions. > df.select(from_utc_timestamp(col("a"), "PST")), > Warning:Warning:line (719)method from_utc_timestamp in object functions > is deprecated (since 3.0.0): This function is deprecated and will be removed > in future versions. > df.select(from_utc_timestamp(col("b"), "PST")), > Warning:Warning:line (725)method from_utc_timestamp in object functions > is deprecated (since 3.0.0): This function is deprecated and will be removed > in future versions. > df.select(from_utc_timestamp(col("a"), "PST")).collect() > Warning:Warning:line (737)method from_utc_timestamp in object functions > is deprecated (since 3.0.0): This function is deprecated and will be removed > in future versions. > df.select(from_utc_timestamp(col("a"), col("c"))), > Warning:Warning:line (742)method from_utc_timestamp in object functions > is deprecated (since 3.0.0): This function is deprecated and will be removed > in future versions. > df.select(from_utc_timestamp(col("b"), col("c"))), > Warning:Warning:line (756)method to_utc_timestamp in object functions is > deprecated (since 3.0.0): This function is deprecated and will be removed in > future versions. > df.select(to_utc_timestamp(col("a"), "PST")), > Warning:Warning:line (761)method to_utc_timestamp in object functions is > deprecated (since 3.0.0): This function is deprecated and will be removed in > future versions. > df.select(to_utc_timestamp(col("b"), "PST")), > Warning:Warning:line (767)method to_utc_timestamp in object functions is > deprecated (since 3.0.0): This function is deprecated and will be removed in > future versions. > df.select(to_utc_timestamp(col("a"), "PST")).collect() > Warning:Warning:line (779)method to_utc_timestamp in object functions is > deprecated (since 3.0.0): This function is deprecated and will be removed in > future versions. > df.select(to_utc_timestamp(col("a"), col("c"))), > Warning:Warning:line (784)method to_utc_timestamp in object functions is > deprecated (since 3.0.0): This function is deprecated and will be removed in > future versions. > df.select(to_utc_timestamp(col("b"), col("c"))), > {code} > sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala > {code:java} > Warning:Warning:line (241)method merge in object Row is deprecated (since > 3.0.0): This method is deprecated and will be removed in future versions. > testData.rdd.flatMap(row => Seq.fill(16)(Row.merge(row, > row))).collect().toSeq) > {code} > sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala > {code:java} > Warning:Warning:line (787)method merge in object Row is deprecated (since > 3.0.0): This method is deprecated and will be removed in future versions. > row => Seq.fill(16)(Row.merge(row, row))).collect().toSeq) > {code} > >
[jira] [Issue Comment Deleted] (SPARK-30175) Eliminate warnings: part 5
[ https://issues.apache.org/jira/browse/SPARK-30175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta updated SPARK-30175: -- Comment: was deleted (was: thanks for raising, will raise PR soon) > Eliminate warnings: part 5 > -- > > Key: SPARK-30175 > URL: https://issues.apache.org/jira/browse/SPARK-30175 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Minor > > sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/WriteToMicroBatchDataSource.scala > {code:java} > Warning:Warning:line (36)class WriteToDataSourceV2 in package v2 is > deprecated (since 2.4.0): Use specific logical plans like AppendData instead > def createPlan(batchId: Long): WriteToDataSourceV2 = { > Warning:Warning:line (37)class WriteToDataSourceV2 in package v2 is > deprecated (since 2.4.0): Use specific logical plans like AppendData instead > WriteToDataSourceV2(new MicroBatchWrite(batchId, write), query) > {code} > sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala > {code:java} > Warning:Warning:line (703)a pure expression does nothing in statement > position; multiline expressions might require enclosing parentheses > q1 > {code} > sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingAggregationSuite.scala > {code:java} > Warning:Warning:line (285)object typed in package scalalang is deprecated > (since 3.0.0): please use untyped builtin aggregate functions. > val aggregated = > inputData.toDS().groupByKey(_._1).agg(typed.sumLong(_._2)) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30108) Add robust accumulator for observable metrics
[ https://issues.apache.org/jira/browse/SPARK-30108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991381#comment-16991381 ] Herman van Hövell commented on SPARK-30108: --- [~Ankitraj] Great to hear! Here are some pointers, the goal of this ticket to make the current AggregatingAccumulator to only add the result of a partition once. For this we need to do a couple of things: - We need to add some tracking mechanism to the AggregatingAccumulator that keeps track of the partitions that have been processed. A (compressible) bit map should be enough. - We need a way to tell the accumulator's merge method which partition it is processing. One way of doing this would be making a small change in the way the DAGScheduler interacts with the accumulator, and also pass Task/Partition info on calling update. A less intrusive way of doing this would be to pass back partition info with the to-be-merged accumulator. > Add robust accumulator for observable metrics > - > > Key: SPARK-30108 > URL: https://issues.apache.org/jira/browse/SPARK-30108 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30183) Disallow to specify reserved properties in CREATE NAMESPACE syntax
Kent Yao created SPARK-30183: Summary: Disallow to specify reserved properties in CREATE NAMESPACE syntax Key: SPARK-30183 URL: https://issues.apache.org/jira/browse/SPARK-30183 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Kent Yao Currently, COMMENT and LOCATION are reserved properties for datasource v2 namespaces. The should be used in COMMENT/LOCATION clauses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30182) Support nested aggregates
jiaan.geng created SPARK-30182: -- Summary: Support nested aggregates Key: SPARK-30182 URL: https://issues.apache.org/jira/browse/SPARK-30182 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: jiaan.geng Spark SQL cannot supports a SQL with nested aggregate as below: {code:java} SELECT sum(salary), row_number() OVER (ORDER BY depname), sum( sum(salary) FILTER (WHERE enroll_date > '2007-01-01') ) FILTER (WHERE depname <> 'sales') OVER (ORDER BY depname DESC) AS "filtered_sum", depname FROM empsalary GROUP BY depname;{code} And Spark will throw exception as follows: {code:java} org.apache.spark.sql.AnalysisException It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.{code} But PostgreSQL supports this syntax. {code:java} SELECT sum(salary), row_number() OVER (ORDER BY depname), sum( sum(salary) FILTER (WHERE enroll_date > '2007-01-01') ) FILTER (WHERE depname <> 'sales') OVER (ORDER BY depname DESC) AS "filtered_sum", depname FROM empsalary GROUP BY depname; sum | row_number | filtered_sum | depname ---++--+--- 25100 | 1 | 22600 | develop 7400 | 2 | 3500 | personnel 14600 | 3 | | sales (3 rows){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30181) Throws runtime exception when filter metastore partition key that's not string type or integral types
Yu-Jhe Li created SPARK-30181: - Summary: Throws runtime exception when filter metastore partition key that's not string type or integral types Key: SPARK-30181 URL: https://issues.apache.org/jira/browse/SPARK-30181 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.4.0 Reporter: Yu-Jhe Li SQL below will throw a runtime exception since spark-2.4.0. I think it's a bug brought from SPARK-22384 {code:scala} spark.sql("CREATE TABLE timestamp_part (value INT) PARTITIONED BY (dt TIMESTAMP)") val df = Seq( (1, java.sql.Timestamp.valueOf("2019-12-01 00:00:00"), 1), (2, java.sql.Timestamp.valueOf("2019-12-01 01:00:00"), 1) ).toDF("id", "dt", "value") df.write.partitionBy("dt").mode("overwrite").saveAsTable("timestamp_part") spark.sql("select * from timestamp_part where dt >= '2019-12-01 00:00:00'").explain(true) {code} {noformat} Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:774) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:679) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:677) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:677) at org.apache.spark.sql.hive.client.HiveClientSuite.testMetastorePartitionFiltering(HiveClientSuite.scala:310) at org.apache.spark.sql.hive.client.HiveClientSuite.org$apache$spark$sql$hive$client$HiveClientSuite$$testMetastorePartitionFiltering(HiveClientSuite.scala:282) at org.apache.spark.sql.hive.client.HiveClientSuite$$anonfun$1.apply$mcV$sp(HiveClientSuite.scala:105) at org.apache.spark.sql.hive.client.HiveClientSuite$$anonfun$1.apply(HiveClientSuite.scala:105) at org.apache.spark.sql.hive.client.HiveClientSuite$$anonfun$1.apply(HiveClientSuite.scala:105) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) at org.scalatest.FunSuite.runTest(FunSuite.scala:1560) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) at org.scalatest.Suite$class.run(Suite.scala:1147) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) at