[jira] [Commented] (SPARK-37031) Unify v1 and v2 DESCRIBE NAMESPACE tests
[ https://issues.apache.org/jira/browse/SPARK-37031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434668#comment-17434668 ] Apache Spark commented on SPARK-37031: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/34399 > Unify v1 and v2 DESCRIBE NAMESPACE tests > > > Key: SPARK-37031 > URL: https://issues.apache.org/jira/browse/SPARK-37031 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.3.0 > > > Extract DESCRIBE NAMESPACE tests to the common place to run them for V1 and > v2 datasources. Some tests can be places to V1 and V2 specific test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37031) Unify v1 and v2 DESCRIBE NAMESPACE tests
[ https://issues.apache.org/jira/browse/SPARK-37031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434667#comment-17434667 ] Apache Spark commented on SPARK-37031: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/34399 > Unify v1 and v2 DESCRIBE NAMESPACE tests > > > Key: SPARK-37031 > URL: https://issues.apache.org/jira/browse/SPARK-37031 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.3.0 > > > Extract DESCRIBE NAMESPACE tests to the common place to run them for V1 and > v2 datasources. Some tests can be places to V1 and V2 specific test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37031) Unify v1 and v2 DESCRIBE NAMESPACE tests
[ https://issues.apache.org/jira/browse/SPARK-37031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37031. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34305 [https://github.com/apache/spark/pull/34305] > Unify v1 and v2 DESCRIBE NAMESPACE tests > > > Key: SPARK-37031 > URL: https://issues.apache.org/jira/browse/SPARK-37031 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.3.0 > > > Extract DESCRIBE NAMESPACE tests to the common place to run them for V1 and > v2 datasources. Some tests can be places to V1 and V2 specific test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37031) Unify v1 and v2 DESCRIBE NAMESPACE tests
[ https://issues.apache.org/jira/browse/SPARK-37031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37031: --- Assignee: Terry Kim > Unify v1 and v2 DESCRIBE NAMESPACE tests > > > Key: SPARK-37031 > URL: https://issues.apache.org/jira/browse/SPARK-37031 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > > Extract DESCRIBE NAMESPACE tests to the common place to run them for V1 and > v2 datasources. Some tests can be places to V1 and V2 specific test suites. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37127) Support non-literal frame bound value for window functions
Kernel Force created SPARK-37127: Summary: Support non-literal frame bound value for window functions Key: SPARK-37127 URL: https://issues.apache.org/jira/browse/SPARK-37127 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.3 Environment: Spark-3.0.3 Reporter: Kernel Force {code:sql} sql(""" with va as ( select 15 a, 100 b union all select 15, 120 union all select 15, 130 union all select 15, 150 ) select t.*, min(t.b) over(partition by t.a order by t.b range between 0.15*t.b preceding and current row) c from va t """).show {code} throws {code:java} org.apache.spark.sql.catalyst.parser.ParseException: Frame bound value must be a literal.(line 12, pos 65) {code} The non-literal expression *0.15*t.b* might leads this exception. But the non-literal frame bound value has already been support by oracle which is very useful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37125) Support AnsiInterval radix sort
[ https://issues.apache.org/jira/browse/SPARK-37125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434654#comment-17434654 ] Apache Spark commented on SPARK-37125: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/34398 > Support AnsiInterval radix sort > --- > > Key: SPARK-37125 > URL: https://issues.apache.org/jira/browse/SPARK-37125 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > The radix sort is more faster than timsort, the benchmark result can see in > `SortBenchmark`. > Since the `AnsiInterval` data type is comparable: > - `YearMonthIntervalType` -> int ordering > - `DayTimeIntervalType` -> long ordering > And we aslo support radix sort when the ordering column date type is int or > long. > So `AnsiInterval` radix sort can be supported. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37125) Support AnsiInterval radix sort
[ https://issues.apache.org/jira/browse/SPARK-37125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434653#comment-17434653 ] Apache Spark commented on SPARK-37125: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/34398 > Support AnsiInterval radix sort > --- > > Key: SPARK-37125 > URL: https://issues.apache.org/jira/browse/SPARK-37125 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > The radix sort is more faster than timsort, the benchmark result can see in > `SortBenchmark`. > Since the `AnsiInterval` data type is comparable: > - `YearMonthIntervalType` -> int ordering > - `DayTimeIntervalType` -> long ordering > And we aslo support radix sort when the ordering column date type is int or > long. > So `AnsiInterval` radix sort can be supported. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37125) Support AnsiInterval radix sort
[ https://issues.apache.org/jira/browse/SPARK-37125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37125: Assignee: Apache Spark > Support AnsiInterval radix sort > --- > > Key: SPARK-37125 > URL: https://issues.apache.org/jira/browse/SPARK-37125 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > > The radix sort is more faster than timsort, the benchmark result can see in > `SortBenchmark`. > Since the `AnsiInterval` data type is comparable: > - `YearMonthIntervalType` -> int ordering > - `DayTimeIntervalType` -> long ordering > And we aslo support radix sort when the ordering column date type is int or > long. > So `AnsiInterval` radix sort can be supported. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37125) Support AnsiInterval radix sort
[ https://issues.apache.org/jira/browse/SPARK-37125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37125: Assignee: (was: Apache Spark) > Support AnsiInterval radix sort > --- > > Key: SPARK-37125 > URL: https://issues.apache.org/jira/browse/SPARK-37125 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > The radix sort is more faster than timsort, the benchmark result can see in > `SortBenchmark`. > Since the `AnsiInterval` data type is comparable: > - `YearMonthIntervalType` -> int ordering > - `DayTimeIntervalType` -> long ordering > And we aslo support radix sort when the ordering column date type is int or > long. > So `AnsiInterval` radix sort can be supported. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37126) Support TimestampNTZ in PySpark
[ https://issues.apache.org/jira/browse/SPARK-37126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37126. -- Fix Version/s: 3.3.0 Resolution: Done > Support TimestampNTZ in PySpark > --- > > Key: SPARK-37126 > URL: https://issues.apache.org/jira/browse/SPARK-37126 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.3.0 > > > This tickets aims TimestampNTZ support in PySpark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37126) Support TimestampNTZ in PySpark
[ https://issues.apache.org/jira/browse/SPARK-37126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37126: - Issue Type: Improvement (was: Epic) > Support TimestampNTZ in PySpark > --- > > Key: SPARK-37126 > URL: https://issues.apache.org/jira/browse/SPARK-37126 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > This tickets aims TimestampNTZ support in PySpark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36661) Support TimestampNTZ in Py4J
[ https://issues.apache.org/jira/browse/SPARK-36661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-36661: Assignee: Hyukjin Kwon (was: Hyukjin Kwon) > Support TimestampNTZ in Py4J > > > Key: SPARK-36661 > URL: https://issues.apache.org/jira/browse/SPARK-36661 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37125) Support AnsiInterval radix sort
[ https://issues.apache.org/jira/browse/SPARK-37125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-37125: -- Parent: SPARK-27790 Issue Type: Sub-task (was: Improvement) > Support AnsiInterval radix sort > --- > > Key: SPARK-37125 > URL: https://issues.apache.org/jira/browse/SPARK-37125 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > The radix sort is more faster than timsort, the benchmark result can see in > `SortBenchmark`. > Since the `AnsiInterval` data type is comparable: > - `YearMonthIntervalType` -> int ordering > - `DayTimeIntervalType` -> long ordering > And we aslo support radix sort when the ordering column date type is int or > long. > So `AnsiInterval` radix sort can be supported. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37126) Support TimestampNTZ in PySpark
Hyukjin Kwon created SPARK-37126: Summary: Support TimestampNTZ in PySpark Key: SPARK-37126 URL: https://issues.apache.org/jira/browse/SPARK-37126 Project: Spark Issue Type: Epic Components: PySpark, SQL Affects Versions: 3.3.0 Reporter: Hyukjin Kwon Assignee: Hyukjin Kwon This tickets aims TimestampNTZ support in PySpark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37125) Support AnsiInterval radix sort
XiDuo You created SPARK-37125: - Summary: Support AnsiInterval radix sort Key: SPARK-37125 URL: https://issues.apache.org/jira/browse/SPARK-37125 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: XiDuo You The radix sort is more faster than timsort, the benchmark result can see in `SortBenchmark`. Since the `AnsiInterval` data type is comparable: - `YearMonthIntervalType` -> int ordering - `DayTimeIntervalType` -> long ordering And we aslo support radix sort when the ordering column date type is int or long. So `AnsiInterval` radix sort can be supported. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37120) Add Java17 GitHub Action build and test job
[ https://issues.apache.org/jira/browse/SPARK-37120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434651#comment-17434651 ] Yang Jie commented on SPARK-37120: -- Thank [~dongjoon] > Add Java17 GitHub Action build and test job > --- > > Key: SPARK-37120 > URL: https://issues.apache.org/jira/browse/SPARK-37120 > Project: Spark > Issue Type: Sub-task > Components: jenkins >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Major > > Now run > {code:java} > build/mvn clean install -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > {code} > to build and test whole project(Head is > 87591c9b22157cfd241af6ea2533359c3fba1bb2) with Java 17. It seems that all the > UTs have passed. > > {code:java} > [INFO] > > [INFO] Reactor Summary for Spark Project Parent POM 3.3.0-SNAPSHOT: > [INFO] > [INFO] Spark Project Parent POM ... SUCCESS [ 1.971 > s] > [INFO] Spark Project Tags . SUCCESS [ 2.170 > s] > [INFO] Spark Project Sketch ... SUCCESS [ 14.008 > s] > [INFO] Spark Project Local DB . SUCCESS [ 2.466 > s] > [INFO] Spark Project Networking ... SUCCESS [ 49.650 > s] > [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 7.095 > s] > [INFO] Spark Project Unsafe ... SUCCESS [ 1.826 > s] > [INFO] Spark Project Launcher . SUCCESS [ 1.851 > s] > [INFO] Spark Project Core . SUCCESS [24:40 > min] > [INFO] Spark Project ML Local Library . SUCCESS [ 17.816 > s] > [INFO] Spark Project GraphX ... SUCCESS [01:27 > min] > [INFO] Spark Project Streaming SUCCESS [04:57 > min] > [INFO] Spark Project Catalyst . SUCCESS [07:56 > min] > [INFO] Spark Project SQL .. SUCCESS [ 01:01 > h] > [INFO] Spark Project ML Library ... SUCCESS [16:46 > min] > [INFO] Spark Project Tools SUCCESS [ 0.748 > s] > [INFO] Spark Project Hive . SUCCESS [ 01:11 > h] > [INFO] Spark Project REPL . SUCCESS [01:26 > min] > [INFO] Spark Project YARN Shuffle Service . SUCCESS [ 0.967 > s] > [INFO] Spark Project YARN . SUCCESS [06:54 > min] > [INFO] Spark Project Mesos SUCCESS [ 46.913 > s] > [INFO] Spark Project Kubernetes ... SUCCESS [01:08 > min] > [INFO] Spark Project Hive Thrift Server ... SUCCESS [19:12 > min] > [INFO] Spark Ganglia Integration .. SUCCESS [ 4.610 > s] > [INFO] Spark Project Hadoop Cloud Integration . SUCCESS [ 11.400 > s] > [INFO] Spark Project Assembly . SUCCESS [ 2.496 > s] > [INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [ 19.870 > s] > [INFO] Spark Integration for Kafka 0.10 ... SUCCESS [01:20 > min] > [INFO] Kafka 0.10+ Source for Structured Streaming SUCCESS [35:06 > min] > [INFO] Spark Kinesis Integration .. SUCCESS [ 29.667 > s] > [INFO] Spark Project Examples . SUCCESS [ 32.189 > s] > [INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [ 0.949 > s] > [INFO] Spark Avro . SUCCESS [01:55 > min] > [INFO] Spark Project Kinesis Assembly . SUCCESS [ 1.104 > s] > [INFO] > > [INFO] BUILD SUCCESS > [INFO] > > [INFO] Total time: 04:19 h > [INFO] Finished at: 2021-10-26T20:02:56+08:00 > [INFO] > > {code} > So should we add a Jenkins build and test job for Java 17? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36348) unexpected Index loaded: pd.Index([10, 20, None], name="x")
[ https://issues.apache.org/jira/browse/SPARK-36348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434644#comment-17434644 ] Apache Spark commented on SPARK-36348: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/34397 > unexpected Index loaded: pd.Index([10, 20, None], name="x") > --- > > Key: SPARK-36348 > URL: https://issues.apache.org/jira/browse/SPARK-36348 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.3.0 > > > {code:python} > pidx = pd.Index([10, 20, 15, 30, 45, None], name="x") > psidx = ps.Index(pidx) > self.assert_eq(psidx.astype(str), pidx.astype(str)) > {code} > [left pandas on spark]: Index(['10.0', '20.0', '15.0', '30.0', '45.0', > 'nan'], dtype='object', name='x') > [right pandas]: Index(['10', '20', '15', '30', '45', 'None'], dtype='object', > name='x') > The index is loaded as float64, so the follow step like astype would be diff > with pandas > [1] > https://github.com/apache/spark/blob/bcc595c112a23d8e3024ace50f0dbc7eab7144b2/python/pyspark/pandas/tests/indexes/test_base.py#L2249 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35437) Use expressions to filter Hive partitions at client side
[ https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35437: Assignee: (was: Apache Spark) > Use expressions to filter Hive partitions at client side > > > Key: SPARK-35437 > URL: https://issues.apache.org/jira/browse/SPARK-35437 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1 >Reporter: dzcxzl >Priority: Minor > Fix For: 3.3.0 > > > When we have a table with a lot of partitions and there is no way to filter > it on the MetaStore Server, we will get all the partition details and filter > it on the client side. This is slow and puts a lot of pressure on the > MetaStore Server. > We can first pull all the partition names, filter by expressions, and then > obtain detailed information about the corresponding partitions from the > MetaStore Server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35437) Use expressions to filter Hive partitions at client side
[ https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35437: Assignee: Apache Spark > Use expressions to filter Hive partitions at client side > > > Key: SPARK-35437 > URL: https://issues.apache.org/jira/browse/SPARK-35437 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1 >Reporter: dzcxzl >Assignee: Apache Spark >Priority: Minor > Fix For: 3.3.0 > > > When we have a table with a lot of partitions and there is no way to filter > it on the MetaStore Server, we will get all the partition details and filter > it on the client side. This is slow and puts a lot of pressure on the > MetaStore Server. > We can first pull all the partition names, filter by expressions, and then > obtain detailed information about the corresponding partitions from the > MetaStore Server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36348) unexpected Index loaded: pd.Index([10, 20, None], name="x")
[ https://issues.apache.org/jira/browse/SPARK-36348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434643#comment-17434643 ] Apache Spark commented on SPARK-36348: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/34397 > unexpected Index loaded: pd.Index([10, 20, None], name="x") > --- > > Key: SPARK-36348 > URL: https://issues.apache.org/jira/browse/SPARK-36348 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.3.0 > > > {code:python} > pidx = pd.Index([10, 20, 15, 30, 45, None], name="x") > psidx = ps.Index(pidx) > self.assert_eq(psidx.astype(str), pidx.astype(str)) > {code} > [left pandas on spark]: Index(['10.0', '20.0', '15.0', '30.0', '45.0', > 'nan'], dtype='object', name='x') > [right pandas]: Index(['10', '20', '15', '30', '45', 'None'], dtype='object', > name='x') > The index is loaded as float64, so the follow step like astype would be diff > with pandas > [1] > https://github.com/apache/spark/blob/bcc595c112a23d8e3024ace50f0dbc7eab7144b2/python/pyspark/pandas/tests/indexes/test_base.py#L2249 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-35437) Use expressions to filter Hive partitions at client side
[ https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-35437: -- Assignee: (was: dzcxzl) Reverted at https://github.com/apache/spark/commit/fb9d6aeb788d2e869e09f18014c966b51aa3af20 > Use expressions to filter Hive partitions at client side > > > Key: SPARK-35437 > URL: https://issues.apache.org/jira/browse/SPARK-35437 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1 >Reporter: dzcxzl >Priority: Minor > Fix For: 3.3.0 > > > When we have a table with a lot of partitions and there is no way to filter > it on the MetaStore Server, we will get all the partition details and filter > it on the client side. This is slow and puts a lot of pressure on the > MetaStore Server. > We can first pull all the partition names, filter by expressions, and then > obtain detailed information about the corresponding partitions from the > MetaStore Server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37124) Support Writable ArrowColumnarVector
[ https://issues.apache.org/jira/browse/SPARK-37124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37124: Assignee: Apache Spark > Support Writable ArrowColumnarVector > > > Key: SPARK-37124 > URL: https://issues.apache.org/jira/browse/SPARK-37124 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chendi.Xue >Assignee: Apache Spark >Priority: Major > > This Jira is aim to add Arrow format as an alternative for ColumnVector > solution. > Current ArrowColumnVector is not fully equivalent to > OnHeap/OffHeapColumnVector in spark, and since Arrow API is now being more > stable, and using pandas udf will perform much better than python udf. > I am proposing to fully support arrow format as an alternative to > ColumnVector just like the other two. > What I did in this PR is to create a new class in the same package with > OnHeap/OffHeapColumnVector and extend from WritableColumnVector to support > all put APIs. > UTs are covering all Data Format with testing on writing to columnVector and > reading from columnVector. I also added 3 UTs for testing on loading from > ArrowRecordBatch and allocateColumns . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37124) Support Writable ArrowColumnarVector
[ https://issues.apache.org/jira/browse/SPARK-37124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37124: Assignee: (was: Apache Spark) > Support Writable ArrowColumnarVector > > > Key: SPARK-37124 > URL: https://issues.apache.org/jira/browse/SPARK-37124 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chendi.Xue >Priority: Major > > This Jira is aim to add Arrow format as an alternative for ColumnVector > solution. > Current ArrowColumnVector is not fully equivalent to > OnHeap/OffHeapColumnVector in spark, and since Arrow API is now being more > stable, and using pandas udf will perform much better than python udf. > I am proposing to fully support arrow format as an alternative to > ColumnVector just like the other two. > What I did in this PR is to create a new class in the same package with > OnHeap/OffHeapColumnVector and extend from WritableColumnVector to support > all put APIs. > UTs are covering all Data Format with testing on writing to columnVector and > reading from columnVector. I also added 3 UTs for testing on loading from > ArrowRecordBatch and allocateColumns . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37124) Support Writable ArrowColumnarVector
[ https://issues.apache.org/jira/browse/SPARK-37124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434624#comment-17434624 ] Apache Spark commented on SPARK-37124: -- User 'xuechendi' has created a pull request for this issue: https://github.com/apache/spark/pull/34396 > Support Writable ArrowColumnarVector > > > Key: SPARK-37124 > URL: https://issues.apache.org/jira/browse/SPARK-37124 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chendi.Xue >Priority: Major > > This Jira is aim to add Arrow format as an alternative for ColumnVector > solution. > Current ArrowColumnVector is not fully equivalent to > OnHeap/OffHeapColumnVector in spark, and since Arrow API is now being more > stable, and using pandas udf will perform much better than python udf. > I am proposing to fully support arrow format as an alternative to > ColumnVector just like the other two. > What I did in this PR is to create a new class in the same package with > OnHeap/OffHeapColumnVector and extend from WritableColumnVector to support > all put APIs. > UTs are covering all Data Format with testing on writing to columnVector and > reading from columnVector. I also added 3 UTs for testing on loading from > ArrowRecordBatch and allocateColumns . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37124) Support Writable ArrowColumnarVector
Chendi.Xue created SPARK-37124: -- Summary: Support Writable ArrowColumnarVector Key: SPARK-37124 URL: https://issues.apache.org/jira/browse/SPARK-37124 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.2.0 Reporter: Chendi.Xue This Jira is aim to add Arrow format as an alternative for ColumnVector solution. Current ArrowColumnVector is not fully equivalent to OnHeap/OffHeapColumnVector in spark, and since Arrow API is now being more stable, and using pandas udf will perform much better than python udf. I am proposing to fully support arrow format as an alternative to ColumnVector just like the other two. What I did in this PR is to create a new class in the same package with OnHeap/OffHeapColumnVector and extend from WritableColumnVector to support all put APIs. UTs are covering all Data Format with testing on writing to columnVector and reading from columnVector. I also added 3 UTs for testing on loading from ArrowRecordBatch and allocateColumns . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37123) Support Writable ArrowColumnarVector
Chendi.Xue created SPARK-37123: -- Summary: Support Writable ArrowColumnarVector Key: SPARK-37123 URL: https://issues.apache.org/jira/browse/SPARK-37123 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.2.0 Reporter: Chendi.Xue This Jira is aim to add Arrow format as an alternative for ColumnVector solution. Current ArrowColumnVector is not fully equivalent to OnHeap/OffHeapColumnVector in spark, and since Arrow API is now being more stable, and using pandas udf will perform much better than python udf. I am proposing to fully support arrow format as an alternative to ColumnVector just like the other two. What I did in this PR is to create a new class in the same package with OnHeap/OffHeapColumnVector and extend from WritableColumnVector to support all put APIs. UTs are covering all Data Format with testing on writing to columnVector and reading from columnVector. I also added 3 UTs for testing on loading from ArrowRecordBatch and allocateColumns . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37122) java.lang.IllegalArgumentException Related to Prometheus
[ https://issues.apache.org/jira/browse/SPARK-37122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biswa Singh updated SPARK-37122: Affects Version/s: (was: 3.0.2) > java.lang.IllegalArgumentException Related to Prometheus > > > Key: SPARK-37122 > URL: https://issues.apache.org/jira/browse/SPARK-37122 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.1.1 >Reporter: Biswa Singh >Priority: Critical > > This issue is similar to > https://issues.apache.org/jira/browse/SPARK-35237?focusedCommentId=17340723=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17340723. > We receive the Following warning continuously: > > 21:00:26.277 [rpc-server-4-2] WARN o.a.s.n.s.TransportChannelHandler - > Exception in connection from > /10.198.3.179:51184java.lang.IllegalArgumentException: Too large frame: > 5135603447297303916 at > org.sparkproject.guava.base.Preconditions.checkArgument(Preconditions.java:119) > at > org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:148) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:98) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) > at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Unknown Source) > > Below are other details related to prometheus and my findings. Please SCROLL > DOWN to see the details: > > {noformat} > Prometheus Scrape Configuration > === > - job_name: 'kubernetes-pods' > kubernetes_sd_configs: > - role: pod > relabel_configs: > - action: labelmap > regex: __meta_kubernetes_pod_label_(.+) > - source_labels: [__meta_kubernetes_namespace] > action: replace > target_label: kubernetes_namespace > - source_labels: [__meta_kubernetes_pod_name] > action: replace > target_label: kubernetes_pod_name > - source_labels: > [__meta_kubernetes_pod_annotation_prometheus_io_scrape] > action: keep > regex: true > - source_labels: > [__meta_kubernetes_pod_annotation_prometheus_io_scheme] > action: replace > target_label: __scheme__ > regex: (https?) > - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] > action: replace > target_label: __metrics_path__ > regex: (.+) > - source_labels: [__address__, > __meta_kubernetes_pod_prometheus_io_port] > action: replace > target_label: __address__ > regex: ([^:]+)(?::\d+)?;(\d+) > replacement: $1:$2 > tcptrack command output in spark3 pod > == > 10.198.22.240:51258 10.198.40.143:7079 CLOSED 10s 0 B/s > 10.198.22.240:51258 10.198.40.143:7079 CLOSED 10s 0 B/s > 10.198.22.240:50354 10.198.40.143:7079 CLOSED 40s 0 B/s > 10.198.22.240:33152 10.198.40.143:4040 ESTABLISHED 2s 0 B/s > 10.198.22.240:47726 10.198.40.143:8090 ESTABLISHED 9s 0 B/s > 10.198.22.240 = prometheus pod > ip10.198.40.143 = testpod ip > Issue > == > Though the scrape config is expected to scrape on port 8090. I see prometheus > tries to initiate scrape on ports like 7079, 7078, 4040, etc on > the spark3 pod and hence the exception in spark3 pod. But is this really a >
[jira] [Updated] (SPARK-37122) java.lang.IllegalArgumentException Related to Prometheus
[ https://issues.apache.org/jira/browse/SPARK-37122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biswa Singh updated SPARK-37122: Description: This issue is similar to https://issues.apache.org/jira/browse/SPARK-35237?focusedCommentId=17340723=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17340723. We receive the Following warning continuously: 21:00:26.277 [rpc-server-4-2] WARN o.a.s.n.s.TransportChannelHandler - Exception in connection from /10.198.3.179:51184java.lang.IllegalArgumentException: Too large frame: 5135603447297303916 at org.sparkproject.guava.base.Preconditions.checkArgument(Preconditions.java:119) at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:148) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:98) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Unknown Source) Below are other details related to prometheus and my findings. Please SCROLL DOWN to see the details: {noformat} Prometheus Scrape Configuration === - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 tcptrack command output in spark3 pod == 10.198.22.240:51258 10.198.40.143:7079 CLOSED 10s 0 B/s 10.198.22.240:51258 10.198.40.143:7079 CLOSED 10s 0 B/s 10.198.22.240:50354 10.198.40.143:7079 CLOSED 40s 0 B/s 10.198.22.240:33152 10.198.40.143:4040 ESTABLISHED 2s 0 B/s 10.198.22.240:47726 10.198.40.143:8090 ESTABLISHED 9s 0 B/s 10.198.22.240 = prometheus pod ip10.198.40.143 = testpod ip Issue == Though the scrape config is expected to scrape on port 8090. I see prometheus tries to initiate scrape on ports like 7079, 7078, 4040, etc on the spark3 pod and hence the exception in spark3 pod. But is this really a prometheus issue or something at spark side? We don't see any such exception in any of the other pods. All our pods including spark3 are annotated with: annotations: prometheus.io/port: "8090" prometheus.io/scrape: "true" We get the metrics and everything fine just extra warning for this exception.{noformat} was: This issue is similar to https://issues.apache.org/jira/browse/SPARK-35237?focusedCommentId=17340723=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17340723. We receive the Following warning continuously: 21:00:26.277 [rpc-server-4-2] WARN
[jira] [Updated] (SPARK-37122) java.lang.IllegalArgumentException Related to Prometheus
[ https://issues.apache.org/jira/browse/SPARK-37122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biswa Singh updated SPARK-37122: Description: This issue is similar to https://issues.apache.org/jira/browse/SPARK-35237?focusedCommentId=17340723=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17340723. We receive the Following warning continuously: 21:00:26.277 [rpc-server-4-2] WARN o.a.s.n.s.TransportChannelHandler - Exception in connection from /10.198.3.179:51184java.lang.IllegalArgumentException: Too large frame: 5135603447297303916 at org.sparkproject.guava.base.Preconditions.checkArgument(Preconditions.java:119) at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:148) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:98) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Unknown Source) Below are other details related to prometheus. Please scroll down to find out details of the issue: {noformat} Prometheus Scrape Configuration === - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 tcptrack command output in spark3 pod == 10.198.22.240:51258 10.198.40.143:7079 CLOSED 10s 0 B/s 10.198.22.240:51258 10.198.40.143:7079 CLOSED 10s 0 B/s 10.198.22.240:50354 10.198.40.143:7079 CLOSED 40s 0 B/s 10.198.22.240:33152 10.198.40.143:4040 ESTABLISHED 2s 0 B/s 10.198.22.240:47726 10.198.40.143:8090 ESTABLISHED 9s 0 B/s 10.198.22.240 = prometheus pod ip10.198.40.143 = testpod ip Issue == Though the scrape config is expected to scrape on port 8090. I see prometheus tries to initiate scrape on ports like 7079, 7078, 4040, etc on the spark3 pod and hence the exception in spark3 pod. But is this really a prometheus issue or something at spark side? We don't see any such exception in any of the other pods. All our pods including spark3 are annotated with: annotations: prometheus.io/port: "8090" prometheus.io/scrape: "true" We get the metrics and everything fine just extra warning for this exception.{noformat} was: This issue is similar to https://issues.apache.org/jira/browse/SPARK-35237?focusedCommentId=17340723=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17340723. We receive the Following warning: 21:00:26.277 [rpc-server-4-2] WARN o.a.s.n.s.TransportChannelHandler
[jira] [Updated] (SPARK-37122) java.lang.IllegalArgumentException Related to Prometheus
[ https://issues.apache.org/jira/browse/SPARK-37122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biswa Singh updated SPARK-37122: Description: This issue is similar to https://issues.apache.org/jira/browse/SPARK-35237?focusedCommentId=17340723=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17340723. We receive the Following warning: 21:00:26.277 [rpc-server-4-2] WARN o.a.s.n.s.TransportChannelHandler - Exception in connection from /10.198.3.179:51184java.lang.IllegalArgumentException: Too large frame: 5135603447297303916 at org.sparkproject.guava.base.Preconditions.checkArgument(Preconditions.java:119) at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:148) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:98) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Unknown Source) Below are other details related to prometheus. Please scroll down to find out details of the issue: {noformat} Prometheus Scrape Configuration === - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 tcptrack command output in spark3 pod == 10.198.22.240:51258 10.198.40.143:7079 CLOSED 10s 0 B/s 10.198.22.240:51258 10.198.40.143:7079 CLOSED 10s 0 B/s 10.198.22.240:50354 10.198.40.143:7079 CLOSED 40s 0 B/s 10.198.22.240:33152 10.198.40.143:4040 ESTABLISHED 2s 0 B/s 10.198.22.240:47726 10.198.40.143:8090 ESTABLISHED 9s 0 B/s 10.198.22.240 = prometheus pod ip10.198.40.143 = testpod ip Issue == Though the scrape config is expected to scrape on port 8090. I see prometheus tries to initiate scrape on ports like 7079, 7078, 4040, etc on the spark3 pod and hence the exception in spark3 pod. But is this really a prometheus issue or something at spark side? We don't see any such exception in any of the other pods. All our pods including spark3 are annotated with: annotations: prometheus.io/port: "8090" prometheus.io/scrape: "true" We get the metrics and everything fine just extra warning for this exception.{noformat} was: This issue is similar to https://issues.apache.org/jira/browse/SPARK-35237?focusedCommentId=17340723=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17340723. We receive the Following warning: 21:00:26.277 [rpc-server-4-2] WARN o.a.s.n.s.TransportChannelHandler -
[jira] [Commented] (SPARK-37109) Install Java 17 on all of the Jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-37109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434581#comment-17434581 ] Shane Knapp commented on SPARK-37109: - yep, jenkins is going away at the end of this year... all support is currently 'best effort'. > Install Java 17 on all of the Jenkins workers > - > > Key: SPARK-37109 > URL: https://issues.apache.org/jira/browse/SPARK-37109 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37122) java.lang.IllegalArgumentException Related to Prometheus
Biswa Singh created SPARK-37122: --- Summary: java.lang.IllegalArgumentException Related to Prometheus Key: SPARK-37122 URL: https://issues.apache.org/jira/browse/SPARK-37122 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.1.1, 3.0.2 Reporter: Biswa Singh This issue is similar to https://issues.apache.org/jira/browse/SPARK-35237?focusedCommentId=17340723=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17340723. We receive the Following warning: 21:00:26.277 [rpc-server-4-2] WARN o.a.s.n.s.TransportChannelHandler - Exception in connection from /10.198.3.179:51184java.lang.IllegalArgumentException: Too large frame: 5135603447297303916 at org.sparkproject.guava.base.Preconditions.checkArgument(Preconditions.java:119) at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:148) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:98) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Unknown Source) Below are other details related to prometheus. {noformat} Prometheus Scrape Configuration === - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 tcptrack command output in spark3 pod == 10.198.22.240:51258 10.198.40.143:7079 CLOSED 10s 0 B/s 10.198.22.240:51258 10.198.40.143:7079 CLOSED 10s 0 B/s 10.198.22.240:50354 10.198.40.143:7079 CLOSED 40s 0 B/s 10.198.22.240:33152 10.198.40.143:4040 ESTABLISHED 2s 0 B/s 10.198.22.240:47726 10.198.40.143:8090 ESTABLISHED 9s 0 B/s 10.198.22.240 = prometheus pod ip10.198.40.143 = testpod ip Issue == Though the scrape config is expected to scrape on port 8090. I see prometheus tries to initiate scrape on ports like 7079, 7078, 4040, etc on the spark3 pod and hence the exception in spark3 pod. But is this really a prometheus issue or something at spark side? We don't see any such exception in any of the other pods. All our pods including spark3 are annotated with: annotations: prometheus.io/port: "8090" prometheus.io/scrape: "true" We get the metrics and everything fine just extra warning for this exception.{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail:
[jira] [Commented] (SPARK-37120) Add Java17 GitHub Action build and test job
[ https://issues.apache.org/jira/browse/SPARK-37120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434558#comment-17434558 ] Dongjoon Hyun commented on SPARK-37120: --- cc [~hyukjin.kwon] > Add Java17 GitHub Action build and test job > --- > > Key: SPARK-37120 > URL: https://issues.apache.org/jira/browse/SPARK-37120 > Project: Spark > Issue Type: Sub-task > Components: jenkins >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Major > > Now run > {code:java} > build/mvn clean install -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > {code} > to build and test whole project(Head is > 87591c9b22157cfd241af6ea2533359c3fba1bb2) with Java 17. It seems that all the > UTs have passed. > > {code:java} > [INFO] > > [INFO] Reactor Summary for Spark Project Parent POM 3.3.0-SNAPSHOT: > [INFO] > [INFO] Spark Project Parent POM ... SUCCESS [ 1.971 > s] > [INFO] Spark Project Tags . SUCCESS [ 2.170 > s] > [INFO] Spark Project Sketch ... SUCCESS [ 14.008 > s] > [INFO] Spark Project Local DB . SUCCESS [ 2.466 > s] > [INFO] Spark Project Networking ... SUCCESS [ 49.650 > s] > [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 7.095 > s] > [INFO] Spark Project Unsafe ... SUCCESS [ 1.826 > s] > [INFO] Spark Project Launcher . SUCCESS [ 1.851 > s] > [INFO] Spark Project Core . SUCCESS [24:40 > min] > [INFO] Spark Project ML Local Library . SUCCESS [ 17.816 > s] > [INFO] Spark Project GraphX ... SUCCESS [01:27 > min] > [INFO] Spark Project Streaming SUCCESS [04:57 > min] > [INFO] Spark Project Catalyst . SUCCESS [07:56 > min] > [INFO] Spark Project SQL .. SUCCESS [ 01:01 > h] > [INFO] Spark Project ML Library ... SUCCESS [16:46 > min] > [INFO] Spark Project Tools SUCCESS [ 0.748 > s] > [INFO] Spark Project Hive . SUCCESS [ 01:11 > h] > [INFO] Spark Project REPL . SUCCESS [01:26 > min] > [INFO] Spark Project YARN Shuffle Service . SUCCESS [ 0.967 > s] > [INFO] Spark Project YARN . SUCCESS [06:54 > min] > [INFO] Spark Project Mesos SUCCESS [ 46.913 > s] > [INFO] Spark Project Kubernetes ... SUCCESS [01:08 > min] > [INFO] Spark Project Hive Thrift Server ... SUCCESS [19:12 > min] > [INFO] Spark Ganglia Integration .. SUCCESS [ 4.610 > s] > [INFO] Spark Project Hadoop Cloud Integration . SUCCESS [ 11.400 > s] > [INFO] Spark Project Assembly . SUCCESS [ 2.496 > s] > [INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [ 19.870 > s] > [INFO] Spark Integration for Kafka 0.10 ... SUCCESS [01:20 > min] > [INFO] Kafka 0.10+ Source for Structured Streaming SUCCESS [35:06 > min] > [INFO] Spark Kinesis Integration .. SUCCESS [ 29.667 > s] > [INFO] Spark Project Examples . SUCCESS [ 32.189 > s] > [INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [ 0.949 > s] > [INFO] Spark Avro . SUCCESS [01:55 > min] > [INFO] Spark Project Kinesis Assembly . SUCCESS [ 1.104 > s] > [INFO] > > [INFO] BUILD SUCCESS > [INFO] > > [INFO] Total time: 04:19 h > [INFO] Finished at: 2021-10-26T20:02:56+08:00 > [INFO] > > {code} > So should we add a Jenkins build and test job for Java 17? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37120) Add Java17 GitHub Action build and test job
[ https://issues.apache.org/jira/browse/SPARK-37120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434557#comment-17434557 ] Dongjoon Hyun commented on SPARK-37120: --- I update the JIRA title to target GitHub Action job. > Add Java17 GitHub Action build and test job > --- > > Key: SPARK-37120 > URL: https://issues.apache.org/jira/browse/SPARK-37120 > Project: Spark > Issue Type: Sub-task > Components: jenkins >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Major > > Now run > {code:java} > build/mvn clean install -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > {code} > to build and test whole project(Head is > 87591c9b22157cfd241af6ea2533359c3fba1bb2) with Java 17. It seems that all the > UTs have passed. > > {code:java} > [INFO] > > [INFO] Reactor Summary for Spark Project Parent POM 3.3.0-SNAPSHOT: > [INFO] > [INFO] Spark Project Parent POM ... SUCCESS [ 1.971 > s] > [INFO] Spark Project Tags . SUCCESS [ 2.170 > s] > [INFO] Spark Project Sketch ... SUCCESS [ 14.008 > s] > [INFO] Spark Project Local DB . SUCCESS [ 2.466 > s] > [INFO] Spark Project Networking ... SUCCESS [ 49.650 > s] > [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 7.095 > s] > [INFO] Spark Project Unsafe ... SUCCESS [ 1.826 > s] > [INFO] Spark Project Launcher . SUCCESS [ 1.851 > s] > [INFO] Spark Project Core . SUCCESS [24:40 > min] > [INFO] Spark Project ML Local Library . SUCCESS [ 17.816 > s] > [INFO] Spark Project GraphX ... SUCCESS [01:27 > min] > [INFO] Spark Project Streaming SUCCESS [04:57 > min] > [INFO] Spark Project Catalyst . SUCCESS [07:56 > min] > [INFO] Spark Project SQL .. SUCCESS [ 01:01 > h] > [INFO] Spark Project ML Library ... SUCCESS [16:46 > min] > [INFO] Spark Project Tools SUCCESS [ 0.748 > s] > [INFO] Spark Project Hive . SUCCESS [ 01:11 > h] > [INFO] Spark Project REPL . SUCCESS [01:26 > min] > [INFO] Spark Project YARN Shuffle Service . SUCCESS [ 0.967 > s] > [INFO] Spark Project YARN . SUCCESS [06:54 > min] > [INFO] Spark Project Mesos SUCCESS [ 46.913 > s] > [INFO] Spark Project Kubernetes ... SUCCESS [01:08 > min] > [INFO] Spark Project Hive Thrift Server ... SUCCESS [19:12 > min] > [INFO] Spark Ganglia Integration .. SUCCESS [ 4.610 > s] > [INFO] Spark Project Hadoop Cloud Integration . SUCCESS [ 11.400 > s] > [INFO] Spark Project Assembly . SUCCESS [ 2.496 > s] > [INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [ 19.870 > s] > [INFO] Spark Integration for Kafka 0.10 ... SUCCESS [01:20 > min] > [INFO] Kafka 0.10+ Source for Structured Streaming SUCCESS [35:06 > min] > [INFO] Spark Kinesis Integration .. SUCCESS [ 29.667 > s] > [INFO] Spark Project Examples . SUCCESS [ 32.189 > s] > [INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [ 0.949 > s] > [INFO] Spark Avro . SUCCESS [01:55 > min] > [INFO] Spark Project Kinesis Assembly . SUCCESS [ 1.104 > s] > [INFO] > > [INFO] BUILD SUCCESS > [INFO] > > [INFO] Total time: 04:19 h > [INFO] Finished at: 2021-10-26T20:02:56+08:00 > [INFO] > > {code} > So should we add a Jenkins build and test job for Java 17? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37120) Add Java17 GitHub Action build and test job
[ https://issues.apache.org/jira/browse/SPARK-37120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37120: -- Summary: Add Java17 GitHub Action build and test job (was: Add a Jenkins build and test job for Java 17) > Add Java17 GitHub Action build and test job > --- > > Key: SPARK-37120 > URL: https://issues.apache.org/jira/browse/SPARK-37120 > Project: Spark > Issue Type: Sub-task > Components: jenkins >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Major > > Now run > {code:java} > build/mvn clean install -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > {code} > to build and test whole project(Head is > 87591c9b22157cfd241af6ea2533359c3fba1bb2) with Java 17. It seems that all the > UTs have passed. > > {code:java} > [INFO] > > [INFO] Reactor Summary for Spark Project Parent POM 3.3.0-SNAPSHOT: > [INFO] > [INFO] Spark Project Parent POM ... SUCCESS [ 1.971 > s] > [INFO] Spark Project Tags . SUCCESS [ 2.170 > s] > [INFO] Spark Project Sketch ... SUCCESS [ 14.008 > s] > [INFO] Spark Project Local DB . SUCCESS [ 2.466 > s] > [INFO] Spark Project Networking ... SUCCESS [ 49.650 > s] > [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 7.095 > s] > [INFO] Spark Project Unsafe ... SUCCESS [ 1.826 > s] > [INFO] Spark Project Launcher . SUCCESS [ 1.851 > s] > [INFO] Spark Project Core . SUCCESS [24:40 > min] > [INFO] Spark Project ML Local Library . SUCCESS [ 17.816 > s] > [INFO] Spark Project GraphX ... SUCCESS [01:27 > min] > [INFO] Spark Project Streaming SUCCESS [04:57 > min] > [INFO] Spark Project Catalyst . SUCCESS [07:56 > min] > [INFO] Spark Project SQL .. SUCCESS [ 01:01 > h] > [INFO] Spark Project ML Library ... SUCCESS [16:46 > min] > [INFO] Spark Project Tools SUCCESS [ 0.748 > s] > [INFO] Spark Project Hive . SUCCESS [ 01:11 > h] > [INFO] Spark Project REPL . SUCCESS [01:26 > min] > [INFO] Spark Project YARN Shuffle Service . SUCCESS [ 0.967 > s] > [INFO] Spark Project YARN . SUCCESS [06:54 > min] > [INFO] Spark Project Mesos SUCCESS [ 46.913 > s] > [INFO] Spark Project Kubernetes ... SUCCESS [01:08 > min] > [INFO] Spark Project Hive Thrift Server ... SUCCESS [19:12 > min] > [INFO] Spark Ganglia Integration .. SUCCESS [ 4.610 > s] > [INFO] Spark Project Hadoop Cloud Integration . SUCCESS [ 11.400 > s] > [INFO] Spark Project Assembly . SUCCESS [ 2.496 > s] > [INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [ 19.870 > s] > [INFO] Spark Integration for Kafka 0.10 ... SUCCESS [01:20 > min] > [INFO] Kafka 0.10+ Source for Structured Streaming SUCCESS [35:06 > min] > [INFO] Spark Kinesis Integration .. SUCCESS [ 29.667 > s] > [INFO] Spark Project Examples . SUCCESS [ 32.189 > s] > [INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [ 0.949 > s] > [INFO] Spark Avro . SUCCESS [01:55 > min] > [INFO] Spark Project Kinesis Assembly . SUCCESS [ 1.104 > s] > [INFO] > > [INFO] BUILD SUCCESS > [INFO] > > [INFO] Total time: 04:19 h > [INFO] Finished at: 2021-10-26T20:02:56+08:00 > [INFO] > > {code} > So should we add a Jenkins build and test job for Java 17? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37098) Alter table properties should invalidate cache
[ https://issues.apache.org/jira/browse/SPARK-37098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37098: -- Fix Version/s: 3.0.4 > Alter table properties should invalidate cache > -- > > Key: SPARK-37098 > URL: https://issues.apache.org/jira/browse/SPARK-37098 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.1.3, 3.0.4, 3.2.1, 3.3.0 > > > The table properties can change the behavior of wriing. e.g. the parquet > table with `parquet.compression`. > If you execute the following SQL, we will get the file with snappy > compression rather than zstd. > {code:java} > CREATE TABLE t (c int) STORED AS PARQUET; > // cache table metadata > SELECT * FROM t; > ALTER TABLE t SET TBLPROPERTIES('parquet.compression'='zstd'); > INSERT INTO TABLE t values(1); > {code} > So we should invalidate the table cache after alter table properties. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37114) Support Submitting Jobs to Cloud Providers Managed Spark Clusters
[ https://issues.apache.org/jira/browse/SPARK-37114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naga Vijayapuram updated SPARK-37114: - Priority: Minor (was: Trivial) > Support Submitting Jobs to Cloud Providers Managed Spark Clusters > - > > Key: SPARK-37114 > URL: https://issues.apache.org/jira/browse/SPARK-37114 > Project: Spark > Issue Type: New Feature > Components: Deploy >Affects Versions: 3.2.0 >Reporter: Naga Vijayapuram >Priority: Minor > > To be able to submit jobs to prominent cloud providers managed spark > clusters, "spark-submit" can be enhanced. For example, to submit job to > "google cloud dataproc", the "spark-submit" can be enhanced to issue "gcloud > dataproc jobs submit spark ..." when "–master gcd://cluster-name" arg is > used. Once this feature is accepted and prioritized, then it can be rolled > out in current and future versions of spark and also back ported to a few > previous versions. I can raise the pull request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36877) Calling ds.rdd with AQE enabled leads to jobs being run, eventually causing reruns
[ https://issues.apache.org/jira/browse/SPARK-36877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434515#comment-17434515 ] Shardul Mahadik commented on SPARK-36877: - Was able to get around this by re-using the RDD for further DF operations {code:scala} val df = /* some expensive multi-table/multi-stage join */ val rdd = df.rdd val numPartitions = rdd.getNumPartitions val dfFromRdd = spark.createDataset(rdd)(df.encoder) dfFromRdd.repartition(x).write. {code} > Calling ds.rdd with AQE enabled leads to jobs being run, eventually causing > reruns > -- > > Key: SPARK-36877 > URL: https://issues.apache.org/jira/browse/SPARK-36877 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.1 >Reporter: Shardul Mahadik >Priority: Major > Attachments: Screen Shot 2021-09-28 at 09.32.20.png > > > In one of our jobs we perform the following operation: > {code:scala} > val df = /* some expensive multi-table/multi-stage join */ > val numPartitions = df.rdd.getNumPartitions > df.repartition(x).write. > {code} > With AQE enabled, we found that the expensive stages were being run twice > causing significant performance regression after enabling AQE; once when > calling {{df.rdd}} and again when calling {{df.write}}. > A more concrete example: > {code:scala} > scala> sql("SET spark.sql.adaptive.enabled=true") > res0: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> sql("SET spark.sql.autoBroadcastJoinThreshold=-1") > res1: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> val df1 = spark.range(10).withColumn("id2", $"id") > df1: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint] > scala> val df2 = df1.join(spark.range(10), "id").join(spark.range(10), > "id").join(spark.range(10), "id") > df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint] > scala> val df3 = df2.groupBy("id2").count() > df3: org.apache.spark.sql.DataFrame = [id2: bigint, count: bigint] > scala> df3.rdd.getNumPartitions > res2: Int = 10(0 + 16) / > 16] > scala> df3.repartition(5).write.mode("overwrite").orc("/tmp/orc1") > {code} > In the screenshot below, you can see that the first 3 stages (0 to 4) were > rerun again (5 to 9). > I have two questions: > 1) Should calling df.rdd trigger actual job execution when AQE is enabled? > 2) Should calling df.write later cause rerun of the stages? If df.rdd has > already partially executed the stages, shouldn't it reuse the result from > previous stages? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36877) Calling ds.rdd with AQE enabled leads to jobs being run, eventually causing reruns
[ https://issues.apache.org/jira/browse/SPARK-36877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shardul Mahadik resolved SPARK-36877. - Resolution: Not A Problem > Calling ds.rdd with AQE enabled leads to jobs being run, eventually causing > reruns > -- > > Key: SPARK-36877 > URL: https://issues.apache.org/jira/browse/SPARK-36877 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.1 >Reporter: Shardul Mahadik >Priority: Major > Attachments: Screen Shot 2021-09-28 at 09.32.20.png > > > In one of our jobs we perform the following operation: > {code:scala} > val df = /* some expensive multi-table/multi-stage join */ > val numPartitions = df.rdd.getNumPartitions > df.repartition(x).write. > {code} > With AQE enabled, we found that the expensive stages were being run twice > causing significant performance regression after enabling AQE; once when > calling {{df.rdd}} and again when calling {{df.write}}. > A more concrete example: > {code:scala} > scala> sql("SET spark.sql.adaptive.enabled=true") > res0: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> sql("SET spark.sql.autoBroadcastJoinThreshold=-1") > res1: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> val df1 = spark.range(10).withColumn("id2", $"id") > df1: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint] > scala> val df2 = df1.join(spark.range(10), "id").join(spark.range(10), > "id").join(spark.range(10), "id") > df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint] > scala> val df3 = df2.groupBy("id2").count() > df3: org.apache.spark.sql.DataFrame = [id2: bigint, count: bigint] > scala> df3.rdd.getNumPartitions > res2: Int = 10(0 + 16) / > 16] > scala> df3.repartition(5).write.mode("overwrite").orc("/tmp/orc1") > {code} > In the screenshot below, you can see that the first 3 stages (0 to 4) were > rerun again (5 to 9). > I have two questions: > 1) Should calling df.rdd trigger actual job execution when AQE is enabled? > 2) Should calling df.write later cause rerun of the stages? If df.rdd has > already partially executed the stages, shouldn't it reuse the result from > previous stages? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36895) Add Create Index syntax support
[ https://issues.apache.org/jira/browse/SPARK-36895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-36895. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34148 [https://github.com/apache/spark/pull/34148] > Add Create Index syntax support > --- > > Key: SPARK-36895 > URL: https://issues.apache.org/jira/browse/SPARK-36895 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36895) Add Create Index syntax support
[ https://issues.apache.org/jira/browse/SPARK-36895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-36895: --- Assignee: Huaxin Gao > Add Create Index syntax support > --- > > Key: SPARK-36895 > URL: https://issues.apache.org/jira/browse/SPARK-36895 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37121) TestUtils.isPythonVersionAtLeast38 returns incorrect results
[ https://issues.apache.org/jira/browse/SPARK-37121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37121: Assignee: Apache Spark > TestUtils.isPythonVersionAtLeast38 returns incorrect results > > > Key: SPARK-37121 > URL: https://issues.apache.org/jira/browse/SPARK-37121 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.2.0 >Reporter: Erik Krogen >Assignee: Apache Spark >Priority: Major > > I was working on {{HiveExternalCatalogVersionsSuite}} recently and noticed > that it was never running against the Spark 2.x release lines, only the 3.x > ones. The problem was coming from here, specifically the Python 3.8+ version > check: > {code} > versions > .filter(v => v.startsWith("3") || !TestUtils.isPythonVersionAtLeast38()) > .filter(v => v.startsWith("3") || > !SystemUtils.isJavaVersionAtLeast(JavaVersion.JAVA_9)) > {code} > I found that {{TestUtils.isPythonVersionAtLeast38()}} was always returning > true, even when my system installation of Python3 was 3.7. Thinking it was an > environment issue, I pulled up a debugger to check which version of Python > the test JVM was seeing, and it was in fact Python 3.7. > Turns out the issue is with the {{isPythonVersionAtLeast38}} method: > {code} > def isPythonVersionAtLeast38(): Boolean = { > val attempt = if (Utils.isWindows) { > Try(Process(Seq("cmd.exe", "/C", "python3 --version")) > .run(ProcessLogger(s => s.startsWith("Python 3.8") || > s.startsWith("Python 3.9"))) > .exitValue()) > } else { > Try(Process(Seq("sh", "-c", "python3 --version")) > .run(ProcessLogger(s => s.startsWith("Python 3.8") || > s.startsWith("Python 3.9"))) > .exitValue()) > } > attempt.isSuccess && attempt.get == 0 > } > {code} > It's trying to evaluate the version of Python using a {{ProcessLogger}}, but > the logger accepts a {{String => Unit}} function, i.e., it does not make use > of the return value in any way (since it's meant for logging). So the result > of the {{startsWith}} checks are thrown away, and {{attempt.isSuccess && > attempt.get == 0}} will always be true as long as your system has a > {{python3}} binary of any version. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37121) TestUtils.isPythonVersionAtLeast38 returns incorrect results
[ https://issues.apache.org/jira/browse/SPARK-37121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37121: Assignee: (was: Apache Spark) > TestUtils.isPythonVersionAtLeast38 returns incorrect results > > > Key: SPARK-37121 > URL: https://issues.apache.org/jira/browse/SPARK-37121 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.2.0 >Reporter: Erik Krogen >Priority: Major > > I was working on {{HiveExternalCatalogVersionsSuite}} recently and noticed > that it was never running against the Spark 2.x release lines, only the 3.x > ones. The problem was coming from here, specifically the Python 3.8+ version > check: > {code} > versions > .filter(v => v.startsWith("3") || !TestUtils.isPythonVersionAtLeast38()) > .filter(v => v.startsWith("3") || > !SystemUtils.isJavaVersionAtLeast(JavaVersion.JAVA_9)) > {code} > I found that {{TestUtils.isPythonVersionAtLeast38()}} was always returning > true, even when my system installation of Python3 was 3.7. Thinking it was an > environment issue, I pulled up a debugger to check which version of Python > the test JVM was seeing, and it was in fact Python 3.7. > Turns out the issue is with the {{isPythonVersionAtLeast38}} method: > {code} > def isPythonVersionAtLeast38(): Boolean = { > val attempt = if (Utils.isWindows) { > Try(Process(Seq("cmd.exe", "/C", "python3 --version")) > .run(ProcessLogger(s => s.startsWith("Python 3.8") || > s.startsWith("Python 3.9"))) > .exitValue()) > } else { > Try(Process(Seq("sh", "-c", "python3 --version")) > .run(ProcessLogger(s => s.startsWith("Python 3.8") || > s.startsWith("Python 3.9"))) > .exitValue()) > } > attempt.isSuccess && attempt.get == 0 > } > {code} > It's trying to evaluate the version of Python using a {{ProcessLogger}}, but > the logger accepts a {{String => Unit}} function, i.e., it does not make use > of the return value in any way (since it's meant for logging). So the result > of the {{startsWith}} checks are thrown away, and {{attempt.isSuccess && > attempt.get == 0}} will always be true as long as your system has a > {{python3}} binary of any version. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37121) TestUtils.isPythonVersionAtLeast38 returns incorrect results
[ https://issues.apache.org/jira/browse/SPARK-37121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434503#comment-17434503 ] Apache Spark commented on SPARK-37121: -- User 'xkrogen' has created a pull request for this issue: https://github.com/apache/spark/pull/34395 > TestUtils.isPythonVersionAtLeast38 returns incorrect results > > > Key: SPARK-37121 > URL: https://issues.apache.org/jira/browse/SPARK-37121 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.2.0 >Reporter: Erik Krogen >Priority: Major > > I was working on {{HiveExternalCatalogVersionsSuite}} recently and noticed > that it was never running against the Spark 2.x release lines, only the 3.x > ones. The problem was coming from here, specifically the Python 3.8+ version > check: > {code} > versions > .filter(v => v.startsWith("3") || !TestUtils.isPythonVersionAtLeast38()) > .filter(v => v.startsWith("3") || > !SystemUtils.isJavaVersionAtLeast(JavaVersion.JAVA_9)) > {code} > I found that {{TestUtils.isPythonVersionAtLeast38()}} was always returning > true, even when my system installation of Python3 was 3.7. Thinking it was an > environment issue, I pulled up a debugger to check which version of Python > the test JVM was seeing, and it was in fact Python 3.7. > Turns out the issue is with the {{isPythonVersionAtLeast38}} method: > {code} > def isPythonVersionAtLeast38(): Boolean = { > val attempt = if (Utils.isWindows) { > Try(Process(Seq("cmd.exe", "/C", "python3 --version")) > .run(ProcessLogger(s => s.startsWith("Python 3.8") || > s.startsWith("Python 3.9"))) > .exitValue()) > } else { > Try(Process(Seq("sh", "-c", "python3 --version")) > .run(ProcessLogger(s => s.startsWith("Python 3.8") || > s.startsWith("Python 3.9"))) > .exitValue()) > } > attempt.isSuccess && attempt.get == 0 > } > {code} > It's trying to evaluate the version of Python using a {{ProcessLogger}}, but > the logger accepts a {{String => Unit}} function, i.e., it does not make use > of the return value in any way (since it's meant for logging). So the result > of the {{startsWith}} checks are thrown away, and {{attempt.isSuccess && > attempt.get == 0}} will always be true as long as your system has a > {{python3}} binary of any version. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37121) TestUtils.isPythonVersionAtLeast38 returns incorrect results
Erik Krogen created SPARK-37121: --- Summary: TestUtils.isPythonVersionAtLeast38 returns incorrect results Key: SPARK-37121 URL: https://issues.apache.org/jira/browse/SPARK-37121 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 3.2.0 Reporter: Erik Krogen I was working on {{HiveExternalCatalogVersionsSuite}} recently and noticed that it was never running against the Spark 2.x release lines, only the 3.x ones. The problem was coming from here, specifically the Python 3.8+ version check: {code} versions .filter(v => v.startsWith("3") || !TestUtils.isPythonVersionAtLeast38()) .filter(v => v.startsWith("3") || !SystemUtils.isJavaVersionAtLeast(JavaVersion.JAVA_9)) {code} I found that {{TestUtils.isPythonVersionAtLeast38()}} was always returning true, even when my system installation of Python3 was 3.7. Thinking it was an environment issue, I pulled up a debugger to check which version of Python the test JVM was seeing, and it was in fact Python 3.7. Turns out the issue is with the {{isPythonVersionAtLeast38}} method: {code} def isPythonVersionAtLeast38(): Boolean = { val attempt = if (Utils.isWindows) { Try(Process(Seq("cmd.exe", "/C", "python3 --version")) .run(ProcessLogger(s => s.startsWith("Python 3.8") || s.startsWith("Python 3.9"))) .exitValue()) } else { Try(Process(Seq("sh", "-c", "python3 --version")) .run(ProcessLogger(s => s.startsWith("Python 3.8") || s.startsWith("Python 3.9"))) .exitValue()) } attempt.isSuccess && attempt.get == 0 } {code} It's trying to evaluate the version of Python using a {{ProcessLogger}}, but the logger accepts a {{String => Unit}} function, i.e., it does not make use of the return value in any way (since it's meant for logging). So the result of the {{startsWith}} checks are thrown away, and {{attempt.isSuccess && attempt.get == 0}} will always be true as long as your system has a {{python3}} binary of any version. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37118) Add KMeans distanceMeasure param to PythonMLLibAPI
[ https://issues.apache.org/jira/browse/SPARK-37118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434380#comment-17434380 ] Apache Spark commented on SPARK-37118: -- User 'remykarem' has created a pull request for this issue: https://github.com/apache/spark/pull/34394 > Add KMeans distanceMeasure param to PythonMLLibAPI > -- > > Key: SPARK-37118 > URL: https://issues.apache.org/jira/browse/SPARK-37118 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 3.2.1 >Reporter: Raimi bin Karim >Priority: Trivial > Fix For: 3.2.1 > > > SPARK-22119 added KMeans {{distanceMeasure}} to the Python API. > We should include this parameter too in the > {{PythonMLLibAPI.t}}{{rainKMeansModel}} method. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37119) parse_url can not handle `{` and `}` correctly
[ https://issues.apache.org/jira/browse/SPARK-37119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434326#comment-17434326 ] Liu Shuo commented on SPARK-37119: -- As discussed in `https://github.com/apache/spark/pull/30333` and `https://github.com/apache/spark/pull/30399`, close this JIRA. > parse_url can not handle `{` and `}` correctly > -- > > Key: SPARK-37119 > URL: https://issues.apache.org/jira/browse/SPARK-37119 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.2.0, 3.3.0 >Reporter: Liu Shuo >Priority: Critical > > when we execute the follow sql command > {code:java} > select parse_url('http://facebook.com/path/p1.php?query={aa}', 'QUERY') > {code} > the expected result: > query=\{aa} > the actual result: > null -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37120) Add a Jenkins build and test job for Java 17
[ https://issues.apache.org/jira/browse/SPARK-37120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434319#comment-17434319 ] Yang Jie commented on SPARK-37120: -- ping [~sowen] [~dongjoon] ,do we need to do this now? Who should we ask to help finish it ? > Add a Jenkins build and test job for Java 17 > > > Key: SPARK-37120 > URL: https://issues.apache.org/jira/browse/SPARK-37120 > Project: Spark > Issue Type: Sub-task > Components: jenkins >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Major > > Now run > {code:java} > build/mvn clean install -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn > -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive > {code} > to build and test whole project(Head is > 87591c9b22157cfd241af6ea2533359c3fba1bb2) with Java 17. It seems that all the > UTs have passed. > > {code:java} > [INFO] > > [INFO] Reactor Summary for Spark Project Parent POM 3.3.0-SNAPSHOT: > [INFO] > [INFO] Spark Project Parent POM ... SUCCESS [ 1.971 > s] > [INFO] Spark Project Tags . SUCCESS [ 2.170 > s] > [INFO] Spark Project Sketch ... SUCCESS [ 14.008 > s] > [INFO] Spark Project Local DB . SUCCESS [ 2.466 > s] > [INFO] Spark Project Networking ... SUCCESS [ 49.650 > s] > [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 7.095 > s] > [INFO] Spark Project Unsafe ... SUCCESS [ 1.826 > s] > [INFO] Spark Project Launcher . SUCCESS [ 1.851 > s] > [INFO] Spark Project Core . SUCCESS [24:40 > min] > [INFO] Spark Project ML Local Library . SUCCESS [ 17.816 > s] > [INFO] Spark Project GraphX ... SUCCESS [01:27 > min] > [INFO] Spark Project Streaming SUCCESS [04:57 > min] > [INFO] Spark Project Catalyst . SUCCESS [07:56 > min] > [INFO] Spark Project SQL .. SUCCESS [ 01:01 > h] > [INFO] Spark Project ML Library ... SUCCESS [16:46 > min] > [INFO] Spark Project Tools SUCCESS [ 0.748 > s] > [INFO] Spark Project Hive . SUCCESS [ 01:11 > h] > [INFO] Spark Project REPL . SUCCESS [01:26 > min] > [INFO] Spark Project YARN Shuffle Service . SUCCESS [ 0.967 > s] > [INFO] Spark Project YARN . SUCCESS [06:54 > min] > [INFO] Spark Project Mesos SUCCESS [ 46.913 > s] > [INFO] Spark Project Kubernetes ... SUCCESS [01:08 > min] > [INFO] Spark Project Hive Thrift Server ... SUCCESS [19:12 > min] > [INFO] Spark Ganglia Integration .. SUCCESS [ 4.610 > s] > [INFO] Spark Project Hadoop Cloud Integration . SUCCESS [ 11.400 > s] > [INFO] Spark Project Assembly . SUCCESS [ 2.496 > s] > [INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [ 19.870 > s] > [INFO] Spark Integration for Kafka 0.10 ... SUCCESS [01:20 > min] > [INFO] Kafka 0.10+ Source for Structured Streaming SUCCESS [35:06 > min] > [INFO] Spark Kinesis Integration .. SUCCESS [ 29.667 > s] > [INFO] Spark Project Examples . SUCCESS [ 32.189 > s] > [INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [ 0.949 > s] > [INFO] Spark Avro . SUCCESS [01:55 > min] > [INFO] Spark Project Kinesis Assembly . SUCCESS [ 1.104 > s] > [INFO] > > [INFO] BUILD SUCCESS > [INFO] > > [INFO] Total time: 04:19 h > [INFO] Finished at: 2021-10-26T20:02:56+08:00 > [INFO] > > {code} > So should we add a Jenkins build and test job for Java 17? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37120) Add a Jenkins build and test job for Java 17
Yang Jie created SPARK-37120: Summary: Add a Jenkins build and test job for Java 17 Key: SPARK-37120 URL: https://issues.apache.org/jira/browse/SPARK-37120 Project: Spark Issue Type: Sub-task Components: jenkins Affects Versions: 3.3.0 Reporter: Yang Jie Now run {code:java} build/mvn clean install -Phadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive {code} to build and test whole project(Head is 87591c9b22157cfd241af6ea2533359c3fba1bb2) with Java 17. It seems that all the UTs have passed. {code:java} [INFO] [INFO] Reactor Summary for Spark Project Parent POM 3.3.0-SNAPSHOT: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 1.971 s] [INFO] Spark Project Tags . SUCCESS [ 2.170 s] [INFO] Spark Project Sketch ... SUCCESS [ 14.008 s] [INFO] Spark Project Local DB . SUCCESS [ 2.466 s] [INFO] Spark Project Networking ... SUCCESS [ 49.650 s] [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 7.095 s] [INFO] Spark Project Unsafe ... SUCCESS [ 1.826 s] [INFO] Spark Project Launcher . SUCCESS [ 1.851 s] [INFO] Spark Project Core . SUCCESS [24:40 min] [INFO] Spark Project ML Local Library . SUCCESS [ 17.816 s] [INFO] Spark Project GraphX ... SUCCESS [01:27 min] [INFO] Spark Project Streaming SUCCESS [04:57 min] [INFO] Spark Project Catalyst . SUCCESS [07:56 min] [INFO] Spark Project SQL .. SUCCESS [ 01:01 h] [INFO] Spark Project ML Library ... SUCCESS [16:46 min] [INFO] Spark Project Tools SUCCESS [ 0.748 s] [INFO] Spark Project Hive . SUCCESS [ 01:11 h] [INFO] Spark Project REPL . SUCCESS [01:26 min] [INFO] Spark Project YARN Shuffle Service . SUCCESS [ 0.967 s] [INFO] Spark Project YARN . SUCCESS [06:54 min] [INFO] Spark Project Mesos SUCCESS [ 46.913 s] [INFO] Spark Project Kubernetes ... SUCCESS [01:08 min] [INFO] Spark Project Hive Thrift Server ... SUCCESS [19:12 min] [INFO] Spark Ganglia Integration .. SUCCESS [ 4.610 s] [INFO] Spark Project Hadoop Cloud Integration . SUCCESS [ 11.400 s] [INFO] Spark Project Assembly . SUCCESS [ 2.496 s] [INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [ 19.870 s] [INFO] Spark Integration for Kafka 0.10 ... SUCCESS [01:20 min] [INFO] Kafka 0.10+ Source for Structured Streaming SUCCESS [35:06 min] [INFO] Spark Kinesis Integration .. SUCCESS [ 29.667 s] [INFO] Spark Project Examples . SUCCESS [ 32.189 s] [INFO] Spark Integration for Kafka 0.10 Assembly .. SUCCESS [ 0.949 s] [INFO] Spark Avro . SUCCESS [01:55 min] [INFO] Spark Project Kinesis Assembly . SUCCESS [ 1.104 s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 04:19 h [INFO] Finished at: 2021-10-26T20:02:56+08:00 [INFO] {code} So should we add a Jenkins build and test job for Java 17? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37110) Add Java 17 support for spark pull request builds
[ https://issues.apache.org/jira/browse/SPARK-37110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434307#comment-17434307 ] Hyukjin Kwon commented on SPARK-37110: -- [~yumwang], when you find some time, feel free to set up JDK 17 and JDK 11. We will need some changes like https://github.com/apache/spark/pull/34091 and https://github.com/apache/spark/pull/34217. I was planning to do it but I am currently stuck in some internal works ... > Add Java 17 support for spark pull request builds > - > > Key: SPARK-37110 > URL: https://issues.apache.org/jira/browse/SPARK-37110 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37119) parse_url can not handle `{` and `}` correctly
[ https://issues.apache.org/jira/browse/SPARK-37119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liu Shuo updated SPARK-37119: - Description: when we execute the follow sql command {code:java} select parse_url('http://facebook.com/path/p1.php?query={aa}', 'QUERY') {code} the expected result: query=\{aa} the actual result: null was: when we execute the follow sql command {code:java} select parse_url('http://facebook.com/path/p1.php?query={aa}', 'QUERY') {code} the expected result: query=\{aa} the actual result: null > parse_url can not handle `{` and `}` correctly > -- > > Key: SPARK-37119 > URL: https://issues.apache.org/jira/browse/SPARK-37119 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.2.0, 3.3.0 >Reporter: Liu Shuo >Priority: Critical > > when we execute the follow sql command > {code:java} > select parse_url('http://facebook.com/path/p1.php?query={aa}', 'QUERY') > {code} > the expected result: > query=\{aa} > the actual result: > null -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37119) parse_url can not handle `{` and `}` correctly
Liu Shuo created SPARK-37119: Summary: parse_url can not handle `{` and `}` correctly Key: SPARK-37119 URL: https://issues.apache.org/jira/browse/SPARK-37119 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0, 2.4.8, 3.3.0 Reporter: Liu Shuo when we execute the follow sql command {code:java} select parse_url('http://facebook.com/path/p1.php?query={aa}', 'QUERY') {code} the expected result: query=\{aa} the actual result: null -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37118) Add KMeans distanceMeasure param to PythonMLLibAPI
[ https://issues.apache.org/jira/browse/SPARK-37118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37118: Assignee: (was: Apache Spark) > Add KMeans distanceMeasure param to PythonMLLibAPI > -- > > Key: SPARK-37118 > URL: https://issues.apache.org/jira/browse/SPARK-37118 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 3.2.1 >Reporter: Raimi bin Karim >Priority: Trivial > Fix For: 3.2.1 > > > SPARK-22119 added KMeans {{distanceMeasure}} to the Python API. > We should include this parameter too in the > {{PythonMLLibAPI.t}}{{rainKMeansModel}} method. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37118) Add KMeans distanceMeasure param to PythonMLLibAPI
[ https://issues.apache.org/jira/browse/SPARK-37118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434289#comment-17434289 ] Apache Spark commented on SPARK-37118: -- User 'remykarem' has created a pull request for this issue: https://github.com/apache/spark/pull/34393 > Add KMeans distanceMeasure param to PythonMLLibAPI > -- > > Key: SPARK-37118 > URL: https://issues.apache.org/jira/browse/SPARK-37118 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 3.2.1 >Reporter: Raimi bin Karim >Priority: Trivial > Fix For: 3.2.1 > > > SPARK-22119 added KMeans {{distanceMeasure}} to the Python API. > We should include this parameter too in the > {{PythonMLLibAPI.t}}{{rainKMeansModel}} method. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37118) Add KMeans distanceMeasure param to PythonMLLibAPI
[ https://issues.apache.org/jira/browse/SPARK-37118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37118: Assignee: Apache Spark > Add KMeans distanceMeasure param to PythonMLLibAPI > -- > > Key: SPARK-37118 > URL: https://issues.apache.org/jira/browse/SPARK-37118 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 3.2.1 >Reporter: Raimi bin Karim >Assignee: Apache Spark >Priority: Trivial > Fix For: 3.2.1 > > > SPARK-22119 added KMeans {{distanceMeasure}} to the Python API. > We should include this parameter too in the > {{PythonMLLibAPI.t}}{{rainKMeansModel}} method. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37118) Add KMeans distanceMeasure param to PythonMLLibAPI
Raimi bin Karim created SPARK-37118: --- Summary: Add KMeans distanceMeasure param to PythonMLLibAPI Key: SPARK-37118 URL: https://issues.apache.org/jira/browse/SPARK-37118 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 3.2.1 Reporter: Raimi bin Karim Fix For: 3.2.1 SPARK-22119 added KMeans {{distanceMeasure}} to the Python API. We should include this parameter too in the {{PythonMLLibAPI.t}}{{rainKMeansModel}} method. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35181) Use zstd for spark.io.compression.codec by default
[ https://issues.apache.org/jira/browse/SPARK-35181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434269#comment-17434269 ] Dongjoon Hyun commented on SPARK-35181: --- If you want to get some help, please use the official *Apache Spark 3.2.0* instead of your production Spark and give us a reproducible example. > Use zstd for spark.io.compression.codec by default > -- > > Key: SPARK-35181 > URL: https://issues.apache.org/jira/browse/SPARK-35181 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35181) Use zstd for spark.io.compression.codec by default
[ https://issues.apache.org/jira/browse/SPARK-35181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434263#comment-17434263 ] Dongjoon Hyun commented on SPARK-35181: --- I have no clue for those issues, but since it's JVM Runtime Error, why don't you try to use the latest Java 11 or Java 8? 1.8.0_232 looks like 2019 version. > Use zstd for spark.io.compression.codec by default > -- > > Key: SPARK-35181 > URL: https://issues.apache.org/jira/browse/SPARK-35181 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35181) Use zstd for spark.io.compression.codec by default
[ https://issues.apache.org/jira/browse/SPARK-35181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434258#comment-17434258 ] angerszhu commented on SPARK-35181: --- [~dongjoon] Yea, only when use ``spark.io.compression.codec=zstd` this error happend. This error is not happen when writing/reading parquet. > Use zstd for spark.io.compression.codec by default > -- > > Key: SPARK-35181 > URL: https://issues.apache.org/jira/browse/SPARK-35181 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-35181) Use zstd for spark.io.compression.codec by default
[ https://issues.apache.org/jira/browse/SPARK-35181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434257#comment-17434257 ] Dongjoon Hyun edited comment on SPARK-35181 at 10/26/21, 10:46 AM: --- BTW, according to the log, if you are trying to use Parquet with ZSTD, it's irrelevant with `spark.io.compression.codec`. You had better file an Apache Parquet JIRA, not Apache Spark JIRA. {code} CodecPool:184 - Got brand-new decompressor [.zst] {code} was (Author: dongjoon): BTW, according to the log, if you are trying to use Parquet with ZSTD, it's irrelevant with `spark.io.compression.codec`. You had better file file an Apache Parquet JIRA, not Apache Spark JIRA. {code} CodecPool:184 - Got brand-new decompressor [.zst] {code} > Use zstd for spark.io.compression.codec by default > -- > > Key: SPARK-35181 > URL: https://issues.apache.org/jira/browse/SPARK-35181 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35181) Use zstd for spark.io.compression.codec by default
[ https://issues.apache.org/jira/browse/SPARK-35181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434257#comment-17434257 ] Dongjoon Hyun commented on SPARK-35181: --- BTW, according to the log, if you are trying to use Parquet with ZSTD, it's irrelevant with `spark.io.compression.codec`. You had better file file an Apache Parquet JIRA, not Apache Spark JIRA. {code} CodecPool:184 - Got brand-new decompressor [.zst] {code} > Use zstd for spark.io.compression.codec by default > -- > > Key: SPARK-35181 > URL: https://issues.apache.org/jira/browse/SPARK-35181 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35181) Use zstd for spark.io.compression.codec by default
[ https://issues.apache.org/jira/browse/SPARK-35181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434252#comment-17434252 ] Dongjoon Hyun commented on SPARK-35181: --- I'm not sure how you build and configure your environment and what you are hitting there. `spark.io.compression.codec=zstd` is not unstable, [~angerszhuuu]. Are you sure that the errors are relevant to `spark.io.compression.codec=zstd`? > Use zstd for spark.io.compression.codec by default > -- > > Key: SPARK-35181 > URL: https://issues.apache.org/jira/browse/SPARK-35181 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35181) Use zstd for spark.io.compression.codec by default
[ https://issues.apache.org/jira/browse/SPARK-35181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434241#comment-17434241 ] angerszhu commented on SPARK-35181: --- [~dongjoon] This issue resolve when we upgrade the zstd version, but another issue happened. {code:java} 2021-10-25 13:42:01 WARN SparkConf:69 - The configuration key 'spark.blacklist.application.fetchFailure.enabled' has been deprecated as of Spark 3.1.0 and may be removed in the future. Please use spark.excludeOnFailure.application.fetchFailure.enabled 2021-10-25 13:42:01 WARN SparkConf:69 - The configuration key 'spark.blacklist.enabled' has been deprecated as of Spark 3.1.0 and may be removed in the future. Please use spark.excludeOnFailure.enabled 2021-10-25 13:42:01 WARN SparkConf:69 - The configuration key 'spark.blacklist.killBlacklistedExecutors' has been deprecated as of Spark 3.1.0 and may be removed in the future. Please use spark.excludeOnFailure.killExcludedExecutors 2021-10-25 13:42:02 INFO EventMetricSparkPlugin:20 - Start to register event process metric plugin. 2021-10-25 13:42:10 INFO deprecation:1398 - No unit for dfs.client.datanode-restart.timeout(30) assuming SECONDS 2021-10-25 13:42:10 INFO deprecation:1398 - No unit for dfs.client.datanode-restart.timeout(30) assuming SECONDS 2021-10-25 13:42:23 INFO CodecPool:184 - Got brand-new decompressor [.zst] 2021-10-25 13:42:23 INFO CodecPool:184 - Got brand-new decompressor [.zst] # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x7f4017bb3112, pid=58809, tid=0x7f402cffe700 # # JRE version: OpenJDK Runtime Environment (8.0_232-b09) (build 1.8.0_232-b09) # Java VM: OpenJDK 64-Bit Server VM (25.232-b09 mixed mode linux-amd64 compressed oops) # Problematic frame: # C [libzstd-jni-1.5.0-28889732549921047792.so+0xc6112] # # Core dump written. Default location: /mnt/ssd/0/yarn/nm-local-dir/usercache/staging_data_trafficmart/appcache/application_1632999515383_3679724/container_e238_1632999515383_3679724_02_02/core or core.58809 # # An error report file with more information is saved as: # /mnt/ssd/0/yarn/nm-local-dir/usercache/staging_data_trafficmart/appcache/application_1632999515383_3679724/container_e238_1632999515383_3679724_02_02/hs_err_pid58809.log # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. {code} Seems zstd so unstable? Or it's related to our zstd env problem? > Use zstd for spark.io.compression.codec by default > -- > > Key: SPARK-35181 > URL: https://issues.apache.org/jira/browse/SPARK-35181 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37109) Install Java 17 on all of the Jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-37109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-37109. - Resolution: Won't Do OK. I do not know this plan. > Install Java 17 on all of the Jenkins workers > - > > Key: SPARK-37109 > URL: https://issues.apache.org/jira/browse/SPARK-37109 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37110) Add Java 17 support for spark pull request builds
[ https://issues.apache.org/jira/browse/SPARK-37110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-37110. - Resolution: Won't Do OK. I do not know this plan. > Add Java 17 support for spark pull request builds > - > > Key: SPARK-37110 > URL: https://issues.apache.org/jira/browse/SPARK-37110 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37051) The filter operator gets wrong results in char type
[ https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434214#comment-17434214 ] Dongjoon Hyun commented on SPARK-37051: --- Got it. If that happens on Parquet, we had better drop `ORC` from the JIRA title. I removed it first. > This scenario also occur on Parquet. > The filter operator gets wrong results in char type > --- > > Key: SPARK-37051 > URL: https://issues.apache.org/jira/browse/SPARK-37051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.1, 3.3.0 > Environment: Spark 3.1.2 > Scala 2.12 / Java 1.8 >Reporter: frankli >Priority: Critical > > When I try the following sample SQL on the TPCDS data, the filter operator > returns an empty row set (shown in web ui). > _select * from item where i_category = 'Music' limit 100;_ > The table is in ORC format, and i_category is char(50) type. > Data is inserted by hive, and queried by Spark. > I guest that the char(50) type will remains redundant blanks after the actual > word. > It will affect the boolean value of "x.equals(Y)", and results in wrong > results. > Luckily, the varchar type is OK. > > This bug can be reproduced by a few steps. > >>> desc t2_orc; > ++---+++ > |col_name|data_type|comment| > ++---+++ > |a|string |NULL| > |b|char(50) |NULL| > |c|int |NULL| > ++---++--–+ > >>> select * from t2_orc where a='a'; > +-+---++--+ > |a|b|c| > +-+---++--+ > |a|b|1| > |a|b|2| > |a|b|3| > |a|b|4| > |a|b|5| > +-+---++–+ > >>> select * from t2_orc where b='b'; > +-+---++--+ > |a|b|c| > +-+---++--+ > +-+---++--+ > > By the way, Spark's tests should add more cases on the char type. > > == Physical Plan == > CollectLimit (3) > +- Filter (2) > +- Scan orc tpcds_bin_partitioned_orc_2.item (1) > (1) Scan orc tpcds_bin_partitioned_orc_2.item > Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Batched: false > Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item] > PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music > )] > ReadSchema: > struct > (2) Filter > Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music ))+ > (3) CollectLimit > Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Arguments: 100 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in char type
[ https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37051: -- Summary: The filter operator gets wrong results in char type (was: The filter operator gets wrong results in ORC's char type) > The filter operator gets wrong results in char type > --- > > Key: SPARK-37051 > URL: https://issues.apache.org/jira/browse/SPARK-37051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.1, 3.3.0 > Environment: Spark 3.1.2 > Scala 2.12 / Java 1.8 >Reporter: frankli >Priority: Critical > > When I try the following sample SQL on the TPCDS data, the filter operator > returns an empty row set (shown in web ui). > _select * from item where i_category = 'Music' limit 100;_ > The table is in ORC format, and i_category is char(50) type. > Data is inserted by hive, and queried by Spark. > I guest that the char(50) type will remains redundant blanks after the actual > word. > It will affect the boolean value of "x.equals(Y)", and results in wrong > results. > Luckily, the varchar type is OK. > > This bug can be reproduced by a few steps. > >>> desc t2_orc; > ++---+++ > |col_name|data_type|comment| > ++---+++ > |a|string |NULL| > |b|char(50) |NULL| > |c|int |NULL| > ++---++--–+ > >>> select * from t2_orc where a='a'; > +-+---++--+ > |a|b|c| > +-+---++--+ > |a|b|1| > |a|b|2| > |a|b|3| > |a|b|4| > |a|b|5| > +-+---++–+ > >>> select * from t2_orc where b='b'; > +-+---++--+ > |a|b|c| > +-+---++--+ > +-+---++--+ > > By the way, Spark's tests should add more cases on the char type. > > == Physical Plan == > CollectLimit (3) > +- Filter (2) > +- Scan orc tpcds_bin_partitioned_orc_2.item (1) > (1) Scan orc tpcds_bin_partitioned_orc_2.item > Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Batched: false > Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item] > PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music > )] > ReadSchema: > struct > (2) Filter > Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music ))+ > (3) CollectLimit > Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Arguments: 100 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37117) Can't read files in one of Parquet encryption modes (external keymaterial)
Gidon Gershinsky created SPARK-37117: Summary: Can't read files in one of Parquet encryption modes (external keymaterial) Key: SPARK-37117 URL: https://issues.apache.org/jira/browse/SPARK-37117 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Gidon Gershinsky Parquet encryption has a number of modes. One of them is "external keymaterial", which keeps encrypted data keys in a separate file (as opposed to inside Parquet file). Upon reading, the Spark Parquet connector does not pass the file path, which causes an NPE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37051) The filter operator gets wrong results in ORC's char type
[ https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434201#comment-17434201 ] frankli edited comment on SPARK-37051 at 10/26/21, 8:48 AM: This scenario also occur on Parquet. [~dongjoon] Spark3.1 will do padding for both writer and reader side. So, Spark 3.1 cannot read Hive data without padding, while Spark 2.4 works well. was (Author: frankli): This scenario also occur on Parquet. Spark3.1 will do padding for both writer and reader side. So, Spark 3.1 cannot read Hive data without padding, while Spark 2.4 works well. > The filter operator gets wrong results in ORC's char type > - > > Key: SPARK-37051 > URL: https://issues.apache.org/jira/browse/SPARK-37051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.1, 3.3.0 > Environment: Spark 3.1.2 > Scala 2.12 / Java 1.8 >Reporter: frankli >Priority: Critical > > When I try the following sample SQL on the TPCDS data, the filter operator > returns an empty row set (shown in web ui). > _select * from item where i_category = 'Music' limit 100;_ > The table is in ORC format, and i_category is char(50) type. > Data is inserted by hive, and queried by Spark. > I guest that the char(50) type will remains redundant blanks after the actual > word. > It will affect the boolean value of "x.equals(Y)", and results in wrong > results. > Luckily, the varchar type is OK. > > This bug can be reproduced by a few steps. > >>> desc t2_orc; > ++---+++ > |col_name|data_type|comment| > ++---+++ > |a|string |NULL| > |b|char(50) |NULL| > |c|int |NULL| > ++---++--–+ > >>> select * from t2_orc where a='a'; > +-+---++--+ > |a|b|c| > +-+---++--+ > |a|b|1| > |a|b|2| > |a|b|3| > |a|b|4| > |a|b|5| > +-+---++–+ > >>> select * from t2_orc where b='b'; > +-+---++--+ > |a|b|c| > +-+---++--+ > +-+---++--+ > > By the way, Spark's tests should add more cases on the char type. > > == Physical Plan == > CollectLimit (3) > +- Filter (2) > +- Scan orc tpcds_bin_partitioned_orc_2.item (1) > (1) Scan orc tpcds_bin_partitioned_orc_2.item > Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Batched: false > Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item] > PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music > )] > ReadSchema: > struct > (2) Filter > Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music ))+ > (3) CollectLimit > Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Arguments: 100 > -- This message was
[jira] [Commented] (SPARK-37051) The filter operator gets wrong results in ORC's char type
[ https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434201#comment-17434201 ] frankli commented on SPARK-37051: - This scenario also occur on Parquet. Spark3.1 will do padding for both writer and reader side. So, Spark 3.1 cannot read Hive data without padding, while Spark 2.4 works well. > The filter operator gets wrong results in ORC's char type > - > > Key: SPARK-37051 > URL: https://issues.apache.org/jira/browse/SPARK-37051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.1, 3.3.0 > Environment: Spark 3.1.2 > Scala 2.12 / Java 1.8 >Reporter: frankli >Priority: Critical > > When I try the following sample SQL on the TPCDS data, the filter operator > returns an empty row set (shown in web ui). > _select * from item where i_category = 'Music' limit 100;_ > The table is in ORC format, and i_category is char(50) type. > Data is inserted by hive, and queried by Spark. > I guest that the char(50) type will remains redundant blanks after the actual > word. > It will affect the boolean value of "x.equals(Y)", and results in wrong > results. > Luckily, the varchar type is OK. > > This bug can be reproduced by a few steps. > >>> desc t2_orc; > ++---+++ > |col_name|data_type|comment| > ++---+++ > |a|string |NULL| > |b|char(50) |NULL| > |c|int |NULL| > ++---++--–+ > >>> select * from t2_orc where a='a'; > +-+---++--+ > |a|b|c| > +-+---++--+ > |a|b|1| > |a|b|2| > |a|b|3| > |a|b|4| > |a|b|5| > +-+---++–+ > >>> select * from t2_orc where b='b'; > +-+---++--+ > |a|b|c| > +-+---++--+ > +-+---++--+ > > By the way, Spark's tests should add more cases on the char type. > > == Physical Plan == > CollectLimit (3) > +- Filter (2) > +- Scan orc tpcds_bin_partitioned_orc_2.item (1) > (1) Scan orc tpcds_bin_partitioned_orc_2.item > Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Batched: false > Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item] > PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music > )] > ReadSchema: > struct > (2) Filter > Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music ))+ > (3) CollectLimit > Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Arguments: 100 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37051) The filter operator gets wrong results in ORC's char type
[ https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434196#comment-17434196 ] Dongjoon Hyun commented on SPARK-37051: --- Does Parquet work in those scenario, [~frankli] and [~wangzhun]? > The filter operator gets wrong results in ORC's char type > - > > Key: SPARK-37051 > URL: https://issues.apache.org/jira/browse/SPARK-37051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.1, 3.3.0 > Environment: Spark 3.1.2 > Scala 2.12 / Java 1.8 >Reporter: frankli >Priority: Critical > > When I try the following sample SQL on the TPCDS data, the filter operator > returns an empty row set (shown in web ui). > _select * from item where i_category = 'Music' limit 100;_ > The table is in ORC format, and i_category is char(50) type. > Data is inserted by hive, and queried by Spark. > I guest that the char(50) type will remains redundant blanks after the actual > word. > It will affect the boolean value of "x.equals(Y)", and results in wrong > results. > Luckily, the varchar type is OK. > > This bug can be reproduced by a few steps. > >>> desc t2_orc; > ++---+++ > |col_name|data_type|comment| > ++---+++ > |a|string |NULL| > |b|char(50) |NULL| > |c|int |NULL| > ++---++--–+ > >>> select * from t2_orc where a='a'; > +-+---++--+ > |a|b|c| > +-+---++--+ > |a|b|1| > |a|b|2| > |a|b|3| > |a|b|4| > |a|b|5| > +-+---++–+ > >>> select * from t2_orc where b='b'; > +-+---++--+ > |a|b|c| > +-+---++--+ > +-+---++--+ > > By the way, Spark's tests should add more cases on the char type. > > == Physical Plan == > CollectLimit (3) > +- Filter (2) > +- Scan orc tpcds_bin_partitioned_orc_2.item (1) > (1) Scan orc tpcds_bin_partitioned_orc_2.item > Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Batched: false > Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item] > PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music > )] > ReadSchema: > struct > (2) Filter > Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music ))+ > (3) CollectLimit > Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Arguments: 100 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36989) Migrate type hint data tests
[ https://issues.apache.org/jira/browse/SPARK-36989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz resolved SPARK-36989. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34296 [https://github.com/apache/spark/pull/34296] > Migrate type hint data tests > > > Key: SPARK-36989 > URL: https://issues.apache.org/jira/browse/SPARK-36989 > Project: Spark > Issue Type: Improvement > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > > Before the migration, {{pyspark-stubs}} contained a set of [data > tests|https://github.com/zero323/pyspark-stubs/tree/branch-3.0/test-data/unit], > modeled after, and using internal test utilities, of mypy. > These were omitted during the migration for a few reasons: > * Simplicity. > * Relative slowness. > * Dependence on non public API. > > Data tests are useful for a number of reasons: > > * Improve test coverage for type hints. > * Checking if type checkers infer expected types. > * Checking if type checkers reject incorrect code. > * Detecting unusual errors with code that otherwise type checks, > > Especially, the last two functions are not fulfilled by simple validation of > existing codebase. > > Data tests are not required for all annotations and can be restricted to code > that has high possibility of failure: > * Complex overloaded signatures. > * Complex generics. > * Generic {{self}} annotations > * Code containing {{type: ignore}} > The biggest risk, is that output matchers have to be updated when signature > changes and / or mypy output changes. > Example of problem detected with data tests can be found in SPARK-36894 PR > ([https://github.com/apache/spark/pull/34146]). > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36989) Migrate type hint data tests
[ https://issues.apache.org/jira/browse/SPARK-36989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz reassigned SPARK-36989: -- Assignee: Maciej Szymkiewicz > Migrate type hint data tests > > > Key: SPARK-36989 > URL: https://issues.apache.org/jira/browse/SPARK-36989 > Project: Spark > Issue Type: Improvement > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > > Before the migration, {{pyspark-stubs}} contained a set of [data > tests|https://github.com/zero323/pyspark-stubs/tree/branch-3.0/test-data/unit], > modeled after, and using internal test utilities, of mypy. > These were omitted during the migration for a few reasons: > * Simplicity. > * Relative slowness. > * Dependence on non public API. > > Data tests are useful for a number of reasons: > > * Improve test coverage for type hints. > * Checking if type checkers infer expected types. > * Checking if type checkers reject incorrect code. > * Detecting unusual errors with code that otherwise type checks, > > Especially, the last two functions are not fulfilled by simple validation of > existing codebase. > > Data tests are not required for all annotations and can be restricted to code > that has high possibility of failure: > * Complex overloaded signatures. > * Complex generics. > * Generic {{self}} annotations > * Code containing {{type: ignore}} > The biggest risk, is that output matchers have to be updated when signature > changes and / or mypy output changes. > Example of problem detected with data tests can be found in SPARK-36894 PR > ([https://github.com/apache/spark/pull/34146]). > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37110) Add Java 17 support for spark pull request builds
[ https://issues.apache.org/jira/browse/SPARK-37110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434190#comment-17434190 ] Dongjoon Hyun commented on SPARK-37110: --- +1 for [~hyukjin.kwon]'s comment to save the community resources. > Add Java 17 support for spark pull request builds > - > > Key: SPARK-37110 > URL: https://issues.apache.org/jira/browse/SPARK-37110 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37109) Install Java 17 on all of the Jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-37109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434189#comment-17434189 ] Dongjoon Hyun commented on SPARK-37109: --- +1 for [~hyukjin.kwon]'s comment. > Install Java 17 on all of the Jenkins workers > - > > Key: SPARK-37109 > URL: https://issues.apache.org/jira/browse/SPARK-37109 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37116) Allow sequences (tuples and lists) as pivot values argument in PySpark
[ https://issues.apache.org/jira/browse/SPARK-37116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434183#comment-17434183 ] Maciej Szymkiewicz commented on SPARK-37116: Sadly, this is not going to work. For example, this will typecheck, although incorrect. {{Tuple | List}} might work, but this probably more general problem in how we interact with JVM, not limited to typing issues. {code:python} from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() (spark.read .csv("foo.csv") .groupBy("foo") .pivot("bar", "baz") .sum()) {code} > Allow sequences (tuples and lists) as pivot values argument in PySpark > -- > > Key: SPARK-37116 > URL: https://issues.apache.org/jira/browse/SPARK-37116 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Minor > > Both tuples and lists are accepted by PySpark on runtime. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37049) executorIdleTimeout is not working for pending pods on K8s
[ https://issues.apache.org/jira/browse/SPARK-37049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434176#comment-17434176 ] Dongjoon Hyun commented on SPARK-37049: --- It's fixed now. > executorIdleTimeout is not working for pending pods on K8s > -- > > Key: SPARK-37049 > URL: https://issues.apache.org/jira/browse/SPARK-37049 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Fix For: 3.1.3, 3.2.1, 3.3.0 > > > SPARK-33099 added the support to respect > "spark.dynamicAllocation.executorIdleTimeout" in ExecutorPodsAllocator. > However, when it checks if a pending executor pod is timed out, it checks > against the pod's "startTime". A pending pod "startTime" is empty, and this > causes the function "isExecutorIdleTimedOut()" always return true for pending > pods. > This caused the issue, pending pods are deleted immediately when a stage is > finished and several new pods got recreated again in the next stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37049) executorIdleTimeout is not working for pending pods on K8s
[ https://issues.apache.org/jira/browse/SPARK-37049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-37049: - Assignee: Weiwei Yang > executorIdleTimeout is not working for pending pods on K8s > -- > > Key: SPARK-37049 > URL: https://issues.apache.org/jira/browse/SPARK-37049 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Fix For: 3.1.3, 3.2.1, 3.3.0 > > > SPARK-33099 added the support to respect > "spark.dynamicAllocation.executorIdleTimeout" in ExecutorPodsAllocator. > However, when it checks if a pending executor pod is timed out, it checks > against the pod's "startTime". A pending pod "startTime" is empty, and this > causes the function "isExecutorIdleTimedOut()" always return true for pending > pods. > This caused the issue, pending pods are deleted immediately when a stage is > finished and several new pods got recreated again in the next stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37049) executorIdleTimeout is not working for pending pods on K8s
[ https://issues.apache.org/jira/browse/SPARK-37049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-37049: - Assignee: (was: wwei) > executorIdleTimeout is not working for pending pods on K8s > -- > > Key: SPARK-37049 > URL: https://issues.apache.org/jira/browse/SPARK-37049 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Priority: Major > Fix For: 3.1.3, 3.2.1, 3.3.0 > > > SPARK-33099 added the support to respect > "spark.dynamicAllocation.executorIdleTimeout" in ExecutorPodsAllocator. > However, when it checks if a pending executor pod is timed out, it checks > against the pod's "startTime". A pending pod "startTime" is empty, and this > causes the function "isExecutorIdleTimedOut()" always return true for pending > pods. > This caused the issue, pending pods are deleted immediately when a stage is > finished and several new pods got recreated again in the next stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37049) executorIdleTimeout is not working for pending pods on K8s
[ https://issues.apache.org/jira/browse/SPARK-37049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434173#comment-17434173 ] Dongjoon Hyun commented on SPARK-37049: --- Oh, sure. Sorry, [~wwei]. > executorIdleTimeout is not working for pending pods on K8s > -- > > Key: SPARK-37049 > URL: https://issues.apache.org/jira/browse/SPARK-37049 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: wwei >Priority: Major > Fix For: 3.1.3, 3.2.1, 3.3.0 > > > SPARK-33099 added the support to respect > "spark.dynamicAllocation.executorIdleTimeout" in ExecutorPodsAllocator. > However, when it checks if a pending executor pod is timed out, it checks > against the pod's "startTime". A pending pod "startTime" is empty, and this > causes the function "isExecutorIdleTimedOut()" always return true for pending > pods. > This caused the issue, pending pods are deleted immediately when a stage is > finished and several new pods got recreated again in the next stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37114) Support Submitting Jobs to Cloud Providers Managed Spark Clusters
[ https://issues.apache.org/jira/browse/SPARK-37114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naga Vijayapuram updated SPARK-37114: - Description: To be able to submit jobs to prominent cloud providers managed spark clusters, "spark-submit" can be enhanced. For example, to submit job to "google cloud dataproc", the "spark-submit" can be enhanced to issue "gcloud dataproc jobs submit spark ..." when "–master gcd://cluster-name" arg is used. Once this feature is accepted and prioritized, then it can be rolled out in current and future versions of spark and also back ported to a few previous versions. I can raise the pull request. (was: To be able to submit jobs to prominent cloud provider managed spark clusters, "spark-submit" can be enhanced. For example, to submit job to "google cloud dataproc", the "spark-submit" can be enhanced to issue "gcloud dataproc jobs submit spark ..." when "–master gcd://cluster-name" arg is used. Once this feature is accepted and prioritized, then it can be rolled out in current and future versions of spark and also back ported to a few previous versions. I can raise the pull request.) > Support Submitting Jobs to Cloud Providers Managed Spark Clusters > - > > Key: SPARK-37114 > URL: https://issues.apache.org/jira/browse/SPARK-37114 > Project: Spark > Issue Type: New Feature > Components: Deploy >Affects Versions: 3.2.0 >Reporter: Naga Vijayapuram >Priority: Trivial > > To be able to submit jobs to prominent cloud providers managed spark > clusters, "spark-submit" can be enhanced. For example, to submit job to > "google cloud dataproc", the "spark-submit" can be enhanced to issue "gcloud > dataproc jobs submit spark ..." when "–master gcd://cluster-name" arg is > used. Once this feature is accepted and prioritized, then it can be rolled > out in current and future versions of spark and also back ported to a few > previous versions. I can raise the pull request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37114) Support Submitting Jobs to Cloud Providers Managed Spark Clusters
[ https://issues.apache.org/jira/browse/SPARK-37114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naga Vijayapuram updated SPARK-37114: - Summary: Support Submitting Jobs to Cloud Providers Managed Spark Clusters (was: Support Submitting Jobs to Cloud Providers) > Support Submitting Jobs to Cloud Providers Managed Spark Clusters > - > > Key: SPARK-37114 > URL: https://issues.apache.org/jira/browse/SPARK-37114 > Project: Spark > Issue Type: New Feature > Components: Deploy >Affects Versions: 3.2.0 >Reporter: Naga Vijayapuram >Priority: Trivial > > To be able to submit jobs to prominent cloud provider managed spark clusters, > "spark-submit" can be enhanced. For example, to submit job to "google cloud > dataproc", the "spark-submit" can be enhanced to issue "gcloud dataproc jobs > submit spark ..." when "–master gcd://cluster-name" arg is used. Once this > feature is accepted and prioritized, then it can be rolled out in current and > future versions of spark and also back ported to a few previous versions. I > can raise the pull request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37114) Support Submitting Jobs to Cloud Providers
[ https://issues.apache.org/jira/browse/SPARK-37114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naga Vijayapuram updated SPARK-37114: - Description: To be able to submit jobs to prominent cloud provider managed spark clusters, "spark-submit" can be enhanced. For example, to submit job to "google cloud dataproc", the "spark-submit" can be enhanced to issue "gcloud dataproc jobs submit spark ..." when "–master gcd://cluster-name" arg is used. Once this feature is accepted and prioritized, then it can be rolled out in current and future versions of spark and also back ported to a few previous versions. I can raise the pull request. (was: To be able to submit jobs to cloud providers, "spark-submit" can be enhanced. For example, to submit job to "google cloud dataproc", the "spark-submit" can be enhanced to issue "gcloud dataproc jobs submit spark ..." when "–master gcd://cluster-name" arg is used. Once this feature is accepted and prioritized, then it can be rolled out in current and future versions of spark and also back ported to a few previous versions. I can raise the pull request.) > Support Submitting Jobs to Cloud Providers > -- > > Key: SPARK-37114 > URL: https://issues.apache.org/jira/browse/SPARK-37114 > Project: Spark > Issue Type: New Feature > Components: Deploy >Affects Versions: 3.2.0 >Reporter: Naga Vijayapuram >Priority: Trivial > > To be able to submit jobs to prominent cloud provider managed spark clusters, > "spark-submit" can be enhanced. For example, to submit job to "google cloud > dataproc", the "spark-submit" can be enhanced to issue "gcloud dataproc jobs > submit spark ..." when "–master gcd://cluster-name" arg is used. Once this > feature is accepted and prioritized, then it can be rolled out in current and > future versions of spark and also back ported to a few previous versions. I > can raise the pull request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37098) Alter table properties should invalidate cache
[ https://issues.apache.org/jira/browse/SPARK-37098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37098: -- Fix Version/s: 3.1.3 > Alter table properties should invalidate cache > -- > > Key: SPARK-37098 > URL: https://issues.apache.org/jira/browse/SPARK-37098 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.1.3, 3.2.1, 3.3.0 > > > The table properties can change the behavior of wriing. e.g. the parquet > table with `parquet.compression`. > If you execute the following SQL, we will get the file with snappy > compression rather than zstd. > {code:java} > CREATE TABLE t (c int) STORED AS PARQUET; > // cache table metadata > SELECT * FROM t; > ALTER TABLE t SET TBLPROPERTIES('parquet.compression'='zstd'); > INSERT INTO TABLE t values(1); > {code} > So we should invalidate the table cache after alter table properties. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37114) Support Submitting Jobs to Cloud Providers
[ https://issues.apache.org/jira/browse/SPARK-37114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naga Vijayapuram updated SPARK-37114: - Description: To be able to submit jobs to cloud providers, "spark-submit" can be enhanced. For example, to submit job to "google cloud dataproc", the "spark-submit" can be enhanced to issue "gcloud dataproc jobs submit spark ..." when "–master gcd://cluster-name" arg is used. Once this feature is accepted and prioritized, then it can be rolled out in current and future versions of spark and also back ported to a few previous versions. I can raise the pull request. (was: To be able to submit jobs to cloud providers, `spark-submit` can be enhanced. For example, to submit job to `google dataproc`, the `spark-submit` can be enhanced to do this ... `gcloud dataproc jobs submit spark ...` when `–master google-cloud-dataproc` arg is used. Once this feature is accepted and prioritized, then it can be rolled out in current and future versions of spark and also back ported to previous versions. I can raise the pull request.) > Support Submitting Jobs to Cloud Providers > -- > > Key: SPARK-37114 > URL: https://issues.apache.org/jira/browse/SPARK-37114 > Project: Spark > Issue Type: New Feature > Components: Deploy >Affects Versions: 3.2.0 >Reporter: Naga Vijayapuram >Priority: Trivial > > To be able to submit jobs to cloud providers, "spark-submit" can be enhanced. > For example, to submit job to "google cloud dataproc", the "spark-submit" can > be enhanced to issue "gcloud dataproc jobs submit spark ..." when "–master > gcd://cluster-name" arg is used. Once this feature is accepted and > prioritized, then it can be rolled out in current and future versions of > spark and also back ported to a few previous versions. I can raise the pull > request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36890) Use default WebsocketPingInterval for Kubernetes watches
[ https://issues.apache.org/jira/browse/SPARK-36890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-36890: -- Affects Version/s: (was: 3.0.3) (was: 3.1.2) (was: 3.1.1) (was: 3.0.2) (was: 2.4.8) (was: 2.4.7) (was: 3.0.1) (was: 2.4.6) (was: 3.1.0) (was: 2.4.5) (was: 2.4.4) (was: 2.4.3) (was: 2.4.2) (was: 2.3.4) (was: 2.4.1) (was: 2.3.3) (was: 2.3.2) (was: 2.3.1) (was: 2.4.0) (was: 2.3.0) (was: 3.0.0) 3.3.0 > Use default WebsocketPingInterval for Kubernetes watches > > > Key: SPARK-36890 > URL: https://issues.apache.org/jira/browse/SPARK-36890 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Philipp Dallig >Assignee: Philipp Dallig >Priority: Major > Fix For: 3.3.0 > > > If you access the Kubernetes API via a load balancer (e.g. HAProxy) and have > set a tunnel timeout, the following error message is thrown exactly after > each timeout. > {code} > >>> 21/09/27 15:35:19 WARN WatchConnectionManager: Exec Failure > java.io.EOFException > at okio.RealBufferedSource.require(RealBufferedSource.java:61) > at okio.RealBufferedSource.readByte(RealBufferedSource.java:74) > at > okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117) > at > okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101) > at > okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) > at > okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > This exception is quite annoying when working interactively with a paused > pySpark shell where the driver component runs locally but the executors run > in Kubernetes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36890) Use default WebsocketPingInterval for Kubernetes watches
[ https://issues.apache.org/jira/browse/SPARK-36890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-36890: -- Summary: Use default WebsocketPingInterval for Kubernetes watches (was: Websocket timeouts to K8s-API) > Use default WebsocketPingInterval for Kubernetes watches > > > Key: SPARK-36890 > URL: https://issues.apache.org/jira/browse/SPARK-36890 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, > 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 2.4.8, 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, > 3.1.1, 3.1.2 >Reporter: Philipp Dallig >Assignee: Philipp Dallig >Priority: Major > Fix For: 3.3.0 > > > If you access the Kubernetes API via a load balancer (e.g. HAProxy) and have > set a tunnel timeout, the following error message is thrown exactly after > each timeout. > {code} > >>> 21/09/27 15:35:19 WARN WatchConnectionManager: Exec Failure > java.io.EOFException > at okio.RealBufferedSource.require(RealBufferedSource.java:61) > at okio.RealBufferedSource.readByte(RealBufferedSource.java:74) > at > okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117) > at > okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101) > at > okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) > at > okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > This exception is quite annoying when working interactively with a paused > pySpark shell where the driver component runs locally but the executors run > in Kubernetes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36890) Websocket timeouts to K8s-API
[ https://issues.apache.org/jira/browse/SPARK-36890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-36890. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34143 [https://github.com/apache/spark/pull/34143] > Websocket timeouts to K8s-API > - > > Key: SPARK-36890 > URL: https://issues.apache.org/jira/browse/SPARK-36890 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, > 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 2.4.8, 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, > 3.1.1, 3.1.2 >Reporter: Philipp Dallig >Assignee: Philipp Dallig >Priority: Major > Fix For: 3.3.0 > > > If you access the Kubernetes API via a load balancer (e.g. HAProxy) and have > set a tunnel timeout, the following error message is thrown exactly after > each timeout. > {code} > >>> 21/09/27 15:35:19 WARN WatchConnectionManager: Exec Failure > java.io.EOFException > at okio.RealBufferedSource.require(RealBufferedSource.java:61) > at okio.RealBufferedSource.readByte(RealBufferedSource.java:74) > at > okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117) > at > okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101) > at > okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) > at > okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > This exception is quite annoying when working interactively with a paused > pySpark shell where the driver component runs locally but the executors run > in Kubernetes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36890) Websocket timeouts to K8s-API
[ https://issues.apache.org/jira/browse/SPARK-36890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-36890: - Assignee: Philipp Dallig > Websocket timeouts to K8s-API > - > > Key: SPARK-36890 > URL: https://issues.apache.org/jira/browse/SPARK-36890 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, > 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 2.4.8, 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, > 3.1.1, 3.1.2 >Reporter: Philipp Dallig >Assignee: Philipp Dallig >Priority: Major > > If you access the Kubernetes API via a load balancer (e.g. HAProxy) and have > set a tunnel timeout, the following error message is thrown exactly after > each timeout. > {code} > >>> 21/09/27 15:35:19 WARN WatchConnectionManager: Exec Failure > java.io.EOFException > at okio.RealBufferedSource.require(RealBufferedSource.java:61) > at okio.RealBufferedSource.readByte(RealBufferedSource.java:74) > at > okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117) > at > okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101) > at > okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) > at > okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > This exception is quite annoying when working interactively with a paused > pySpark shell where the driver component runs locally but the executors run > in Kubernetes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37110) Add Java 17 support for spark pull request builds
[ https://issues.apache.org/jira/browse/SPARK-37110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-37110: Summary: Add Java 17 support for spark pull request builds (was: Add java17 support for spark pull request builds) > Add Java 17 support for spark pull request builds > - > > Key: SPARK-37110 > URL: https://issues.apache.org/jira/browse/SPARK-37110 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37087) merge three relation resolutions into one
[ https://issues.apache.org/jira/browse/SPARK-37087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37087: --- Assignee: Wenchen Fan > merge three relation resolutions into one > - > > Key: SPARK-37087 > URL: https://issues.apache.org/jira/browse/SPARK-37087 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37087) merge three relation resolutions into one
[ https://issues.apache.org/jira/browse/SPARK-37087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37087. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34358 [https://github.com/apache/spark/pull/34358] > merge three relation resolutions into one > - > > Key: SPARK-37087 > URL: https://issues.apache.org/jira/browse/SPARK-37087 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org