[jira] [Created] (SPARK-31832) Add tool tip for Structured streaming page tables
jobit mathew created SPARK-31832: Summary: Add tool tip for Structured streaming page tables Key: SPARK-31832 URL: https://issues.apache.org/jira/browse/SPARK-31832 Project: Spark Issue Type: Sub-task Components: SQL, Web UI Affects Versions: 3.1.0 Reporter: jobit mathew Better to add tool tip for Structured streaming page tables -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26646) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
[ https://issues.apache.org/jira/browse/SPARK-26646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117400#comment-17117400 ] Jungtaek Lim commented on SPARK-26646: -- Still happening. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123143/testReport/ Would we need to disable the test for now? > Flaky test: pyspark.mllib.tests.test_streaming_algorithms > StreamingLogisticRegressionWithSGDTests.test_training_and_prediction > -- > > Key: SPARK-26646 > URL: https://issues.apache.org/jira/browse/SPARK-26646 > Project: Spark > Issue Type: Test > Components: MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.0.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101356/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101358/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101254/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100941/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100327/console > {code} > == > FAIL: test_training_and_prediction > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) > Test that the model improves on toy data with no. of batches > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 367, in test_training_and_prediction > self._eventually(condition, timeout=60.0) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 69, in _eventually > lastValue = condition() > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 362, in condition > self.assertGreater(errors[1] - errors[-1], 0.3) > AssertionError: -0.070062 not greater than 0.3 > -- > Ran 13 tests in 198.327s > FAILED (failures=1, skipped=1) > Had test failures in pyspark.mllib.tests.test_streaming_algorithms with > python3.4; see logs. > {code} > It apparently became less flaky after increasing the time at SPARK-26275 but > looks now it became flacky due to unexpected results. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29137) Flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_train_prediction
[ https://issues.apache.org/jira/browse/SPARK-29137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117399#comment-17117399 ] Jungtaek Lim commented on SPARK-29137: -- Still valid on latest master. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123144/consoleFull https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123146/testReport/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123141/testReport/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123142/testReport/ Would we need to disable the test for now? > Flaky test: > pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_train_prediction > -- > > Key: SPARK-29137 > URL: https://issues.apache.org/jira/browse/SPARK-29137 > Project: Spark > Issue Type: Bug > Components: MLlib, Tests >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/110686/testReport/] > {code:java} > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 503, in test_train_prediction > self._eventually(condition) > File > "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 69, in _eventually > lastValue = condition() > File > "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 498, in condition > self.assertGreater(errors[1] - errors[-1], 2) > AssertionError: 1.672640157855923 not greater than 2 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31831) Flaky test: org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.(It is not a test it is a sbt.testing.SuiteSelector)
Jungtaek Lim created SPARK-31831: Summary: Flaky test: org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.(It is not a test it is a sbt.testing.SuiteSelector) Key: SPARK-31831 URL: https://issues.apache.org/jira/browse/SPARK-31831 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Jungtaek Lim I've seen the failures two times (not in a row but closely) which seems to require investigation. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123147/testReport https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123150/testReport {noformat} org.mockito.exceptions.base.MockitoException: ClassCastException occurred while creating the mockito mock : class to mock : 'org.apache.hive.service.cli.session.SessionManager', loaded by classloader : 'sun.misc.Launcher$AppClassLoader@483bf400' created class : 'org.mockito.codegen.SessionManager$MockitoMock$1696557705', loaded by classloader : 'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6' proxy instance class : 'org.mockito.codegen.SessionManager$MockitoMock$1696557705', loaded by classloader : 'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6' instance creation by : ObjenesisInstantiator You might experience classloading issues, please ask the mockito mailing-list. Stack Trace sbt.ForkMain$ForkError: org.mockito.exceptions.base.MockitoException: ClassCastException occurred while creating the mockito mock : class to mock : 'org.apache.hive.service.cli.session.SessionManager', loaded by classloader : 'sun.misc.Launcher$AppClassLoader@483bf400' created class : 'org.mockito.codegen.SessionManager$MockitoMock$1696557705', loaded by classloader : 'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6' proxy instance class : 'org.mockito.codegen.SessionManager$MockitoMock$1696557705', loaded by classloader : 'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6' instance creation by : ObjenesisInstantiator You might experience classloading issues, please ask the mockito mailing-list. at org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.beforeAll(HiveSessionImplSuite.scala:44) at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:59) at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317) at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510) at sbt.ForkMain$Run$2.call(ForkMain.java:296) at sbt.ForkMain$Run$2.call(ForkMain.java:286) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: sbt.ForkMain$ForkError: java.lang.ClassCastException: org.mockito.codegen.SessionManager$MockitoMock$1696557705 cannot be cast to org.mockito.internal.creation.bytebuddy.MockAccess at org.mockito.internal.creation.bytebuddy.SubclassByteBuddyMockMaker.createMock(SubclassByteBuddyMockMaker.java:48) at org.mockito.internal.creation.bytebuddy.ByteBuddyMockMaker.createMock(ByteBuddyMockMaker.java:25) at org.mockito.internal.util.MockUtil.createMock(MockUtil.java:35) at org.mockito.internal.MockitoCore.mock(MockitoCore.java:63) at org.mockito.Mockito.mock(Mockito.java:1908) at org.mockito.Mockito.mock(Mockito.java:1817) ... 13 more {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31696) Support spark.kubernetes.driver.service.annotation
[ https://issues.apache.org/jira/browse/SPARK-31696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117360#comment-17117360 ] Dongjoon Hyun commented on SPARK-31696: --- I'll give a talk at Spark Summit next month. :) - https://databricks.com/session_na20/native-support-of-prometheus-monitoring-in-apache-spark-3-0 > Support spark.kubernetes.driver.service.annotation > -- > > Key: SPARK-31696 > URL: https://issues.apache.org/jira/browse/SPARK-31696 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31830) Consistent error handling for datetime formatting functions
[ https://issues.apache.org/jira/browse/SPARK-31830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117334#comment-17117334 ] Apache Spark commented on SPARK-31830: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/28650 > Consistent error handling for datetime formatting functions > --- > > Key: SPARK-31830 > URL: https://issues.apache.org/jira/browse/SPARK-31830 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > date_format and from_unixtime have different error handling behavior for > formatting datetime values. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31830) Consistent error handling for datetime formatting functions
[ https://issues.apache.org/jira/browse/SPARK-31830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31830: Assignee: Apache Spark > Consistent error handling for datetime formatting functions > --- > > Key: SPARK-31830 > URL: https://issues.apache.org/jira/browse/SPARK-31830 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > > date_format and from_unixtime have different error handling behavior for > formatting datetime values. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31830) Consistent error handling for datetime formatting functions
[ https://issues.apache.org/jira/browse/SPARK-31830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31830: Assignee: (was: Apache Spark) > Consistent error handling for datetime formatting functions > --- > > Key: SPARK-31830 > URL: https://issues.apache.org/jira/browse/SPARK-31830 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > date_format and from_unixtime have different error handling behavior for > formatting datetime values. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31830) Consistent error handling for datetime formatting functions
[ https://issues.apache.org/jira/browse/SPARK-31830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31830: Assignee: Apache Spark > Consistent error handling for datetime formatting functions > --- > > Key: SPARK-31830 > URL: https://issues.apache.org/jira/browse/SPARK-31830 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > > date_format and from_unixtime have different error handling behavior for > formatting datetime values. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31830) Consistent error handling for datetime formatting functions
Kent Yao created SPARK-31830: Summary: Consistent error handling for datetime formatting functions Key: SPARK-31830 URL: https://issues.apache.org/jira/browse/SPARK-31830 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0, 3.1.0 Reporter: Kent Yao date_format and from_unixtime have different error handling behavior for formatting datetime values. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31829) Check for partition existence for Insert overwrite if not exists queries on Hive Serde Table before computation
[ https://issues.apache.org/jira/browse/SPARK-31829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31829: Assignee: Apache Spark > Check for partition existence for Insert overwrite if not exists queries on > Hive Serde Table before computation > --- > > Key: SPARK-31829 > URL: https://issues.apache.org/jira/browse/SPARK-31829 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.5 >Reporter: Aniket Namadeo Mokashi >Assignee: Apache Spark >Priority: Major > > If T is a Hive table, Query: INSERT OVERWRITE table T partition(p='existing') > IF NOT EXISTS select ... ; executes job/computation on Spark and then avoids > loading partitions. It should avoid doing the wasteful computation and exit > early. > For Datasource table, it does avoid the computation and exits early (due to > work done in SPARK-20831). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31829) Check for partition existence for Insert overwrite if not exists queries on Hive Serde Table before computation
[ https://issues.apache.org/jira/browse/SPARK-31829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31829: Assignee: (was: Apache Spark) > Check for partition existence for Insert overwrite if not exists queries on > Hive Serde Table before computation > --- > > Key: SPARK-31829 > URL: https://issues.apache.org/jira/browse/SPARK-31829 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.5 >Reporter: Aniket Namadeo Mokashi >Priority: Major > > If T is a Hive table, Query: INSERT OVERWRITE table T partition(p='existing') > IF NOT EXISTS select ... ; executes job/computation on Spark and then avoids > loading partitions. It should avoid doing the wasteful computation and exit > early. > For Datasource table, it does avoid the computation and exits early (due to > work done in SPARK-20831). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31829) Check for partition existence for Insert overwrite if not exists queries on Hive Serde Table before computation
[ https://issues.apache.org/jira/browse/SPARK-31829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117262#comment-17117262 ] Apache Spark commented on SPARK-31829: -- User 'aniket486' has created a pull request for this issue: https://github.com/apache/spark/pull/28649 > Check for partition existence for Insert overwrite if not exists queries on > Hive Serde Table before computation > --- > > Key: SPARK-31829 > URL: https://issues.apache.org/jira/browse/SPARK-31829 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.5 >Reporter: Aniket Namadeo Mokashi >Priority: Major > > If T is a Hive table, Query: INSERT OVERWRITE table T partition(p='existing') > IF NOT EXISTS select ... ; executes job/computation on Spark and then avoids > loading partitions. It should avoid doing the wasteful computation and exit > early. > For Datasource table, it does avoid the computation and exits early (due to > work done in SPARK-20831). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31829) Check for partition existence for Insert overwrite if not exists queries on Hive Serde Table before computation
[ https://issues.apache.org/jira/browse/SPARK-31829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Namadeo Mokashi updated SPARK-31829: --- Summary: Check for partition existence for Insert overwrite if not exists queries on Hive Serde Table before computation (was: Check for partition existence for Insert overwrite if not exists queries on Hive Serde Table) > Check for partition existence for Insert overwrite if not exists queries on > Hive Serde Table before computation > --- > > Key: SPARK-31829 > URL: https://issues.apache.org/jira/browse/SPARK-31829 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.5 >Reporter: Aniket Namadeo Mokashi >Priority: Major > > If T is a Hive table, Query: INSERT OVERWRITE table T partition(p='existing') > IF NOT EXISTS select ... ; executes job/computation on Spark and then avoids > loading partitions. It should avoid doing the wasteful computation and exit > early. > For Datasource table, it does avoid the computation and exits early (due to > work done in SPARK-20831). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31829) Check for partition existence for Insert overwrite if not exists queries on Hive Serde Table
Aniket Namadeo Mokashi created SPARK-31829: -- Summary: Check for partition existence for Insert overwrite if not exists queries on Hive Serde Table Key: SPARK-31829 URL: https://issues.apache.org/jira/browse/SPARK-31829 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.5, 2.3.4, 2.2.3, 2.1.3 Reporter: Aniket Namadeo Mokashi If T is a Hive table, Query: INSERT OVERWRITE table T partition(p='existing') IF NOT EXISTS select ... ; executes job/computation on Spark and then avoids loading partitions. It should avoid doing the wasteful computation and exit early. For Datasource table, it does avoid the computation and exits early (due to work done in SPARK-20831). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31788) Error when creating UnionRDD of PairRDDs
[ https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117253#comment-17117253 ] Apache Spark commented on SPARK-31788: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28648 > Error when creating UnionRDD of PairRDDs > > > Key: SPARK-31788 > URL: https://issues.apache.org/jira/browse/SPARK-31788 > Project: Spark > Issue Type: Bug > Components: DStreams, PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Sanket Reddy >Assignee: Hyukjin Kwon >Priority: Blocker > > Union RDD of Pair RDD's seems to have issues > SparkSession available as 'spark'. > {code} > rdd1 = sc.parallelize([1,2,3,4,5]) > rdd2 = sc.parallelize([6,7,8,9,10]) > pairRDD1 = rdd1.zip(rdd2) > unionRDD1 = sc.union([pairRDD1, pairRDD1]) > {code} > {code} > Traceback (most recent call last): File "", line 1, in File > "/home/gs/spark/latest/python/pyspark/context.py", line 870, > in union jrdds[i] = rdds[i]._jrdd > File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 238, in _setitem_ File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 221, > in __set_item File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line > 332, in get_return_value py4j.protocol.Py4JError: An error occurred while > calling None.None. Trace: py4j.Py4JException: Cannot convert > org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at > py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at > py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at > py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.lang.Thread.run(Thread.java:748) > {code} > {code} > rdd3 = sc.parallelize([11,12,13,14,15]) > pairRDD2 = rdd3.zip(rdd3) > unionRDD2 = sc.union([pairRDD1, pairRDD2]) > {code} > {code} > Traceback (most recent call last): File "", line 1, in File > "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union > jrdds[i] = rdds[i]._jrdd File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 238, in _setitem_ File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 221, in __set_item File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line > 332, in get_return_value py4j.protocol.Py4JError: An error occurred while > calling None.None. Trace: py4j.Py4JException: Cannot convert > org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at > py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at > py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at > py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.lang.Thread.run(Thread.java:748) > {code} > 2.4.5 does not have this regression as below: > {code} > rdd4 = sc.parallelize(range(5)) > pairRDD3 = rdd4.zip(rdd4) > unionRDD3 = sc.union([pairRDD1, pairRDD3]) > unionRDD3.collect() > {code} > {code} > [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 1), (2, 2), (3, 3), (4, > 4)] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31788) Error when creating UnionRDD of PairRDDs
[ https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117252#comment-17117252 ] Apache Spark commented on SPARK-31788: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28648 > Error when creating UnionRDD of PairRDDs > > > Key: SPARK-31788 > URL: https://issues.apache.org/jira/browse/SPARK-31788 > Project: Spark > Issue Type: Bug > Components: DStreams, PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Sanket Reddy >Assignee: Hyukjin Kwon >Priority: Blocker > > Union RDD of Pair RDD's seems to have issues > SparkSession available as 'spark'. > {code} > rdd1 = sc.parallelize([1,2,3,4,5]) > rdd2 = sc.parallelize([6,7,8,9,10]) > pairRDD1 = rdd1.zip(rdd2) > unionRDD1 = sc.union([pairRDD1, pairRDD1]) > {code} > {code} > Traceback (most recent call last): File "", line 1, in File > "/home/gs/spark/latest/python/pyspark/context.py", line 870, > in union jrdds[i] = rdds[i]._jrdd > File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 238, in _setitem_ File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 221, > in __set_item File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line > 332, in get_return_value py4j.protocol.Py4JError: An error occurred while > calling None.None. Trace: py4j.Py4JException: Cannot convert > org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at > py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at > py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at > py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.lang.Thread.run(Thread.java:748) > {code} > {code} > rdd3 = sc.parallelize([11,12,13,14,15]) > pairRDD2 = rdd3.zip(rdd3) > unionRDD2 = sc.union([pairRDD1, pairRDD2]) > {code} > {code} > Traceback (most recent call last): File "", line 1, in File > "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union > jrdds[i] = rdds[i]._jrdd File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 238, in _setitem_ File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 221, in __set_item File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line > 332, in get_return_value py4j.protocol.Py4JError: An error occurred while > calling None.None. Trace: py4j.Py4JException: Cannot convert > org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at > py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at > py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at > py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.lang.Thread.run(Thread.java:748) > {code} > 2.4.5 does not have this regression as below: > {code} > rdd4 = sc.parallelize(range(5)) > pairRDD3 = rdd4.zip(rdd4) > unionRDD3 = sc.union([pairRDD1, pairRDD3]) > unionRDD3.collect() > {code} > {code} > [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 1), (2, 2), (3, 3), (4, > 4)] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31788) Error when creating UnionRDD of PairRDDs
[ https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31788: Assignee: Apache Spark (was: Hyukjin Kwon) > Error when creating UnionRDD of PairRDDs > > > Key: SPARK-31788 > URL: https://issues.apache.org/jira/browse/SPARK-31788 > Project: Spark > Issue Type: Bug > Components: DStreams, PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Sanket Reddy >Assignee: Apache Spark >Priority: Blocker > > Union RDD of Pair RDD's seems to have issues > SparkSession available as 'spark'. > {code} > rdd1 = sc.parallelize([1,2,3,4,5]) > rdd2 = sc.parallelize([6,7,8,9,10]) > pairRDD1 = rdd1.zip(rdd2) > unionRDD1 = sc.union([pairRDD1, pairRDD1]) > {code} > {code} > Traceback (most recent call last): File "", line 1, in File > "/home/gs/spark/latest/python/pyspark/context.py", line 870, > in union jrdds[i] = rdds[i]._jrdd > File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 238, in _setitem_ File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 221, > in __set_item File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line > 332, in get_return_value py4j.protocol.Py4JError: An error occurred while > calling None.None. Trace: py4j.Py4JException: Cannot convert > org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at > py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at > py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at > py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.lang.Thread.run(Thread.java:748) > {code} > {code} > rdd3 = sc.parallelize([11,12,13,14,15]) > pairRDD2 = rdd3.zip(rdd3) > unionRDD2 = sc.union([pairRDD1, pairRDD2]) > {code} > {code} > Traceback (most recent call last): File "", line 1, in File > "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union > jrdds[i] = rdds[i]._jrdd File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 238, in _setitem_ File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 221, in __set_item File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line > 332, in get_return_value py4j.protocol.Py4JError: An error occurred while > calling None.None. Trace: py4j.Py4JException: Cannot convert > org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at > py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at > py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at > py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.lang.Thread.run(Thread.java:748) > {code} > 2.4.5 does not have this regression as below: > {code} > rdd4 = sc.parallelize(range(5)) > pairRDD3 = rdd4.zip(rdd4) > unionRDD3 = sc.union([pairRDD1, pairRDD3]) > unionRDD3.collect() > {code} > {code} > [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 1), (2, 2), (3, 3), (4, > 4)] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31788) Error when creating UnionRDD of PairRDDs
[ https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31788: Assignee: Hyukjin Kwon (was: Apache Spark) > Error when creating UnionRDD of PairRDDs > > > Key: SPARK-31788 > URL: https://issues.apache.org/jira/browse/SPARK-31788 > Project: Spark > Issue Type: Bug > Components: DStreams, PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Sanket Reddy >Assignee: Hyukjin Kwon >Priority: Blocker > > Union RDD of Pair RDD's seems to have issues > SparkSession available as 'spark'. > {code} > rdd1 = sc.parallelize([1,2,3,4,5]) > rdd2 = sc.parallelize([6,7,8,9,10]) > pairRDD1 = rdd1.zip(rdd2) > unionRDD1 = sc.union([pairRDD1, pairRDD1]) > {code} > {code} > Traceback (most recent call last): File "", line 1, in File > "/home/gs/spark/latest/python/pyspark/context.py", line 870, > in union jrdds[i] = rdds[i]._jrdd > File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 238, in _setitem_ File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 221, > in __set_item File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line > 332, in get_return_value py4j.protocol.Py4JError: An error occurred while > calling None.None. Trace: py4j.Py4JException: Cannot convert > org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at > py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at > py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at > py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.lang.Thread.run(Thread.java:748) > {code} > {code} > rdd3 = sc.parallelize([11,12,13,14,15]) > pairRDD2 = rdd3.zip(rdd3) > unionRDD2 = sc.union([pairRDD1, pairRDD2]) > {code} > {code} > Traceback (most recent call last): File "", line 1, in File > "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union > jrdds[i] = rdds[i]._jrdd File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 238, in _setitem_ File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 221, in __set_item File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line > 332, in get_return_value py4j.protocol.Py4JError: An error occurred while > calling None.None. Trace: py4j.Py4JException: Cannot convert > org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at > py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at > py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at > py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.lang.Thread.run(Thread.java:748) > {code} > 2.4.5 does not have this regression as below: > {code} > rdd4 = sc.parallelize(range(5)) > pairRDD3 = rdd4.zip(rdd4) > unionRDD3 = sc.union([pairRDD1, pairRDD3]) > unionRDD3.collect() > {code} > {code} > [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 1), (2, 2), (3, 3), (4, > 4)] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31788) Error when creating UnionRDD of PairRDDs
[ https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31788: - Description: Union RDD of Pair RDD's seems to have issues SparkSession available as 'spark'. {code} rdd1 = sc.parallelize([1,2,3,4,5]) rdd2 = sc.parallelize([6,7,8,9,10]) pairRDD1 = rdd1.zip(rdd2) unionRDD1 = sc.union([pairRDD1, pairRDD1]) {code} {code} Traceback (most recent call last): File "", line 1, in File "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] = rdds[i]._jrdd File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 238, in _setitem_ File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 221, in __set_item File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling None.None. Trace: py4j.Py4JException: Cannot convert org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) {code} {code} rdd3 = sc.parallelize([11,12,13,14,15]) pairRDD2 = rdd3.zip(rdd3) unionRDD2 = sc.union([pairRDD1, pairRDD2]) {code} {code} Traceback (most recent call last): File "", line 1, in File "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] = rdds[i]._jrdd File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 238, in _setitem_ File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 221, in __set_item File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling None.None. Trace: py4j.Py4JException: Cannot convert org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) {code} 2.4.5 does not have this regression as below: {code} rdd4 = sc.parallelize(range(5)) pairRDD3 = rdd4.zip(rdd4) unionRDD3 = sc.union([pairRDD1, pairRDD3]) unionRDD3.collect() {code} {code} [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 1), (2, 2), (3, 3), (4, 4)] {code} was: Union RDD of Pair RDD's seems to have issues SparkSession available as 'spark'. {code} rdd1 = sc.parallelize([1,2,3,4,5]) rdd2 = sc.parallelize([6,7,8,9,10]) pairRDD1 = rdd1.zip(rdd2) unionRDD1 = sc.union([pairRDD1, pairRDD1]) {code} {code} Traceback (most recent call last): File "", line 1, in File "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] = rdds[i]._jrdd File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 238, in _setitem_ File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 221, in __set_item File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling None.None. Trace: py4j.Py4JException: Cannot convert org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) {code} {code} rdd3 = sc.parallelize([11,12,13,14,15]) pairRDD2 = rdd3.zip(rdd3) unionRDD2 = sc.union([pairRDD1, pairRDD2]) {code} {code} Traceback (most recent call last): File "", line 1, in File "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] = rdds[i]._jrdd File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 238, in _setitem_ File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 221, in __set_item File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling None.None. Trace: py4j.Py4JException: Cannot convert org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at
[jira] [Updated] (SPARK-31788) Error when creating UnionRDD of PairRDDs
[ https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31788: - Target Version/s: 3.0.0 Affects Version/s: (was: 3.0.1) Description: Union RDD of Pair RDD's seems to have issues SparkSession available as 'spark'. {code} rdd1 = sc.parallelize([1,2,3,4,5]) rdd2 = sc.parallelize([6,7,8,9,10]) pairRDD1 = rdd1.zip(rdd2) unionRDD1 = sc.union([pairRDD1, pairRDD1]) {code} {code} Traceback (most recent call last): File "", line 1, in File "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] = rdds[i]._jrdd File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 238, in _setitem_ File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 221, in __set_item File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling None.None. Trace: py4j.Py4JException: Cannot convert org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) {code} {code} rdd3 = sc.parallelize([11,12,13,14,15]) pairRDD2 = rdd3.zip(rdd3) unionRDD2 = sc.union([pairRDD1, pairRDD2]) {code} {code} Traceback (most recent call last): File "", line 1, in File "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] = rdds[i]._jrdd File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 238, in _setitem_ File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 221, in __set_item File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling None.None. Trace: py4j.Py4JException: Cannot convert org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) {code} {code} rdd4 = sc.parallelize(range(5)) pairRDD3 = rdd4.zip(rdd4) unionRDD3 = sc.union([pairRDD1, pairRDD3]) unionRDD3.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 1), (2, 2), (3, 3), (4, 4)] {code} 2.4.5 does not have this regression was: Union RDD of Pair RDD's seems to have issues SparkSession available as 'spark'. >>> rdd1 = sc.parallelize([1,2,3,4,5]) >>> rdd2 = sc.parallelize([6,7,8,9,10]) >>> pairRDD1 = rdd1.zip(rdd2) >>> unionRDD1 = sc.union([pairRDD1, pairRDD1]) Traceback (most recent call last): File "", line 1, in File "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] = rdds[i]._jrdd File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 238, in _setitem_ File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 221, in __set_item File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling None.None. Trace: py4j.Py4JException: Cannot convert org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) >>> rdd3 = sc.parallelize([11,12,13,14,15]) >>> pairRDD2 = rdd3.zip(rdd3) >>> unionRDD2 = sc.union([pairRDD1, pairRDD2]) Traceback (most recent call last): File "", line 1, in File "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] = rdds[i]._jrdd File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 238, in _setitem_ File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 221, in __set_item File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling None.None. Trace: py4j.Py4JException: Cannot convert org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at
[jira] [Reopened] (SPARK-31788) Error when creating UnionRDD of PairRDDs
[ https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-31788: -- Assignee: Hyukjin Kwon (was: Sanket Reddy) > Error when creating UnionRDD of PairRDDs > > > Key: SPARK-31788 > URL: https://issues.apache.org/jira/browse/SPARK-31788 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Sanket Reddy >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > Union RDD of Pair RDD's seems to have issues > SparkSession available as 'spark'. > >>> rdd1 = sc.parallelize([1,2,3,4,5]) > >>> rdd2 = sc.parallelize([6,7,8,9,10]) > >>> pairRDD1 = rdd1.zip(rdd2) > >>> unionRDD1 = sc.union([pairRDD1, pairRDD1]) > Traceback (most recent call last): File "", line 1, in File > "/home/gs/spark/latest/python/pyspark/context.py", line 870, > in union jrdds[i] = rdds[i]._jrdd > File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 238, in _setitem_ File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 221, > in __set_item File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line > 332, in get_return_value py4j.protocol.Py4JError: An error occurred while > calling None.None. Trace: py4j.Py4JException: Cannot convert > org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at > py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at > py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at > py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.lang.Thread.run(Thread.java:748) > >>> rdd3 = sc.parallelize([11,12,13,14,15]) > >>> pairRDD2 = rdd3.zip(rdd3) > >>> unionRDD2 = sc.union([pairRDD1, pairRDD2]) > Traceback (most recent call last): File "", line 1, in File > "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union > jrdds[i] = rdds[i]._jrdd File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 238, in _setitem_ File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 221, in __set_item File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line > 332, in get_return_value py4j.protocol.Py4JError: An error occurred while > calling None.None. Trace: py4j.Py4JException: Cannot convert > org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at > py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at > py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at > py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.lang.Thread.run(Thread.java:748) > >>> rdd4 = sc.parallelize(range(5)) > >>> pairRDD3 = rdd4.zip(rdd4) > >>> unionRDD3 = sc.union([pairRDD1, pairRDD3]) > >>> unionRDD3.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, > >>> 1), (2, 2), (3, 3), (4, 4)] > > 2.4.5 does not have this regression -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31788) Error when creating UnionRDD of PairRDDs
[ https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31788: - Priority: Blocker (was: Major) > Error when creating UnionRDD of PairRDDs > > > Key: SPARK-31788 > URL: https://issues.apache.org/jira/browse/SPARK-31788 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Sanket Reddy >Assignee: Hyukjin Kwon >Priority: Blocker > > Union RDD of Pair RDD's seems to have issues > SparkSession available as 'spark'. > >>> rdd1 = sc.parallelize([1,2,3,4,5]) > >>> rdd2 = sc.parallelize([6,7,8,9,10]) > >>> pairRDD1 = rdd1.zip(rdd2) > >>> unionRDD1 = sc.union([pairRDD1, pairRDD1]) > Traceback (most recent call last): File "", line 1, in File > "/home/gs/spark/latest/python/pyspark/context.py", line 870, > in union jrdds[i] = rdds[i]._jrdd > File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 238, in _setitem_ File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 221, > in __set_item File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line > 332, in get_return_value py4j.protocol.Py4JError: An error occurred while > calling None.None. Trace: py4j.Py4JException: Cannot convert > org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at > py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at > py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at > py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.lang.Thread.run(Thread.java:748) > >>> rdd3 = sc.parallelize([11,12,13,14,15]) > >>> pairRDD2 = rdd3.zip(rdd3) > >>> unionRDD2 = sc.union([pairRDD1, pairRDD2]) > Traceback (most recent call last): File "", line 1, in File > "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union > jrdds[i] = rdds[i]._jrdd File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 238, in _setitem_ File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 221, in __set_item File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line > 332, in get_return_value py4j.protocol.Py4JError: An error occurred while > calling None.None. Trace: py4j.Py4JException: Cannot convert > org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at > py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at > py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at > py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.lang.Thread.run(Thread.java:748) > >>> rdd4 = sc.parallelize(range(5)) > >>> pairRDD3 = rdd4.zip(rdd4) > >>> unionRDD3 = sc.union([pairRDD1, pairRDD3]) > >>> unionRDD3.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, > >>> 1), (2, 2), (3, 3), (4, 4)] > > 2.4.5 does not have this regression -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31788) Error when creating UnionRDD of PairRDDs
[ https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31788: - Fix Version/s: (was: 3.0.0) > Error when creating UnionRDD of PairRDDs > > > Key: SPARK-31788 > URL: https://issues.apache.org/jira/browse/SPARK-31788 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Sanket Reddy >Assignee: Hyukjin Kwon >Priority: Major > > Union RDD of Pair RDD's seems to have issues > SparkSession available as 'spark'. > >>> rdd1 = sc.parallelize([1,2,3,4,5]) > >>> rdd2 = sc.parallelize([6,7,8,9,10]) > >>> pairRDD1 = rdd1.zip(rdd2) > >>> unionRDD1 = sc.union([pairRDD1, pairRDD1]) > Traceback (most recent call last): File "", line 1, in File > "/home/gs/spark/latest/python/pyspark/context.py", line 870, > in union jrdds[i] = rdds[i]._jrdd > File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 238, in _setitem_ File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 221, > in __set_item File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line > 332, in get_return_value py4j.protocol.Py4JError: An error occurred while > calling None.None. Trace: py4j.Py4JException: Cannot convert > org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at > py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at > py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at > py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.lang.Thread.run(Thread.java:748) > >>> rdd3 = sc.parallelize([11,12,13,14,15]) > >>> pairRDD2 = rdd3.zip(rdd3) > >>> unionRDD2 = sc.union([pairRDD1, pairRDD2]) > Traceback (most recent call last): File "", line 1, in File > "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union > jrdds[i] = rdds[i]._jrdd File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 238, in _setitem_ File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 221, in __set_item File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line > 332, in get_return_value py4j.protocol.Py4JError: An error occurred while > calling None.None. Trace: py4j.Py4JException: Cannot convert > org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at > py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at > py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at > py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.lang.Thread.run(Thread.java:748) > >>> rdd4 = sc.parallelize(range(5)) > >>> pairRDD3 = rdd4.zip(rdd4) > >>> unionRDD3 = sc.union([pairRDD1, pairRDD3]) > >>> unionRDD3.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, > >>> 1), (2, 2), (3, 3), (4, 4)] > > 2.4.5 does not have this regression -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31788) Error when creating UnionRDD of PairRDDs
[ https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31788: - Component/s: DStreams > Error when creating UnionRDD of PairRDDs > > > Key: SPARK-31788 > URL: https://issues.apache.org/jira/browse/SPARK-31788 > Project: Spark > Issue Type: Bug > Components: DStreams, PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Sanket Reddy >Assignee: Hyukjin Kwon >Priority: Blocker > > Union RDD of Pair RDD's seems to have issues > SparkSession available as 'spark'. > >>> rdd1 = sc.parallelize([1,2,3,4,5]) > >>> rdd2 = sc.parallelize([6,7,8,9,10]) > >>> pairRDD1 = rdd1.zip(rdd2) > >>> unionRDD1 = sc.union([pairRDD1, pairRDD1]) > Traceback (most recent call last): File "", line 1, in File > "/home/gs/spark/latest/python/pyspark/context.py", line 870, > in union jrdds[i] = rdds[i]._jrdd > File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 238, in _setitem_ File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 221, > in __set_item File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line > 332, in get_return_value py4j.protocol.Py4JError: An error occurred while > calling None.None. Trace: py4j.Py4JException: Cannot convert > org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at > py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at > py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at > py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.lang.Thread.run(Thread.java:748) > >>> rdd3 = sc.parallelize([11,12,13,14,15]) > >>> pairRDD2 = rdd3.zip(rdd3) > >>> unionRDD2 = sc.union([pairRDD1, pairRDD2]) > Traceback (most recent call last): File "", line 1, in File > "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union > jrdds[i] = rdds[i]._jrdd File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 238, in _setitem_ File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 221, in __set_item File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line > 332, in get_return_value py4j.protocol.Py4JError: An error occurred while > calling None.None. Trace: py4j.Py4JException: Cannot convert > org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at > py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at > py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at > py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.lang.Thread.run(Thread.java:748) > >>> rdd4 = sc.parallelize(range(5)) > >>> pairRDD3 = rdd4.zip(rdd4) > >>> unionRDD3 = sc.union([pairRDD1, pairRDD3]) > >>> unionRDD3.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, > >>> 1), (2, 2), (3, 3), (4, 4)] > > 2.4.5 does not have this regression -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31813) Cannot write snappy-compressed text files
[ https://issues.apache.org/jira/browse/SPARK-31813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117231#comment-17117231 ] ZhangShuai commented on SPARK-31813: In my environment, it works fine. > Cannot write snappy-compressed text files > - > > Key: SPARK-31813 > URL: https://issues.apache.org/jira/browse/SPARK-31813 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.5 >Reporter: Ondrej Kokes >Priority: Minor > > After installing pyspark (pip install pyspark) on both macOS and Ubuntu (a > clean Docker image with default-jre), Spark fails to write text-based files > (CSV and JSON) with snappy compression. It can snappy compress parquet and > orc, gzipping CSVs also works. > This is a clean PySpark installation, snappy jars are in place > {{$ ls -1 /usr/local/lib/python3.7/site-packages/pyspark/jars/ | grep snappy}} > {{snappy-0.2.jar > }}{{snappy-java-1.1.7.3.jar}} > Repro 1 (Scala): > $ spark-shell > {{spark.sql("select 1").write.option("compression", > "snappy").mode("overwrite").parquet("tmp/foo")}} > spark.sql("select 1").write.option("compression", > "snappy").mode("overwrite").csv("tmp/foo") > The first (parquet) will work, the second one won't. > Repro 2 (PySpark): > {{from pyspark.sql import SparkSession}} > {{if __name__ == '__main__':}}{{spark}} > {{ SparkSession.builder.appName('snappy_testing').getOrCreate()}} > {{ spark.sql('select 1').write.option('compression', > 'snappy').mode('overwrite').parquet('tmp/works_fine')}} > {{ spark.sql('select 1').write.option('compression', > 'gzip').mode('overwrite').csv('tmp/also_works')}} > {{ spark.sql('select 1').write.option('compression', > 'snappy').mode('overwrite').csv('tmp/snappy_not_found')}} > > In either case I get the following traceback > java.lang.RuntimeException: native snappy library not available: this version > of libhadoop was built without snappy support.java.lang.RuntimeException: > native snappy library not available: this version of libhadoop was built > without snappy support. at > org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:65) > at > org.apache.hadoop.io.compress.SnappyCodec.getCompressorType(SnappyCodec.java:134) > at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:150) > at > org.apache.hadoop.io.compress.CompressionCodec$Util.createOutputStreamWithCodecPool(CompressionCodec.java:131) > at > org.apache.hadoop.io.compress.SnappyCodec.createOutputStream(SnappyCodec.java:100) > at > org.apache.spark.sql.execution.datasources.CodecStreams$$anonfun$createOutputStream$1.apply(CodecStreams.scala:84) > at > org.apache.spark.sql.execution.datasources.CodecStreams$$anonfun$createOutputStream$1.apply(CodecStreams.scala:84) > at scala.Option.map(Option.scala:146) at > org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:84) > at > org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92) > at > org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CSVFileFormat.scala:177) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:85) > at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120) > at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:108) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at > org.apache.spark.scheduler.Task.run(Task.scala:123) at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31788) Error when creating UnionRDD of PairRDDs
[ https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117224#comment-17117224 ] Hyukjin Kwon commented on SPARK-31788: -- Reverted at https://github.com/apache/spark/commit/7fb2275f009c8744560c3247decdc106a8bca86f > Error when creating UnionRDD of PairRDDs > > > Key: SPARK-31788 > URL: https://issues.apache.org/jira/browse/SPARK-31788 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Sanket Reddy >Assignee: Sanket Reddy >Priority: Major > Fix For: 3.0.0 > > > Union RDD of Pair RDD's seems to have issues > SparkSession available as 'spark'. > >>> rdd1 = sc.parallelize([1,2,3,4,5]) > >>> rdd2 = sc.parallelize([6,7,8,9,10]) > >>> pairRDD1 = rdd1.zip(rdd2) > >>> unionRDD1 = sc.union([pairRDD1, pairRDD1]) > Traceback (most recent call last): File "", line 1, in File > "/home/gs/spark/latest/python/pyspark/context.py", line 870, > in union jrdds[i] = rdds[i]._jrdd > File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 238, in _setitem_ File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 221, > in __set_item File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line > 332, in get_return_value py4j.protocol.Py4JError: An error occurred while > calling None.None. Trace: py4j.Py4JException: Cannot convert > org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at > py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at > py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at > py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.lang.Thread.run(Thread.java:748) > >>> rdd3 = sc.parallelize([11,12,13,14,15]) > >>> pairRDD2 = rdd3.zip(rdd3) > >>> unionRDD2 = sc.union([pairRDD1, pairRDD2]) > Traceback (most recent call last): File "", line 1, in File > "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union > jrdds[i] = rdds[i]._jrdd File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 238, in _setitem_ File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", > line 221, in __set_item File > "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line > 332, in get_return_value py4j.protocol.Py4JError: An error occurred while > calling None.None. Trace: py4j.Py4JException: Cannot convert > org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at > py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at > py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at > py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.lang.Thread.run(Thread.java:748) > >>> rdd4 = sc.parallelize(range(5)) > >>> pairRDD3 = rdd4.zip(rdd4) > >>> unionRDD3 = sc.union([pairRDD1, pairRDD3]) > >>> unionRDD3.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, > >>> 1), (2, 2), (3, 3), (4, 4)] > > 2.4.5 does not have this regression -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31828) Retain table properties at CreateTableLikeCommand
[ https://issues.apache.org/jira/browse/SPARK-31828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117218#comment-17117218 ] Apache Spark commented on SPARK-31828: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/28647 > Retain table properties at CreateTableLikeCommand > - > > Key: SPARK-31828 > URL: https://issues.apache.org/jira/browse/SPARK-31828 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31828) Retain table properties at CreateTableLikeCommand
[ https://issues.apache.org/jira/browse/SPARK-31828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31828: Assignee: Apache Spark > Retain table properties at CreateTableLikeCommand > - > > Key: SPARK-31828 > URL: https://issues.apache.org/jira/browse/SPARK-31828 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31828) Retain table properties at CreateTableLikeCommand
[ https://issues.apache.org/jira/browse/SPARK-31828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117217#comment-17117217 ] Apache Spark commented on SPARK-31828: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/28647 > Retain table properties at CreateTableLikeCommand > - > > Key: SPARK-31828 > URL: https://issues.apache.org/jira/browse/SPARK-31828 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31828) Retain table properties at CreateTableLikeCommand
[ https://issues.apache.org/jira/browse/SPARK-31828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31828: Assignee: (was: Apache Spark) > Retain table properties at CreateTableLikeCommand > - > > Key: SPARK-31828 > URL: https://issues.apache.org/jira/browse/SPARK-31828 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31821) Remove mssql-jdbc dependencies
[ https://issues.apache.org/jira/browse/SPARK-31821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31821: - Assignee: Gabor Somogyi > Remove mssql-jdbc dependencies > -- > > Key: SPARK-31821 > URL: https://issues.apache.org/jira/browse/SPARK-31821 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0, 3.1.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31821) Remove mssql-jdbc dependencies
[ https://issues.apache.org/jira/browse/SPARK-31821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31821. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28640 [https://github.com/apache/spark/pull/28640] > Remove mssql-jdbc dependencies > -- > > Key: SPARK-31821 > URL: https://issues.apache.org/jira/browse/SPARK-31821 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0, 3.1.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31828) Retain table properties at CreateTableLikeCommand
ulysses you created SPARK-31828: --- Summary: Retain table properties at CreateTableLikeCommand Key: SPARK-31828 URL: https://issues.apache.org/jira/browse/SPARK-31828 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: ulysses you -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31800) Unable to disable Kerberos when submitting jobs to Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-31800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117202#comment-17117202 ] Devaraj Kavali commented on SPARK-31800: If we look at the log, about *krb5.conf* is just an info log but not the actual cause for failure. Actual failure is about the *spark.kubernetes.file.upload.path*, you can provide any dfs(S3, HDFS or any distributed file system) path for the config. > Unable to disable Kerberos when submitting jobs to Kubernetes > - > > Key: SPARK-31800 > URL: https://issues.apache.org/jira/browse/SPARK-31800 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: James Boylan >Priority: Major > > When you attempt to submit a process to Kubernetes using spark-submit through > --master, it returns the exception: > {code:java} > 20/05/22 20:25:54 INFO KerberosConfDriverFeatureStep: You have not specified > a krb5.conf file locally or via a ConfigMap. Make sure that you have the > krb5.conf locally on the driver image. > Exception in thread "main" org.apache.spark.SparkException: Please specify > spark.kubernetes.file.upload.path property. > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:290) > at > org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:246) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:245) > at > org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:165) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:163) > at > org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:60) > at > scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) > at > scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) > at scala.collection.immutable.List.foldLeft(List.scala:89) > at > org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58) > at > org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:98) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4(KubernetesClientApplication.scala:221) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4$adapted(KubernetesClientApplication.scala:215) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2539) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:215) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:188) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > 20/05/22 20:25:54 INFO ShutdownHookManager: Shutdown hook called > 20/05/22 20:25:54 INFO ShutdownHookManager: Deleting directory > /private/var/folders/p1/y24myg413wx1l1l52bsdn2hrgq/T/spark-c94db9c5-b8a8-414d-b01d-f6369d31c9b8 > {code} > No changes in settings appear to be able to disable Kerberos. This is when > running a simple execution of the SparkPi on our lab cluster. The command > being used is > {code:java} > ./bin/spark-submit --master k8s://https://{api_hostname} --deploy-mode >
[jira] [Commented] (SPARK-21784) Add ALTER TABLE ADD CONSTRANT DDL to support defining primary key and foreign keys
[ https://issues.apache.org/jira/browse/SPARK-21784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117169#comment-17117169 ] Sunitha Kambhampati commented on SPARK-21784: - [~krish_the_coder], [~Tagar] ,Thank you for your interest on this feature. We have PR's that are waiting on review and it would help if you could share your use case and level of interest here. There are significant improvements that we have seen with it and it was demonstrated in the Spark Summit. [https://databricks.com/session/informational-referential-integrity-constraints-support-in-apache-spark] We would be interested in moving this forward if there is more interest in the community and committers to review and get this in. > Add ALTER TABLE ADD CONSTRANT DDL to support defining primary key and foreign > keys > -- > > Key: SPARK-21784 > URL: https://issues.apache.org/jira/browse/SPARK-21784 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Suresh Thalamati >Priority: Major > > Currently Spark SQL does not have DDL support to define primary key , and > foreign key constraints. This Jira is to add DDL support to define primary > key and foreign key informational constraint using ALTER TABLE syntax. These > constraints will be used in query optimization and you can find more details > about this in the spec in SPARK-19842 > *Syntax :* > {code} > ALTER TABLE [db_name.]table_name ADD [CONSTRAINT constraintName] > (PRIMARY KEY (col_names) | > FOREIGN KEY (col_names) REFERENCES [db_name.]table_name [(col_names)]) > [VALIDATE | NOVALIDATE] [RELY | NORELY] > {code} > Examples : > {code:sql} > ALTER TABLE employee _ADD CONSTRANT pk_ PRIMARY KEY(empno) VALIDATE RELY > ALTER TABLE department _ADD CONSTRAINT emp_fk_ FOREIGN KEY (mgrno) REFERENCES > employee(empno) NOVALIDATE NORELY > {code} > *Constraint name generated by the system:* > {code:sql} > ALTER TABLE department ADD PRIMARY KEY(deptno) VALIDATE RELY > ALTER TABLE employee ADD FOREIGN KEY (workdept) REFERENCES department(deptno) > VALIDATE RELY; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31819) Add a workaround for Java 8u251+/K8s 1.17 and update integration test cases
[ https://issues.apache.org/jira/browse/SPARK-31819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31819: - Assignee: Dongjoon Hyun > Add a workaround for Java 8u251+/K8s 1.17 and update integration test cases > --- > > Key: SPARK-31819 > URL: https://issues.apache.org/jira/browse/SPARK-31819 > Project: Spark > Issue Type: Bug > Components: Documentation, Kubernetes, Tests >Affects Versions: 2.4.6 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31819) Add a workaround for Java 8u251+/K8s 1.17 and update integration test cases
[ https://issues.apache.org/jira/browse/SPARK-31819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117053#comment-17117053 ] Dongjoon Hyun commented on SPARK-31819: --- Yes. I fixed master/branch-3.0 via SPARK-31786 . > Add a workaround for Java 8u251+/K8s 1.17 and update integration test cases > --- > > Key: SPARK-31819 > URL: https://issues.apache.org/jira/browse/SPARK-31819 > Project: Spark > Issue Type: Bug > Components: Documentation, Kubernetes, Tests >Affects Versions: 2.4.6 >Reporter: Dongjoon Hyun >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31819) Add a workaround for Java 8u251+/K8s 1.17 and update integration test cases
[ https://issues.apache.org/jira/browse/SPARK-31819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31819. --- Fix Version/s: 2.4.6 Resolution: Fixed Issue resolved by pull request 28638 [https://github.com/apache/spark/pull/28638] > Add a workaround for Java 8u251+/K8s 1.17 and update integration test cases > --- > > Key: SPARK-31819 > URL: https://issues.apache.org/jira/browse/SPARK-31819 > Project: Spark > Issue Type: Bug > Components: Documentation, Kubernetes, Tests >Affects Versions: 2.4.6 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Blocker > Fix For: 2.4.6 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27997) kubernetes client token expired
[ https://issues.apache.org/jira/browse/SPARK-27997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116925#comment-17116925 ] rameshkrishnan muthusamy commented on SPARK-27997: -- I am currently working on this request. Will be sharing the details of the PR and design link soon. > kubernetes client token expired > > > Key: SPARK-27997 > URL: https://issues.apache.org/jira/browse/SPARK-27997 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Henry Yu >Priority: Major > > Hi , > when I try to submit spark to k8s in cluster mode, I need an authtoken to > talk with k8s. > unfortunately, many cloud provider provide token and expired with 10-15 mins. > so we need to fresh this token. > client mode is event worse, because scheduler is created on submit process. > Should I also make a pr on this ? I fix it by adding > RotatingOAuthTokenProvider and some configuration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31819) Add a workaround for Java 8u251+/K8s 1.17 and update integration test cases
[ https://issues.apache.org/jira/browse/SPARK-31819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116845#comment-17116845 ] Xiao Li commented on SPARK-31819: - [~dongjoon]This is 2.4 only? > Add a workaround for Java 8u251+/K8s 1.17 and update integration test cases > --- > > Key: SPARK-31819 > URL: https://issues.apache.org/jira/browse/SPARK-31819 > Project: Spark > Issue Type: Bug > Components: Documentation, Kubernetes, Tests >Affects Versions: 2.4.6 >Reporter: Dongjoon Hyun >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31827) better error message for the JDK bug of stand-alone form
[ https://issues.apache.org/jira/browse/SPARK-31827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116825#comment-17116825 ] Apache Spark commented on SPARK-31827: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/28646 > better error message for the JDK bug of stand-alone form > > > Key: SPARK-31827 > URL: https://issues.apache.org/jira/browse/SPARK-31827 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31827) better error message for the JDK bug of stand-alone form
[ https://issues.apache.org/jira/browse/SPARK-31827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31827: Assignee: Apache Spark (was: Wenchen Fan) > better error message for the JDK bug of stand-alone form > > > Key: SPARK-31827 > URL: https://issues.apache.org/jira/browse/SPARK-31827 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31827) better error message for the JDK bug of stand-alone form
[ https://issues.apache.org/jira/browse/SPARK-31827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31827: Assignee: Wenchen Fan (was: Apache Spark) > better error message for the JDK bug of stand-alone form > > > Key: SPARK-31827 > URL: https://issues.apache.org/jira/browse/SPARK-31827 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31827) better error message for the JDK bug of stand-alone form
Wenchen Fan created SPARK-31827: --- Summary: better error message for the JDK bug of stand-alone form Key: SPARK-31827 URL: https://issues.apache.org/jira/browse/SPARK-31827 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics
[ https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116803#comment-17116803 ] Itamar Turner-Trauring commented on SPARK-23206: It seems like one of the subtasks, has a PR that is basically done, just needs someone to review or even just approve it? [https://github.com/apache/spark/pull/23340] Any chance someone could look at it? > Additional Memory Tuning Metrics > > > Key: SPARK-23206 > URL: https://issues.apache.org/jira/browse/SPARK-23206 > Project: Spark > Issue Type: Umbrella > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edward Lu >Priority: Major > Attachments: ExecutorsTab.png, ExecutorsTab2.png, > MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png > > > At LinkedIn, we have multiple clusters, running thousands of Spark > applications, and these numbers are growing rapidly. We need to ensure that > these Spark applications are well tuned – cluster resources, including > memory, should be used efficiently so that the cluster can support running > more applications concurrently, and applications should run quickly and > reliably. > Currently there is limited visibility into how much memory executors are > using, and users are guessing numbers for executor and driver memory sizing. > These estimates are often much larger than needed, leading to memory wastage. > Examining the metrics for one cluster for a month, the average percentage of > used executor memory (max JVM used memory across executors / > spark.executor.memory) is 35%, leading to an average of 591GB unused memory > per application (number of executors * (spark.executor.memory - max JVM used > memory)). Spark has multiple memory regions (user memory, execution memory, > storage memory, and overhead memory), and to understand how memory is being > used and fine-tune allocation between regions, it would be useful to have > information about how much memory is being used for the different regions. > To improve visibility into memory usage for the driver and executors and > different memory regions, the following additional memory metrics can be be > tracked for each executor and driver: > * JVM used memory: the JVM heap size for the executor/driver. > * Execution memory: memory used for computation in shuffles, joins, sorts > and aggregations. > * Storage memory: memory used caching and propagating internal data across > the cluster. > * Unified memory: sum of execution and storage memory. > The peak values for each memory metric can be tracked for each executor, and > also per stage. This information can be shown in the Spark UI and the REST > APIs. Information for peak JVM used memory can help with determining > appropriate values for spark.executor.memory and spark.driver.memory, and > information about the unified memory region can help with determining > appropriate values for spark.memory.fraction and > spark.memory.storageFraction. Stage memory information can help identify > which stages are most memory intensive, and users can look into the relevant > code to determine if it can be optimized. > The memory metrics can be gathered by adding the current JVM used memory, > execution memory and storage memory to the heartbeat. SparkListeners are > modified to collect the new metrics for the executors, stages and Spark > history log. Only interesting values (peak values per stage per executor) are > recorded in the Spark history log, to minimize the amount of additional > logging. > We have attached our design documentation with this ticket and would like to > receive feedback from the community for this proposal. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31822) Cost too much resources when read orc hive table for infer schema
[ https://issues.apache.org/jira/browse/SPARK-31822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116778#comment-17116778 ] lithiumlee-_- commented on SPARK-31822: --- And i notice that : {quote}Spark 2.1.1 introduced a new configuration key: spark.sql.hive.caseSensitiveInferenceMode. It had a default setting of NEVER_INFER, which kept behavior identical to 2.1.0. However, Spark 2.2.0 changes this setting’s default value to INFER_AND_SAVE to restore compatibility with reading Hive metastore tables whose underlying file schema have mixed-case column names. With the INFER_AND_SAVE configuration value, on first access Spark will perform schema inference on any Hive metastore table for which it has not already saved an inferred schema. Note that schema inference can be a very time consuming operation for tables with thousands of partitions. If compatibility with mixed-case column names is not a concern, you can safely set spark.sql.hive.caseSensitiveInferenceMode to NEVER_INFER to avoid the initial overhead of schema inference. Note that with the new default INFER_AND_SAVE setting, the results of the schema inference are saved as a metastore key for future use. Therefore, the initial schema inference occurs only at a table’s first access." {quote} This situation can easily resolved by set "set spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER"... But I cannot think it is a best way. [https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.2-docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-21-to-22] > Cost too much resources when read orc hive table for infer schema > - > > Key: SPARK-31822 > URL: https://issues.apache.org/jira/browse/SPARK-31822 > Project: Spark > Issue Type: Improvement > Components: Input/Output, SQL >Affects Versions: 2.4.3 >Reporter: lithiumlee-_- >Priority: Major > Labels: HiveMetastoreCatalog, orc > > When read a hive orc partitioned table without spark schema properties , > spark read all partitions and all files for infer schema. > Other settings: native orc mode ; _convertMetastoreOrc = true._ > > And I think it can improved by pass *_partitionFilters_* to > *_fileIndex.listFiles_*. > {code:java} > // code placeholder > // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238 > val inferredSchema = fileFormat > .inferSchema( > sparkSession, > options, > fileIndex.listFiles(Nil, Nil).flatMap(_.files)) > .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _)) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31763) DataFrame.inputFiles() not Available
[ https://issues.apache.org/jira/browse/SPARK-31763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116777#comment-17116777 ] Hyukjin Kwon commented on SPARK-31763: -- Please go ahead > DataFrame.inputFiles() not Available > > > Key: SPARK-31763 > URL: https://issues.apache.org/jira/browse/SPARK-31763 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5 >Reporter: Felix Kizhakkel Jose >Priority: Major > > I have been trying to list inputFiles that compose my DataSet by using > *PySpark* > spark_session.read > .format(sourceFileFormat) > .load(S3A_FILESYSTEM_PREFIX + bucket + File.separator + sourceFolderPrefix) > *.inputFiles();* > but I get an exception saying inputFiles attribute not present. But I was > able to get this functionality with Spark Java. > *So is this something missing in PySpark?* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31826) Support composed type of case class for typed Scala UDF
[ https://issues.apache.org/jira/browse/SPARK-31826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31826: Assignee: (was: Apache Spark) > Support composed type of case class for typed Scala UDF > --- > > Key: SPARK-31826 > URL: https://issues.apache.org/jira/browse/SPARK-31826 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > After SPARK-30127, typed Scala UDF now supports to accept case class as input > parameter. However, it still does not support types like Seq[T], Array[T], > assuming T is a case class. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31826) Support composed type of case class for typed Scala UDF
[ https://issues.apache.org/jira/browse/SPARK-31826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31826: Assignee: Apache Spark > Support composed type of case class for typed Scala UDF > --- > > Key: SPARK-31826 > URL: https://issues.apache.org/jira/browse/SPARK-31826 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > > After SPARK-30127, typed Scala UDF now supports to accept case class as input > parameter. However, it still does not support types like Seq[T], Array[T], > assuming T is a case class. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31826) Support composed type of case class for typed Scala UDF
[ https://issues.apache.org/jira/browse/SPARK-31826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116755#comment-17116755 ] Apache Spark commented on SPARK-31826: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/28645 > Support composed type of case class for typed Scala UDF > --- > > Key: SPARK-31826 > URL: https://issues.apache.org/jira/browse/SPARK-31826 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > After SPARK-30127, typed Scala UDF now supports to accept case class as input > parameter. However, it still does not support types like Seq[T], Array[T], > assuming T is a case class. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31763) DataFrame.inputFiles() not Available
[ https://issues.apache.org/jira/browse/SPARK-31763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116738#comment-17116738 ] Rakesh Raushan commented on SPARK-31763: Shall I open a PR for this? > DataFrame.inputFiles() not Available > > > Key: SPARK-31763 > URL: https://issues.apache.org/jira/browse/SPARK-31763 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5 >Reporter: Felix Kizhakkel Jose >Priority: Major > > I have been trying to list inputFiles that compose my DataSet by using > *PySpark* > spark_session.read > .format(sourceFileFormat) > .load(S3A_FILESYSTEM_PREFIX + bucket + File.separator + sourceFolderPrefix) > *.inputFiles();* > but I get an exception saying inputFiles attribute not present. But I was > able to get this functionality with Spark Java. > *So is this something missing in PySpark?* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31826) Support composed type of case class for typed Scala UDF
wuyi created SPARK-31826: Summary: Support composed type of case class for typed Scala UDF Key: SPARK-31826 URL: https://issues.apache.org/jira/browse/SPARK-31826 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: wuyi After SPARK-30127, typed Scala UDF now supports to accept case class as input parameter. However, it still does not support types like Seq[T], Array[T], assuming T is a case class. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31820) Flaky JavaBeanDeserializationSuite
[ https://issues.apache.org/jira/browse/SPARK-31820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31820. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28639 [https://github.com/apache/spark/pull/28639] > Flaky JavaBeanDeserializationSuite > -- > > Key: SPARK-31820 > URL: https://issues.apache.org/jira/browse/SPARK-31820 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > The test suite JavaBeanDeserializationSuite sometimes fails with: > {code} > sbt.ForkMain$ForkError: java.lang.AssertionError: > expected:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25 > 12:39:16.999,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25 > 12:39:17.0,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25 > 12:39:17.0,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25 > 12:39:17.0,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25 > 12:39:17.0,nullIntField=]]> but > was:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25 > 12:39:16.999,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25 > 12:39:17,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25 > 12:39:17,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25 > 12:39:17,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25 > 12:39:17,nullIntField=]]> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:144) > at > test.org.apache.spark.sql.JavaBeanDeserializationSuite.testSpark22000(JavaBeanDeserializationSuite.java:165) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > {code} > See https://github.com/apache/spark/pull/28630#issuecomment-633695723 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31820) Flaky JavaBeanDeserializationSuite
[ https://issues.apache.org/jira/browse/SPARK-31820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31820: --- Assignee: Maxim Gekk > Flaky JavaBeanDeserializationSuite > -- > > Key: SPARK-31820 > URL: https://issues.apache.org/jira/browse/SPARK-31820 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > The test suite JavaBeanDeserializationSuite sometimes fails with: > {code} > sbt.ForkMain$ForkError: java.lang.AssertionError: > expected:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25 > 12:39:16.999,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25 > 12:39:17.0,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25 > 12:39:17.0,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25 > 12:39:17.0,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25 > 12:39:17.0,nullIntField=]]> but > was:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25 > 12:39:16.999,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25 > 12:39:17,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25 > 12:39:17,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25 > 12:39:17,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25 > 12:39:17,nullIntField=]]> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:144) > at > test.org.apache.spark.sql.JavaBeanDeserializationSuite.testSpark22000(JavaBeanDeserializationSuite.java:165) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > {code} > See https://github.com/apache/spark/pull/28630#issuecomment-633695723 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31800) Unable to disable Kerberos when submitting jobs to Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-31800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116657#comment-17116657 ] James Boylan edited comment on SPARK-31800 at 5/26/20, 11:40 AM: - There are a few problems with that: # I'm running a local standalone spark in the environment this is being tested for. there is no HDFS to interact with. It leverages S3 for the storage medium. # I don't have Kerberos configured at all. We don't leverage it in our existing system and I would prefer not to have to leverage it just to support Spark 3.0 on Kubernetes as none of our processes require it. # It is not honoring the spark.authenticate false configuration property, or any other property to try and disable Kerberos. was (Author: drahkar): There are a couple problems with that: # I'm running a local standalone spark in the environment this is being tested for. there is no HDFS to interact with. It leverages S3 for the storage medium. # I don't have Kerberos configured at all. We don't leverage it in our existing system and I would prefer not to have to leverage it just to support Spark 3.0 on Kubernetes as none of our processes require it. # It is not honoring the spark.authenticate false configuration property, or any other property to try and disable Kerberos. > Unable to disable Kerberos when submitting jobs to Kubernetes > - > > Key: SPARK-31800 > URL: https://issues.apache.org/jira/browse/SPARK-31800 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: James Boylan >Priority: Major > > When you attempt to submit a process to Kubernetes using spark-submit through > --master, it returns the exception: > {code:java} > 20/05/22 20:25:54 INFO KerberosConfDriverFeatureStep: You have not specified > a krb5.conf file locally or via a ConfigMap. Make sure that you have the > krb5.conf locally on the driver image. > Exception in thread "main" org.apache.spark.SparkException: Please specify > spark.kubernetes.file.upload.path property. > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:290) > at > org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:246) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:245) > at > org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:165) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:163) > at > org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:60) > at > scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) > at > scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) > at scala.collection.immutable.List.foldLeft(List.scala:89) > at > org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58) > at > org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:98) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4(KubernetesClientApplication.scala:221) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4$adapted(KubernetesClientApplication.scala:215) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2539) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:215) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:188) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) >
[jira] [Commented] (SPARK-31800) Unable to disable Kerberos when submitting jobs to Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-31800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116657#comment-17116657 ] James Boylan commented on SPARK-31800: -- There are a couple problems with that: # I'm running a local standalone spark in the environment this is being tested for. there is no HDFS to interact with. It leverages S3 for the storage medium. # I don't have Kerberos configured at all. We don't leverage it in our existing system and I would prefer not to have to leverage it just to support Spark 3.0 on Kubernetes as none of our processes require it. # It is not honoring the spark.authenticate false configuration property, or any other property to try and disable Kerberos. > Unable to disable Kerberos when submitting jobs to Kubernetes > - > > Key: SPARK-31800 > URL: https://issues.apache.org/jira/browse/SPARK-31800 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: James Boylan >Priority: Major > > When you attempt to submit a process to Kubernetes using spark-submit through > --master, it returns the exception: > {code:java} > 20/05/22 20:25:54 INFO KerberosConfDriverFeatureStep: You have not specified > a krb5.conf file locally or via a ConfigMap. Make sure that you have the > krb5.conf locally on the driver image. > Exception in thread "main" org.apache.spark.SparkException: Please specify > spark.kubernetes.file.upload.path property. > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:290) > at > org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:246) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:245) > at > org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:165) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:163) > at > org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:60) > at > scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) > at > scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) > at scala.collection.immutable.List.foldLeft(List.scala:89) > at > org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58) > at > org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:98) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4(KubernetesClientApplication.scala:221) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4$adapted(KubernetesClientApplication.scala:215) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2539) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:215) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:188) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > 20/05/22 20:25:54 INFO ShutdownHookManager: Shutdown hook called > 20/05/22 20:25:54 INFO ShutdownHookManager: Deleting directory >
[jira] [Commented] (SPARK-23539) Add support for Kafka headers in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-23539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116653#comment-17116653 ] Jungtaek Lim commented on SPARK-23539: -- You can ignore the affect version in most cases if the type of the issue is new feature/improvement. > Add support for Kafka headers in Structured Streaming > - > > Key: SPARK-23539 > URL: https://issues.apache.org/jira/browse/SPARK-23539 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Dongjin Lee >Priority: Major > Fix For: 3.0.0 > > > Kafka headers were added in 0.11. We should expose them through our kafka > data source in both batch and streaming queries. > This is currently blocked on version of Kafka in Spark from 0.10.1 to 1.0+ > SPARK-18057 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31771) Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q'
[ https://issues.apache.org/jira/browse/SPARK-31771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31771. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28637 [https://github.com/apache/spark/pull/28637] > Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q' > - > > Key: SPARK-31771 > URL: https://issues.apache.org/jira/browse/SPARK-31771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > Five continuous pattern characters with 'G/M/L/E/u/Q/q' means Narrow-Text > Style in java.time.DateTimeFormatterBuilder which output the leading single > letter of the value, e.g. `December` would be `D`, while in Spark 2.4 they > means Full-Text Style. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31771) Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q'
[ https://issues.apache.org/jira/browse/SPARK-31771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31771: --- Assignee: Kent Yao > Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q' > - > > Key: SPARK-31771 > URL: https://issues.apache.org/jira/browse/SPARK-31771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > Five continuous pattern characters with 'G/M/L/E/u/Q/q' means Narrow-Text > Style in java.time.DateTimeFormatterBuilder which output the leading single > letter of the value, e.g. `December` would be `D`, while in Spark 2.4 they > means Full-Text Style. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31762) Fix perf regression of date/timestamp formatting in toHiveString
[ https://issues.apache.org/jira/browse/SPARK-31762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116629#comment-17116629 ] Apache Spark commented on SPARK-31762: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/28643 > Fix perf regression of date/timestamp formatting in toHiveString > > > Key: SPARK-31762 > URL: https://issues.apache.org/jira/browse/SPARK-31762 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > HiveResult.toHiveString has to convert incoming Java date/timestamps types to > days/microseconds because existing API of DateFormatter/TimestampFormatter > don't accept java.sql.Timestamp/java.util.Date and > java.time.Instant/java.time.LocalDate. Internally, the formatters perform > conversions to Java types again. This badly impacts on the performance. The > ticket aims to add new APIs to DateFormatter and TimestampFormatter that > should accept Java types. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31762) Fix perf regression of date/timestamp formatting in toHiveString
[ https://issues.apache.org/jira/browse/SPARK-31762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116628#comment-17116628 ] Apache Spark commented on SPARK-31762: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/28643 > Fix perf regression of date/timestamp formatting in toHiveString > > > Key: SPARK-31762 > URL: https://issues.apache.org/jira/browse/SPARK-31762 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > HiveResult.toHiveString has to convert incoming Java date/timestamps types to > days/microseconds because existing API of DateFormatter/TimestampFormatter > don't accept java.sql.Timestamp/java.util.Date and > java.time.Instant/java.time.LocalDate. Internally, the formatters perform > conversions to Java types again. This badly impacts on the performance. The > ticket aims to add new APIs to DateFormatter and TimestampFormatter that > should accept Java types. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23539) Add support for Kafka headers in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-23539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116607#comment-17116607 ] Martin Andersson commented on SPARK-23539: -- Why does it say {{Affects Version/s: 2.3.0}} when it was only included 3.0.0 ? > Add support for Kafka headers in Structured Streaming > - > > Key: SPARK-23539 > URL: https://issues.apache.org/jira/browse/SPARK-23539 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Dongjin Lee >Priority: Major > Fix For: 3.0.0 > > > Kafka headers were added in 0.11. We should expose them through our kafka > data source in both batch and streaming queries. > This is currently blocked on version of Kafka in Spark from 0.10.1 to 1.0+ > SPARK-18057 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31825) Spark History Server UI does not come up when hosted on a custom path
[ https://issues.apache.org/jira/browse/SPARK-31825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Rao updated SPARK-31825: - Attachment: Faulty Spark History UI.PNG > Spark History Server UI does not come up when hosted on a custom path > - > > Key: SPARK-31825 > URL: https://issues.apache.org/jira/browse/SPARK-31825 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5 > Environment: Bring up Spark-History Server on any linux machine using > start-history-server.sh script. >Reporter: Abhishek Rao >Priority: Major > Attachments: Faulty Spark History UI.PNG > > > I tried to bringup spark-history server using the start-history-server.sh > script. The UI works perfectly fine when there is no path specified. > i.e. http://:18080 > But If I bringup history server using custom path, I do not see the UI > working properly. > Following is my configuration > spark.history.fs.logDirectory= > spark.ui.proxyBase=/test > When I hit the url http://:18080/test, I do not > see the History Server UI working properly. Attaching the screenshot of the > faulty UI. > Wanted to know if I'm missing any configuration > > !image-2020-05-26-15-26-21-616.png! > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31825) Spark History Server UI does not come up when hosted on a custom path
Abhishek Rao created SPARK-31825: Summary: Spark History Server UI does not come up when hosted on a custom path Key: SPARK-31825 URL: https://issues.apache.org/jira/browse/SPARK-31825 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.5 Environment: Bring up Spark-History Server on any linux machine using start-history-server.sh script. Reporter: Abhishek Rao I tried to bringup spark-history server using the start-history-server.sh script. The UI works perfectly fine when there is no path specified. i.e. http://:18080 But If I bringup history server using custom path, I do not see the UI working properly. Following is my configuration spark.history.fs.logDirectory= spark.ui.proxyBase=/test When I hit the url http://:18080/test, I do not see the History Server UI working properly. Attaching the screenshot of the faulty UI. Wanted to know if I'm missing any configuration !image-2020-05-26-15-26-21-616.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31809) Infer IsNotNull for all children of NullIntolerant expressions
[ https://issues.apache.org/jira/browse/SPARK-31809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116577#comment-17116577 ] Apache Spark commented on SPARK-31809: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/28642 > Infer IsNotNull for all children of NullIntolerant expressions > -- > > Key: SPARK-31809 > URL: https://issues.apache.org/jira/browse/SPARK-31809 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Attachments: default.png, infer.png > > > We should infer {{IsNotNull}} for all children of {{NullIntolerant}} > expressions. For example: > {code:sql} > CREATE TABLE t1(c1 string, c2 string); > CREATE TABLE t2(c1 string, c2 string); > EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON coalesce(t1.c1, t1.c2)=t2.c1; > {code} > {noformat} > == Physical Plan == > *(4) Project [c1#5, c2#6] > +- *(4) SortMergeJoin [coalesce(c1#5, c2#6)], [c1#7], Inner >:- *(1) Sort [coalesce(c1#5, c2#6) ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(coalesce(c1#5, c2#6), 200), true, [id=#33] >: +- Scan hive default.t1 [c1#5, c2#6], HiveTableRelation > `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#5, > c2#6], Statistics(sizeInBytes=8.0 EiB) >+- *(3) Sort [c1#7 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(c1#7, 200), true, [id=#46] > +- *(2) Filter isnotnull(c1#7) > +- Scan hive default.t2 [c1#7], HiveTableRelation `default`.`t2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8], > Statistics(sizeInBytes=8.0 EiB) > {noformat} > We should infer {{coalesce(t1.c1, t1.c2) IS NOT NULL}} to improve query > performance: > {noformat} > == Physical Plan == > *(5) Project [c1#23, c2#24] > +- *(5) SortMergeJoin [coalesce(c1#23, c2#24)], [c1#25], Inner >:- *(2) Sort [coalesce(c1#23, c2#24) ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(coalesce(c1#23, c2#24), 200), true, > [id=#95] >: +- *(1) Filter isnotnull(coalesce(c1#23, c2#24)) >:+- Scan hive default.t1 [c1#23, c2#24], HiveTableRelation > `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#23, > c2#24], Statistics(sizeInBytes=8.0 EiB) >+- *(4) Sort [c1#25 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(c1#25, 200), true, [id=#103] > +- *(3) Filter isnotnull(c1#25) > +- Scan hive default.t2 [c1#25], HiveTableRelation > `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#25, > c2#26], Statistics(sizeInBytes=8.0 EiB) > {noformat} > Real performance test case: > !default.png! !infer.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31809) Infer IsNotNull for all children of NullIntolerant expressions
[ https://issues.apache.org/jira/browse/SPARK-31809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116576#comment-17116576 ] Apache Spark commented on SPARK-31809: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/28642 > Infer IsNotNull for all children of NullIntolerant expressions > -- > > Key: SPARK-31809 > URL: https://issues.apache.org/jira/browse/SPARK-31809 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Attachments: default.png, infer.png > > > We should infer {{IsNotNull}} for all children of {{NullIntolerant}} > expressions. For example: > {code:sql} > CREATE TABLE t1(c1 string, c2 string); > CREATE TABLE t2(c1 string, c2 string); > EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON coalesce(t1.c1, t1.c2)=t2.c1; > {code} > {noformat} > == Physical Plan == > *(4) Project [c1#5, c2#6] > +- *(4) SortMergeJoin [coalesce(c1#5, c2#6)], [c1#7], Inner >:- *(1) Sort [coalesce(c1#5, c2#6) ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(coalesce(c1#5, c2#6), 200), true, [id=#33] >: +- Scan hive default.t1 [c1#5, c2#6], HiveTableRelation > `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#5, > c2#6], Statistics(sizeInBytes=8.0 EiB) >+- *(3) Sort [c1#7 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(c1#7, 200), true, [id=#46] > +- *(2) Filter isnotnull(c1#7) > +- Scan hive default.t2 [c1#7], HiveTableRelation `default`.`t2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8], > Statistics(sizeInBytes=8.0 EiB) > {noformat} > We should infer {{coalesce(t1.c1, t1.c2) IS NOT NULL}} to improve query > performance: > {noformat} > == Physical Plan == > *(5) Project [c1#23, c2#24] > +- *(5) SortMergeJoin [coalesce(c1#23, c2#24)], [c1#25], Inner >:- *(2) Sort [coalesce(c1#23, c2#24) ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(coalesce(c1#23, c2#24), 200), true, > [id=#95] >: +- *(1) Filter isnotnull(coalesce(c1#23, c2#24)) >:+- Scan hive default.t1 [c1#23, c2#24], HiveTableRelation > `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#23, > c2#24], Statistics(sizeInBytes=8.0 EiB) >+- *(4) Sort [c1#25 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(c1#25, 200), true, [id=#103] > +- *(3) Filter isnotnull(c1#25) > +- Scan hive default.t2 [c1#25], HiveTableRelation > `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#25, > c2#26], Statistics(sizeInBytes=8.0 EiB) > {noformat} > Real performance test case: > !default.png! !infer.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31809) Infer IsNotNull for all children of NullIntolerant expressions
[ https://issues.apache.org/jira/browse/SPARK-31809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31809: Assignee: Yuming Wang (was: Apache Spark) > Infer IsNotNull for all children of NullIntolerant expressions > -- > > Key: SPARK-31809 > URL: https://issues.apache.org/jira/browse/SPARK-31809 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Attachments: default.png, infer.png > > > We should infer {{IsNotNull}} for all children of {{NullIntolerant}} > expressions. For example: > {code:sql} > CREATE TABLE t1(c1 string, c2 string); > CREATE TABLE t2(c1 string, c2 string); > EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON coalesce(t1.c1, t1.c2)=t2.c1; > {code} > {noformat} > == Physical Plan == > *(4) Project [c1#5, c2#6] > +- *(4) SortMergeJoin [coalesce(c1#5, c2#6)], [c1#7], Inner >:- *(1) Sort [coalesce(c1#5, c2#6) ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(coalesce(c1#5, c2#6), 200), true, [id=#33] >: +- Scan hive default.t1 [c1#5, c2#6], HiveTableRelation > `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#5, > c2#6], Statistics(sizeInBytes=8.0 EiB) >+- *(3) Sort [c1#7 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(c1#7, 200), true, [id=#46] > +- *(2) Filter isnotnull(c1#7) > +- Scan hive default.t2 [c1#7], HiveTableRelation `default`.`t2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8], > Statistics(sizeInBytes=8.0 EiB) > {noformat} > We should infer {{coalesce(t1.c1, t1.c2) IS NOT NULL}} to improve query > performance: > {noformat} > == Physical Plan == > *(5) Project [c1#23, c2#24] > +- *(5) SortMergeJoin [coalesce(c1#23, c2#24)], [c1#25], Inner >:- *(2) Sort [coalesce(c1#23, c2#24) ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(coalesce(c1#23, c2#24), 200), true, > [id=#95] >: +- *(1) Filter isnotnull(coalesce(c1#23, c2#24)) >:+- Scan hive default.t1 [c1#23, c2#24], HiveTableRelation > `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#23, > c2#24], Statistics(sizeInBytes=8.0 EiB) >+- *(4) Sort [c1#25 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(c1#25, 200), true, [id=#103] > +- *(3) Filter isnotnull(c1#25) > +- Scan hive default.t2 [c1#25], HiveTableRelation > `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#25, > c2#26], Statistics(sizeInBytes=8.0 EiB) > {noformat} > Real performance test case: > !default.png! !infer.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31809) Infer IsNotNull for all children of NullIntolerant expressions
[ https://issues.apache.org/jira/browse/SPARK-31809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31809: Assignee: Apache Spark (was: Yuming Wang) > Infer IsNotNull for all children of NullIntolerant expressions > -- > > Key: SPARK-31809 > URL: https://issues.apache.org/jira/browse/SPARK-31809 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > Attachments: default.png, infer.png > > > We should infer {{IsNotNull}} for all children of {{NullIntolerant}} > expressions. For example: > {code:sql} > CREATE TABLE t1(c1 string, c2 string); > CREATE TABLE t2(c1 string, c2 string); > EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON coalesce(t1.c1, t1.c2)=t2.c1; > {code} > {noformat} > == Physical Plan == > *(4) Project [c1#5, c2#6] > +- *(4) SortMergeJoin [coalesce(c1#5, c2#6)], [c1#7], Inner >:- *(1) Sort [coalesce(c1#5, c2#6) ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(coalesce(c1#5, c2#6), 200), true, [id=#33] >: +- Scan hive default.t1 [c1#5, c2#6], HiveTableRelation > `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#5, > c2#6], Statistics(sizeInBytes=8.0 EiB) >+- *(3) Sort [c1#7 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(c1#7, 200), true, [id=#46] > +- *(2) Filter isnotnull(c1#7) > +- Scan hive default.t2 [c1#7], HiveTableRelation `default`.`t2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8], > Statistics(sizeInBytes=8.0 EiB) > {noformat} > We should infer {{coalesce(t1.c1, t1.c2) IS NOT NULL}} to improve query > performance: > {noformat} > == Physical Plan == > *(5) Project [c1#23, c2#24] > +- *(5) SortMergeJoin [coalesce(c1#23, c2#24)], [c1#25], Inner >:- *(2) Sort [coalesce(c1#23, c2#24) ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(coalesce(c1#23, c2#24), 200), true, > [id=#95] >: +- *(1) Filter isnotnull(coalesce(c1#23, c2#24)) >:+- Scan hive default.t1 [c1#23, c2#24], HiveTableRelation > `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#23, > c2#24], Statistics(sizeInBytes=8.0 EiB) >+- *(4) Sort [c1#25 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(c1#25, 200), true, [id=#103] > +- *(3) Filter isnotnull(c1#25) > +- Scan hive default.t2 [c1#25], HiveTableRelation > `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#25, > c2#26], Statistics(sizeInBytes=8.0 EiB) > {noformat} > Real performance test case: > !default.png! !infer.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31822) Cost too much resources when read orc hive table for infer schema
[ https://issues.apache.org/jira/browse/SPARK-31822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lithiumlee-_- updated SPARK-31822: -- Labels: HiveMetastoreCatalog orc (was: ) > Cost too much resources when read orc hive table for infer schema > - > > Key: SPARK-31822 > URL: https://issues.apache.org/jira/browse/SPARK-31822 > Project: Spark > Issue Type: Improvement > Components: Input/Output, SQL >Affects Versions: 2.4.3 >Reporter: lithiumlee-_- >Priority: Major > Labels: HiveMetastoreCatalog, orc > > When read a hive orc partitioned table without spark schema properties , > spark read all partitions and all files for infer schema. > Other settings: native orc mode ; _convertMetastoreOrc = true._ > > And I think it can improved by pass *_partitionFilters_* to > *_fileIndex.listFiles_*. > {code:java} > // code placeholder > // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238 > val inferredSchema = fileFormat > .inferSchema( > sparkSession, > options, > fileIndex.listFiles(Nil, Nil).flatMap(_.files)) > .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _)) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31822) Cost too much resources when read orc hive table for infer schema
[ https://issues.apache.org/jira/browse/SPARK-31822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lithiumlee-_- updated SPARK-31822: -- Description: When read a hive orc partitioned table without spark schema properties , spark read all partitions and all files for infer schema. Other settings: native orc mode ; _convertMetastoreOrc = true._ And I think it can improved by pass *_partitionFilters_* to *_fileIndex.listFiles_*. {code:java} // code placeholder // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238 val inferredSchema = fileFormat .inferSchema( sparkSession, options, fileIndex.listFiles(Nil, Nil).flatMap(_.files)) .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _)) {code} was: When read a hive orc partitioned table without spark schema properties , spark read all partitions and all files for infer schema. Other settings: native orc mode ; _convertMetastoreOrc = true._ And I think it can improve by pass *_partitionFilters_* to *_fileIndex.listFiles_*. {code:java} // code placeholder // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238 val inferredSchema = fileFormat .inferSchema( sparkSession, options, fileIndex.listFiles(Nil, Nil).flatMap(_.files)) .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _)) {code} > Cost too much resources when read orc hive table for infer schema > - > > Key: SPARK-31822 > URL: https://issues.apache.org/jira/browse/SPARK-31822 > Project: Spark > Issue Type: Improvement > Components: Input/Output, SQL >Affects Versions: 2.4.3 >Reporter: lithiumlee-_- >Priority: Major > > When read a hive orc partitioned table without spark schema properties , > spark read all partitions and all files for infer schema. > Other settings: native orc mode ; _convertMetastoreOrc = true._ > > And I think it can improved by pass *_partitionFilters_* to > *_fileIndex.listFiles_*. > {code:java} > // code placeholder > // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238 > val inferredSchema = fileFormat > .inferSchema( > sparkSession, > options, > fileIndex.listFiles(Nil, Nil).flatMap(_.files)) > .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _)) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31822) Cost too much resources when read orc hive table for infer schema
[ https://issues.apache.org/jira/browse/SPARK-31822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lithiumlee-_- updated SPARK-31822: -- Summary: Cost too much resources when read orc hive table for infer schema (was: Cost too much resources when read orc hive table to infer schema) > Cost too much resources when read orc hive table for infer schema > - > > Key: SPARK-31822 > URL: https://issues.apache.org/jira/browse/SPARK-31822 > Project: Spark > Issue Type: Improvement > Components: Input/Output, SQL >Affects Versions: 2.4.3 >Reporter: lithiumlee-_- >Priority: Major > > When read a hive orc partitioned table without spark schema properties , > spark read all partitions and all files for infer schema. > Other settings: native orc mode ; _convertMetastoreOrc = true._ > > And I think it can improve by pass *_partitionFilters_* to > *_fileIndex.listFiles_*. > {code:java} > // code placeholder > // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238 > val inferredSchema = fileFormat > .inferSchema( > sparkSession, > options, > fileIndex.listFiles(Nil, Nil).flatMap(_.files)) > .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _)) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31824) DAGSchedulerSuite: Improve and reuse completeShuffleMapStageSuccessfully
[ https://issues.apache.org/jira/browse/SPARK-31824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116567#comment-17116567 ] Apache Spark commented on SPARK-31824: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/28641 > DAGSchedulerSuite: Improve and reuse completeShuffleMapStageSuccessfully > > > Key: SPARK-31824 > URL: https://issues.apache.org/jira/browse/SPARK-31824 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > DAGSchedulerSuite provides completeShuffleMapStageSuccessfully to make > ShuffleMapStage successfully. > But many test case uses complete directly as follows: > complete(taskSets(0), Seq((Success, makeMapStatus("hostA", 1 > We need to improve completeShuffleMapStageSuccessfully and reuse it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31824) DAGSchedulerSuite: Improve and reuse completeShuffleMapStageSuccessfully
[ https://issues.apache.org/jira/browse/SPARK-31824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31824: Assignee: (was: Apache Spark) > DAGSchedulerSuite: Improve and reuse completeShuffleMapStageSuccessfully > > > Key: SPARK-31824 > URL: https://issues.apache.org/jira/browse/SPARK-31824 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > DAGSchedulerSuite provides completeShuffleMapStageSuccessfully to make > ShuffleMapStage successfully. > But many test case uses complete directly as follows: > complete(taskSets(0), Seq((Success, makeMapStatus("hostA", 1 > We need to improve completeShuffleMapStageSuccessfully and reuse it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31824) DAGSchedulerSuite: Improve and reuse completeShuffleMapStageSuccessfully
[ https://issues.apache.org/jira/browse/SPARK-31824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116565#comment-17116565 ] Apache Spark commented on SPARK-31824: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/28641 > DAGSchedulerSuite: Improve and reuse completeShuffleMapStageSuccessfully > > > Key: SPARK-31824 > URL: https://issues.apache.org/jira/browse/SPARK-31824 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > DAGSchedulerSuite provides completeShuffleMapStageSuccessfully to make > ShuffleMapStage successfully. > But many test case uses complete directly as follows: > complete(taskSets(0), Seq((Success, makeMapStatus("hostA", 1 > We need to improve completeShuffleMapStageSuccessfully and reuse it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31824) DAGSchedulerSuite: Improve and reuse completeShuffleMapStageSuccessfully
[ https://issues.apache.org/jira/browse/SPARK-31824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31824: Assignee: Apache Spark > DAGSchedulerSuite: Improve and reuse completeShuffleMapStageSuccessfully > > > Key: SPARK-31824 > URL: https://issues.apache.org/jira/browse/SPARK-31824 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > DAGSchedulerSuite provides completeShuffleMapStageSuccessfully to make > ShuffleMapStage successfully. > But many test case uses complete directly as follows: > complete(taskSets(0), Seq((Success, makeMapStatus("hostA", 1 > We need to improve completeShuffleMapStageSuccessfully and reuse it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31824) DAGSchedulerSuite: Improve and reuse completeShuffleMapStageSuccessfully
jiaan.geng created SPARK-31824: -- Summary: DAGSchedulerSuite: Improve and reuse completeShuffleMapStageSuccessfully Key: SPARK-31824 URL: https://issues.apache.org/jira/browse/SPARK-31824 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.1.0 Reporter: jiaan.geng DAGSchedulerSuite provides completeShuffleMapStageSuccessfully to make ShuffleMapStage successfully. But many test case uses complete directly as follows: complete(taskSets(0), Seq((Success, makeMapStatus("hostA", 1 We need to improve completeShuffleMapStageSuccessfully and reuse it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31823) Improve the current Spark Scheduler test framework
jiaan.geng created SPARK-31823: -- Summary: Improve the current Spark Scheduler test framework Key: SPARK-31823 URL: https://issues.apache.org/jira/browse/SPARK-31823 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.0 Reporter: jiaan.geng The major source of Spark Scheduler unit test cases are DAGSchedulerSuite、TaskSchedulerImplSuite、TaskSetManagerSuite. These test suites have played an important role to ensure the Spark Scheduler behaves as we expected, however, we should significantly improve these suites to provide better organized and more extendable test framework now, to further support the evolution of the Spark Scheduler. The major limitations of the current Spark Scheduler test framework: * The test framework was designed at the very early stage of Spark, so it doesn’t integrate well with the features introduced later, e.g. barrier execution, indeterminate stage, zombie taskset, resource profile. * Many test cases are added in a hacky way, don’t fully utilize or expend the original test framework (while they could have been), this leads to a heavy maintenance burden. * The test cases are not organized well, many test cases are appended case by case, each test file consists of thousands of LOCs. * Frequently introducing flaky test cases because there is no standard way to generate test data and verify the result. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31821) Remove mssql-jdbc dependencies
[ https://issues.apache.org/jira/browse/SPARK-31821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116561#comment-17116561 ] Apache Spark commented on SPARK-31821: -- User 'gaborgsomogyi' has created a pull request for this issue: https://github.com/apache/spark/pull/28640 > Remove mssql-jdbc dependencies > -- > > Key: SPARK-31821 > URL: https://issues.apache.org/jira/browse/SPARK-31821 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0, 3.1.0 >Reporter: Gabor Somogyi >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31821) Remove mssql-jdbc dependencies
[ https://issues.apache.org/jira/browse/SPARK-31821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31821: Assignee: (was: Apache Spark) > Remove mssql-jdbc dependencies > -- > > Key: SPARK-31821 > URL: https://issues.apache.org/jira/browse/SPARK-31821 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0, 3.1.0 >Reporter: Gabor Somogyi >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31821) Remove mssql-jdbc dependencies
[ https://issues.apache.org/jira/browse/SPARK-31821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31821: Assignee: Apache Spark > Remove mssql-jdbc dependencies > -- > > Key: SPARK-31821 > URL: https://issues.apache.org/jira/browse/SPARK-31821 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0, 3.1.0 >Reporter: Gabor Somogyi >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31822) Cost too much resources when read orc hive table to infer schema
[ https://issues.apache.org/jira/browse/SPARK-31822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lithiumlee-_- updated SPARK-31822: -- Component/s: SQL > Cost too much resources when read orc hive table to infer schema > > > Key: SPARK-31822 > URL: https://issues.apache.org/jira/browse/SPARK-31822 > Project: Spark > Issue Type: Improvement > Components: Input/Output, SQL >Affects Versions: 2.4.3 >Reporter: lithiumlee-_- >Priority: Major > > When read a hive orc partitioned table without spark schema properties , > spark read all partitions and all files for infer schema. > Other settings: native orc mode ; _convertMetastoreOrc = true._ > > And I think it can improve by pass *_partitionFilters_* to > *_fileIndex.listFiles_*. > {code:java} > // code placeholder > // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238 > val inferredSchema = fileFormat > .inferSchema( > sparkSession, > options, > fileIndex.listFiles(Nil, Nil).flatMap(_.files)) > .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _)) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31822) Cost too much resources when read orc hive table to infer schema
[ https://issues.apache.org/jira/browse/SPARK-31822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lithiumlee-_- updated SPARK-31822: -- Description: When read a hive orc partitioned table without spark schema properties , spark read all partitions and all files to infer schema. Other settings: native orc mode ; _convertMetastoreOrc = true._ And I think it can improve by pass *_partitionFilters_* to *_fileIndex.listFiles_*. {code:java} // code placeholder // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238 val inferredSchema = fileFormat .inferSchema( sparkSession, options, fileIndex.listFiles(Nil, Nil).flatMap(_.files)) .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _)) {code} was: When read a hive orc partitioned table without spark schema properties , spark read all partitions and all files to infer schema. Other settings: native orc mode ; _convertMetastoreOrc = true._ And I think it can improve by pass *_partitionFilters_* to *_fileIndex.listFiles_*. {code:java} // code placeholder // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238 val inferredSchema = fileFormat .inferSchema( sparkSession, options, fileIndex.listFiles(Nil, Nil).flatMap(_.files)) .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _)) {code} I think > Cost too much resources when read orc hive table to infer schema > > > Key: SPARK-31822 > URL: https://issues.apache.org/jira/browse/SPARK-31822 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.4.3 >Reporter: lithiumlee-_- >Priority: Major > > When read a hive orc partitioned table without spark schema properties , > spark read all partitions and all files to infer schema. > Other settings: native orc mode ; _convertMetastoreOrc = true._ > > And I think it can improve by pass *_partitionFilters_* to > *_fileIndex.listFiles_*. > {code:java} > // code placeholder > // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238 > val inferredSchema = fileFormat > .inferSchema( > sparkSession, > options, > fileIndex.listFiles(Nil, Nil).flatMap(_.files)) > .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _)) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31822) Cost too much resources when read orc hive table to infer schema
[ https://issues.apache.org/jira/browse/SPARK-31822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lithiumlee-_- updated SPARK-31822: -- Description: When read a hive orc partitioned table without spark schema properties , spark read all partitions and all files for infer schema. Other settings: native orc mode ; _convertMetastoreOrc = true._ And I think it can improve by pass *_partitionFilters_* to *_fileIndex.listFiles_*. {code:java} // code placeholder // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238 val inferredSchema = fileFormat .inferSchema( sparkSession, options, fileIndex.listFiles(Nil, Nil).flatMap(_.files)) .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _)) {code} was: When read a hive orc partitioned table without spark schema properties , spark read all partitions and all files to infer schema. Other settings: native orc mode ; _convertMetastoreOrc = true._ And I think it can improve by pass *_partitionFilters_* to *_fileIndex.listFiles_*. {code:java} // code placeholder // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238 val inferredSchema = fileFormat .inferSchema( sparkSession, options, fileIndex.listFiles(Nil, Nil).flatMap(_.files)) .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _)) {code} > Cost too much resources when read orc hive table to infer schema > > > Key: SPARK-31822 > URL: https://issues.apache.org/jira/browse/SPARK-31822 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.4.3 >Reporter: lithiumlee-_- >Priority: Major > > When read a hive orc partitioned table without spark schema properties , > spark read all partitions and all files for infer schema. > Other settings: native orc mode ; _convertMetastoreOrc = true._ > > And I think it can improve by pass *_partitionFilters_* to > *_fileIndex.listFiles_*. > {code:java} > // code placeholder > // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238 > val inferredSchema = fileFormat > .inferSchema( > sparkSession, > options, > fileIndex.listFiles(Nil, Nil).flatMap(_.files)) > .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _)) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31822) Cost too much resources when read orc hive table to infer schema
lithiumlee-_- created SPARK-31822: - Summary: Cost too much resources when read orc hive table to infer schema Key: SPARK-31822 URL: https://issues.apache.org/jira/browse/SPARK-31822 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 2.4.3 Reporter: lithiumlee-_- When read a hive orc partitioned table without spark schema properties , spark read all partitions and all files to infer schema. Other settings: native orc mode ; _convertMetastoreOrc = true._ And I think it can improve by pass *_partitionFilters_* to *_fileIndex.listFiles_*. {code:java} // code placeholder // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238 val inferredSchema = fileFormat .inferSchema( sparkSession, options, fileIndex.listFiles(Nil, Nil).flatMap(_.files)) .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _)) {code} I think -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31821) Remove mssql-jdbc dependencies
Gabor Somogyi created SPARK-31821: - Summary: Remove mssql-jdbc dependencies Key: SPARK-31821 URL: https://issues.apache.org/jira/browse/SPARK-31821 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.0.0, 3.1.0 Reporter: Gabor Somogyi -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31820) Flaky JavaBeanDeserializationSuite
[ https://issues.apache.org/jira/browse/SPARK-31820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31820: Assignee: Apache Spark > Flaky JavaBeanDeserializationSuite > -- > > Key: SPARK-31820 > URL: https://issues.apache.org/jira/browse/SPARK-31820 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > The test suite JavaBeanDeserializationSuite sometimes fails with: > {code} > sbt.ForkMain$ForkError: java.lang.AssertionError: > expected:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25 > 12:39:16.999,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25 > 12:39:17.0,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25 > 12:39:17.0,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25 > 12:39:17.0,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25 > 12:39:17.0,nullIntField=]]> but > was:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25 > 12:39:16.999,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25 > 12:39:17,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25 > 12:39:17,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25 > 12:39:17,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25 > 12:39:17,nullIntField=]]> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:144) > at > test.org.apache.spark.sql.JavaBeanDeserializationSuite.testSpark22000(JavaBeanDeserializationSuite.java:165) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > {code} > See https://github.com/apache/spark/pull/28630#issuecomment-633695723 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31820) Flaky JavaBeanDeserializationSuite
[ https://issues.apache.org/jira/browse/SPARK-31820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116494#comment-17116494 ] Apache Spark commented on SPARK-31820: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/28639 > Flaky JavaBeanDeserializationSuite > -- > > Key: SPARK-31820 > URL: https://issues.apache.org/jira/browse/SPARK-31820 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The test suite JavaBeanDeserializationSuite sometimes fails with: > {code} > sbt.ForkMain$ForkError: java.lang.AssertionError: > expected:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25 > 12:39:16.999,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25 > 12:39:17.0,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25 > 12:39:17.0,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25 > 12:39:17.0,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25 > 12:39:17.0,nullIntField=]]> but > was:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25 > 12:39:16.999,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25 > 12:39:17,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25 > 12:39:17,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25 > 12:39:17,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25 > 12:39:17,nullIntField=]]> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:144) > at > test.org.apache.spark.sql.JavaBeanDeserializationSuite.testSpark22000(JavaBeanDeserializationSuite.java:165) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > {code} > See https://github.com/apache/spark/pull/28630#issuecomment-633695723 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31820) Flaky JavaBeanDeserializationSuite
[ https://issues.apache.org/jira/browse/SPARK-31820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31820: Assignee: (was: Apache Spark) > Flaky JavaBeanDeserializationSuite > -- > > Key: SPARK-31820 > URL: https://issues.apache.org/jira/browse/SPARK-31820 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The test suite JavaBeanDeserializationSuite sometimes fails with: > {code} > sbt.ForkMain$ForkError: java.lang.AssertionError: > expected:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25 > 12:39:16.999,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25 > 12:39:17.0,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25 > 12:39:17.0,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25 > 12:39:17.0,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25 > 12:39:17.0,nullIntField=]]> but > was:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25 > 12:39:16.999,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25 > 12:39:17,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25 > 12:39:17,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25 > 12:39:17,nullIntField=], > JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25 > 12:39:17,nullIntField=]]> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:144) > at > test.org.apache.spark.sql.JavaBeanDeserializationSuite.testSpark22000(JavaBeanDeserializationSuite.java:165) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > {code} > See https://github.com/apache/spark/pull/28630#issuecomment-633695723 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31819) Add a workaround for Java 8u251+/K8s 1.17 and update integration test cases
[ https://issues.apache.org/jira/browse/SPARK-31819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31819: -- Summary: Add a workaround for Java 8u251+/K8s 1.17 and update integration test cases (was: Add a workaround for Java 8u251+ and update integration test cases) > Add a workaround for Java 8u251+/K8s 1.17 and update integration test cases > --- > > Key: SPARK-31819 > URL: https://issues.apache.org/jira/browse/SPARK-31819 > Project: Spark > Issue Type: Bug > Components: Documentation, Kubernetes, Tests >Affects Versions: 2.4.6 >Reporter: Dongjoon Hyun >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31819) Add a workaround for Java 8u251+ and update integration test cases
[ https://issues.apache.org/jira/browse/SPARK-31819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31819: -- Priority: Blocker (was: Major) > Add a workaround for Java 8u251+ and update integration test cases > -- > > Key: SPARK-31819 > URL: https://issues.apache.org/jira/browse/SPARK-31819 > Project: Spark > Issue Type: Bug > Components: Documentation, Kubernetes, Tests >Affects Versions: 2.4.6 >Reporter: Dongjoon Hyun >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31820) Flaky JavaBeanDeserializationSuite
Maxim Gekk created SPARK-31820: -- Summary: Flaky JavaBeanDeserializationSuite Key: SPARK-31820 URL: https://issues.apache.org/jira/browse/SPARK-31820 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk The test suite JavaBeanDeserializationSuite sometimes fails with: {code} sbt.ForkMain$ForkError: java.lang.AssertionError: expected:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25 12:39:16.999,nullIntField=], JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25 12:39:17.0,nullIntField=], JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25 12:39:17.0,nullIntField=], JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25 12:39:17.0,nullIntField=], JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25 12:39:17.0,nullIntField=]]> but was:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25 12:39:16.999,nullIntField=], JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25 12:39:17,nullIntField=], JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25 12:39:17,nullIntField=], JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25 12:39:17,nullIntField=], JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25 12:39:17,nullIntField=]]> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:834) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:144) at test.org.apache.spark.sql.JavaBeanDeserializationSuite.testSpark22000(JavaBeanDeserializationSuite.java:165) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) {code} See https://github.com/apache/spark/pull/28630#issuecomment-633695723 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31794) Incorrect distribution with repartitionByRange and repartition column expression
[ https://issues.apache.org/jira/browse/SPARK-31794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116470#comment-17116470 ] Jungtaek Lim commented on SPARK-31794: -- http://spark.apache.org/docs/3.0.0-preview2/api/scala/org/apache/spark/sql/Dataset.html (The detailed explanation seem to be only added for 3.0.0 - I haven't indicated it's not addressed to Spark 2.4.x. My bad. That's just a doc issue and still be valid for all Spark 2.x though.) Please check the description of "repartition*" methods - please click on method name to expand the description. Given Spark describes the limitation of the repartitions it would be never a sort of bugs. Anyone is welcome to propose better solutions, but the new solutions should also take existing considerations into account. If you're fully understand about your data distribution then you'll want to get your hand dirty by custom partitioner - though it seems to be only available for RDD. > Incorrect distribution with repartitionByRange and repartition column > expression > > > Key: SPARK-31794 > URL: https://issues.apache.org/jira/browse/SPARK-31794 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.4.5, 3.0.1 > Environment: Sample code for obtaining the above test results. > import java.io.File > import java.io.PrintWriter > val logfile="/tmp/sparkdftest.log" > val writer = new PrintWriter(logfile) > writer.println("Spark Version " + sc.version) > val df= Range(1, 1002).toDF("val") > writer.println("Default Partition Length:" + df.rdd.partitions.length) > writer.println("Default Partition getNumPartitions:" + > df.rdd.getNumPartitions) > writer.println("Default Partition groupBy spark_partition_id:" + > df.groupBy(spark_partition_id).count().rdd.partitions.length) > val dfcount=df.mapPartitions\{part => Iterator(part.size)} > writer.println("Default Partition:" + dfcount.collect().toList) > val numparts=24 > val dfparts_range=df.withColumn("partid", $"val" % > numparts).repartitionByRange(numparts, $"partid") > writer.println("repartitionByRange Length:" + > dfparts_range.rdd.partitions.length) > writer.println("repartitionByRange getNumPartitions:" + > dfparts_range.rdd.getNumPartitions) > writer.println("repartitionByRange groupBy spark_partition_id:" + > dfparts_range.groupBy(spark_partition_id).count().rdd.partitions.length) > val dfpartscount=dfparts_range.mapPartitions\{part => Iterator(part.size)} > writer.println("repartitionByRange: " + dfpartscount.collect().toList) > val dfparts_expr=df.withColumn("partid", $"val" % > numparts).repartition(numparts, $"partid") > writer.println("repartition by column expr Length:" + > dfparts_expr.rdd.partitions.length) > writer.println("repartition by column expr getNumPartitions:" + > dfparts_expr.rdd.getNumPartitions) > writer.println("repartition by column expr groupBy spark_partitoin_id:" + > dfparts_expr.groupBy(spark_partition_id).count().rdd.partitions.length) > val dfpartscount=dfparts_expr.mapPartitions\{part => Iterator(part.size)} > writer.println("repartition by column expr:" + dfpartscount.collect().toList) > writer.close() >Reporter: Ramesha Bhatta >Priority: Major > Labels: performance > > Both repartitionByRange and repartition(, ) resulting in wrong > distribution within the resulting partition. > > In the Range partition one of the partition has 2x volume and last one with > zero. In repartition this is more problematic with some partition with 4x, > 2x the avg and many partitions with zero volume. > > This distribution imbalance can cause performance problem in a concurrent > environment. > Details from testing in 3 different versions. > |Verion 2.3.2|Version 2.4.5|Versoin 3.0 Preview2| > |Spark Version 2.3.2.3.1.4.0-315|Spark Version 2.4.5|Spark Version > 3.0.0-preview2| > |Default Partition Length:2|Default Partition Length:2|Default Partition > Length:80| > |Default Partition getNumPartitions:2|Default Partition > getNumPartitions:2|Default Partition getNumPartitions:80| > |Default Partition groupBy spark_partition_id:200|Default Partition groupBy > spark_partition_id:200|Default Partition groupBy spark_partition_id:200| > |repartitionByRange Length:24|repartitionByRange Length:24|repartitionByRange > Length:24| > |repartitionByRange getNumPartitions:24|repartitionByRange > getNumPartitions:24|repartitionByRange getNumPartitions:24| > |repartitionByRange groupBy spark_partition_id:200|repartitionByRange groupBy > spark_partition_id:200|repartitionByRange groupBy spark_partition_id:200| > |repartitionByRange: List(83, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, > 42, 42, 42, 42, 41, 41, 41, 41, 41, 41, 0)|repartitionByRange: List(83, 42, > 42, 42, 42,
[jira] [Commented] (SPARK-31819) Add a workaround for Java 8u251+ and update integration test cases
[ https://issues.apache.org/jira/browse/SPARK-31819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116462#comment-17116462 ] Apache Spark commented on SPARK-31819: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/28638 > Add a workaround for Java 8u251+ and update integration test cases > -- > > Key: SPARK-31819 > URL: https://issues.apache.org/jira/browse/SPARK-31819 > Project: Spark > Issue Type: Bug > Components: Documentation, Kubernetes, Tests >Affects Versions: 2.4.6 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31819) Add a workaround for Java 8u251+ and update integration test cases
[ https://issues.apache.org/jira/browse/SPARK-31819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31819: Assignee: Apache Spark > Add a workaround for Java 8u251+ and update integration test cases > -- > > Key: SPARK-31819 > URL: https://issues.apache.org/jira/browse/SPARK-31819 > Project: Spark > Issue Type: Bug > Components: Documentation, Kubernetes, Tests >Affects Versions: 2.4.6 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31819) Add a workaround for Java 8u251+ and update integration test cases
[ https://issues.apache.org/jira/browse/SPARK-31819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31819: Assignee: (was: Apache Spark) > Add a workaround for Java 8u251+ and update integration test cases > -- > > Key: SPARK-31819 > URL: https://issues.apache.org/jira/browse/SPARK-31819 > Project: Spark > Issue Type: Bug > Components: Documentation, Kubernetes, Tests >Affects Versions: 2.4.6 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31819) Add a workaround for Java 8u251+ and update integration test cases
[ https://issues.apache.org/jira/browse/SPARK-31819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116461#comment-17116461 ] Apache Spark commented on SPARK-31819: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/28638 > Add a workaround for Java 8u251+ and update integration test cases > -- > > Key: SPARK-31819 > URL: https://issues.apache.org/jira/browse/SPARK-31819 > Project: Spark > Issue Type: Bug > Components: Documentation, Kubernetes, Tests >Affects Versions: 2.4.6 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31786) Exception on submitting Spark-Pi to Kubernetes 1.17.3
[ https://issues.apache.org/jira/browse/SPARK-31786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116460#comment-17116460 ] Dongjoon Hyun commented on SPARK-31786: --- Okay. I'll create a PR for that, [~maver1ck]. > Exception on submitting Spark-Pi to Kubernetes 1.17.3 > - > > Key: SPARK-31786 > URL: https://issues.apache.org/jira/browse/SPARK-31786 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.5, 3.0.0 >Reporter: Maciej Bryński >Assignee: Dongjoon Hyun >Priority: Blocker > Fix For: 3.0.0 > > > Hi, > I'm getting exception when submitting Spark-Pi app to Kubernetes cluster. > Kubernetes version: 1.17.3 > JDK version: openjdk version "1.8.0_252" > Exception: > {code} > ./bin/spark-submit --master k8s://https://172.31.23.60:8443 --deploy-mode > cluster --name spark-pi --conf > spark.kubernetes.container.image=spark-py:2.4.5 --conf > spark.kubernetes.executor.request.cores=0.1 --conf > spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf > spark.executor.instances=1 local:///opt/spark/examples/src/main/python/pi.py > log4j:WARN No appenders could be found for logger > (io.fabric8.kubernetes.client.Config). > log4j:WARN Please initialize the log4j system properly. > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more > info. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create] > for kind: [Pod] with name: [null] in namespace: [default] failed. > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:337) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:330) > at > org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:141) > at > org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:140) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543) > at > org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:140) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:250) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:241) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.net.SocketException: Broken pipe (Write failed) > at java.net.SocketOutputStream.socketWrite0(Native Method) > at > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) > at java.net.SocketOutputStream.write(SocketOutputStream.java:155) > at sun.security.ssl.OutputRecord.writeBuffer(OutputRecord.java:431) > at sun.security.ssl.OutputRecord.write(OutputRecord.java:417) > at > sun.security.ssl.SSLSocketImpl.writeRecordInternal(SSLSocketImpl.java:894) > at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:865) > at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:123) > at okio.Okio$1.write(Okio.java:79) > at okio.AsyncTimeout$1.write(AsyncTimeout.java:180) > at okio.RealBufferedSink.flush(RealBufferedSink.java:224) > at okhttp3.internal.http2.Http2Writer.settings(Http2Writer.java:203) > at > okhttp3.internal.http2.Http2Connection.start(Http2Connection.java:515) > at >