[jira] [Created] (SPARK-31832) Add tool tip for Structured streaming page tables

2020-05-26 Thread jobit mathew (Jira)
jobit mathew created SPARK-31832:


 Summary: Add tool tip for Structured streaming page tables
 Key: SPARK-31832
 URL: https://issues.apache.org/jira/browse/SPARK-31832
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Web UI
Affects Versions: 3.1.0
Reporter: jobit mathew


Better to add tool tip for Structured streaming page tables



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26646) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction

2020-05-26 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117400#comment-17117400
 ] 

Jungtaek Lim commented on SPARK-26646:
--

Still happening.

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123143/testReport/

Would we need to disable the test for now?

> Flaky test: pyspark.mllib.tests.test_streaming_algorithms 
> StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
> --
>
> Key: SPARK-26646
> URL: https://issues.apache.org/jira/browse/SPARK-26646
> Project: Spark
>  Issue Type: Test
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101356/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101358/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/101254/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100941/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/100327/console
> {code}
> ==
> FAIL: test_training_and_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the model improves on toy data with no. of batches
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 367, in test_training_and_prediction
> self._eventually(condition, timeout=60.0)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 69, in _eventually
> lastValue = condition()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 362, in condition
> self.assertGreater(errors[1] - errors[-1], 0.3)
> AssertionError: -0.070062 not greater than 0.3
> --
> Ran 13 tests in 198.327s
> FAILED (failures=1, skipped=1)
> Had test failures in pyspark.mllib.tests.test_streaming_algorithms with 
> python3.4; see logs.
> {code}
> It apparently became less flaky after increasing the time at SPARK-26275 but 
> looks now it became flacky due to unexpected results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29137) Flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_train_prediction

2020-05-26 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117399#comment-17117399
 ] 

Jungtaek Lim commented on SPARK-29137:
--

Still valid on latest master.

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123144/consoleFull

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123146/testReport/

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123141/testReport/

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123142/testReport/

Would we need to disable the test for now?

> Flaky test: 
> pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_train_prediction
> --
>
> Key: SPARK-29137
> URL: https://issues.apache.org/jira/browse/SPARK-29137
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Tests
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/110686/testReport/]
> {code:java}
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 503, in test_train_prediction
> self._eventually(condition)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 69, in _eventually
> lastValue = condition()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 498, in condition
> self.assertGreater(errors[1] - errors[-1], 2)
> AssertionError: 1.672640157855923 not greater than 2 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31831) Flaky test: org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.(It is not a test it is a sbt.testing.SuiteSelector)

2020-05-26 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-31831:


 Summary: Flaky test: 
org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.(It is not a test 
it is a sbt.testing.SuiteSelector)
 Key: SPARK-31831
 URL: https://issues.apache.org/jira/browse/SPARK-31831
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Jungtaek Lim


I've seen the failures two times (not in a row but closely) which seems to 
require investigation.

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123147/testReport
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123150/testReport

{noformat}
org.mockito.exceptions.base.MockitoException:  ClassCastException occurred 
while creating the mockito mock :   class to mock : 
'org.apache.hive.service.cli.session.SessionManager', loaded by classloader : 
'sun.misc.Launcher$AppClassLoader@483bf400'   created class : 
'org.mockito.codegen.SessionManager$MockitoMock$1696557705', loaded by 
classloader : 
'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6'   proxy 
instance class : 'org.mockito.codegen.SessionManager$MockitoMock$1696557705', 
loaded by classloader : 
'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6'   instance 
creation by : ObjenesisInstantiator  You might experience classloading issues, 
please ask the mockito mailing-list. 
 Stack Trace
sbt.ForkMain$ForkError: org.mockito.exceptions.base.MockitoException: 
ClassCastException occurred while creating the mockito mock :
  class to mock : 'org.apache.hive.service.cli.session.SessionManager', loaded 
by classloader : 'sun.misc.Launcher$AppClassLoader@483bf400'
  created class : 'org.mockito.codegen.SessionManager$MockitoMock$1696557705', 
loaded by classloader : 
'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6'
  proxy instance class : 
'org.mockito.codegen.SessionManager$MockitoMock$1696557705', loaded by 
classloader : 'net.bytebuddy.dynamic.loading.MultipleParentClassLoader@47ecf2c6'
  instance creation by : ObjenesisInstantiator

You might experience classloading issues, please ask the mockito mailing-list.

at 
org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite.beforeAll(HiveSessionImplSuite.scala:44)
at 
org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:59)
at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317)
at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510)
at sbt.ForkMain$Run$2.call(ForkMain.java:296)
at sbt.ForkMain$Run$2.call(ForkMain.java:286)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: sbt.ForkMain$ForkError: java.lang.ClassCastException: 
org.mockito.codegen.SessionManager$MockitoMock$1696557705 cannot be cast to 
org.mockito.internal.creation.bytebuddy.MockAccess
at 
org.mockito.internal.creation.bytebuddy.SubclassByteBuddyMockMaker.createMock(SubclassByteBuddyMockMaker.java:48)
at 
org.mockito.internal.creation.bytebuddy.ByteBuddyMockMaker.createMock(ByteBuddyMockMaker.java:25)
at org.mockito.internal.util.MockUtil.createMock(MockUtil.java:35)
at org.mockito.internal.MockitoCore.mock(MockitoCore.java:63)
at org.mockito.Mockito.mock(Mockito.java:1908)
at org.mockito.Mockito.mock(Mockito.java:1817)
... 13 more
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31696) Support spark.kubernetes.driver.service.annotation

2020-05-26 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117360#comment-17117360
 ] 

Dongjoon Hyun commented on SPARK-31696:
---

I'll give a talk at Spark Summit next month. :)
- 
https://databricks.com/session_na20/native-support-of-prometheus-monitoring-in-apache-spark-3-0

> Support spark.kubernetes.driver.service.annotation
> --
>
> Key: SPARK-31696
> URL: https://issues.apache.org/jira/browse/SPARK-31696
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31830) Consistent error handling for datetime formatting functions

2020-05-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117334#comment-17117334
 ] 

Apache Spark commented on SPARK-31830:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/28650

> Consistent error handling for datetime formatting functions
> ---
>
> Key: SPARK-31830
> URL: https://issues.apache.org/jira/browse/SPARK-31830
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> date_format and from_unixtime have different error handling behavior for 
> formatting datetime values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31830) Consistent error handling for datetime formatting functions

2020-05-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31830:


Assignee: Apache Spark

> Consistent error handling for datetime formatting functions
> ---
>
> Key: SPARK-31830
> URL: https://issues.apache.org/jira/browse/SPARK-31830
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> date_format and from_unixtime have different error handling behavior for 
> formatting datetime values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31830) Consistent error handling for datetime formatting functions

2020-05-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31830:


Assignee: (was: Apache Spark)

> Consistent error handling for datetime formatting functions
> ---
>
> Key: SPARK-31830
> URL: https://issues.apache.org/jira/browse/SPARK-31830
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> date_format and from_unixtime have different error handling behavior for 
> formatting datetime values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31830) Consistent error handling for datetime formatting functions

2020-05-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31830:


Assignee: Apache Spark

> Consistent error handling for datetime formatting functions
> ---
>
> Key: SPARK-31830
> URL: https://issues.apache.org/jira/browse/SPARK-31830
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> date_format and from_unixtime have different error handling behavior for 
> formatting datetime values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31830) Consistent error handling for datetime formatting functions

2020-05-26 Thread Kent Yao (Jira)
Kent Yao created SPARK-31830:


 Summary: Consistent error handling for datetime formatting 
functions
 Key: SPARK-31830
 URL: https://issues.apache.org/jira/browse/SPARK-31830
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Kent Yao


date_format and from_unixtime have different error handling behavior for 
formatting datetime values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31829) Check for partition existence for Insert overwrite if not exists queries on Hive Serde Table before computation

2020-05-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31829:


Assignee: Apache Spark

> Check for partition existence for Insert overwrite if not exists queries on 
> Hive Serde Table before computation
> ---
>
> Key: SPARK-31829
> URL: https://issues.apache.org/jira/browse/SPARK-31829
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.5
>Reporter: Aniket Namadeo Mokashi
>Assignee: Apache Spark
>Priority: Major
>
> If T is a Hive table, Query: INSERT OVERWRITE table T partition(p='existing') 
> IF NOT EXISTS select ... ; executes job/computation on Spark and then avoids 
> loading partitions. It should avoid doing the wasteful computation and exit 
> early.
> For Datasource table, it does avoid the computation and exits early (due to 
> work done in SPARK-20831).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31829) Check for partition existence for Insert overwrite if not exists queries on Hive Serde Table before computation

2020-05-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31829:


Assignee: (was: Apache Spark)

> Check for partition existence for Insert overwrite if not exists queries on 
> Hive Serde Table before computation
> ---
>
> Key: SPARK-31829
> URL: https://issues.apache.org/jira/browse/SPARK-31829
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.5
>Reporter: Aniket Namadeo Mokashi
>Priority: Major
>
> If T is a Hive table, Query: INSERT OVERWRITE table T partition(p='existing') 
> IF NOT EXISTS select ... ; executes job/computation on Spark and then avoids 
> loading partitions. It should avoid doing the wasteful computation and exit 
> early.
> For Datasource table, it does avoid the computation and exits early (due to 
> work done in SPARK-20831).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31829) Check for partition existence for Insert overwrite if not exists queries on Hive Serde Table before computation

2020-05-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117262#comment-17117262
 ] 

Apache Spark commented on SPARK-31829:
--

User 'aniket486' has created a pull request for this issue:
https://github.com/apache/spark/pull/28649

> Check for partition existence for Insert overwrite if not exists queries on 
> Hive Serde Table before computation
> ---
>
> Key: SPARK-31829
> URL: https://issues.apache.org/jira/browse/SPARK-31829
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.5
>Reporter: Aniket Namadeo Mokashi
>Priority: Major
>
> If T is a Hive table, Query: INSERT OVERWRITE table T partition(p='existing') 
> IF NOT EXISTS select ... ; executes job/computation on Spark and then avoids 
> loading partitions. It should avoid doing the wasteful computation and exit 
> early.
> For Datasource table, it does avoid the computation and exits early (due to 
> work done in SPARK-20831).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31829) Check for partition existence for Insert overwrite if not exists queries on Hive Serde Table before computation

2020-05-26 Thread Aniket Namadeo Mokashi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Namadeo Mokashi updated SPARK-31829:
---
Summary: Check for partition existence for Insert overwrite if not exists 
queries on Hive Serde Table before computation  (was: Check for partition 
existence for Insert overwrite if not exists queries on Hive Serde Table)

> Check for partition existence for Insert overwrite if not exists queries on 
> Hive Serde Table before computation
> ---
>
> Key: SPARK-31829
> URL: https://issues.apache.org/jira/browse/SPARK-31829
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.5
>Reporter: Aniket Namadeo Mokashi
>Priority: Major
>
> If T is a Hive table, Query: INSERT OVERWRITE table T partition(p='existing') 
> IF NOT EXISTS select ... ; executes job/computation on Spark and then avoids 
> loading partitions. It should avoid doing the wasteful computation and exit 
> early.
> For Datasource table, it does avoid the computation and exits early (due to 
> work done in SPARK-20831).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31829) Check for partition existence for Insert overwrite if not exists queries on Hive Serde Table

2020-05-26 Thread Aniket Namadeo Mokashi (Jira)
Aniket Namadeo Mokashi created SPARK-31829:
--

 Summary: Check for partition existence for Insert overwrite if not 
exists queries on Hive Serde Table
 Key: SPARK-31829
 URL: https://issues.apache.org/jira/browse/SPARK-31829
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.5, 2.3.4, 2.2.3, 2.1.3
Reporter: Aniket Namadeo Mokashi


If T is a Hive table, Query: INSERT OVERWRITE table T partition(p='existing') 
IF NOT EXISTS select ... ; executes job/computation on Spark and then avoids 
loading partitions. It should avoid doing the wasteful computation and exit 
early.

For Datasource table, it does avoid the computation and exits early (due to 
work done in SPARK-20831).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31788) Error when creating UnionRDD of PairRDDs

2020-05-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117253#comment-17117253
 ] 

Apache Spark commented on SPARK-31788:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28648

> Error when creating UnionRDD of PairRDDs
> 
>
> Key: SPARK-31788
> URL: https://issues.apache.org/jira/browse/SPARK-31788
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Sanket Reddy
>Assignee: Hyukjin Kwon
>Priority: Blocker
>
> Union RDD of Pair RDD's seems to have issues
> SparkSession available as 'spark'.
> {code}
> rdd1 = sc.parallelize([1,2,3,4,5])
> rdd2 = sc.parallelize([6,7,8,9,10])
> pairRDD1 = rdd1.zip(rdd2)
> unionRDD1 = sc.union([pairRDD1, pairRDD1])
> {code}
> {code}
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870,
> in union jrdds[i] = rdds[i]._jrdd
> File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221,
> in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> {code}
> {code}
> rdd3 = sc.parallelize([11,12,13,14,15])
> pairRDD2 = rdd3.zip(rdd3)
> unionRDD2 = sc.union([pairRDD1, pairRDD2])
> {code}
> {code}
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union 
> jrdds[i] = rdds[i]._jrdd File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221, in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> {code}
> 2.4.5 does not have this regression as below:
> {code}
> rdd4 = sc.parallelize(range(5))
> pairRDD3 = rdd4.zip(rdd4)
> unionRDD3 = sc.union([pairRDD1, pairRDD3])
> unionRDD3.collect()
> {code}
> {code}
> [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 1), (2, 2), (3, 3), (4, 
> 4)]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31788) Error when creating UnionRDD of PairRDDs

2020-05-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117252#comment-17117252
 ] 

Apache Spark commented on SPARK-31788:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28648

> Error when creating UnionRDD of PairRDDs
> 
>
> Key: SPARK-31788
> URL: https://issues.apache.org/jira/browse/SPARK-31788
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Sanket Reddy
>Assignee: Hyukjin Kwon
>Priority: Blocker
>
> Union RDD of Pair RDD's seems to have issues
> SparkSession available as 'spark'.
> {code}
> rdd1 = sc.parallelize([1,2,3,4,5])
> rdd2 = sc.parallelize([6,7,8,9,10])
> pairRDD1 = rdd1.zip(rdd2)
> unionRDD1 = sc.union([pairRDD1, pairRDD1])
> {code}
> {code}
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870,
> in union jrdds[i] = rdds[i]._jrdd
> File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221,
> in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> {code}
> {code}
> rdd3 = sc.parallelize([11,12,13,14,15])
> pairRDD2 = rdd3.zip(rdd3)
> unionRDD2 = sc.union([pairRDD1, pairRDD2])
> {code}
> {code}
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union 
> jrdds[i] = rdds[i]._jrdd File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221, in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> {code}
> 2.4.5 does not have this regression as below:
> {code}
> rdd4 = sc.parallelize(range(5))
> pairRDD3 = rdd4.zip(rdd4)
> unionRDD3 = sc.union([pairRDD1, pairRDD3])
> unionRDD3.collect()
> {code}
> {code}
> [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 1), (2, 2), (3, 3), (4, 
> 4)]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31788) Error when creating UnionRDD of PairRDDs

2020-05-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31788:


Assignee: Apache Spark  (was: Hyukjin Kwon)

> Error when creating UnionRDD of PairRDDs
> 
>
> Key: SPARK-31788
> URL: https://issues.apache.org/jira/browse/SPARK-31788
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Sanket Reddy
>Assignee: Apache Spark
>Priority: Blocker
>
> Union RDD of Pair RDD's seems to have issues
> SparkSession available as 'spark'.
> {code}
> rdd1 = sc.parallelize([1,2,3,4,5])
> rdd2 = sc.parallelize([6,7,8,9,10])
> pairRDD1 = rdd1.zip(rdd2)
> unionRDD1 = sc.union([pairRDD1, pairRDD1])
> {code}
> {code}
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870,
> in union jrdds[i] = rdds[i]._jrdd
> File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221,
> in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> {code}
> {code}
> rdd3 = sc.parallelize([11,12,13,14,15])
> pairRDD2 = rdd3.zip(rdd3)
> unionRDD2 = sc.union([pairRDD1, pairRDD2])
> {code}
> {code}
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union 
> jrdds[i] = rdds[i]._jrdd File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221, in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> {code}
> 2.4.5 does not have this regression as below:
> {code}
> rdd4 = sc.parallelize(range(5))
> pairRDD3 = rdd4.zip(rdd4)
> unionRDD3 = sc.union([pairRDD1, pairRDD3])
> unionRDD3.collect()
> {code}
> {code}
> [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 1), (2, 2), (3, 3), (4, 
> 4)]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31788) Error when creating UnionRDD of PairRDDs

2020-05-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31788:


Assignee: Hyukjin Kwon  (was: Apache Spark)

> Error when creating UnionRDD of PairRDDs
> 
>
> Key: SPARK-31788
> URL: https://issues.apache.org/jira/browse/SPARK-31788
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Sanket Reddy
>Assignee: Hyukjin Kwon
>Priority: Blocker
>
> Union RDD of Pair RDD's seems to have issues
> SparkSession available as 'spark'.
> {code}
> rdd1 = sc.parallelize([1,2,3,4,5])
> rdd2 = sc.parallelize([6,7,8,9,10])
> pairRDD1 = rdd1.zip(rdd2)
> unionRDD1 = sc.union([pairRDD1, pairRDD1])
> {code}
> {code}
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870,
> in union jrdds[i] = rdds[i]._jrdd
> File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221,
> in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> {code}
> {code}
> rdd3 = sc.parallelize([11,12,13,14,15])
> pairRDD2 = rdd3.zip(rdd3)
> unionRDD2 = sc.union([pairRDD1, pairRDD2])
> {code}
> {code}
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union 
> jrdds[i] = rdds[i]._jrdd File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221, in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> {code}
> 2.4.5 does not have this regression as below:
> {code}
> rdd4 = sc.parallelize(range(5))
> pairRDD3 = rdd4.zip(rdd4)
> unionRDD3 = sc.union([pairRDD1, pairRDD3])
> unionRDD3.collect()
> {code}
> {code}
> [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 1), (2, 2), (3, 3), (4, 
> 4)]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31788) Error when creating UnionRDD of PairRDDs

2020-05-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31788:
-
Description: 
Union RDD of Pair RDD's seems to have issues

SparkSession available as 'spark'.

{code}
rdd1 = sc.parallelize([1,2,3,4,5])
rdd2 = sc.parallelize([6,7,8,9,10])
pairRDD1 = rdd1.zip(rdd2)
unionRDD1 = sc.union([pairRDD1, pairRDD1])
{code}

{code}
Traceback (most recent call last): File "", line 1, in  File 
"/home/gs/spark/latest/python/pyspark/context.py", line 870,
in union jrdds[i] = rdds[i]._jrdd
File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 238, in _setitem_ File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 221,
in __set_item File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
calling None.None. Trace: py4j.Py4JException: Cannot convert 
org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
py4j.GatewayConnection.run(GatewayConnection.java:238) at 
java.lang.Thread.run(Thread.java:748)
{code}


{code}
rdd3 = sc.parallelize([11,12,13,14,15])
pairRDD2 = rdd3.zip(rdd3)
unionRDD2 = sc.union([pairRDD1, pairRDD2])
{code}

{code}
Traceback (most recent call last): File "", line 1, in  File 
"/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] 
= rdds[i]._jrdd File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 238, in _setitem_ File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 221, in __set_item File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
calling None.None. Trace: py4j.Py4JException: Cannot convert 
org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
py4j.GatewayConnection.run(GatewayConnection.java:238) at 
java.lang.Thread.run(Thread.java:748)
{code}

2.4.5 does not have this regression as below:

{code}
rdd4 = sc.parallelize(range(5))
pairRDD3 = rdd4.zip(rdd4)
unionRDD3 = sc.union([pairRDD1, pairRDD3])
unionRDD3.collect()
{code}

{code}
[(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 1), (2, 2), (3, 3), (4, 
4)]
{code}



  was:
Union RDD of Pair RDD's seems to have issues

SparkSession available as 'spark'.

{code}
rdd1 = sc.parallelize([1,2,3,4,5])
rdd2 = sc.parallelize([6,7,8,9,10])
pairRDD1 = rdd1.zip(rdd2)
unionRDD1 = sc.union([pairRDD1, pairRDD1])
{code}

{code}
Traceback (most recent call last): File "", line 1, in  File 
"/home/gs/spark/latest/python/pyspark/context.py", line 870,
in union jrdds[i] = rdds[i]._jrdd
File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 238, in _setitem_ File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 221,
in __set_item File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
calling None.None. Trace: py4j.Py4JException: Cannot convert 
org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
py4j.GatewayConnection.run(GatewayConnection.java:238) at 
java.lang.Thread.run(Thread.java:748)
{code}


{code}
rdd3 = sc.parallelize([11,12,13,14,15])
pairRDD2 = rdd3.zip(rdd3)
unionRDD2 = sc.union([pairRDD1, pairRDD2])
{code}

{code}
Traceback (most recent call last): File "", line 1, in  File 
"/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] 
= rdds[i]._jrdd File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 238, in _setitem_ File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 221, in __set_item File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
calling None.None. Trace: py4j.Py4JException: Cannot convert 
org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 

[jira] [Updated] (SPARK-31788) Error when creating UnionRDD of PairRDDs

2020-05-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31788:
-
 Target Version/s: 3.0.0
Affects Version/s: (was: 3.0.1)
  Description: 
Union RDD of Pair RDD's seems to have issues

SparkSession available as 'spark'.

{code}
rdd1 = sc.parallelize([1,2,3,4,5])
rdd2 = sc.parallelize([6,7,8,9,10])
pairRDD1 = rdd1.zip(rdd2)
unionRDD1 = sc.union([pairRDD1, pairRDD1])
{code}

{code}
Traceback (most recent call last): File "", line 1, in  File 
"/home/gs/spark/latest/python/pyspark/context.py", line 870,
in union jrdds[i] = rdds[i]._jrdd
File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 238, in _setitem_ File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 221,
in __set_item File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
calling None.None. Trace: py4j.Py4JException: Cannot convert 
org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
py4j.GatewayConnection.run(GatewayConnection.java:238) at 
java.lang.Thread.run(Thread.java:748)
{code}


{code}
rdd3 = sc.parallelize([11,12,13,14,15])
pairRDD2 = rdd3.zip(rdd3)
unionRDD2 = sc.union([pairRDD1, pairRDD2])
{code}

{code}
Traceback (most recent call last): File "", line 1, in  File 
"/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] 
= rdds[i]._jrdd File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 238, in _setitem_ File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 221, in __set_item File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
calling None.None. Trace: py4j.Py4JException: Cannot convert 
org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
py4j.GatewayConnection.run(GatewayConnection.java:238) at 
java.lang.Thread.run(Thread.java:748)
{code}

{code}
rdd4 = sc.parallelize(range(5))
pairRDD3 = rdd4.zip(rdd4)
unionRDD3 = sc.union([pairRDD1, pairRDD3])
unionRDD3.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 1), 
(2, 2), (3, 3), (4, 4)]
{code}

2.4.5 does not have this regression

  was:
Union RDD of Pair RDD's seems to have issues

SparkSession available as 'spark'.

>>> rdd1 = sc.parallelize([1,2,3,4,5])

>>> rdd2 = sc.parallelize([6,7,8,9,10])

>>> pairRDD1 = rdd1.zip(rdd2)

>>> unionRDD1 = sc.union([pairRDD1, pairRDD1])

Traceback (most recent call last): File "", line 1, in  File 
"/home/gs/spark/latest/python/pyspark/context.py", line 870,

in union jrdds[i] = rdds[i]._jrdd

File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 238, in _setitem_ File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 221,

in __set_item File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
calling None.None. Trace: py4j.Py4JException: Cannot convert 
org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
py4j.GatewayConnection.run(GatewayConnection.java:238) at 
java.lang.Thread.run(Thread.java:748)

>>> rdd3 = sc.parallelize([11,12,13,14,15])

>>> pairRDD2 = rdd3.zip(rdd3)

>>> unionRDD2 = sc.union([pairRDD1, pairRDD2])

Traceback (most recent call last): File "", line 1, in  File 
"/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] 
= rdds[i]._jrdd File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 238, in _setitem_ File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 221, in __set_item File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
calling None.None. Trace: py4j.Py4JException: Cannot convert 
org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 

[jira] [Reopened] (SPARK-31788) Error when creating UnionRDD of PairRDDs

2020-05-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-31788:
--
  Assignee: Hyukjin Kwon  (was: Sanket Reddy)

> Error when creating UnionRDD of PairRDDs
> 
>
> Key: SPARK-31788
> URL: https://issues.apache.org/jira/browse/SPARK-31788
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Sanket Reddy
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> Union RDD of Pair RDD's seems to have issues
> SparkSession available as 'spark'.
> >>> rdd1 = sc.parallelize([1,2,3,4,5])
> >>> rdd2 = sc.parallelize([6,7,8,9,10])
> >>> pairRDD1 = rdd1.zip(rdd2)
> >>> unionRDD1 = sc.union([pairRDD1, pairRDD1])
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870,
> in union jrdds[i] = rdds[i]._jrdd
> File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221,
> in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> >>> rdd3 = sc.parallelize([11,12,13,14,15])
> >>> pairRDD2 = rdd3.zip(rdd3)
> >>> unionRDD2 = sc.union([pairRDD1, pairRDD2])
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union 
> jrdds[i] = rdds[i]._jrdd File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221, in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> >>> rdd4 = sc.parallelize(range(5))
> >>> pairRDD3 = rdd4.zip(rdd4)
> >>> unionRDD3 = sc.union([pairRDD1, pairRDD3])
> >>> unionRDD3.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 
> >>> 1), (2, 2), (3, 3), (4, 4)]
>  
> 2.4.5 does not have this regression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31788) Error when creating UnionRDD of PairRDDs

2020-05-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31788:
-
Priority: Blocker  (was: Major)

> Error when creating UnionRDD of PairRDDs
> 
>
> Key: SPARK-31788
> URL: https://issues.apache.org/jira/browse/SPARK-31788
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Sanket Reddy
>Assignee: Hyukjin Kwon
>Priority: Blocker
>
> Union RDD of Pair RDD's seems to have issues
> SparkSession available as 'spark'.
> >>> rdd1 = sc.parallelize([1,2,3,4,5])
> >>> rdd2 = sc.parallelize([6,7,8,9,10])
> >>> pairRDD1 = rdd1.zip(rdd2)
> >>> unionRDD1 = sc.union([pairRDD1, pairRDD1])
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870,
> in union jrdds[i] = rdds[i]._jrdd
> File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221,
> in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> >>> rdd3 = sc.parallelize([11,12,13,14,15])
> >>> pairRDD2 = rdd3.zip(rdd3)
> >>> unionRDD2 = sc.union([pairRDD1, pairRDD2])
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union 
> jrdds[i] = rdds[i]._jrdd File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221, in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> >>> rdd4 = sc.parallelize(range(5))
> >>> pairRDD3 = rdd4.zip(rdd4)
> >>> unionRDD3 = sc.union([pairRDD1, pairRDD3])
> >>> unionRDD3.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 
> >>> 1), (2, 2), (3, 3), (4, 4)]
>  
> 2.4.5 does not have this regression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31788) Error when creating UnionRDD of PairRDDs

2020-05-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31788:
-
Fix Version/s: (was: 3.0.0)

> Error when creating UnionRDD of PairRDDs
> 
>
> Key: SPARK-31788
> URL: https://issues.apache.org/jira/browse/SPARK-31788
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Sanket Reddy
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Union RDD of Pair RDD's seems to have issues
> SparkSession available as 'spark'.
> >>> rdd1 = sc.parallelize([1,2,3,4,5])
> >>> rdd2 = sc.parallelize([6,7,8,9,10])
> >>> pairRDD1 = rdd1.zip(rdd2)
> >>> unionRDD1 = sc.union([pairRDD1, pairRDD1])
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870,
> in union jrdds[i] = rdds[i]._jrdd
> File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221,
> in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> >>> rdd3 = sc.parallelize([11,12,13,14,15])
> >>> pairRDD2 = rdd3.zip(rdd3)
> >>> unionRDD2 = sc.union([pairRDD1, pairRDD2])
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union 
> jrdds[i] = rdds[i]._jrdd File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221, in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> >>> rdd4 = sc.parallelize(range(5))
> >>> pairRDD3 = rdd4.zip(rdd4)
> >>> unionRDD3 = sc.union([pairRDD1, pairRDD3])
> >>> unionRDD3.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 
> >>> 1), (2, 2), (3, 3), (4, 4)]
>  
> 2.4.5 does not have this regression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31788) Error when creating UnionRDD of PairRDDs

2020-05-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31788:
-
Component/s: DStreams

> Error when creating UnionRDD of PairRDDs
> 
>
> Key: SPARK-31788
> URL: https://issues.apache.org/jira/browse/SPARK-31788
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Sanket Reddy
>Assignee: Hyukjin Kwon
>Priority: Blocker
>
> Union RDD of Pair RDD's seems to have issues
> SparkSession available as 'spark'.
> >>> rdd1 = sc.parallelize([1,2,3,4,5])
> >>> rdd2 = sc.parallelize([6,7,8,9,10])
> >>> pairRDD1 = rdd1.zip(rdd2)
> >>> unionRDD1 = sc.union([pairRDD1, pairRDD1])
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870,
> in union jrdds[i] = rdds[i]._jrdd
> File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221,
> in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> >>> rdd3 = sc.parallelize([11,12,13,14,15])
> >>> pairRDD2 = rdd3.zip(rdd3)
> >>> unionRDD2 = sc.union([pairRDD1, pairRDD2])
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union 
> jrdds[i] = rdds[i]._jrdd File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221, in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> >>> rdd4 = sc.parallelize(range(5))
> >>> pairRDD3 = rdd4.zip(rdd4)
> >>> unionRDD3 = sc.union([pairRDD1, pairRDD3])
> >>> unionRDD3.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 
> >>> 1), (2, 2), (3, 3), (4, 4)]
>  
> 2.4.5 does not have this regression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31813) Cannot write snappy-compressed text files

2020-05-26 Thread ZhangShuai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117231#comment-17117231
 ] 

ZhangShuai commented on SPARK-31813:


In my environment, it works fine.

> Cannot write snappy-compressed text files
> -
>
> Key: SPARK-31813
> URL: https://issues.apache.org/jira/browse/SPARK-31813
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.5
>Reporter: Ondrej Kokes
>Priority: Minor
>
> After installing pyspark (pip install pyspark) on both macOS and Ubuntu (a 
> clean Docker image with default-jre), Spark fails to write text-based files 
> (CSV and JSON) with snappy compression. It can snappy compress parquet and 
> orc, gzipping CSVs also works.
> This is a clean PySpark installation, snappy jars are in place
> {{$ ls -1 /usr/local/lib/python3.7/site-packages/pyspark/jars/ | grep snappy}}
> {{snappy-0.2.jar
> }}{{snappy-java-1.1.7.3.jar}}
> Repro 1 (Scala):
> $ spark-shell
> {{spark.sql("select 1").write.option("compression", 
> "snappy").mode("overwrite").parquet("tmp/foo")}}
> spark.sql("select 1").write.option("compression", 
> "snappy").mode("overwrite").csv("tmp/foo")
> The first (parquet) will work, the second one won't.
> Repro 2 (PySpark):
>  {{from pyspark.sql import SparkSession}}
>  {{if __name__ == '__main__':}}{{spark}}
>  {{  SparkSession.builder.appName('snappy_testing').getOrCreate()}}
>  {{  spark.sql('select 1').write.option('compression', 
> 'snappy').mode('overwrite').parquet('tmp/works_fine')}}
>  {{  spark.sql('select 1').write.option('compression', 
> 'gzip').mode('overwrite').csv('tmp/also_works')}}
>  {{  spark.sql('select 1').write.option('compression', 
> 'snappy').mode('overwrite').csv('tmp/snappy_not_found')}}
>   
>  In either case I get the following traceback
> java.lang.RuntimeException: native snappy library not available: this version 
> of libhadoop was built without snappy support.java.lang.RuntimeException: 
> native snappy library not available: this version of libhadoop was built 
> without snappy support. at 
> org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:65)
>  at 
> org.apache.hadoop.io.compress.SnappyCodec.getCompressorType(SnappyCodec.java:134)
>  at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:150) 
> at 
> org.apache.hadoop.io.compress.CompressionCodec$Util.createOutputStreamWithCodecPool(CompressionCodec.java:131)
>  at 
> org.apache.hadoop.io.compress.SnappyCodec.createOutputStream(SnappyCodec.java:100)
>  at 
> org.apache.spark.sql.execution.datasources.CodecStreams$$anonfun$createOutputStream$1.apply(CodecStreams.scala:84)
>  at 
> org.apache.spark.sql.execution.datasources.CodecStreams$$anonfun$createOutputStream$1.apply(CodecStreams.scala:84)
>  at scala.Option.map(Option.scala:146) at 
> org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:84)
>  at 
> org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CSVFileFormat.scala:177)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:85)
>  at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
>  at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:108)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.run(Task.scala:123) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31788) Error when creating UnionRDD of PairRDDs

2020-05-26 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117224#comment-17117224
 ] 

Hyukjin Kwon commented on SPARK-31788:
--

Reverted at 
https://github.com/apache/spark/commit/7fb2275f009c8744560c3247decdc106a8bca86f

> Error when creating UnionRDD of PairRDDs
> 
>
> Key: SPARK-31788
> URL: https://issues.apache.org/jira/browse/SPARK-31788
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Sanket Reddy
>Assignee: Sanket Reddy
>Priority: Major
> Fix For: 3.0.0
>
>
> Union RDD of Pair RDD's seems to have issues
> SparkSession available as 'spark'.
> >>> rdd1 = sc.parallelize([1,2,3,4,5])
> >>> rdd2 = sc.parallelize([6,7,8,9,10])
> >>> pairRDD1 = rdd1.zip(rdd2)
> >>> unionRDD1 = sc.union([pairRDD1, pairRDD1])
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870,
> in union jrdds[i] = rdds[i]._jrdd
> File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221,
> in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> >>> rdd3 = sc.parallelize([11,12,13,14,15])
> >>> pairRDD2 = rdd3.zip(rdd3)
> >>> unionRDD2 = sc.union([pairRDD1, pairRDD2])
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union 
> jrdds[i] = rdds[i]._jrdd File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221, in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> >>> rdd4 = sc.parallelize(range(5))
> >>> pairRDD3 = rdd4.zip(rdd4)
> >>> unionRDD3 = sc.union([pairRDD1, pairRDD3])
> >>> unionRDD3.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 
> >>> 1), (2, 2), (3, 3), (4, 4)]
>  
> 2.4.5 does not have this regression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31828) Retain table properties at CreateTableLikeCommand

2020-05-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117218#comment-17117218
 ] 

Apache Spark commented on SPARK-31828:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/28647

> Retain table properties at CreateTableLikeCommand
> -
>
> Key: SPARK-31828
> URL: https://issues.apache.org/jira/browse/SPARK-31828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31828) Retain table properties at CreateTableLikeCommand

2020-05-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31828:


Assignee: Apache Spark

> Retain table properties at CreateTableLikeCommand
> -
>
> Key: SPARK-31828
> URL: https://issues.apache.org/jira/browse/SPARK-31828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31828) Retain table properties at CreateTableLikeCommand

2020-05-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117217#comment-17117217
 ] 

Apache Spark commented on SPARK-31828:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/28647

> Retain table properties at CreateTableLikeCommand
> -
>
> Key: SPARK-31828
> URL: https://issues.apache.org/jira/browse/SPARK-31828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31828) Retain table properties at CreateTableLikeCommand

2020-05-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31828:


Assignee: (was: Apache Spark)

> Retain table properties at CreateTableLikeCommand
> -
>
> Key: SPARK-31828
> URL: https://issues.apache.org/jira/browse/SPARK-31828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31821) Remove mssql-jdbc dependencies

2020-05-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31821:
-

Assignee: Gabor Somogyi

> Remove mssql-jdbc dependencies
> --
>
> Key: SPARK-31821
> URL: https://issues.apache.org/jira/browse/SPARK-31821
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31821) Remove mssql-jdbc dependencies

2020-05-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31821.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28640
[https://github.com/apache/spark/pull/28640]

> Remove mssql-jdbc dependencies
> --
>
> Key: SPARK-31821
> URL: https://issues.apache.org/jira/browse/SPARK-31821
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31828) Retain table properties at CreateTableLikeCommand

2020-05-26 Thread ulysses you (Jira)
ulysses you created SPARK-31828:
---

 Summary: Retain table properties at CreateTableLikeCommand
 Key: SPARK-31828
 URL: https://issues.apache.org/jira/browse/SPARK-31828
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: ulysses you






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31800) Unable to disable Kerberos when submitting jobs to Kubernetes

2020-05-26 Thread Devaraj Kavali (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117202#comment-17117202
 ] 

Devaraj Kavali commented on SPARK-31800:


If we look at the log, about *krb5.conf* is just an info log but not the actual 
cause for failure. Actual failure is about the 
*spark.kubernetes.file.upload.path*, you can provide any dfs(S3, HDFS or any 
distributed file system) path for the config.

> Unable to disable Kerberos when submitting jobs to Kubernetes
> -
>
> Key: SPARK-31800
> URL: https://issues.apache.org/jira/browse/SPARK-31800
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: James Boylan
>Priority: Major
>
> When you attempt to submit a process to Kubernetes using spark-submit through 
> --master, it returns the exception:
> {code:java}
> 20/05/22 20:25:54 INFO KerberosConfDriverFeatureStep: You have not specified 
> a krb5.conf file locally or via a ConfigMap. Make sure that you have the 
> krb5.conf locally on the driver image.
> Exception in thread "main" org.apache.spark.SparkException: Please specify 
> spark.kubernetes.file.upload.path property.
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:290)
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:246)
> at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
> at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at scala.collection.TraversableLike.map(TraversableLike.scala:238)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
> at scala.collection.AbstractTraversable.map(Traversable.scala:108)
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:245)
> at 
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:165)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at 
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:163)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:60)
> at 
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
> at 
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
> at scala.collection.immutable.List.foldLeft(List.scala:89)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58)
> at 
> org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:98)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4(KubernetesClientApplication.scala:221)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4$adapted(KubernetesClientApplication.scala:215)
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2539)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:215)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:188)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
> at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 20/05/22 20:25:54 INFO ShutdownHookManager: Shutdown hook called
> 20/05/22 20:25:54 INFO ShutdownHookManager: Deleting directory 
> /private/var/folders/p1/y24myg413wx1l1l52bsdn2hrgq/T/spark-c94db9c5-b8a8-414d-b01d-f6369d31c9b8
>  {code}
> No changes in settings appear to be able to disable Kerberos. This is when 
> running a simple execution of the SparkPi on our lab cluster. The command 
> being used is
> {code:java}
> ./bin/spark-submit --master k8s://https://{api_hostname} --deploy-mode 
> 

[jira] [Commented] (SPARK-21784) Add ALTER TABLE ADD CONSTRANT DDL to support defining primary key and foreign keys

2020-05-26 Thread Sunitha Kambhampati (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-21784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117169#comment-17117169
 ] 

Sunitha Kambhampati commented on SPARK-21784:
-

[~krish_the_coder], [~Tagar] ,Thank you for your interest on this feature.   We 
have PR's that are waiting on review and it would help if you could share your 
use case and level of interest here.  There are significant improvements that 
we have seen with it and it was demonstrated in the Spark Summit.   
[https://databricks.com/session/informational-referential-integrity-constraints-support-in-apache-spark]

We would be interested in moving this forward if there is more interest in the 
community and committers to review and get this in. 

> Add ALTER TABLE ADD CONSTRANT DDL to support defining primary key and foreign 
> keys
> --
>
> Key: SPARK-21784
> URL: https://issues.apache.org/jira/browse/SPARK-21784
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Suresh Thalamati
>Priority: Major
>
> Currently Spark SQL does not have  DDL support to define primary key , and 
> foreign key constraints. This Jira is to add DDL support to define primary 
> key and foreign key informational constraint using ALTER TABLE syntax. These 
> constraints will be used in query optimization and you can find more details 
> about this in the spec in SPARK-19842
> *Syntax :*
> {code}
> ALTER TABLE [db_name.]table_name ADD [CONSTRAINT constraintName]
>   (PRIMARY KEY (col_names) |
>   FOREIGN KEY (col_names) REFERENCES [db_name.]table_name [(col_names)])
>   [VALIDATE | NOVALIDATE] [RELY | NORELY]
> {code}
> Examples :
> {code:sql}
> ALTER TABLE employee _ADD CONSTRANT pk_ PRIMARY KEY(empno) VALIDATE RELY
> ALTER TABLE department _ADD CONSTRAINT emp_fk_ FOREIGN KEY (mgrno) REFERENCES 
> employee(empno) NOVALIDATE NORELY
> {code}
> *Constraint name generated by the system:*
> {code:sql}
> ALTER TABLE department ADD PRIMARY KEY(deptno) VALIDATE RELY
> ALTER TABLE employee ADD FOREIGN KEY (workdept) REFERENCES department(deptno) 
> VALIDATE RELY;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31819) Add a workaround for Java 8u251+/K8s 1.17 and update integration test cases

2020-05-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31819:
-

Assignee: Dongjoon Hyun

> Add a workaround for Java 8u251+/K8s 1.17 and update integration test cases
> ---
>
> Key: SPARK-31819
> URL: https://issues.apache.org/jira/browse/SPARK-31819
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Kubernetes, Tests
>Affects Versions: 2.4.6
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31819) Add a workaround for Java 8u251+/K8s 1.17 and update integration test cases

2020-05-26 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117053#comment-17117053
 ] 

Dongjoon Hyun commented on SPARK-31819:
---

Yes. I fixed master/branch-3.0 via SPARK-31786 .

> Add a workaround for Java 8u251+/K8s 1.17 and update integration test cases
> ---
>
> Key: SPARK-31819
> URL: https://issues.apache.org/jira/browse/SPARK-31819
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Kubernetes, Tests
>Affects Versions: 2.4.6
>Reporter: Dongjoon Hyun
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31819) Add a workaround for Java 8u251+/K8s 1.17 and update integration test cases

2020-05-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31819.
---
Fix Version/s: 2.4.6
   Resolution: Fixed

Issue resolved by pull request 28638
[https://github.com/apache/spark/pull/28638]

> Add a workaround for Java 8u251+/K8s 1.17 and update integration test cases
> ---
>
> Key: SPARK-31819
> URL: https://issues.apache.org/jira/browse/SPARK-31819
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Kubernetes, Tests
>Affects Versions: 2.4.6
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Blocker
> Fix For: 2.4.6
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27997) kubernetes client token expired

2020-05-26 Thread rameshkrishnan muthusamy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116925#comment-17116925
 ] 

rameshkrishnan muthusamy commented on SPARK-27997:
--

I am currently working on this request. Will be sharing the details of the PR 
and design link soon. 

> kubernetes client token expired 
> 
>
> Key: SPARK-27997
> URL: https://issues.apache.org/jira/browse/SPARK-27997
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Henry Yu
>Priority: Major
>
> Hi ,
> when I try to submit spark to k8s in cluster mode, I need an authtoken to 
> talk with k8s.
> unfortunately, many cloud provider provide token and expired with 10-15 mins. 
> so we need to fresh this token.  
> client mode is event worse, because scheduler is created on submit process.
> Should I also make a pr on this ? I fix it by adding 
> RotatingOAuthTokenProvider and some configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31819) Add a workaround for Java 8u251+/K8s 1.17 and update integration test cases

2020-05-26 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116845#comment-17116845
 ] 

Xiao Li commented on SPARK-31819:
-

[~dongjoon]This is 2.4 only?

> Add a workaround for Java 8u251+/K8s 1.17 and update integration test cases
> ---
>
> Key: SPARK-31819
> URL: https://issues.apache.org/jira/browse/SPARK-31819
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Kubernetes, Tests
>Affects Versions: 2.4.6
>Reporter: Dongjoon Hyun
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31827) better error message for the JDK bug of stand-alone form

2020-05-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116825#comment-17116825
 ] 

Apache Spark commented on SPARK-31827:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/28646

> better error message for the JDK bug of stand-alone form
> 
>
> Key: SPARK-31827
> URL: https://issues.apache.org/jira/browse/SPARK-31827
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31827) better error message for the JDK bug of stand-alone form

2020-05-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31827:


Assignee: Apache Spark  (was: Wenchen Fan)

> better error message for the JDK bug of stand-alone form
> 
>
> Key: SPARK-31827
> URL: https://issues.apache.org/jira/browse/SPARK-31827
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31827) better error message for the JDK bug of stand-alone form

2020-05-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31827:


Assignee: Wenchen Fan  (was: Apache Spark)

> better error message for the JDK bug of stand-alone form
> 
>
> Key: SPARK-31827
> URL: https://issues.apache.org/jira/browse/SPARK-31827
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31827) better error message for the JDK bug of stand-alone form

2020-05-26 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-31827:
---

 Summary: better error message for the JDK bug of stand-alone form
 Key: SPARK-31827
 URL: https://issues.apache.org/jira/browse/SPARK-31827
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2020-05-26 Thread Itamar Turner-Trauring (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116803#comment-17116803
 ] 

Itamar Turner-Trauring commented on SPARK-23206:


It seems like one of the subtasks, has a PR that is basically done, just needs 
someone to review or even just approve it? 
[https://github.com/apache/spark/pull/23340] Any chance someone could look at 
it?

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edward Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31822) Cost too much resources when read orc hive table for infer schema

2020-05-26 Thread lithiumlee-_- (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116778#comment-17116778
 ] 

lithiumlee-_- commented on SPARK-31822:
---

And i notice that :
{quote}Spark 2.1.1 introduced a new configuration key: 
spark.sql.hive.caseSensitiveInferenceMode. It had a default setting of 
NEVER_INFER, which kept behavior identical to 2.1.0. However, Spark 2.2.0 
changes this setting’s default value to INFER_AND_SAVE to restore compatibility 
with reading Hive metastore tables whose underlying file schema have mixed-case 
column names. With the INFER_AND_SAVE configuration value, on first access 
Spark will perform schema inference on any Hive metastore table for which it 
has not already saved an inferred schema. Note that schema inference can be a 
very time consuming operation for tables with thousands of partitions. If 
compatibility with mixed-case column names is not a concern, you can safely set 
spark.sql.hive.caseSensitiveInferenceMode to NEVER_INFER to avoid the initial 
overhead of schema inference. Note that with the new default INFER_AND_SAVE 
setting, the results of the schema inference are saved as a metastore key for 
future use. Therefore, the initial schema inference occurs only at a table’s 
first access."
{quote}
This situation can easily resolved by set "set 
spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER"...

But I cannot think it is a best way.

 

 

[https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.2-docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-21-to-22]

> Cost too much resources when read orc hive table for infer schema
> -
>
> Key: SPARK-31822
> URL: https://issues.apache.org/jira/browse/SPARK-31822
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, SQL
>Affects Versions: 2.4.3
>Reporter: lithiumlee-_-
>Priority: Major
>  Labels: HiveMetastoreCatalog, orc
>
> When read a hive orc partitioned table without spark schema properties , 
> spark read all partitions and all files for infer schema. 
> Other settings: native orc mode ; _convertMetastoreOrc = true._
>  
> And I think it can improved by pass  *_partitionFilters_* to 
> *_fileIndex.listFiles_*.
> {code:java}
> // code placeholder
> // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238
> val inferredSchema = fileFormat
>   .inferSchema(
> sparkSession,
> options,
> fileIndex.listFiles(Nil, Nil).flatMap(_.files))
>   .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31763) DataFrame.inputFiles() not Available

2020-05-26 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116777#comment-17116777
 ] 

Hyukjin Kwon commented on SPARK-31763:
--

Please go ahead 

> DataFrame.inputFiles() not Available
> 
>
> Key: SPARK-31763
> URL: https://issues.apache.org/jira/browse/SPARK-31763
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> I have been trying to list inputFiles that compose my DataSet by using 
> *PySpark* 
> spark_session.read
>  .format(sourceFileFormat)
>  .load(S3A_FILESYSTEM_PREFIX + bucket + File.separator + sourceFolderPrefix)
>  *.inputFiles();*
> but I get an exception saying inputFiles attribute not present. But I was 
> able to get this functionality with Spark Java. 
> *So is this something missing in PySpark?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31826) Support composed type of case class for typed Scala UDF

2020-05-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31826:


Assignee: (was: Apache Spark)

> Support composed type of case class for typed Scala UDF
> ---
>
> Key: SPARK-31826
> URL: https://issues.apache.org/jira/browse/SPARK-31826
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> After SPARK-30127, typed Scala UDF now supports to accept case class as input 
> parameter. However, it still does not support types like Seq[T], Array[T], 
> assuming T is a case class. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31826) Support composed type of case class for typed Scala UDF

2020-05-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31826:


Assignee: Apache Spark

> Support composed type of case class for typed Scala UDF
> ---
>
> Key: SPARK-31826
> URL: https://issues.apache.org/jira/browse/SPARK-31826
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
> After SPARK-30127, typed Scala UDF now supports to accept case class as input 
> parameter. However, it still does not support types like Seq[T], Array[T], 
> assuming T is a case class. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31826) Support composed type of case class for typed Scala UDF

2020-05-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116755#comment-17116755
 ] 

Apache Spark commented on SPARK-31826:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/28645

> Support composed type of case class for typed Scala UDF
> ---
>
> Key: SPARK-31826
> URL: https://issues.apache.org/jira/browse/SPARK-31826
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> After SPARK-30127, typed Scala UDF now supports to accept case class as input 
> parameter. However, it still does not support types like Seq[T], Array[T], 
> assuming T is a case class. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31763) DataFrame.inputFiles() not Available

2020-05-26 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116738#comment-17116738
 ] 

Rakesh Raushan commented on SPARK-31763:


Shall I open a PR for this?

> DataFrame.inputFiles() not Available
> 
>
> Key: SPARK-31763
> URL: https://issues.apache.org/jira/browse/SPARK-31763
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> I have been trying to list inputFiles that compose my DataSet by using 
> *PySpark* 
> spark_session.read
>  .format(sourceFileFormat)
>  .load(S3A_FILESYSTEM_PREFIX + bucket + File.separator + sourceFolderPrefix)
>  *.inputFiles();*
> but I get an exception saying inputFiles attribute not present. But I was 
> able to get this functionality with Spark Java. 
> *So is this something missing in PySpark?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31826) Support composed type of case class for typed Scala UDF

2020-05-26 Thread wuyi (Jira)
wuyi created SPARK-31826:


 Summary: Support composed type of case class for typed Scala UDF
 Key: SPARK-31826
 URL: https://issues.apache.org/jira/browse/SPARK-31826
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: wuyi


After SPARK-30127, typed Scala UDF now supports to accept case class as input 
parameter. However, it still does not support types like Seq[T], Array[T], 
assuming T is a case class. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31820) Flaky JavaBeanDeserializationSuite

2020-05-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31820.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28639
[https://github.com/apache/spark/pull/28639]

> Flaky JavaBeanDeserializationSuite
> --
>
> Key: SPARK-31820
> URL: https://issues.apache.org/jira/browse/SPARK-31820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> The test suite JavaBeanDeserializationSuite sometimes fails with:
> {code}
> sbt.ForkMain$ForkError: java.lang.AssertionError: 
> expected:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25
>  12:39:16.999,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25
>  12:39:17.0,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25
>  12:39:17.0,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25
>  12:39:17.0,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25
>  12:39:17.0,nullIntField=]]> but 
> was:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25
>  12:39:16.999,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25
>  12:39:17,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25
>  12:39:17,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25
>  12:39:17,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25
>  12:39:17,nullIntField=]]>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> test.org.apache.spark.sql.JavaBeanDeserializationSuite.testSpark22000(JavaBeanDeserializationSuite.java:165)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> {code}
> See https://github.com/apache/spark/pull/28630#issuecomment-633695723



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31820) Flaky JavaBeanDeserializationSuite

2020-05-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31820:
---

Assignee: Maxim Gekk

> Flaky JavaBeanDeserializationSuite
> --
>
> Key: SPARK-31820
> URL: https://issues.apache.org/jira/browse/SPARK-31820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> The test suite JavaBeanDeserializationSuite sometimes fails with:
> {code}
> sbt.ForkMain$ForkError: java.lang.AssertionError: 
> expected:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25
>  12:39:16.999,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25
>  12:39:17.0,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25
>  12:39:17.0,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25
>  12:39:17.0,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25
>  12:39:17.0,nullIntField=]]> but 
> was:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25
>  12:39:16.999,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25
>  12:39:17,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25
>  12:39:17,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25
>  12:39:17,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25
>  12:39:17,nullIntField=]]>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> test.org.apache.spark.sql.JavaBeanDeserializationSuite.testSpark22000(JavaBeanDeserializationSuite.java:165)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> {code}
> See https://github.com/apache/spark/pull/28630#issuecomment-633695723



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31800) Unable to disable Kerberos when submitting jobs to Kubernetes

2020-05-26 Thread James Boylan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116657#comment-17116657
 ] 

James Boylan edited comment on SPARK-31800 at 5/26/20, 11:40 AM:
-

There are a few problems with that:
 # I'm running a local standalone spark in the environment this is being tested 
for. there is no HDFS to interact with. It leverages S3 for the storage medium.
 # I don't have Kerberos configured at all. We don't leverage it in our 
existing system and I would prefer not to have to leverage it just to support 
Spark 3.0 on Kubernetes as none of our processes require it.
 # It is not honoring the spark.authenticate false configuration property, or 
any other property to try and disable Kerberos. 

 


was (Author: drahkar):
There are a couple problems with that:
 # I'm running a local standalone spark in the environment this is being tested 
for. there is no HDFS to interact with. It leverages S3 for the storage medium.
 # I don't have Kerberos configured at all. We don't leverage it in our 
existing system and I would prefer not to have to leverage it just to support 
Spark 3.0 on Kubernetes as none of our processes require it.
 # It is not honoring the spark.authenticate false configuration property, or 
any other property to try and disable Kerberos. 

 

> Unable to disable Kerberos when submitting jobs to Kubernetes
> -
>
> Key: SPARK-31800
> URL: https://issues.apache.org/jira/browse/SPARK-31800
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: James Boylan
>Priority: Major
>
> When you attempt to submit a process to Kubernetes using spark-submit through 
> --master, it returns the exception:
> {code:java}
> 20/05/22 20:25:54 INFO KerberosConfDriverFeatureStep: You have not specified 
> a krb5.conf file locally or via a ConfigMap. Make sure that you have the 
> krb5.conf locally on the driver image.
> Exception in thread "main" org.apache.spark.SparkException: Please specify 
> spark.kubernetes.file.upload.path property.
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:290)
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:246)
> at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
> at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at scala.collection.TraversableLike.map(TraversableLike.scala:238)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
> at scala.collection.AbstractTraversable.map(Traversable.scala:108)
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:245)
> at 
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:165)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at 
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:163)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:60)
> at 
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
> at 
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
> at scala.collection.immutable.List.foldLeft(List.scala:89)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58)
> at 
> org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:98)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4(KubernetesClientApplication.scala:221)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4$adapted(KubernetesClientApplication.scala:215)
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2539)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:215)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:188)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
> at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>  

[jira] [Commented] (SPARK-31800) Unable to disable Kerberos when submitting jobs to Kubernetes

2020-05-26 Thread James Boylan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116657#comment-17116657
 ] 

James Boylan commented on SPARK-31800:
--

There are a couple problems with that:
 # I'm running a local standalone spark in the environment this is being tested 
for. there is no HDFS to interact with. It leverages S3 for the storage medium.
 # I don't have Kerberos configured at all. We don't leverage it in our 
existing system and I would prefer not to have to leverage it just to support 
Spark 3.0 on Kubernetes as none of our processes require it.
 # It is not honoring the spark.authenticate false configuration property, or 
any other property to try and disable Kerberos. 

 

> Unable to disable Kerberos when submitting jobs to Kubernetes
> -
>
> Key: SPARK-31800
> URL: https://issues.apache.org/jira/browse/SPARK-31800
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: James Boylan
>Priority: Major
>
> When you attempt to submit a process to Kubernetes using spark-submit through 
> --master, it returns the exception:
> {code:java}
> 20/05/22 20:25:54 INFO KerberosConfDriverFeatureStep: You have not specified 
> a krb5.conf file locally or via a ConfigMap. Make sure that you have the 
> krb5.conf locally on the driver image.
> Exception in thread "main" org.apache.spark.SparkException: Please specify 
> spark.kubernetes.file.upload.path property.
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:290)
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:246)
> at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
> at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at scala.collection.TraversableLike.map(TraversableLike.scala:238)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
> at scala.collection.AbstractTraversable.map(Traversable.scala:108)
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:245)
> at 
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:165)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at 
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:163)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:60)
> at 
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
> at 
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
> at scala.collection.immutable.List.foldLeft(List.scala:89)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58)
> at 
> org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:98)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4(KubernetesClientApplication.scala:221)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4$adapted(KubernetesClientApplication.scala:215)
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2539)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:215)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:188)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
> at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 20/05/22 20:25:54 INFO ShutdownHookManager: Shutdown hook called
> 20/05/22 20:25:54 INFO ShutdownHookManager: Deleting directory 
> 

[jira] [Commented] (SPARK-23539) Add support for Kafka headers in Structured Streaming

2020-05-26 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116653#comment-17116653
 ] 

Jungtaek Lim commented on SPARK-23539:
--

You can ignore the affect version in most cases if the type of the issue is new 
feature/improvement.

> Add support for Kafka headers in Structured Streaming
> -
>
> Key: SPARK-23539
> URL: https://issues.apache.org/jira/browse/SPARK-23539
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Dongjin Lee
>Priority: Major
> Fix For: 3.0.0
>
>
> Kafka headers were added in 0.11. We should expose them through our kafka 
> data source in both batch and streaming queries. 
> This is currently blocked on version of Kafka in Spark from 0.10.1 to 1.0+ 
> SPARK-18057



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31771) Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q'

2020-05-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31771.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28637
[https://github.com/apache/spark/pull/28637]

> Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q'
> -
>
> Key: SPARK-31771
> URL: https://issues.apache.org/jira/browse/SPARK-31771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> Five continuous pattern characters with 'G/M/L/E/u/Q/q' means Narrow-Text 
> Style in java.time.DateTimeFormatterBuilder which output the leading single 
> letter of the value, e.g. `December` would be `D`,  while in Spark 2.4 they 
> means Full-Text Style.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31771) Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q'

2020-05-26 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31771:
---

Assignee: Kent Yao

> Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q'
> -
>
> Key: SPARK-31771
> URL: https://issues.apache.org/jira/browse/SPARK-31771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> Five continuous pattern characters with 'G/M/L/E/u/Q/q' means Narrow-Text 
> Style in java.time.DateTimeFormatterBuilder which output the leading single 
> letter of the value, e.g. `December` would be `D`,  while in Spark 2.4 they 
> means Full-Text Style.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31762) Fix perf regression of date/timestamp formatting in toHiveString

2020-05-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116629#comment-17116629
 ] 

Apache Spark commented on SPARK-31762:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28643

> Fix perf regression of date/timestamp formatting in toHiveString
> 
>
> Key: SPARK-31762
> URL: https://issues.apache.org/jira/browse/SPARK-31762
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> HiveResult.toHiveString has to convert incoming Java date/timestamps types to 
> days/microseconds because existing API of DateFormatter/TimestampFormatter 
> don't accept java.sql.Timestamp/java.util.Date and 
> java.time.Instant/java.time.LocalDate. Internally, the formatters perform 
> conversions to Java types again. This badly impacts on the performance. The 
> ticket aims to add new APIs to DateFormatter and TimestampFormatter that 
> should accept Java types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31762) Fix perf regression of date/timestamp formatting in toHiveString

2020-05-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116628#comment-17116628
 ] 

Apache Spark commented on SPARK-31762:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28643

> Fix perf regression of date/timestamp formatting in toHiveString
> 
>
> Key: SPARK-31762
> URL: https://issues.apache.org/jira/browse/SPARK-31762
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> HiveResult.toHiveString has to convert incoming Java date/timestamps types to 
> days/microseconds because existing API of DateFormatter/TimestampFormatter 
> don't accept java.sql.Timestamp/java.util.Date and 
> java.time.Instant/java.time.LocalDate. Internally, the formatters perform 
> conversions to Java types again. This badly impacts on the performance. The 
> ticket aims to add new APIs to DateFormatter and TimestampFormatter that 
> should accept Java types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23539) Add support for Kafka headers in Structured Streaming

2020-05-26 Thread Martin Andersson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116607#comment-17116607
 ] 

Martin Andersson commented on SPARK-23539:
--

Why does it say {{Affects Version/s: 2.3.0}} when it was only included 3.0.0 ?

> Add support for Kafka headers in Structured Streaming
> -
>
> Key: SPARK-23539
> URL: https://issues.apache.org/jira/browse/SPARK-23539
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Dongjin Lee
>Priority: Major
> Fix For: 3.0.0
>
>
> Kafka headers were added in 0.11. We should expose them through our kafka 
> data source in both batch and streaming queries. 
> This is currently blocked on version of Kafka in Spark from 0.10.1 to 1.0+ 
> SPARK-18057



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31825) Spark History Server UI does not come up when hosted on a custom path

2020-05-26 Thread Abhishek Rao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Rao updated SPARK-31825:
-
Attachment: Faulty Spark History UI.PNG

> Spark History Server UI does not come up when hosted on a custom path
> -
>
> Key: SPARK-31825
> URL: https://issues.apache.org/jira/browse/SPARK-31825
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5
> Environment: Bring up Spark-History Server on any linux machine using 
> start-history-server.sh script.
>Reporter: Abhishek Rao
>Priority: Major
> Attachments: Faulty Spark History UI.PNG
>
>
> I tried to bringup spark-history server using the start-history-server.sh 
> script. The UI works perfectly fine when there is no path specified.
> i.e. http://:18080
> But If I bringup history server using custom path, I do not see the UI 
> working properly.
> Following is my configuration
> spark.history.fs.logDirectory=
> spark.ui.proxyBase=/test
> When I hit the url  http://:18080/test, I do not 
> see the History Server UI working properly. Attaching the screenshot of the 
> faulty UI.
> Wanted to know if I'm missing any configuration
>  
> !image-2020-05-26-15-26-21-616.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31825) Spark History Server UI does not come up when hosted on a custom path

2020-05-26 Thread Abhishek Rao (Jira)
Abhishek Rao created SPARK-31825:


 Summary: Spark History Server UI does not come up when hosted on a 
custom path
 Key: SPARK-31825
 URL: https://issues.apache.org/jira/browse/SPARK-31825
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.5
 Environment: Bring up Spark-History Server on any linux machine using 
start-history-server.sh script.
Reporter: Abhishek Rao


I tried to bringup spark-history server using the start-history-server.sh 
script. The UI works perfectly fine when there is no path specified.

i.e. http://:18080

But If I bringup history server using custom path, I do not see the UI working 
properly.

Following is my configuration

spark.history.fs.logDirectory=
spark.ui.proxyBase=/test

When I hit the url  http://:18080/test, I do not see 
the History Server UI working properly. Attaching the screenshot of the faulty 
UI.

Wanted to know if I'm missing any configuration

 

!image-2020-05-26-15-26-21-616.png!

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31809) Infer IsNotNull for all children of NullIntolerant expressions

2020-05-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116577#comment-17116577
 ] 

Apache Spark commented on SPARK-31809:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/28642

> Infer IsNotNull for all children of NullIntolerant expressions
> --
>
> Key: SPARK-31809
> URL: https://issues.apache.org/jira/browse/SPARK-31809
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Attachments: default.png, infer.png
>
>
> We should infer {{IsNotNull}} for all children of {{NullIntolerant}} 
> expressions. For example:
> {code:sql}
> CREATE TABLE t1(c1 string, c2 string);
> CREATE TABLE t2(c1 string, c2 string);
> EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON coalesce(t1.c1, t1.c2)=t2.c1;
> {code}
> {noformat}
> == Physical Plan ==
> *(4) Project [c1#5, c2#6]
> +- *(4) SortMergeJoin [coalesce(c1#5, c2#6)], [c1#7], Inner
>:- *(1) Sort [coalesce(c1#5, c2#6) ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(coalesce(c1#5, c2#6), 200), true, [id=#33]
>: +- Scan hive default.t1 [c1#5, c2#6], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#5, 
> c2#6], Statistics(sizeInBytes=8.0 EiB)
>+- *(3) Sort [c1#7 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(c1#7, 200), true, [id=#46]
>  +- *(2) Filter isnotnull(c1#7)
> +- Scan hive default.t2 [c1#7], HiveTableRelation `default`.`t2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8], 
> Statistics(sizeInBytes=8.0 EiB)
> {noformat}
> We should infer {{coalesce(t1.c1, t1.c2) IS NOT NULL}} to improve query 
> performance:
> {noformat}
> == Physical Plan ==
> *(5) Project [c1#23, c2#24]
> +- *(5) SortMergeJoin [coalesce(c1#23, c2#24)], [c1#25], Inner
>:- *(2) Sort [coalesce(c1#23, c2#24) ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(coalesce(c1#23, c2#24), 200), true, 
> [id=#95]
>: +- *(1) Filter isnotnull(coalesce(c1#23, c2#24))
>:+- Scan hive default.t1 [c1#23, c2#24], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#23, 
> c2#24], Statistics(sizeInBytes=8.0 EiB)
>+- *(4) Sort [c1#25 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(c1#25, 200), true, [id=#103]
>  +- *(3) Filter isnotnull(c1#25)
> +- Scan hive default.t2 [c1#25], HiveTableRelation 
> `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#25, 
> c2#26], Statistics(sizeInBytes=8.0 EiB)
> {noformat}
> Real performance test case:
>  !default.png!  !infer.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31809) Infer IsNotNull for all children of NullIntolerant expressions

2020-05-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116576#comment-17116576
 ] 

Apache Spark commented on SPARK-31809:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/28642

> Infer IsNotNull for all children of NullIntolerant expressions
> --
>
> Key: SPARK-31809
> URL: https://issues.apache.org/jira/browse/SPARK-31809
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Attachments: default.png, infer.png
>
>
> We should infer {{IsNotNull}} for all children of {{NullIntolerant}} 
> expressions. For example:
> {code:sql}
> CREATE TABLE t1(c1 string, c2 string);
> CREATE TABLE t2(c1 string, c2 string);
> EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON coalesce(t1.c1, t1.c2)=t2.c1;
> {code}
> {noformat}
> == Physical Plan ==
> *(4) Project [c1#5, c2#6]
> +- *(4) SortMergeJoin [coalesce(c1#5, c2#6)], [c1#7], Inner
>:- *(1) Sort [coalesce(c1#5, c2#6) ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(coalesce(c1#5, c2#6), 200), true, [id=#33]
>: +- Scan hive default.t1 [c1#5, c2#6], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#5, 
> c2#6], Statistics(sizeInBytes=8.0 EiB)
>+- *(3) Sort [c1#7 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(c1#7, 200), true, [id=#46]
>  +- *(2) Filter isnotnull(c1#7)
> +- Scan hive default.t2 [c1#7], HiveTableRelation `default`.`t2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8], 
> Statistics(sizeInBytes=8.0 EiB)
> {noformat}
> We should infer {{coalesce(t1.c1, t1.c2) IS NOT NULL}} to improve query 
> performance:
> {noformat}
> == Physical Plan ==
> *(5) Project [c1#23, c2#24]
> +- *(5) SortMergeJoin [coalesce(c1#23, c2#24)], [c1#25], Inner
>:- *(2) Sort [coalesce(c1#23, c2#24) ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(coalesce(c1#23, c2#24), 200), true, 
> [id=#95]
>: +- *(1) Filter isnotnull(coalesce(c1#23, c2#24))
>:+- Scan hive default.t1 [c1#23, c2#24], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#23, 
> c2#24], Statistics(sizeInBytes=8.0 EiB)
>+- *(4) Sort [c1#25 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(c1#25, 200), true, [id=#103]
>  +- *(3) Filter isnotnull(c1#25)
> +- Scan hive default.t2 [c1#25], HiveTableRelation 
> `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#25, 
> c2#26], Statistics(sizeInBytes=8.0 EiB)
> {noformat}
> Real performance test case:
>  !default.png!  !infer.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31809) Infer IsNotNull for all children of NullIntolerant expressions

2020-05-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31809:


Assignee: Yuming Wang  (was: Apache Spark)

> Infer IsNotNull for all children of NullIntolerant expressions
> --
>
> Key: SPARK-31809
> URL: https://issues.apache.org/jira/browse/SPARK-31809
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Attachments: default.png, infer.png
>
>
> We should infer {{IsNotNull}} for all children of {{NullIntolerant}} 
> expressions. For example:
> {code:sql}
> CREATE TABLE t1(c1 string, c2 string);
> CREATE TABLE t2(c1 string, c2 string);
> EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON coalesce(t1.c1, t1.c2)=t2.c1;
> {code}
> {noformat}
> == Physical Plan ==
> *(4) Project [c1#5, c2#6]
> +- *(4) SortMergeJoin [coalesce(c1#5, c2#6)], [c1#7], Inner
>:- *(1) Sort [coalesce(c1#5, c2#6) ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(coalesce(c1#5, c2#6), 200), true, [id=#33]
>: +- Scan hive default.t1 [c1#5, c2#6], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#5, 
> c2#6], Statistics(sizeInBytes=8.0 EiB)
>+- *(3) Sort [c1#7 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(c1#7, 200), true, [id=#46]
>  +- *(2) Filter isnotnull(c1#7)
> +- Scan hive default.t2 [c1#7], HiveTableRelation `default`.`t2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8], 
> Statistics(sizeInBytes=8.0 EiB)
> {noformat}
> We should infer {{coalesce(t1.c1, t1.c2) IS NOT NULL}} to improve query 
> performance:
> {noformat}
> == Physical Plan ==
> *(5) Project [c1#23, c2#24]
> +- *(5) SortMergeJoin [coalesce(c1#23, c2#24)], [c1#25], Inner
>:- *(2) Sort [coalesce(c1#23, c2#24) ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(coalesce(c1#23, c2#24), 200), true, 
> [id=#95]
>: +- *(1) Filter isnotnull(coalesce(c1#23, c2#24))
>:+- Scan hive default.t1 [c1#23, c2#24], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#23, 
> c2#24], Statistics(sizeInBytes=8.0 EiB)
>+- *(4) Sort [c1#25 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(c1#25, 200), true, [id=#103]
>  +- *(3) Filter isnotnull(c1#25)
> +- Scan hive default.t2 [c1#25], HiveTableRelation 
> `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#25, 
> c2#26], Statistics(sizeInBytes=8.0 EiB)
> {noformat}
> Real performance test case:
>  !default.png!  !infer.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31809) Infer IsNotNull for all children of NullIntolerant expressions

2020-05-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31809:


Assignee: Apache Spark  (was: Yuming Wang)

> Infer IsNotNull for all children of NullIntolerant expressions
> --
>
> Key: SPARK-31809
> URL: https://issues.apache.org/jira/browse/SPARK-31809
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
> Attachments: default.png, infer.png
>
>
> We should infer {{IsNotNull}} for all children of {{NullIntolerant}} 
> expressions. For example:
> {code:sql}
> CREATE TABLE t1(c1 string, c2 string);
> CREATE TABLE t2(c1 string, c2 string);
> EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON coalesce(t1.c1, t1.c2)=t2.c1;
> {code}
> {noformat}
> == Physical Plan ==
> *(4) Project [c1#5, c2#6]
> +- *(4) SortMergeJoin [coalesce(c1#5, c2#6)], [c1#7], Inner
>:- *(1) Sort [coalesce(c1#5, c2#6) ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(coalesce(c1#5, c2#6), 200), true, [id=#33]
>: +- Scan hive default.t1 [c1#5, c2#6], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#5, 
> c2#6], Statistics(sizeInBytes=8.0 EiB)
>+- *(3) Sort [c1#7 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(c1#7, 200), true, [id=#46]
>  +- *(2) Filter isnotnull(c1#7)
> +- Scan hive default.t2 [c1#7], HiveTableRelation `default`.`t2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8], 
> Statistics(sizeInBytes=8.0 EiB)
> {noformat}
> We should infer {{coalesce(t1.c1, t1.c2) IS NOT NULL}} to improve query 
> performance:
> {noformat}
> == Physical Plan ==
> *(5) Project [c1#23, c2#24]
> +- *(5) SortMergeJoin [coalesce(c1#23, c2#24)], [c1#25], Inner
>:- *(2) Sort [coalesce(c1#23, c2#24) ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(coalesce(c1#23, c2#24), 200), true, 
> [id=#95]
>: +- *(1) Filter isnotnull(coalesce(c1#23, c2#24))
>:+- Scan hive default.t1 [c1#23, c2#24], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#23, 
> c2#24], Statistics(sizeInBytes=8.0 EiB)
>+- *(4) Sort [c1#25 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(c1#25, 200), true, [id=#103]
>  +- *(3) Filter isnotnull(c1#25)
> +- Scan hive default.t2 [c1#25], HiveTableRelation 
> `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#25, 
> c2#26], Statistics(sizeInBytes=8.0 EiB)
> {noformat}
> Real performance test case:
>  !default.png!  !infer.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31822) Cost too much resources when read orc hive table for infer schema

2020-05-26 Thread lithiumlee-_- (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lithiumlee-_- updated SPARK-31822:
--
Labels: HiveMetastoreCatalog orc  (was: )

> Cost too much resources when read orc hive table for infer schema
> -
>
> Key: SPARK-31822
> URL: https://issues.apache.org/jira/browse/SPARK-31822
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, SQL
>Affects Versions: 2.4.3
>Reporter: lithiumlee-_-
>Priority: Major
>  Labels: HiveMetastoreCatalog, orc
>
> When read a hive orc partitioned table without spark schema properties , 
> spark read all partitions and all files for infer schema. 
> Other settings: native orc mode ; _convertMetastoreOrc = true._
>  
> And I think it can improved by pass  *_partitionFilters_* to 
> *_fileIndex.listFiles_*.
> {code:java}
> // code placeholder
> // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238
> val inferredSchema = fileFormat
>   .inferSchema(
> sparkSession,
> options,
> fileIndex.listFiles(Nil, Nil).flatMap(_.files))
>   .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31822) Cost too much resources when read orc hive table for infer schema

2020-05-26 Thread lithiumlee-_- (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lithiumlee-_- updated SPARK-31822:
--
Description: 
When read a hive orc partitioned table without spark schema properties , spark 
read all partitions and all files for infer schema. 

Other settings: native orc mode ; _convertMetastoreOrc = true._

 

And I think it can improved by pass  *_partitionFilters_* to 
*_fileIndex.listFiles_*.
{code:java}
// code placeholder
// org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238
val inferredSchema = fileFormat
  .inferSchema(
sparkSession,
options,
fileIndex.listFiles(Nil, Nil).flatMap(_.files))
  .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))

{code}
 

 

  was:
When read a hive orc partitioned table without spark schema properties , spark 
read all partitions and all files for infer schema. 

Other settings: native orc mode ; _convertMetastoreOrc = true._

 

And I think it can improve by pass  *_partitionFilters_* to 
*_fileIndex.listFiles_*.
{code:java}
// code placeholder
// org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238
val inferredSchema = fileFormat
  .inferSchema(
sparkSession,
options,
fileIndex.listFiles(Nil, Nil).flatMap(_.files))
  .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))

{code}
 

 


> Cost too much resources when read orc hive table for infer schema
> -
>
> Key: SPARK-31822
> URL: https://issues.apache.org/jira/browse/SPARK-31822
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, SQL
>Affects Versions: 2.4.3
>Reporter: lithiumlee-_-
>Priority: Major
>
> When read a hive orc partitioned table without spark schema properties , 
> spark read all partitions and all files for infer schema. 
> Other settings: native orc mode ; _convertMetastoreOrc = true._
>  
> And I think it can improved by pass  *_partitionFilters_* to 
> *_fileIndex.listFiles_*.
> {code:java}
> // code placeholder
> // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238
> val inferredSchema = fileFormat
>   .inferSchema(
> sparkSession,
> options,
> fileIndex.listFiles(Nil, Nil).flatMap(_.files))
>   .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31822) Cost too much resources when read orc hive table for infer schema

2020-05-26 Thread lithiumlee-_- (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lithiumlee-_- updated SPARK-31822:
--
Summary: Cost too much resources when read orc hive table for infer schema  
(was: Cost too much resources when read orc hive table to infer schema)

> Cost too much resources when read orc hive table for infer schema
> -
>
> Key: SPARK-31822
> URL: https://issues.apache.org/jira/browse/SPARK-31822
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, SQL
>Affects Versions: 2.4.3
>Reporter: lithiumlee-_-
>Priority: Major
>
> When read a hive orc partitioned table without spark schema properties , 
> spark read all partitions and all files for infer schema. 
> Other settings: native orc mode ; _convertMetastoreOrc = true._
>  
> And I think it can improve by pass  *_partitionFilters_* to 
> *_fileIndex.listFiles_*.
> {code:java}
> // code placeholder
> // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238
> val inferredSchema = fileFormat
>   .inferSchema(
> sparkSession,
> options,
> fileIndex.listFiles(Nil, Nil).flatMap(_.files))
>   .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31824) DAGSchedulerSuite: Improve and reuse completeShuffleMapStageSuccessfully

2020-05-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116567#comment-17116567
 ] 

Apache Spark commented on SPARK-31824:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/28641

> DAGSchedulerSuite: Improve and reuse completeShuffleMapStageSuccessfully
> 
>
> Key: SPARK-31824
> URL: https://issues.apache.org/jira/browse/SPARK-31824
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> DAGSchedulerSuite provides completeShuffleMapStageSuccessfully to make 
> ShuffleMapStage successfully.
> But many test case uses complete directly as follows:
> complete(taskSets(0), Seq((Success, makeMapStatus("hostA", 1
> We need to improve completeShuffleMapStageSuccessfully and reuse it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31824) DAGSchedulerSuite: Improve and reuse completeShuffleMapStageSuccessfully

2020-05-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31824:


Assignee: (was: Apache Spark)

> DAGSchedulerSuite: Improve and reuse completeShuffleMapStageSuccessfully
> 
>
> Key: SPARK-31824
> URL: https://issues.apache.org/jira/browse/SPARK-31824
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> DAGSchedulerSuite provides completeShuffleMapStageSuccessfully to make 
> ShuffleMapStage successfully.
> But many test case uses complete directly as follows:
> complete(taskSets(0), Seq((Success, makeMapStatus("hostA", 1
> We need to improve completeShuffleMapStageSuccessfully and reuse it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31824) DAGSchedulerSuite: Improve and reuse completeShuffleMapStageSuccessfully

2020-05-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116565#comment-17116565
 ] 

Apache Spark commented on SPARK-31824:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/28641

> DAGSchedulerSuite: Improve and reuse completeShuffleMapStageSuccessfully
> 
>
> Key: SPARK-31824
> URL: https://issues.apache.org/jira/browse/SPARK-31824
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> DAGSchedulerSuite provides completeShuffleMapStageSuccessfully to make 
> ShuffleMapStage successfully.
> But many test case uses complete directly as follows:
> complete(taskSets(0), Seq((Success, makeMapStatus("hostA", 1
> We need to improve completeShuffleMapStageSuccessfully and reuse it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31824) DAGSchedulerSuite: Improve and reuse completeShuffleMapStageSuccessfully

2020-05-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31824:


Assignee: Apache Spark

> DAGSchedulerSuite: Improve and reuse completeShuffleMapStageSuccessfully
> 
>
> Key: SPARK-31824
> URL: https://issues.apache.org/jira/browse/SPARK-31824
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> DAGSchedulerSuite provides completeShuffleMapStageSuccessfully to make 
> ShuffleMapStage successfully.
> But many test case uses complete directly as follows:
> complete(taskSets(0), Seq((Success, makeMapStatus("hostA", 1
> We need to improve completeShuffleMapStageSuccessfully and reuse it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31824) DAGSchedulerSuite: Improve and reuse completeShuffleMapStageSuccessfully

2020-05-26 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-31824:
--

 Summary: DAGSchedulerSuite: Improve and reuse 
completeShuffleMapStageSuccessfully
 Key: SPARK-31824
 URL: https://issues.apache.org/jira/browse/SPARK-31824
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: jiaan.geng


DAGSchedulerSuite provides completeShuffleMapStageSuccessfully to make 
ShuffleMapStage successfully.
But many test case uses complete directly as follows:
complete(taskSets(0), Seq((Success, makeMapStatus("hostA", 1
We need to improve completeShuffleMapStageSuccessfully and reuse it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31823) Improve the current Spark Scheduler test framework

2020-05-26 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-31823:
--

 Summary: Improve the current Spark Scheduler test framework
 Key: SPARK-31823
 URL: https://issues.apache.org/jira/browse/SPARK-31823
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: jiaan.geng


The major source of Spark Scheduler unit test cases are 
DAGSchedulerSuite、TaskSchedulerImplSuite、TaskSetManagerSuite. These test suites 
have played an important role to ensure the Spark Scheduler behaves as we 
expected, however, we should significantly improve these suites to provide 
better organized and more extendable test framework now, to further support the 
evolution of the Spark Scheduler.

The major limitations of the current Spark Scheduler test framework:
* The test framework was designed at the very early stage of Spark, so it 
doesn’t integrate well with the features introduced later, e.g. barrier 
execution, indeterminate stage, zombie taskset, resource profile.
* Many test cases are added in a hacky way, don’t fully utilize or expend the 
original test framework (while they could have been), this leads to a heavy 
maintenance burden.
* The test cases are not organized well, many test cases are appended case by 
case, each test file consists of thousands of LOCs.
* Frequently introducing flaky test cases because there is no standard way to 
generate test data and verify the result.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31821) Remove mssql-jdbc dependencies

2020-05-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116561#comment-17116561
 ] 

Apache Spark commented on SPARK-31821:
--

User 'gaborgsomogyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/28640

> Remove mssql-jdbc dependencies
> --
>
> Key: SPARK-31821
> URL: https://issues.apache.org/jira/browse/SPARK-31821
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Gabor Somogyi
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31821) Remove mssql-jdbc dependencies

2020-05-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31821:


Assignee: (was: Apache Spark)

> Remove mssql-jdbc dependencies
> --
>
> Key: SPARK-31821
> URL: https://issues.apache.org/jira/browse/SPARK-31821
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Gabor Somogyi
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31821) Remove mssql-jdbc dependencies

2020-05-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31821:


Assignee: Apache Spark

> Remove mssql-jdbc dependencies
> --
>
> Key: SPARK-31821
> URL: https://issues.apache.org/jira/browse/SPARK-31821
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31822) Cost too much resources when read orc hive table to infer schema

2020-05-26 Thread lithiumlee-_- (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lithiumlee-_- updated SPARK-31822:
--
Component/s: SQL

> Cost too much resources when read orc hive table to infer schema
> 
>
> Key: SPARK-31822
> URL: https://issues.apache.org/jira/browse/SPARK-31822
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, SQL
>Affects Versions: 2.4.3
>Reporter: lithiumlee-_-
>Priority: Major
>
> When read a hive orc partitioned table without spark schema properties , 
> spark read all partitions and all files for infer schema. 
> Other settings: native orc mode ; _convertMetastoreOrc = true._
>  
> And I think it can improve by pass  *_partitionFilters_* to 
> *_fileIndex.listFiles_*.
> {code:java}
> // code placeholder
> // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238
> val inferredSchema = fileFormat
>   .inferSchema(
> sparkSession,
> options,
> fileIndex.listFiles(Nil, Nil).flatMap(_.files))
>   .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31822) Cost too much resources when read orc hive table to infer schema

2020-05-26 Thread lithiumlee-_- (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lithiumlee-_- updated SPARK-31822:
--
Description: 
When read a hive orc partitioned table without spark schema properties , spark 
read all partitions and all files to infer schema. 

Other settings: native orc mode ; _convertMetastoreOrc = true._

 

And I think it can improve by pass  *_partitionFilters_* to 
*_fileIndex.listFiles_*.
{code:java}
// code placeholder
// org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238
val inferredSchema = fileFormat
  .inferSchema(
sparkSession,
options,
fileIndex.listFiles(Nil, Nil).flatMap(_.files))
  .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))

{code}
 

 

  was:
When read a hive orc partitioned table without spark schema properties , spark 
read all partitions and all files to infer schema. 

Other settings: native orc mode ; _convertMetastoreOrc = true._

 

And I think it can improve by pass  *_partitionFilters_* to 
*_fileIndex.listFiles_*.
{code:java}
// code placeholder
// org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238
val inferredSchema = fileFormat
  .inferSchema(
sparkSession,
options,
fileIndex.listFiles(Nil, Nil).flatMap(_.files))
  .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))

{code}
I think 

 

 

 


> Cost too much resources when read orc hive table to infer schema
> 
>
> Key: SPARK-31822
> URL: https://issues.apache.org/jira/browse/SPARK-31822
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.3
>Reporter: lithiumlee-_-
>Priority: Major
>
> When read a hive orc partitioned table without spark schema properties , 
> spark read all partitions and all files to infer schema. 
> Other settings: native orc mode ; _convertMetastoreOrc = true._
>  
> And I think it can improve by pass  *_partitionFilters_* to 
> *_fileIndex.listFiles_*.
> {code:java}
> // code placeholder
> // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238
> val inferredSchema = fileFormat
>   .inferSchema(
> sparkSession,
> options,
> fileIndex.listFiles(Nil, Nil).flatMap(_.files))
>   .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31822) Cost too much resources when read orc hive table to infer schema

2020-05-26 Thread lithiumlee-_- (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lithiumlee-_- updated SPARK-31822:
--
Description: 
When read a hive orc partitioned table without spark schema properties , spark 
read all partitions and all files for infer schema. 

Other settings: native orc mode ; _convertMetastoreOrc = true._

 

And I think it can improve by pass  *_partitionFilters_* to 
*_fileIndex.listFiles_*.
{code:java}
// code placeholder
// org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238
val inferredSchema = fileFormat
  .inferSchema(
sparkSession,
options,
fileIndex.listFiles(Nil, Nil).flatMap(_.files))
  .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))

{code}
 

 

  was:
When read a hive orc partitioned table without spark schema properties , spark 
read all partitions and all files to infer schema. 

Other settings: native orc mode ; _convertMetastoreOrc = true._

 

And I think it can improve by pass  *_partitionFilters_* to 
*_fileIndex.listFiles_*.
{code:java}
// code placeholder
// org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238
val inferredSchema = fileFormat
  .inferSchema(
sparkSession,
options,
fileIndex.listFiles(Nil, Nil).flatMap(_.files))
  .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))

{code}
 

 


> Cost too much resources when read orc hive table to infer schema
> 
>
> Key: SPARK-31822
> URL: https://issues.apache.org/jira/browse/SPARK-31822
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.3
>Reporter: lithiumlee-_-
>Priority: Major
>
> When read a hive orc partitioned table without spark schema properties , 
> spark read all partitions and all files for infer schema. 
> Other settings: native orc mode ; _convertMetastoreOrc = true._
>  
> And I think it can improve by pass  *_partitionFilters_* to 
> *_fileIndex.listFiles_*.
> {code:java}
> // code placeholder
> // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238
> val inferredSchema = fileFormat
>   .inferSchema(
> sparkSession,
> options,
> fileIndex.listFiles(Nil, Nil).flatMap(_.files))
>   .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31822) Cost too much resources when read orc hive table to infer schema

2020-05-26 Thread lithiumlee-_- (Jira)
lithiumlee-_- created SPARK-31822:
-

 Summary: Cost too much resources when read orc hive table to infer 
schema
 Key: SPARK-31822
 URL: https://issues.apache.org/jira/browse/SPARK-31822
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 2.4.3
Reporter: lithiumlee-_-


When read a hive orc partitioned table without spark schema properties , spark 
read all partitions and all files to infer schema. 

Other settings: native orc mode ; _convertMetastoreOrc = true._

 

And I think it can improve by pass  *_partitionFilters_* to 
*_fileIndex.listFiles_*.
{code:java}
// code placeholder
// org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238
val inferredSchema = fileFormat
  .inferSchema(
sparkSession,
options,
fileIndex.listFiles(Nil, Nil).flatMap(_.files))
  .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))

{code}
I think 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31821) Remove mssql-jdbc dependencies

2020-05-26 Thread Gabor Somogyi (Jira)
Gabor Somogyi created SPARK-31821:
-

 Summary: Remove mssql-jdbc dependencies
 Key: SPARK-31821
 URL: https://issues.apache.org/jira/browse/SPARK-31821
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.0.0, 3.1.0
Reporter: Gabor Somogyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31820) Flaky JavaBeanDeserializationSuite

2020-05-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31820:


Assignee: Apache Spark

> Flaky JavaBeanDeserializationSuite
> --
>
> Key: SPARK-31820
> URL: https://issues.apache.org/jira/browse/SPARK-31820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The test suite JavaBeanDeserializationSuite sometimes fails with:
> {code}
> sbt.ForkMain$ForkError: java.lang.AssertionError: 
> expected:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25
>  12:39:16.999,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25
>  12:39:17.0,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25
>  12:39:17.0,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25
>  12:39:17.0,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25
>  12:39:17.0,nullIntField=]]> but 
> was:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25
>  12:39:16.999,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25
>  12:39:17,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25
>  12:39:17,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25
>  12:39:17,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25
>  12:39:17,nullIntField=]]>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> test.org.apache.spark.sql.JavaBeanDeserializationSuite.testSpark22000(JavaBeanDeserializationSuite.java:165)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> {code}
> See https://github.com/apache/spark/pull/28630#issuecomment-633695723



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31820) Flaky JavaBeanDeserializationSuite

2020-05-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116494#comment-17116494
 ] 

Apache Spark commented on SPARK-31820:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28639

> Flaky JavaBeanDeserializationSuite
> --
>
> Key: SPARK-31820
> URL: https://issues.apache.org/jira/browse/SPARK-31820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The test suite JavaBeanDeserializationSuite sometimes fails with:
> {code}
> sbt.ForkMain$ForkError: java.lang.AssertionError: 
> expected:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25
>  12:39:16.999,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25
>  12:39:17.0,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25
>  12:39:17.0,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25
>  12:39:17.0,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25
>  12:39:17.0,nullIntField=]]> but 
> was:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25
>  12:39:16.999,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25
>  12:39:17,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25
>  12:39:17,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25
>  12:39:17,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25
>  12:39:17,nullIntField=]]>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> test.org.apache.spark.sql.JavaBeanDeserializationSuite.testSpark22000(JavaBeanDeserializationSuite.java:165)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> {code}
> See https://github.com/apache/spark/pull/28630#issuecomment-633695723



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31820) Flaky JavaBeanDeserializationSuite

2020-05-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31820:


Assignee: (was: Apache Spark)

> Flaky JavaBeanDeserializationSuite
> --
>
> Key: SPARK-31820
> URL: https://issues.apache.org/jira/browse/SPARK-31820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The test suite JavaBeanDeserializationSuite sometimes fails with:
> {code}
> sbt.ForkMain$ForkError: java.lang.AssertionError: 
> expected:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25
>  12:39:16.999,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25
>  12:39:17.0,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25
>  12:39:17.0,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25
>  12:39:17.0,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25
>  12:39:17.0,nullIntField=]]> but 
> was:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25
>  12:39:16.999,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25
>  12:39:17,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25
>  12:39:17,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25
>  12:39:17,nullIntField=], 
> JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25
>  12:39:17,nullIntField=]]>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> test.org.apache.spark.sql.JavaBeanDeserializationSuite.testSpark22000(JavaBeanDeserializationSuite.java:165)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> {code}
> See https://github.com/apache/spark/pull/28630#issuecomment-633695723



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31819) Add a workaround for Java 8u251+/K8s 1.17 and update integration test cases

2020-05-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31819:
--
Summary: Add a workaround for Java 8u251+/K8s 1.17 and update integration 
test cases  (was: Add a workaround for Java 8u251+ and update integration test 
cases)

> Add a workaround for Java 8u251+/K8s 1.17 and update integration test cases
> ---
>
> Key: SPARK-31819
> URL: https://issues.apache.org/jira/browse/SPARK-31819
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Kubernetes, Tests
>Affects Versions: 2.4.6
>Reporter: Dongjoon Hyun
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31819) Add a workaround for Java 8u251+ and update integration test cases

2020-05-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31819:
--
Priority: Blocker  (was: Major)

> Add a workaround for Java 8u251+ and update integration test cases
> --
>
> Key: SPARK-31819
> URL: https://issues.apache.org/jira/browse/SPARK-31819
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Kubernetes, Tests
>Affects Versions: 2.4.6
>Reporter: Dongjoon Hyun
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31820) Flaky JavaBeanDeserializationSuite

2020-05-26 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-31820:
--

 Summary: Flaky JavaBeanDeserializationSuite
 Key: SPARK-31820
 URL: https://issues.apache.org/jira/browse/SPARK-31820
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


The test suite JavaBeanDeserializationSuite sometimes fails with:
{code}
sbt.ForkMain$ForkError: java.lang.AssertionError: 
expected:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25
 12:39:16.999,nullIntField=], 
JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25
 12:39:17.0,nullIntField=], 
JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25
 12:39:17.0,nullIntField=], 
JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25
 12:39:17.0,nullIntField=], 
JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25
 12:39:17.0,nullIntField=]]> but 
was:<[JavaBeanDeserializationSuite.RecordSpark22000[shortField=0,intField=0,longField=0,floatField=0.0,doubleField=0.0,stringField=0,booleanField=true,timestampField=2020-05-25
 12:39:16.999,nullIntField=], 
JavaBeanDeserializationSuite.RecordSpark22000[shortField=1,intField=1,longField=1,floatField=1.0,doubleField=1.0,stringField=1,booleanField=false,timestampField=2020-05-25
 12:39:17,nullIntField=], 
JavaBeanDeserializationSuite.RecordSpark22000[shortField=2,intField=2,longField=2,floatField=2.0,doubleField=2.0,stringField=2,booleanField=true,timestampField=2020-05-25
 12:39:17,nullIntField=], 
JavaBeanDeserializationSuite.RecordSpark22000[shortField=3,intField=3,longField=3,floatField=3.0,doubleField=3.0,stringField=3,booleanField=false,timestampField=2020-05-25
 12:39:17,nullIntField=], 
JavaBeanDeserializationSuite.RecordSpark22000[shortField=4,intField=4,longField=4,floatField=4.0,doubleField=4.0,stringField=4,booleanField=true,timestampField=2020-05-25
 12:39:17,nullIntField=]]>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:834)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:144)
at 
test.org.apache.spark.sql.JavaBeanDeserializationSuite.testSpark22000(JavaBeanDeserializationSuite.java:165)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
{code}

See https://github.com/apache/spark/pull/28630#issuecomment-633695723



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31794) Incorrect distribution with repartitionByRange and repartition column expression

2020-05-26 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116470#comment-17116470
 ] 

Jungtaek Lim commented on SPARK-31794:
--

http://spark.apache.org/docs/3.0.0-preview2/api/scala/org/apache/spark/sql/Dataset.html
(The detailed explanation seem to be only added for 3.0.0 - I haven't indicated 
it's not addressed to Spark 2.4.x. My bad. That's just a doc issue and still be 
valid for all Spark 2.x though.)

Please check the description of "repartition*" methods - please click on method 
name to expand the description.

Given Spark describes the limitation of the repartitions it would be never a 
sort of bugs. Anyone is welcome to propose better solutions, but the new 
solutions should also take existing considerations into account.

If you're fully understand about your data distribution then you'll want to get 
your hand dirty by custom partitioner - though it seems to be only available 
for RDD.

> Incorrect distribution with repartitionByRange and repartition column 
> expression
> 
>
> Key: SPARK-31794
> URL: https://issues.apache.org/jira/browse/SPARK-31794
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.5, 3.0.1
> Environment: Sample code for obtaining the above test results.
> import java.io.File 
> import java.io.PrintWriter 
> val logfile="/tmp/sparkdftest.log"
> val writer = new PrintWriter(logfile) 
> writer.println("Spark Version " + sc.version)
> val df= Range(1, 1002).toDF("val")
> writer.println("Default Partition Length:" + df.rdd.partitions.length)
> writer.println("Default Partition getNumPartitions:" + 
> df.rdd.getNumPartitions)
> writer.println("Default Partition groupBy spark_partition_id:" + 
> df.groupBy(spark_partition_id).count().rdd.partitions.length)
> val dfcount=df.mapPartitions\{part => Iterator(part.size)}
> writer.println("Default Partition:" + dfcount.collect().toList)
> val numparts=24
> val dfparts_range=df.withColumn("partid", $"val" % 
> numparts).repartitionByRange(numparts, $"partid")
> writer.println("repartitionByRange Length:" + 
> dfparts_range.rdd.partitions.length)
> writer.println("repartitionByRange getNumPartitions:" + 
> dfparts_range.rdd.getNumPartitions)
> writer.println("repartitionByRange groupBy spark_partition_id:" + 
> dfparts_range.groupBy(spark_partition_id).count().rdd.partitions.length)
> val dfpartscount=dfparts_range.mapPartitions\{part => Iterator(part.size)}
> writer.println("repartitionByRange: " + dfpartscount.collect().toList)
> val dfparts_expr=df.withColumn("partid", $"val" % 
> numparts).repartition(numparts, $"partid")
> writer.println("repartition by column expr Length:" + 
> dfparts_expr.rdd.partitions.length)
> writer.println("repartition by column expr getNumPartitions:" + 
> dfparts_expr.rdd.getNumPartitions)
> writer.println("repartition by column expr groupBy spark_partitoin_id:" + 
> dfparts_expr.groupBy(spark_partition_id).count().rdd.partitions.length)
> val dfpartscount=dfparts_expr.mapPartitions\{part => Iterator(part.size)}
> writer.println("repartition by column expr:" + dfpartscount.collect().toList)
> writer.close()
>Reporter: Ramesha Bhatta
>Priority: Major
>  Labels: performance
>
> Both repartitionByRange and  repartition(, )  resulting in wrong 
> distribution within the resulting partition.  
>  
> In the Range partition one of the partition has 2x volume and last one with 
> zero.  In repartition this is more problematic with some partition with 4x, 
> 2x the avg and many partitions with zero volume.  
>  
> This distribution imbalance can cause performance problem in a concurrent 
> environment.
> Details from testing in 3 different versions.
> |Verion 2.3.2|Version 2.4.5|Versoin 3.0 Preview2|
> |Spark Version 2.3.2.3.1.4.0-315|Spark Version 2.4.5|Spark Version 
> 3.0.0-preview2|
> |Default Partition Length:2|Default Partition Length:2|Default Partition 
> Length:80|
> |Default Partition getNumPartitions:2|Default Partition 
> getNumPartitions:2|Default Partition getNumPartitions:80|
> |Default Partition groupBy spark_partition_id:200|Default Partition groupBy 
> spark_partition_id:200|Default Partition groupBy spark_partition_id:200|
> |repartitionByRange Length:24|repartitionByRange Length:24|repartitionByRange 
> Length:24|
> |repartitionByRange getNumPartitions:24|repartitionByRange 
> getNumPartitions:24|repartitionByRange getNumPartitions:24|
> |repartitionByRange groupBy spark_partition_id:200|repartitionByRange groupBy 
> spark_partition_id:200|repartitionByRange groupBy spark_partition_id:200|
> |repartitionByRange: List(83, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 
> 42, 42, 42, 42, 41, 41, 41, 41, 41, 41, 0)|repartitionByRange: List(83, 42, 
> 42, 42, 42, 

[jira] [Commented] (SPARK-31819) Add a workaround for Java 8u251+ and update integration test cases

2020-05-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116462#comment-17116462
 ] 

Apache Spark commented on SPARK-31819:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/28638

> Add a workaround for Java 8u251+ and update integration test cases
> --
>
> Key: SPARK-31819
> URL: https://issues.apache.org/jira/browse/SPARK-31819
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Kubernetes, Tests
>Affects Versions: 2.4.6
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31819) Add a workaround for Java 8u251+ and update integration test cases

2020-05-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31819:


Assignee: Apache Spark

> Add a workaround for Java 8u251+ and update integration test cases
> --
>
> Key: SPARK-31819
> URL: https://issues.apache.org/jira/browse/SPARK-31819
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Kubernetes, Tests
>Affects Versions: 2.4.6
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31819) Add a workaround for Java 8u251+ and update integration test cases

2020-05-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31819:


Assignee: (was: Apache Spark)

> Add a workaround for Java 8u251+ and update integration test cases
> --
>
> Key: SPARK-31819
> URL: https://issues.apache.org/jira/browse/SPARK-31819
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Kubernetes, Tests
>Affects Versions: 2.4.6
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31819) Add a workaround for Java 8u251+ and update integration test cases

2020-05-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116461#comment-17116461
 ] 

Apache Spark commented on SPARK-31819:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/28638

> Add a workaround for Java 8u251+ and update integration test cases
> --
>
> Key: SPARK-31819
> URL: https://issues.apache.org/jira/browse/SPARK-31819
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Kubernetes, Tests
>Affects Versions: 2.4.6
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31786) Exception on submitting Spark-Pi to Kubernetes 1.17.3

2020-05-26 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116460#comment-17116460
 ] 

Dongjoon Hyun commented on SPARK-31786:
---

Okay. I'll create a PR for that, [~maver1ck].

> Exception on submitting Spark-Pi to Kubernetes 1.17.3
> -
>
> Key: SPARK-31786
> URL: https://issues.apache.org/jira/browse/SPARK-31786
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Maciej Bryński
>Assignee: Dongjoon Hyun
>Priority: Blocker
> Fix For: 3.0.0
>
>
> Hi,
> I'm getting exception when submitting Spark-Pi app to Kubernetes cluster.
> Kubernetes version: 1.17.3
> JDK version: openjdk version "1.8.0_252"
> Exception:
> {code}
>  ./bin/spark-submit --master k8s://https://172.31.23.60:8443 --deploy-mode 
> cluster --name spark-pi --conf 
> spark.kubernetes.container.image=spark-py:2.4.5 --conf 
> spark.kubernetes.executor.request.cores=0.1 --conf 
> spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf 
> spark.executor.instances=1 local:///opt/spark/examples/src/main/python/pi.py
> log4j:WARN No appenders could be found for logger 
> (io.fabric8.kubernetes.client.Config).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create]  
> for kind: [Pod]  with name: [null]  in namespace: [default]  failed.
> at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
> at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:337)
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:330)
> at 
> org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:141)
> at 
> org.apache.spark.deploy.k8s.submit.Client$$anonfun$run$2.apply(KubernetesClientApplication.scala:140)
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
> at 
> org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:140)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:250)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:241)
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
> at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.net.SocketException: Broken pipe (Write failed)
> at java.net.SocketOutputStream.socketWrite0(Native Method)
> at 
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
> at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
> at sun.security.ssl.OutputRecord.writeBuffer(OutputRecord.java:431)
> at sun.security.ssl.OutputRecord.write(OutputRecord.java:417)
> at 
> sun.security.ssl.SSLSocketImpl.writeRecordInternal(SSLSocketImpl.java:894)
> at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:865)
> at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:123)
> at okio.Okio$1.write(Okio.java:79)
> at okio.AsyncTimeout$1.write(AsyncTimeout.java:180)
> at okio.RealBufferedSink.flush(RealBufferedSink.java:224)
> at okhttp3.internal.http2.Http2Writer.settings(Http2Writer.java:203)
> at 
> okhttp3.internal.http2.Http2Connection.start(Http2Connection.java:515)
> at 
>