[jira] [Commented] (SPARK-27051) Bump Jackson version to 2.9.8

2019-04-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829960#comment-16829960
 ] 

Apache Spark commented on SPARK-27051:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/24493

> Bump Jackson version to 2.9.8
> -
>
> Key: SPARK-27051
> URL: https://issues.apache.org/jira/browse/SPARK-27051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Major
> Fix For: 3.0.0, 2.4.3
>
>
> Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs 
> [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix 
> bump the dependent Jackson to 2.9.8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27051) Bump Jackson version to 2.9.8

2019-04-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829959#comment-16829959
 ] 

Apache Spark commented on SPARK-27051:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/24493

> Bump Jackson version to 2.9.8
> -
>
> Key: SPARK-27051
> URL: https://issues.apache.org/jira/browse/SPARK-27051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Major
> Fix For: 3.0.0, 2.4.3
>
>
> Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs 
> [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix 
> bump the dependent Jackson to 2.9.8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27601) Upgrade stream-lib to 2.9.6

2019-04-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27601:


Assignee: Apache Spark

> Upgrade stream-lib to 2.9.6
> ---
>
> Key: SPARK-27601
> URL: https://issues.apache.org/jira/browse/SPARK-27601
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> 1. Improve HyperLogLogPlus.merge and HyperLogLogPlus.mergeEstimators by using 
> native arrays instead of ArrayLists to avoid boxing and unboxing of integers. 
> https://github.com/addthis/stream-lib/pull/98
> 2. Improve HyperLogLogPlus.sortEncodedSet by using Arrays.sort on 
> appropriately transformed encoded values, in this way boxing and unboxing of 
> integers is avoided. https://github.com/addthis/stream-lib/pull/97



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27601) Upgrade stream-lib to 2.9.6

2019-04-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27601:


Assignee: (was: Apache Spark)

> Upgrade stream-lib to 2.9.6
> ---
>
> Key: SPARK-27601
> URL: https://issues.apache.org/jira/browse/SPARK-27601
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> 1. Improve HyperLogLogPlus.merge and HyperLogLogPlus.mergeEstimators by using 
> native arrays instead of ArrayLists to avoid boxing and unboxing of integers. 
> https://github.com/addthis/stream-lib/pull/98
> 2. Improve HyperLogLogPlus.sortEncodedSet by using Arrays.sort on 
> appropriately transformed encoded values, in this way boxing and unboxing of 
> integers is avoided. https://github.com/addthis/stream-lib/pull/97



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27602) SparkSQL CBO can't get true size of partition table after partition pruning

2019-04-29 Thread angerszhu (JIRA)
angerszhu created SPARK-27602:
-

 Summary: SparkSQL CBO can't get true size of partition table after 
partition pruning
 Key: SPARK-27602
 URL: https://issues.apache.org/jira/browse/SPARK-27602
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0, 2.3.0, 2.2.0
Reporter: angerszhu


When I want to do extract a cost of one sql for myself's cost framework,  I 
found that CBO  can't get true size of partition table  since when partition 
pruning is true. we just need corresponding partition's size. It just use the 
tables's statistic.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27601) Upgrade stream-lib to 2.9.6

2019-04-29 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27601:

Description: 
1. Improve HyperLogLogPlus.merge and HyperLogLogPlus.mergeEstimators by using 
native arrays instead of ArrayLists to avoid boxing and unboxing of integers. 
https://github.com/addthis/stream-lib/pull/98
2. Improve HyperLogLogPlus.sortEncodedSet by using Arrays.sort on appropriately 
transformed encoded values, in this way boxing and unboxing of integers is 
avoided. https://github.com/addthis/stream-lib/pull/97

  was:
1. Improve HyperLogLogPlus.merge and HyperLogLogPlus.mergeEstimators by using 
native arrays instead of ArrayLists to avoid boxing and unboxing of integers.
2. Improve HyperLogLogPlus.sortEncodedSet by using Arrays.sort on appropriately 
transformed encoded values, in this way boxing and unboxing of integers is 
avoided.


> Upgrade stream-lib to 2.9.6
> ---
>
> Key: SPARK-27601
> URL: https://issues.apache.org/jira/browse/SPARK-27601
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> 1. Improve HyperLogLogPlus.merge and HyperLogLogPlus.mergeEstimators by using 
> native arrays instead of ArrayLists to avoid boxing and unboxing of integers. 
> https://github.com/addthis/stream-lib/pull/98
> 2. Improve HyperLogLogPlus.sortEncodedSet by using Arrays.sort on 
> appropriately transformed encoded values, in this way boxing and unboxing of 
> integers is avoided. https://github.com/addthis/stream-lib/pull/97



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27601) Upgrade stream-lib to 2.9.6

2019-04-29 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-27601:
---

 Summary: Upgrade stream-lib to 2.9.6
 Key: SPARK-27601
 URL: https://issues.apache.org/jira/browse/SPARK-27601
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Yuming Wang


1. Improve HyperLogLogPlus.merge and HyperLogLogPlus.mergeEstimators by using 
native arrays instead of ArrayLists to avoid boxing and unboxing of integers.
2. Improve HyperLogLogPlus.sortEncodedSet by using Arrays.sort on appropriately 
transformed encoded values, in this way boxing and unboxing of integers is 
avoided.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27525) Exclude commons-httpclient when interacting with different versions of the HiveMetastoreClient

2019-04-29 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27525:

Description: 
{noformat}
Error Message
java.lang.RuntimeException: [download failed: 
commons-httpclient#commons-httpclient;3.0.1!commons-httpclient.jar]
Stacktrace
sbt.ForkMain$ForkError: java.lang.RuntimeException: [download failed: 
commons-httpclient#commons-httpclient;3.0.1!commons-httpclient.jar]
at 
org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1321)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$.$anonfun$downloadVersion$2(IsolatedClientLoader.scala:122)
at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:41)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$.downloadVersion(IsolatedClientLoader.scala:122)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$.liftedTree1$1(IsolatedClientLoader.scala:65)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:64)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:370)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:305)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:68)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:67)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:217)
at 
scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:217)
at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists(ExternalCatalog.scala:41)
at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists$(ExternalCatalog.scala:40)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.requireDbExists(HiveExternalCatalog.scala:57)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:242)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:238)
at 
org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.testAlterTable(Hive_2_1_DDLSuite.scala:123)
at 
org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.$anonfun$new$1(Hive_2_1_DDLSuite.scala:89)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:105)
at 
org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
at 
org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(Hive_2_1_DDLSuite.scala:41)
at 
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
at 
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
at 
org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.runTest(Hive_2_1_DDLSuite.scala:41)
at 
org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
at 
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
at org.scalatest.Suite.run(Suite.scala:1147)
at org.scalatest.Suite.run$(Suite.scala:1129)
at 

[jira] [Resolved] (SPARK-27525) Exclude commons-httpclient when interacting with different versions of the HiveMetastoreClient

2019-04-29 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-27525.
-
Resolution: Not A Problem

> Exclude commons-httpclient when interacting with different versions of the 
> HiveMetastoreClient 
> ---
>
> Key: SPARK-27525
> URL: https://issues.apache.org/jira/browse/SPARK-27525
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> Error Message
> java.lang.RuntimeException: [download failed: 
> commons-httpclient#commons-httpclient;3.0.1!commons-httpclient.jar]
> Stacktrace
> sbt.ForkMain$ForkError: java.lang.RuntimeException: [download failed: 
> commons-httpclient#commons-httpclient;3.0.1!commons-httpclient.jar]
>   at 
> org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1321)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.$anonfun$downloadVersion$2(IsolatedClientLoader.scala:122)
>   at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:41)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.downloadVersion(IsolatedClientLoader.scala:122)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.liftedTree1$1(IsolatedClientLoader.scala:65)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:64)
>   at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:370)
>   at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:305)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:68)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:67)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:217)
>   at 
> scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:217)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists(ExternalCatalog.scala:41)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists$(ExternalCatalog.scala:40)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.requireDbExists(HiveExternalCatalog.scala:57)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:242)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:238)
>   at 
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.testAlterTable(Hive_2_1_DDLSuite.scala:123)
>   at 
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.$anonfun$new$1(Hive_2_1_DDLSuite.scala:89)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:105)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
>   at 
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(Hive_2_1_DDLSuite.scala:41)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
>   at 
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.runTest(Hive_2_1_DDLSuite.scala:41)
>   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
>   at 

[jira] [Commented] (SPARK-27525) Exclude commons-httpclient when interacting with different versions of the HiveMetastoreClient

2019-04-29 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829880#comment-16829880
 ] 

Yuming Wang commented on SPARK-27525:
-

It's jenkins issue.

> Exclude commons-httpclient when interacting with different versions of the 
> HiveMetastoreClient 
> ---
>
> Key: SPARK-27525
> URL: https://issues.apache.org/jira/browse/SPARK-27525
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> Error Message
> java.lang.RuntimeException: [download failed: 
> commons-httpclient#commons-httpclient;3.0.1!commons-httpclient.jar]
> Stacktrace
> sbt.ForkMain$ForkError: java.lang.RuntimeException: [download failed: 
> commons-httpclient#commons-httpclient;3.0.1!commons-httpclient.jar]
>   at 
> org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1321)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.$anonfun$downloadVersion$2(IsolatedClientLoader.scala:122)
>   at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:41)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.downloadVersion(IsolatedClientLoader.scala:122)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.liftedTree1$1(IsolatedClientLoader.scala:65)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:64)
>   at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:370)
>   at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:305)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:68)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:67)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:217)
>   at 
> scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:217)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists(ExternalCatalog.scala:41)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists$(ExternalCatalog.scala:40)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.requireDbExists(HiveExternalCatalog.scala:57)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:242)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:238)
>   at 
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.testAlterTable(Hive_2_1_DDLSuite.scala:123)
>   at 
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.$anonfun$new$1(Hive_2_1_DDLSuite.scala:89)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:105)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
>   at 
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(Hive_2_1_DDLSuite.scala:41)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
>   at 
> org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.runTest(Hive_2_1_DDLSuite.scala:41)
>   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
>

[jira] [Created] (SPARK-27600) Unable to start Spark Hive Thrift Server when multiple hive server server share the same metastore

2019-04-29 Thread pin_zhang (JIRA)
pin_zhang created SPARK-27600:
-

 Summary: Unable to start Spark Hive Thrift Server when multiple 
hive server server share the same metastore
 Key: SPARK-27600
 URL: https://issues.apache.org/jira/browse/SPARK-27600
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: pin_zhang


When start ten or more spark hive thrift servers at the same time, more than 
one version saved to table VERSION when meet exception WARN [DataNucleus.Query] 
(main:) Query for candidates of 
org.apache.hadoop.hive.metastore.model.MVersionTable and subclasses resulted in 
no possible candidates
Exception thrown obtaining schema column information from datastore
org.datanucleus.exceptions.NucleusDataStoreException: Exception thrown 
obtaining schema column information from datastore

Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Table 
'via_ms.deleteme1556239494724' doesn't exist
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
 at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
 at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
 at com.mysql.jdbc.Util.getInstance(Util.java:408)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:944)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3978)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3914)
 at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2530)
 at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2683)
 at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2491)
 at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2449)
 at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1381)
 at com.mysql.jdbc.DatabaseMetaData$2.forEach(DatabaseMetaData.java:2441)
 at com.mysql.jdbc.DatabaseMetaData$2.forEach(DatabaseMetaData.java:2339)
 at com.mysql.jdbc.IterateBlock.doForAll(IterateBlock.java:50)
 at com.mysql.jdbc.DatabaseMetaData.getColumns(DatabaseMetaData.java:2337)
 at 
org.apache.commons.dbcp.DelegatingDatabaseMetaData.getColumns(DelegatingDatabaseMetaData.java:218)
 at 
org.datanucleus.store.rdbms.adapter.BaseDatastoreAdapter.getColumns(BaseDatastoreAdapter.java:1532)
 at 
org.datanucleus.store.rdbms.schema.RDBMSSchemaHandler.refreshTableData(RDBMSSchemaHandler.java:921)

Then cannot start hive server any more because of 
MetaException(message:Metastore contains multiple versions (2) 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27598) DStreams checkpointing does not work with Scala 2.12

2019-04-29 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-27598:

Description: 
When I restarted a stream with checkpointing enabled I got this:
{quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from 
file 
[file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk]
 java.io.IOException: java.lang.ClassCastException: cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.streaming.dstream.FileInputDStream.filter of type 
scala.Function1 in instance of 
org.apache.spark.streaming.dstream.FileInputDStream
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322)
 at 
org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
{quote}
It seems that the closure is stored in the Serialized format and cannot be 
assigned back to a scala function1

Details of how to reproduce it here: 
[https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6]

Maybe this is spark-shell specific and is not expected to work anyway, as I 
dont see this to be an issues with a normal jar. 

Note that with Spark 2.3.3 the error is different and this still does not work 
but with a different error.

  was:
When I restarted a stream with checkpointing enabled I got this:
{quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from 
file 
[file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk]
 java.io.IOException: java.lang.ClassCastException: cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.streaming.dstream.FileInputDStream.filter of type 
scala.Function1 in instance of 
org.apache.spark.streaming.dstream.FileInputDStream
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322)
 at 
org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
{quote}
It seems that the closure is stored in the Serialized format and cannot be 
assigned back to a scala function1

Details of how to reproduce it here: 
[https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6]

Maybe this is spark-shell specific and is not expected to work anyway, as I 
dont see this to be an issues with a normal jar. 


> DStreams checkpointing does not work with Scala 2.12
> 
>
> Key: SPARK-27598
> URL: https://issues.apache.org/jira/browse/SPARK-27598
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Critical
>
> When I restarted a stream with checkpointing enabled I got this:
> {quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from 
> file 
> [file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk]
>  java.io.IOException: java.lang.ClassCastException: cannot assign instance of 
> java.lang.invoke.SerializedLambda to field 
> org.apache.spark.streaming.dstream.FileInputDStream.filter of type 
> scala.Function1 in instance of 
> org.apache.spark.streaming.dstream.FileInputDStream
>  at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {quote}
> It seems that the closure is stored in the Serialized format and cannot be 
> assigned back to a scala function1
> Details of how to reproduce it here: 
> [https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6]
> Maybe this is spark-shell specific and is not expected to work anyway, as I 
> dont see this to be an issues with a normal jar. 
> Note that with Spark 2.3.3 the error is different and this still does not 
> work but with a different error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27598) DStreams checkpointing does not work with Scala 2.12

2019-04-29 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-27598:

Priority: Major  (was: Critical)

> DStreams checkpointing does not work with Scala 2.12
> 
>
> Key: SPARK-27598
> URL: https://issues.apache.org/jira/browse/SPARK-27598
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> When I restarted a stream with checkpointing enabled I got this:
> {quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from 
> file 
> [file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk]
>  java.io.IOException: java.lang.ClassCastException: cannot assign instance of 
> java.lang.invoke.SerializedLambda to field 
> org.apache.spark.streaming.dstream.FileInputDStream.filter of type 
> scala.Function1 in instance of 
> org.apache.spark.streaming.dstream.FileInputDStream
>  at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {quote}
> It seems that the closure is stored in the Serialized format and cannot be 
> assigned back to a scala function1
> Details of how to reproduce it here: 
> [https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6]
> Maybe this is spark-shell specific and is not expected to work anyway, as I 
> dont see this to be an issues with a normal jar. 
> Note that with Spark 2.3.3 the error is different and this still does not 
> work but with a different error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27548) PySpark toLocalIterator does not raise errors from worker

2019-04-29 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829861#comment-16829861
 ] 

Bryan Cutler edited comment on SPARK-27548 at 4/30/19 12:46 AM:


This is not that easy to fix by itself. Since there is no call from Py4J after 
the initial socket is setup, when a task fails the Thread serving the iterator 
to python just terminates and the serializer stops. This ends up giving partial 
results that look fine because the error is never seen by the Python driver.

Because the serving thread is being run asynchronously, the exception would 
have to be caught and then be transferred to Python after being joined. This 
would require lots of changes, but is pretty easy to do with the changes from 
SPARK-23961, so I will add the fix there.


was (Author: bryanc):
This is not that easy to fix by itself. Since there is no call from Py4J after 
the initial socket is setup, when a task fails the Thread serving the iterator 
to python just terminates and the serializer stops, so the error is never seen 
by the Python driver. Because the serving thread is being run asynchronously, 
the exception would have to be caught and then be transferred to Python after 
being joined. This would require lots of changes, but is pretty easy to do with 
the changes from SPARK-23961, so I will add the fix there.

> PySpark toLocalIterator does not raise errors from worker
> -
>
> Key: SPARK-27548
> URL: https://issues.apache.org/jira/browse/SPARK-27548
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.1
>Reporter: Bryan Cutler
>Priority: Major
>
> When using a PySpark RDD local iterator and an error occurs on the worker, it 
> is not picked up by Py4J and raised in the Python driver process. So unless 
> looking at logs, there is no way for the application to know the worker had 
> an error. This is a test that should pass if the error is raised in the 
> driver:
> {code}
> def test_to_local_iterator_error(self):
> def fail(_):
> raise RuntimeError("local iterator error")
> rdd = self.sc.parallelize(range(10)).map(fail)
> with self.assertRaisesRegexp(Exception, "local iterator error"):
> for _ in rdd.toLocalIterator():
> pass{code}
> but it does not raise an exception:
> {noformat}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 428, in main
>     process()
>   File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 423, in process
>     serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
> line 505, in dump_stream
>     vs = list(itertools.islice(iterator, batch))
>   File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/util.py", line 
> 99, in wrapper
>     return f(*args, **kwargs)
>   File "/home/bryan/git/spark/python/pyspark/tests/test_rdd.py", line 742, in 
> fail
>     raise RuntimeError("local iterator error")
> RuntimeError: local iterator error
>     at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:453)
> ...
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> FAIL
> ==
> FAIL: test_to_local_iterator_error (pyspark.tests.test_rdd.RDDTests)
> --
> Traceback (most recent call last):
>   File "/home/bryan/git/spark/python/pyspark/tests/test_rdd.py", line 748, in 
> test_to_local_iterator_error
>     pass
> AssertionError: Exception not raised{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27548) PySpark toLocalIterator does not raise errors from worker

2019-04-29 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829861#comment-16829861
 ] 

Bryan Cutler edited comment on SPARK-27548 at 4/30/19 12:46 AM:


This is not that easy to fix by itself. Since there is no call from Py4J after 
the initial socket is setup, when a task fails the Thread serving the iterator 
to python just terminates and the serializer stops. This ends up giving partial 
results that look fine because the error is never seen by the Python driver.

Because the serving thread is being run asynchronously, the exception would 
have to be caught and then be transferred to Python after being joined. This 
would require lots of modifications, but is pretty easy to do with the changes 
from SPARK-23961, so I will add the fix there.


was (Author: bryanc):
This is not that easy to fix by itself. Since there is no call from Py4J after 
the initial socket is setup, when a task fails the Thread serving the iterator 
to python just terminates and the serializer stops. This ends up giving partial 
results that look fine because the error is never seen by the Python driver.

Because the serving thread is being run asynchronously, the exception would 
have to be caught and then be transferred to Python after being joined. This 
would require lots of changes, but is pretty easy to do with the changes from 
SPARK-23961, so I will add the fix there.

> PySpark toLocalIterator does not raise errors from worker
> -
>
> Key: SPARK-27548
> URL: https://issues.apache.org/jira/browse/SPARK-27548
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.1
>Reporter: Bryan Cutler
>Priority: Major
>
> When using a PySpark RDD local iterator and an error occurs on the worker, it 
> is not picked up by Py4J and raised in the Python driver process. So unless 
> looking at logs, there is no way for the application to know the worker had 
> an error. This is a test that should pass if the error is raised in the 
> driver:
> {code}
> def test_to_local_iterator_error(self):
> def fail(_):
> raise RuntimeError("local iterator error")
> rdd = self.sc.parallelize(range(10)).map(fail)
> with self.assertRaisesRegexp(Exception, "local iterator error"):
> for _ in rdd.toLocalIterator():
> pass{code}
> but it does not raise an exception:
> {noformat}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 428, in main
>     process()
>   File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 423, in process
>     serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
> line 505, in dump_stream
>     vs = list(itertools.islice(iterator, batch))
>   File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/util.py", line 
> 99, in wrapper
>     return f(*args, **kwargs)
>   File "/home/bryan/git/spark/python/pyspark/tests/test_rdd.py", line 742, in 
> fail
>     raise RuntimeError("local iterator error")
> RuntimeError: local iterator error
>     at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:453)
> ...
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> FAIL
> ==
> FAIL: test_to_local_iterator_error (pyspark.tests.test_rdd.RDDTests)
> --
> Traceback (most recent call last):
>   File "/home/bryan/git/spark/python/pyspark/tests/test_rdd.py", line 748, in 
> test_to_local_iterator_error
>     pass
> AssertionError: Exception not raised{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27548) PySpark toLocalIterator does not raise errors from worker

2019-04-29 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829861#comment-16829861
 ] 

Bryan Cutler commented on SPARK-27548:
--

This is not that easy to fix by itself. Since there is no call from Py4J after 
the initial socket is setup, when a task fails the Thread serving the iterator 
to python just terminates and the serializer stops, so the error is never seen 
by the Python driver. Because the serving thread is being run asynchronously, 
the exception would have to be caught and then be transferred to Python after 
being joined. This would require lots of changes, but is pretty easy to do with 
the changes from SPARK-23961, so I will add the fix there.

> PySpark toLocalIterator does not raise errors from worker
> -
>
> Key: SPARK-27548
> URL: https://issues.apache.org/jira/browse/SPARK-27548
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.1
>Reporter: Bryan Cutler
>Priority: Major
>
> When using a PySpark RDD local iterator and an error occurs on the worker, it 
> is not picked up by Py4J and raised in the Python driver process. So unless 
> looking at logs, there is no way for the application to know the worker had 
> an error. This is a test that should pass if the error is raised in the 
> driver:
> {code}
> def test_to_local_iterator_error(self):
> def fail(_):
> raise RuntimeError("local iterator error")
> rdd = self.sc.parallelize(range(10)).map(fail)
> with self.assertRaisesRegexp(Exception, "local iterator error"):
> for _ in rdd.toLocalIterator():
> pass{code}
> but it does not raise an exception:
> {noformat}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 428, in main
>     process()
>   File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 423, in process
>     serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
> line 505, in dump_stream
>     vs = list(itertools.islice(iterator, batch))
>   File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/util.py", line 
> 99, in wrapper
>     return f(*args, **kwargs)
>   File "/home/bryan/git/spark/python/pyspark/tests/test_rdd.py", line 742, in 
> fail
>     raise RuntimeError("local iterator error")
> RuntimeError: local iterator error
>     at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:453)
> ...
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> FAIL
> ==
> FAIL: test_to_local_iterator_error (pyspark.tests.test_rdd.RDDTests)
> --
> Traceback (most recent call last):
>   File "/home/bryan/git/spark/python/pyspark/tests/test_rdd.py", line 748, in 
> test_to_local_iterator_error
>     pass
> AssertionError: Exception not raised{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27588) Fail fast if binary file data source will load a file that is bigger than 2GB

2019-04-29 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-27588.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24483
[https://github.com/apache/spark/pull/24483]

> Fail fast if binary file data source will load a file that is bigger than 2GB
> -
>
> Key: SPARK-27588
> URL: https://issues.apache.org/jira/browse/SPARK-27588
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
> Fix For: 3.0.0
>
>
> We use BinaryType to store file content, if a file is bigger than 2GB, we 
> should fail the job without trying to load the file that increases the error 
> waiting time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16367) Wheelhouse Support for PySpark

2019-04-29 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-16367.

Resolution: Duplicate

This is somewhat similar to SPARK-13587 so let's keep the discussion in one 
place.

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: gsemet
>Priority: Major
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> This ticket proposes to allow users the ability to deploy their job as 
> "Wheels" packages. The Python community is strongly advocating to promote 
> this way of packaging and distributing Python application as a "standard way 
> of deploying Python App". In other word, this is the "Pythonic Way of 
> Deployment".
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via spark-testing-base 
> six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
> spark-testing-base==0.0.7.post2 
> traceback2==1.4.0 # via unittest2 
> unittest2==1.1.0 # via spark-testing-base 
> wheel==0.29.0 
> wrapt==1.10.8 # via astroid 
> {code} 
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it 

[jira] [Assigned] (SPARK-13587) Support virtualenv in PySpark

2019-04-29 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-13587:
--

Assignee: Marcelo Vanzin

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Jeff Zhang
>Assignee: Marcelo Vanzin
>Priority: Major
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13587) Support virtualenv in PySpark

2019-04-29 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-13587:
---
Target Version/s:   (was: 3.0.0)

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Jeff Zhang
>Priority: Major
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13587) Support virtualenv in PySpark

2019-04-29 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-13587:
--

Assignee: (was: Marcelo Vanzin)

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Jeff Zhang
>Priority: Major
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27547) fix DataFrame self-join problems

2019-04-29 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27547:
--
Labels: correctness  (was: )

> fix DataFrame self-join problems
> 
>
> Key: SPARK-27547
> URL: https://issues.apache.org/jira/browse/SPARK-27547
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correctness
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27599) DataFrameWriter$partitionBy should be optional when writing to a hive table

2019-04-29 Thread Nick Dimiduk (JIRA)
Nick Dimiduk created SPARK-27599:


 Summary: DataFrameWriter$partitionBy should be optional when 
writing to a hive table
 Key: SPARK-27599
 URL: https://issues.apache.org/jira/browse/SPARK-27599
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.1
Reporter: Nick Dimiduk


When writing to an existing, partitioned table stored in the Hive metastore, 
Spark requires the call to {{saveAsTable}} to provide a value for 
{{partitionedBy}}, even though that information is provided by the metastore 
itself. Indeed, that information is available to Spark, as it will error if the 
specified {{partitionBy}} does not match that of the table definition in 
metastore.

There may be other attributes of the save call that can be retrieved from the 
metastore...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27598) DStreams checkpointing does not work with Scala 2.12

2019-04-29 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-27598:

Description: 
When I restarted a stream with checkpointing enabled I got this:
{quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from 
file 
[file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk]
 java.io.IOException: java.lang.ClassCastException: cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.streaming.dstream.FileInputDStream.filter of type 
scala.Function1 in instance of 
org.apache.spark.streaming.dstream.FileInputDStream
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322)
 at 
org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
{quote}
It seems that the closure is stored in the Serialized format and cannot be 
assigned back to a scala function1

Details of how to reproduce it here: 
[https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6]

Maybe this is spark-shell specific and is not expected to work anyway, as I 
dont see this to be an issues with a normal jar. 

  was:
When I restarted a stream with checkpointing enabled I got this:
{quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from 
file 
[file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk]
 java.io.IOException: java.lang.ClassCastException: cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.streaming.dstream.FileInputDStream.filter of type 
scala.Function1 in instance of 
org.apache.spark.streaming.dstream.FileInputDStream
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322)
 at 
org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
{quote}
It seems that the closure is stored in the Serialized format and cannot be 
assigned back to a scala function1

Details of how to reproduce it here: 
[https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6]

Maybe this is spark-shell specific and is not expected to work anyway. 


> DStreams checkpointing does not work with Scala 2.12
> 
>
> Key: SPARK-27598
> URL: https://issues.apache.org/jira/browse/SPARK-27598
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Critical
>
> When I restarted a stream with checkpointing enabled I got this:
> {quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from 
> file 
> [file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk]
>  java.io.IOException: java.lang.ClassCastException: cannot assign instance of 
> java.lang.invoke.SerializedLambda to field 
> org.apache.spark.streaming.dstream.FileInputDStream.filter of type 
> scala.Function1 in instance of 
> org.apache.spark.streaming.dstream.FileInputDStream
>  at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {quote}
> It seems that the closure is stored in the Serialized format and cannot be 
> assigned back to a scala function1
> Details of how to reproduce it here: 
> [https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6]
> Maybe this is spark-shell specific and is not expected to work anyway, as I 
> dont see this to be an issues with a normal jar. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27599) DataFrameWriter.partitionBy should be optional when writing to a hive table

2019-04-29 Thread Nick Dimiduk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Dimiduk updated SPARK-27599:
-
Summary: DataFrameWriter.partitionBy should be optional when writing to a 
hive table  (was: DataFrameWriter$partitionBy should be optional when writing 
to a hive table)

> DataFrameWriter.partitionBy should be optional when writing to a hive table
> ---
>
> Key: SPARK-27599
> URL: https://issues.apache.org/jira/browse/SPARK-27599
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Nick Dimiduk
>Priority: Minor
>
> When writing to an existing, partitioned table stored in the Hive metastore, 
> Spark requires the call to {{saveAsTable}} to provide a value for 
> {{partitionedBy}}, even though that information is provided by the metastore 
> itself. Indeed, that information is available to Spark, as it will error if 
> the specified {{partitionBy}} does not match that of the table definition in 
> metastore.
> There may be other attributes of the save call that can be retrieved from the 
> metastore...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27598) DStreams checkpointing does not work with Scala 2.12

2019-04-29 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-27598:

Priority: Critical  (was: Blocker)

> DStreams checkpointing does not work with Scala 2.12
> 
>
> Key: SPARK-27598
> URL: https://issues.apache.org/jira/browse/SPARK-27598
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Critical
>
> When I restarted a stream with checkpointing enabled I got this:
> {quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from 
> file 
> [file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk]
>  java.io.IOException: java.lang.ClassCastException: cannot assign instance of 
> java.lang.invoke.SerializedLambda to field 
> org.apache.spark.streaming.dstream.FileInputDStream.filter of type 
> scala.Function1 in instance of 
> org.apache.spark.streaming.dstream.FileInputDStream
>  at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {quote}
> It seems that the closure is stored in the Serialized format and cannot be 
> assigned back to a scala function1
> Details of how to reproduce it here: 
> [https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6]
> Maybe this is spark-shell specific and is not expected to work anyway. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27598) DStreams checkpointing does not work with Scala 2.12

2019-04-29 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-27598:

Description: 
When I restarted a stream with checkpointing enabled I got this:
{quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from 
file 
[file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk]
 java.io.IOException: java.lang.ClassCastException: cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.streaming.dstream.FileInputDStream.filter of type 
scala.Function1 in instance of 
org.apache.spark.streaming.dstream.FileInputDStream
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322)
 at 
org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
{quote}
It seems that the closure is stored in the Serialized format and cannot be 
assigned back to a scala function1

Details of how to reproduce it here: 
[https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6]

Maybe this is spark-shell specific and is not expected to work anyway. 

  was:
When I restarted a stream with checkpointing enabled I got this:
{quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from 
file 
[file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk]
 java.io.IOException: java.lang.ClassCastException: cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.streaming.dstream.FileInputDStream.filter of type 
scala.Function1 in instance of 
org.apache.spark.streaming.dstream.FileInputDStream
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322)
 at 
org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
{quote}
It seems that the closure is stored in the Serialized format and cannot be 
assigned back to a scala function1

Details of how to reproduce it here:

[https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6]

Maybe this is spark-shell specific.


> DStreams checkpointing does not work with Scala 2.12
> 
>
> Key: SPARK-27598
> URL: https://issues.apache.org/jira/browse/SPARK-27598
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Blocker
>
> When I restarted a stream with checkpointing enabled I got this:
> {quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from 
> file 
> [file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk]
>  java.io.IOException: java.lang.ClassCastException: cannot assign instance of 
> java.lang.invoke.SerializedLambda to field 
> org.apache.spark.streaming.dstream.FileInputDStream.filter of type 
> scala.Function1 in instance of 
> org.apache.spark.streaming.dstream.FileInputDStream
>  at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {quote}
> It seems that the closure is stored in the Serialized format and cannot be 
> assigned back to a scala function1
> Details of how to reproduce it here: 
> [https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6]
> Maybe this is spark-shell specific and is not expected to work anyway. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27598) DStreams checkpointing does not work with Scala 2.12

2019-04-29 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-27598:

Description: 
When I restarted a stream with checkpointing enabled I got this:
{quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from 
file 
[file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk]
 java.io.IOException: java.lang.ClassCastException: cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.streaming.dstream.FileInputDStream.filter of type 
scala.Function1 in instance of 
org.apache.spark.streaming.dstream.FileInputDStream
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322)
 at 
org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
{quote}
It seems that the closure is stored in the Serialized format and cannot be 
assigned back to a scala function1

Details of how to reproduce it here:

[https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6]

Maybe this is spark-shell specific.

  was:
When I restarted a stream with checkpointing enabled I got this:
{quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from 
file file:/tmp/checkpoint/checkpoint-155656695.bk
java.io.IOException: java.lang.ClassCastException: cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.streaming.dstream.FileInputDStream.filter of type 
scala.Function1 in instance of 
org.apache.spark.streaming.dstream.FileInputDStream
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322)
 at 
org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
{quote}
It seems that the closure is stored in the Serialized format and cannot be 
assigned back to a scala function1

Details of how to reproduce it here:

https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6


> DStreams checkpointing does not work with Scala 2.12
> 
>
> Key: SPARK-27598
> URL: https://issues.apache.org/jira/browse/SPARK-27598
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Blocker
>
> When I restarted a stream with checkpointing enabled I got this:
> {quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from 
> file 
> [file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk]
>  java.io.IOException: java.lang.ClassCastException: cannot assign instance of 
> java.lang.invoke.SerializedLambda to field 
> org.apache.spark.streaming.dstream.FileInputDStream.filter of type 
> scala.Function1 in instance of 
> org.apache.spark.streaming.dstream.FileInputDStream
>  at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322)
>  at 
> org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {quote}
> It seems that the closure is stored in the Serialized format and cannot be 
> assigned back to a scala function1
> Details of how to reproduce it here:
> [https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6]
> Maybe this is spark-shell specific.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27598) DStreams checkpointing does not work with Scala 2.12

2019-04-29 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created SPARK-27598:
---

 Summary: DStreams checkpointing does not work with Scala 2.12
 Key: SPARK-27598
 URL: https://issues.apache.org/jira/browse/SPARK-27598
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.4.2, 2.4.1, 2.4.0, 3.0.0
Reporter: Stavros Kontopoulos


When I restarted a stream with checkpointing enabled I got this:
{quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from 
file file:/tmp/checkpoint/checkpoint-155656695.bk
java.io.IOException: java.lang.ClassCastException: cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.streaming.dstream.FileInputDStream.filter of type 
scala.Function1 in instance of 
org.apache.spark.streaming.dstream.FileInputDStream
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322)
 at 
org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
{quote}
It seems that the closure is stored in the Serialized format and cannot be 
assigned back to a scala function1

Details of how to reproduce it here:

https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support

2019-04-29 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829631#comment-16829631
 ] 

Robert Joseph Evans commented on SPARK-27396:
-

I have updated this SPIP to clarify some things and to reduce the scope of it.

We are no longer trying to tackle anything around data movement to/from AI/ML 
systems.

We are no longer trying to expose any Arrow APIs or formatting to end users.  
To avoid any stability issues with either the Arrow API or the Arrow memory 
layout specification we are going to split that part off as a separate SPIP 
that can be added in later.  We will work with the Arrow community to see what 
if any stability guarantees we can get before putting that SPIP forward.

The only APIs that we are going to expose in this first SPIP is through the 
spark sql extensions config/API so that groups that want to do accelerated 
columnar ETL can have an API to do it, even if it is an unstable API.

 

Please take another look, and if there is not much in the way of comments we 
will put it up for another vote.

> SPIP: Public APIs for extended Columnar Processing Support
> --
>
> Key: SPARK-27396
> URL: https://issues.apache.org/jira/browse/SPARK-27396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> *SPIP: Columnar Processing Without Arrow Formatting Guarantees.*
>  
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
> The Dataset/DataFrame API in Spark currently only exposes to users one row at 
> a time when processing data.  The goals of this are to
>  # Add to the current sql extensions mechanism so advanced users can have 
> access to the physical SparkPlan and manipulate it to provide columnar 
> processing for existing operators, including shuffle.  This will allow them 
> to implement their own cost based optimizers to decide when processing should 
> be columnar and when it should not.
>  # Make any transitions between the columnar memory layout and a row based 
> layout transparent to the users so operations that are not columnar see the 
> data as rows, and operations that are columnar see the data as columns.
>  
> Not Requirements, but things that would be nice to have.
>  # Transition the existing in memory columnar layouts to be compatible with 
> Apache Arrow.  This would make the transformations to Apache Arrow format a 
> no-op. The existing formats are already very close to those layouts in many 
> cases.  This would not be using the Apache Arrow java library, but instead 
> being compatible with the memory 
> [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a 
> subset of that layout.
>  
> *Q2.* What problem is this proposal NOT designed to solve? 
> The goal of this is not for ML/AI but to provide APIs for accelerated 
> computing in Spark primarily targeting SQL/ETL like workloads.  ML/AI already 
> have several mechanisms to get data into/out of them. These can be improved 
> but will be covered in a separate SPIP.
> This is not trying to implement any of the processing itself in a columnar 
> way, with the exception of examples for documentation.
> This does not cover exposing the underlying format of the data.  The only way 
> to get at the data in a ColumnVector is through the public APIs.  Exposing 
> the underlying format to improve efficiency will be covered in a separate 
> SPIP.
> This is not trying to implement new ways of transferring data to external 
> ML/AI applications.  That is covered by separate SPIPs already.
> This is not trying to add in generic code generation for columnar processing. 
>  Currently code generation for columnar processing is only supported when 
> translating columns to rows.  We will continue to support this, but will not 
> extend it as a general solution. That will be covered in a separate SPIP if 
> we find it is helpful.  For now columnar processing will be interpreted.
> This is not trying to expose a way to get columnar data into Spark through 
> DataSource V2 or any other similar API.  That would be covered by a separate 
> SPIP if we find it is needed.
>  
> *Q3.* How is it done today, and what are the limits of current practice?
> The current columnar support is limited to 3 areas.
>  # Internal implementations of FileFormats, optionally can return a 
> ColumnarBatch instead of rows.  The code generation phase knows how to take 
> that columnar data and iterate through it as rows for stages that wants rows, 
> which currently is almost everything.  The limitations here are mostly 
> implementation specific. The current standard is to abuse Scala’s type 
> erasure to return ColumnarBatches as the elements of an 

[jira] [Updated] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support

2019-04-29 Thread Robert Joseph Evans (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated SPARK-27396:

Description: 
*SPIP: Columnar Processing Without Arrow Formatting Guarantees.*

 

*Q1.* What are you trying to do? Articulate your objectives using absolutely no 
jargon.

The Dataset/DataFrame API in Spark currently only exposes to users one row at a 
time when processing data.  The goals of this are to
 # Add to the current sql extensions mechanism so advanced users can have 
access to the physical SparkPlan and manipulate it to provide columnar 
processing for existing operators, including shuffle.  This will allow them to 
implement their own cost based optimizers to decide when processing should be 
columnar and when it should not.
 # Make any transitions between the columnar memory layout and a row based 
layout transparent to the users so operations that are not columnar see the 
data as rows, and operations that are columnar see the data as columns.

 

Not Requirements, but things that would be nice to have.
 # Transition the existing in memory columnar layouts to be compatible with 
Apache Arrow.  This would make the transformations to Apache Arrow format a 
no-op. The existing formats are already very close to those layouts in many 
cases.  This would not be using the Apache Arrow java library, but instead 
being compatible with the memory 
[layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a 
subset of that layout.

 

*Q2.* What problem is this proposal NOT designed to solve? 

The goal of this is not for ML/AI but to provide APIs for accelerated computing 
in Spark primarily targeting SQL/ETL like workloads.  ML/AI already have 
several mechanisms to get data into/out of them. These can be improved but will 
be covered in a separate SPIP.

This is not trying to implement any of the processing itself in a columnar way, 
with the exception of examples for documentation.

This does not cover exposing the underlying format of the data.  The only way 
to get at the data in a ColumnVector is through the public APIs.  Exposing the 
underlying format to improve efficiency will be covered in a separate SPIP.

This is not trying to implement new ways of transferring data to external ML/AI 
applications.  That is covered by separate SPIPs already.

This is not trying to add in generic code generation for columnar processing.  
Currently code generation for columnar processing is only supported when 
translating columns to rows.  We will continue to support this, but will not 
extend it as a general solution. That will be covered in a separate SPIP if we 
find it is helpful.  For now columnar processing will be interpreted.

This is not trying to expose a way to get columnar data into Spark through 
DataSource V2 or any other similar API.  That would be covered by a separate 
SPIP if we find it is needed.

 

*Q3.* How is it done today, and what are the limits of current practice?

The current columnar support is limited to 3 areas.
 # Internal implementations of FileFormats, optionally can return a 
ColumnarBatch instead of rows.  The code generation phase knows how to take 
that columnar data and iterate through it as rows for stages that wants rows, 
which currently is almost everything.  The limitations here are mostly 
implementation specific. The current standard is to abuse Scala’s type erasure 
to return ColumnarBatches as the elements of an RDD[InternalRow]. The code 
generation can handle this because it is generating java code, so it bypasses 
scala’s type checking and just casts the InternalRow to the desired 
ColumnarBatch.  This makes it difficult for others to implement the same 
functionality for different processing because they can only do it through code 
generation. There really is no clean separate path in the code generation for 
columnar vs row based. Additionally, because it is only supported through code 
generation if for any reason code generation would fail there is no backup.  
This is typically fine for input formats but can be problematic when we get 
into more extensive processing.
 # When caching data it can optionally be cached in a columnar format if the 
input is also columnar.  This is similar to the first area and has the same 
limitations because the cache acts as an input, but it is the only piece of 
code that also consumes columnar data as an input.
 # Pandas vectorized processing.  To be able to support Pandas UDFs Spark will 
build up a batch of data and send it to python for processing, and then get a 
batch of data back as a result.  The format of the data being sent to python 
can either be pickle, which is the default, or optionally Arrow. The result 
returned is the same format. The limitations here really are around 
performance.  Transforming the data back and forth can be very expensive.

 

*Q4.* What is new in 

[jira] [Commented] (SPARK-27597) RuntimeConfig should be serializable

2019-04-29 Thread Nick Dimiduk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829612#comment-16829612
 ] 

Nick Dimiduk commented on SPARK-27597:
--

It would nice nice if there was an API built into the {{FunctionalInterface}} 
that let their implementation access the fully populated {{SparkSession}}.

> RuntimeConfig should be serializable
> 
>
> Key: SPARK-27597
> URL: https://issues.apache.org/jira/browse/SPARK-27597
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Nick Dimiduk
>Priority: Major
>
> When implementing a UDF or similar, it's quite surprising to see that the 
> {{SparkSession}} is {{Serializable}} but {{RuntimeConf}} is not. When 
> modeling UDFs in an object-oriented way, this leads to quite a surprise, an 
> ugly NPE from the {{call}} site.
> {noformat}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:143)
> at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:141)
> at org.apache.spark.sql.SparkSession.conf$lzycompute(SparkSession.scala:170)
> at org.apache.spark.sql.SparkSession.conf(SparkSession.scala:170)
> ...{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27597) RuntimeConfig should be serializable

2019-04-29 Thread Nick Dimiduk (JIRA)
Nick Dimiduk created SPARK-27597:


 Summary: RuntimeConfig should be serializable
 Key: SPARK-27597
 URL: https://issues.apache.org/jira/browse/SPARK-27597
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.1
Reporter: Nick Dimiduk


When implementing a UDF or similar, it's quite surprising to see that the 
{{SparkSession}} is {{Serializable}} but {{RuntimeConf}} is not. When modeling 
UDFs in an object-oriented way, this leads to quite a surprise, an ugly NPE 
from the {{call}} site.
{noformat}
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:143)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:141)
at org.apache.spark.sql.SparkSession.conf$lzycompute(SparkSession.scala:170)
at org.apache.spark.sql.SparkSession.conf(SparkSession.scala:170)
...{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27567) Spark Streaming consumers (from Kafka) intermittently die with 'SparkException: Couldn't find leaders for Set'

2019-04-29 Thread Dmitry Goldenberg (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829555#comment-16829555
 ] 

Dmitry Goldenberg commented on SPARK-27567:
---

Hi Liang-Chi,

So you must be referring to this comment from that post, I gather:
{noformat}
This is expected behaviour. You have requested that each topic be stored on one 
machine by setting ReplicationFactor to one. When the one machine that happens 
to store the topic normalized-tenant4 is taken down, the consumer cannot find 
the leader of the topic.

See http://kafka.apache.org/documentation.html#intro_guarantees.{noformat}
I believe that indeed we have replication factor set to 1 for Kafka in this 
particular cluster I'm looking at.

Could it possibly be that the node that happens to hold particular segments of 
Kafka data becomes intermittently inaccessible (e.g. due to a network hiccup) 
and then the consumer is not able to retrieve the data from Kafka? And then the 
issue is really one of Kafka configuration? It sounds like if we increase the 
replication factor on the Kafka side the issue on the Spark side may go away. 
Agree / disagree?

In any case, the error "Couldn't find leaders for Set" is not very direct in 
terms of what the error causes may be. Perhaps this can be improved in Spark 
Streaming. At the very least, this Jira incident may help others.

I'll look into increasing the replication factor on the Kafka side and update 
this ticket.

> Spark Streaming consumers (from Kafka) intermittently die with 
> 'SparkException: Couldn't find leaders for Set'
> --
>
> Key: SPARK-27567
> URL: https://issues.apache.org/jira/browse/SPARK-27567
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.5.0
> Environment: GCP / 170~14.04.1-Ubuntu
>Reporter: Dmitry Goldenberg
>Priority: Major
>
> Some of our consumers intermittently die with the stack traces I'm including. 
> Once restarted they run for a while then die again.
> I can't find any cohesive documentation on what this error means and how to 
> go about troubleshooting it. Any help would be appreciated.
> *Kafka version* is 0.8.2.1 (2.10-0.8.2.1).
> Some of the errors seen look like this:
> {noformat}
> ERROR org.apache.spark.scheduler.TaskSchedulerImpl: Lost executor 2 on 
> 10.150.0.54: remote Rpc client disassociated{noformat}
> Main error stack trace:
> {noformat}
> 2019-04-23 20:36:54,323 ERROR 
> org.apache.spark.streaming.scheduler.JobScheduler: Error g
> enerating jobs for time 1556066214000 ms
> org.apache.spark.SparkException: ArrayBuffer(org.apache.spark.SparkException: 
> Couldn't find leaders for Set([hdfs.hbase.acme.attachments,49], 
> [hdfs.hbase.acme.attachmen
> ts,63], [hdfs.hbase.acme.attachments,31], [hdfs.hbase.acme.attachments,9], 
> [hdfs.hbase.acme.attachments,25], [hdfs.hbase.acme.attachments,55], 
> [hdfs.hbase.acme.attachme
> nts,5], [hdfs.hbase.acme.attachments,37], [hdfs.hbase.acme.attachments,7], 
> [hdfs.hbase.acme.attachments,47], [hdfs.hbase.acme.attachments,13], 
> [hdfs.hbase.acme.attachme
> nts,43], [hdfs.hbase.acme.attachments,19], [hdfs.hbase.acme.attachments,15], 
> [hdfs.hbase.acme.attachments,23], [hdfs.hbase.acme.attachments,53], 
> [hdfs.hbase.acme.attach
> ments,1], [hdfs.hbase.acme.attachments,27], [hdfs.hbase.acme.attachments,57], 
> [hdfs.hbase.acme.attachments,39], [hdfs.hbase.acme.attachments,11], 
> [hdfs.hbase.acme.attac
> hments,29], [hdfs.hbase.acme.attachments,33], 
> [hdfs.hbase.acme.attachments,35], [hdfs.hbase.acme.attachments,51], 
> [hdfs.hbase.acme.attachments,45], [hdfs.hbase.acme.att
> achments,21], [hdfs.hbase.acme.attachments,3], 
> [hdfs.hbase.acme.attachments,59], [hdfs.hbase.acme.attachments,41], 
> [hdfs.hbase.acme.attachments,17], [hdfs.hbase.acme.at
> tachments,61]))
> at 
> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:123)
>  ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.j
> ar:?]
> at 
> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:145)
>  ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.jar:?]
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.ja
> r:1.5.0]
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
>  ~[spark-assembly-1.5.0-hadoop2.4.0.ja
> r:1.5.0]
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) 
> ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.jar:?]
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
>  

[jira] [Commented] (SPARK-21492) Memory leak in SortMergeJoin

2019-04-29 Thread Tao Luo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829545#comment-16829545
 ] 

Tao Luo commented on SPARK-21492:
-

The problem is that the task won't complete because of memory being leaked (You 
can see from the simple example above)
Secondly, it's not just the last page, it's every page with records from unused 
iterators. 
Can we increase the priority of this bug? SMJ is a pretty integral part of 
Spark SQL, and it seems like no progress is being made on this bug, which is 
causing jobs to fail and has no workaround. 

I don't think that it's a hack: the argument seems to be that limit also needs 
to fixed, so let's not fix this bug until that is also fixed, meanwhile this 
issue has been lingering since at least July 2017. 
This would fix a memory leak and improve performance from not spilling 
unnecessarily. Why don't we target this fix for SMJ first, since it's pretty 
isolated to UnsafeExternalRowIterator in SMJ, run it through all the test 
cases, and make additional changes as necessary in the future. 

I've been porting [this PR|https://github.com/apache/spark/pull/23762] onto my 
production Spark cluster for the last 3 months, but I'm hoping we can get some 
sort of fix into 3.0 at least.

I started a discussion thread here, hopefully people can jump in:
http://apache-spark-developers-list.1001551.n3.nabble.com/Memory-leak-in-SortMergeJoin-td27152.html


> Memory leak in SortMergeJoin
> 
>
> Key: SPARK-21492
> URL: https://issues.apache.org/jira/browse/SPARK-21492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0, 2.3.1, 3.0.0
>Reporter: Zhan Zhang
>Priority: Major
>
> In SortMergeJoin, if the iterator is not exhausted, there will be memory leak 
> caused by the Sort. The memory is not released until the task end, and cannot 
> be used by other operators causing performance drop or OOM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27575) Spark overwrites existing value of spark.yarn.dist.* instead of merging value

2019-04-29 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-27575.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24465
[https://github.com/apache/spark/pull/24465]

> Spark overwrites existing value of spark.yarn.dist.* instead of merging value
> -
>
> Key: SPARK-27575
> URL: https://issues.apache.org/jira/browse/SPARK-27575
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> If we specify `--files` arg when submitting app where configuration has 
> "spark.yarn.dist.files", SparkSubmit overwrites the new files (files provided 
> in arg) instead of merging existing value in configuration.
> Same issue happens also on "spark.yarn.dist.pyFiles", "spark.yarn.dist.jars", 
> "spark.yarn.dist.archives".
>  
> While I encountered the issue in Spark 2.4.0, I can see the issue from 
> codebase in master branch and also branch-2.3 as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27575) Spark overwrites existing value of spark.yarn.dist.* instead of merging value

2019-04-29 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-27575:
--

Assignee: Jungtaek Lim

> Spark overwrites existing value of spark.yarn.dist.* instead of merging value
> -
>
> Key: SPARK-27575
> URL: https://issues.apache.org/jira/browse/SPARK-27575
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> If we specify `--files` arg when submitting app where configuration has 
> "spark.yarn.dist.files", SparkSubmit overwrites the new files (files provided 
> in arg) instead of merging existing value in configuration.
> Same issue happens also on "spark.yarn.dist.pyFiles", "spark.yarn.dist.jars", 
> "spark.yarn.dist.archives".
>  
> While I encountered the issue in Spark 2.4.0, I can see the issue from 
> codebase in master branch and also branch-2.3 as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23014) Migrate MemorySink fully to v2

2019-04-29 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23014.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24403
[https://github.com/apache/spark/pull/24403]

> Migrate MemorySink fully to v2
> --
>
> Key: SPARK-23014
> URL: https://issues.apache.org/jira/browse/SPARK-23014
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.0.0
>
>
> There's already a MemorySinkV2, but its use is controlled by a flag. We need 
> to remove the V1 sink and always use it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23014) Migrate MemorySink fully to v2

2019-04-29 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23014:
--

Assignee: Gabor Somogyi

> Migrate MemorySink fully to v2
> --
>
> Key: SPARK-23014
> URL: https://issues.apache.org/jira/browse/SPARK-23014
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Gabor Somogyi
>Priority: Major
>
> There's already a MemorySinkV2, but its use is controlled by a flag. We need 
> to remove the V1 sink and always use it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27596) The JDBC 'query' option doesn't work for Oracle database

2019-04-29 Thread Xiao Li (JIRA)
Xiao Li created SPARK-27596:
---

 Summary: The JDBC 'query' option doesn't work for Oracle database
 Key: SPARK-27596
 URL: https://issues.apache.org/jira/browse/SPARK-27596
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.2
Reporter: Xiao Li
Assignee: Dilip Biswal


For the JDBC option `query`, we use the identifier name to start with 
underscore: s"(${subquery}) 
__SPARK_GEN_JDBC_SUBQUERY_NAME_${curId.getAndIncrement()}". This is not 
supported by Oracle. 

The Oracle doesn't seem to support identifier name to start with non-alphabet 
character (unless it is quoted) and has length restrictions as well.
https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements008.htm

{code:java}
Nonquoted identifiers must begin with an alphabetic character from your 
database character set. Quoted identifiers can begin with any character as per 
below documentation - 
Nonquoted identifiers can contain only alphanumeric characters from your 
database character set and the underscore (_), dollar sign ($), and pound sign 
(#). Database links can also contain periods (.) and "at" signs (@). Oracle 
strongly discourages you from using $ and # in nonquoted identifiers.
{code}

The alias name '_SPARK_GEN_JDBC_SUBQUERY_NAME' should be fixed to 
remove "__" prefix ( or make it quoted.not sure if it may impact other sources) 
to make it work for Oracle. Also the length should be limited as it is hitting 
below error on removing the prefix.

{code:java}
java.sql.SQLSyntaxErrorException: ORA-00972: identifier is too long 
{code}

It can be verified using below sqlfiddle link.

http://www.sqlfiddle.com/#!4/9bbe9a/10050





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27571) Spark 3.0 build warnings: reflectiveCalls edition

2019-04-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27571.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24463
[https://github.com/apache/spark/pull/24463]

> Spark 3.0 build warnings: reflectiveCalls edition
> -
>
> Key: SPARK-27571
> URL: https://issues.apache.org/jira/browse/SPARK-27571
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, Spark Core, YARN
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 3.0.0
>
>
> We have a few places in the code where {{scala.lang.reflectiveCalls}} is 
> imported to avoid warnings about, well, reflective calls in Scala. This 
> usually comes up in tests where an anonymous inner class has been defined 
> with additional methods, which actually can't be accessed directly without 
> reflection. This is a little undesirable and easy enough to avoid.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27595) Spark couldn't read partitioned(string type) Orc column correctly if the value contains Float/Double value

2019-04-29 Thread Ameer Basha Pattan (JIRA)
Ameer Basha Pattan created SPARK-27595:
--

 Summary: Spark couldn't read partitioned(string type) Orc column 
correctly if the value contains Float/Double value
 Key: SPARK-27595
 URL: https://issues.apache.org/jira/browse/SPARK-27595
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Ameer Basha Pattan


create external table unique_keys (
key string
,locator_id string
, create_date string
, sequence int
)
partitioned by (source string)
stored as orc location '/user/hive/warehouse/reservation.db/unique_keys';

/user/hive/warehouse/reservation.db/unique_keys contains data like below:

/user/hive/warehouse/reservation.db/unique_keys/source=6S
/user/hive/warehouse/reservation.db/unique_keys/source=7F
/user/hive/warehouse/reservation.db/unique_keys/source=7H
/user/hive/warehouse/reservation.db/unique_keys/source=8D

 

If I try to read orc files through Spark, 

val masterDF = 
hiveContext.read.orc("/user/hive/warehouse/reservation.db/unique_keys")

source value getting changed to *7.0 and 8.0* for 7F and 8D respectively.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27536) Code improvements for 3.0: existentials edition

2019-04-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27536.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24431
[https://github.com/apache/spark/pull/24431]

> Code improvements for 3.0: existentials edition
> ---
>
> Key: SPARK-27536
> URL: https://issues.apache.org/jira/browse/SPARK-27536
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
> Fix For: 3.0.0
>
>
> The Spark code base makes use of 'existential types' in Scala, a language 
> feature which is quasi-deprecated -- it generates a warning unless 
> scala.language.existentials is imported, and there is talk of removing it 
> from future Scala versions: 
> https://contributors.scala-lang.org/t/proposal-to-remove-existential-types-from-the-language/2785
> We can get rid of most usages of this feature with lots of minor changes to 
> the code. A PR is coming to demonstrate what's involved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27591) A bug in UnivocityParser prevents using UDT

2019-04-29 Thread Artem Kalchenko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829379#comment-16829379
 ] 

Artem Kalchenko commented on SPARK-27591:
-

yes, I will later today

> A bug in UnivocityParser prevents using UDT
> ---
>
> Key: SPARK-27591
> URL: https://issues.apache.org/jira/browse/SPARK-27591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: Artem Kalchenko
>Priority: Minor
>
> I am trying to define a UserDefinedType based on String but different from 
> StringType in Spark 2.4.1 but it looks like there is a bug in Spark or I am 
> doing smth incorrectly.
> I define my type as follows:
> {code:java}
> class MyType extends UserDefinedType[MyValue] {
>   override def sqlType: DataType = StringType
>   ...
> }
> @SQLUserDefinedType(udt = classOf[MyType])
> case class MyValue
> {code}
> I expect it to be read and stored as String with just a custom SQL type. In 
> fact Spark can't read the string at all:
> {code:java}
> java.lang.ClassCastException: 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$11
>  cannot be cast to org.apache.spark.unsafe.types.UTF8String
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:195)
> at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
> {code}
> the problem is with UnivocityParser.makeConverter that doesn't return (String 
> => Any) function but (String => (String => Any)) in the case of UDT, see 
> UnivocityParser:184
> {code:java}
> case udt: UserDefinedType[_] => (datum: String) =>
>   makeConverter(name, udt.sqlType, nullable, options)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27472) Docuement binary file data source in Spark user guide

2019-04-29 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-27472.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24484
[https://github.com/apache/spark/pull/24484]

> Docuement binary file data source in Spark user guide
> -
>
> Key: SPARK-27472
> URL: https://issues.apache.org/jira/browse/SPARK-27472
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
> Fix For: 3.0.0
>
>
> We should add binary file data source to 
> https://spark.apache.org/docs/latest/sql-data-sources.html after SPARK-25348.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27591) A bug in UnivocityParser prevents using UDT

2019-04-29 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829370#comment-16829370
 ] 

Liang-Chi Hsieh commented on SPARK-27591:
-

oh, you're right. I've misread the description. Want to submit a PR with your 
fix?

> A bug in UnivocityParser prevents using UDT
> ---
>
> Key: SPARK-27591
> URL: https://issues.apache.org/jira/browse/SPARK-27591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: Artem Kalchenko
>Priority: Minor
>
> I am trying to define a UserDefinedType based on String but different from 
> StringType in Spark 2.4.1 but it looks like there is a bug in Spark or I am 
> doing smth incorrectly.
> I define my type as follows:
> {code:java}
> class MyType extends UserDefinedType[MyValue] {
>   override def sqlType: DataType = StringType
>   ...
> }
> @SQLUserDefinedType(udt = classOf[MyType])
> case class MyValue
> {code}
> I expect it to be read and stored as String with just a custom SQL type. In 
> fact Spark can't read the string at all:
> {code:java}
> java.lang.ClassCastException: 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$11
>  cannot be cast to org.apache.spark.unsafe.types.UTF8String
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:195)
> at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
> {code}
> the problem is with UnivocityParser.makeConverter that doesn't return (String 
> => Any) function but (String => (String => Any)) in the case of UDT, see 
> UnivocityParser:184
> {code:java}
> case udt: UserDefinedType[_] => (datum: String) =>
>   makeConverter(name, udt.sqlType, nullable, options)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27591) A bug in UnivocityParser prevents using UDT

2019-04-29 Thread Artem Kalchenko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829354#comment-16829354
 ] 

Artem Kalchenko commented on SPARK-27591:
-

[~viirya], yes, I'm returning a string. I tried creating a UTF8String, but it 
didn't help. As I mentioned in the description, the method is returning String 
=> makeConverter(...) instead of just calling makeConverter on udt.sqlType

> A bug in UnivocityParser prevents using UDT
> ---
>
> Key: SPARK-27591
> URL: https://issues.apache.org/jira/browse/SPARK-27591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: Artem Kalchenko
>Priority: Minor
>
> I am trying to define a UserDefinedType based on String but different from 
> StringType in Spark 2.4.1 but it looks like there is a bug in Spark or I am 
> doing smth incorrectly.
> I define my type as follows:
> {code:java}
> class MyType extends UserDefinedType[MyValue] {
>   override def sqlType: DataType = StringType
>   ...
> }
> @SQLUserDefinedType(udt = classOf[MyType])
> case class MyValue
> {code}
> I expect it to be read and stored as String with just a custom SQL type. In 
> fact Spark can't read the string at all:
> {code:java}
> java.lang.ClassCastException: 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$11
>  cannot be cast to org.apache.spark.unsafe.types.UTF8String
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:195)
> at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
> {code}
> the problem is with UnivocityParser.makeConverter that doesn't return (String 
> => Any) function but (String => (String => Any)) in the case of UDT, see 
> UnivocityParser:184
> {code:java}
> case udt: UserDefinedType[_] => (datum: String) =>
>   makeConverter(name, udt.sqlType, nullable, options)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27574) spark on kubernetes driver pod phase changed from running to pending and starts another container in pod

2019-04-29 Thread Will Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Zhang updated SPARK-27574:
---
Description: 
I'm using spark-on-kubernetes to submit spark app to kubernetes.
most of the time, it runs smoothly.
but sometimes, I see logs after submitting: the driver pod phase changed from 
running to pending and starts another container in the pod though the first 
container exited successfully.

I use the standard spark-submit to kubernetes like:

/opt/spark/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --deploy-mode cluster 
--class xxx ...

 

log is below:

19/04/19 09:38:40 INFO LineBufferedStream: stdout: 2019-04-19 09:38:40 INFO 
LoggingPodStatusWatcherImpl:54 - State changed, new state:
19/04/19 09:38:40 INFO LineBufferedStream: stdout: pod name: 
com--cloud-mf-trainer-submit-1555666719424-driver
19/04/19 09:38:40 INFO LineBufferedStream: stdout: namespace: default
19/04/19 09:38:40 INFO LineBufferedStream: stdout: labels: DagTask_ID -> 
54f854e2-0bce-4bd6-50e7-57b521b216f7, spark-app-selector -> 
spark-4343fe80572c4240bd933246efd975da, spark-role -> driver
19/04/19 09:38:40 INFO LineBufferedStream: stdout: pod uid: 
ea4410d5-6286-11e9-ae72-e8611f1fbb2a
19/04/19 09:38:40 INFO LineBufferedStream: stdout: creation time: 
2019-04-19T09:38:40Z
19/04/19 09:38:40 INFO LineBufferedStream: stdout: service account name: default
19/04/19 09:38:40 INFO LineBufferedStream: stdout: volumes: spark-local-dir-1, 
spark-conf-volume, default-token-q7drh
19/04/19 09:38:40 INFO LineBufferedStream: stdout: node name: N/A
19/04/19 09:38:40 INFO LineBufferedStream: stdout: start time: N/A
19/04/19 09:38:40 INFO LineBufferedStream: stdout: container images: N/A
19/04/19 09:38:40 INFO LineBufferedStream: stdout: phase: Pending
19/04/19 09:38:40 INFO LineBufferedStream: stdout: status: []
19/04/19 09:38:40 INFO LineBufferedStream: stdout: 2019-04-19 09:38:40 INFO 
LoggingPodStatusWatcherImpl:54 - State changed, new state:
19/04/19 09:38:40 INFO LineBufferedStream: stdout: pod name: 
com--cloud-mf-trainer-submit-1555666719424-driver
19/04/19 09:38:40 INFO LineBufferedStream: stdout: namespace: default
19/04/19 09:38:40 INFO LineBufferedStream: stdout: labels: DagTask_ID -> 
54f854e2-0bce-4bd6-50e7-57b521b216f7, spark-app-selector -> 
spark-4343fe80572c4240bd933246efd975da, spark-role -> driver
19/04/19 09:38:40 INFO LineBufferedStream: stdout: pod uid: 
ea4410d5-6286-11e9-ae72-e8611f1fbb2a
19/04/19 09:38:40 INFO LineBufferedStream: stdout: creation time: 
2019-04-19T09:38:40Z
19/04/19 09:38:40 INFO LineBufferedStream: stdout: service account name: default
19/04/19 09:38:40 INFO LineBufferedStream: stdout: volumes: spark-local-dir-1, 
spark-conf-volume, default-token-q7drh
19/04/19 09:38:40 INFO LineBufferedStream: stdout: node name: 
yq01-m12-ai2b-service02.yq01..com
19/04/19 09:38:40 INFO LineBufferedStream: stdout: start time: N/A
19/04/19 09:38:40 INFO LineBufferedStream: stdout: container images: N/A
19/04/19 09:38:40 INFO LineBufferedStream: stdout: phase: Pending
19/04/19 09:38:40 INFO LineBufferedStream: stdout: status: []

19/04/19 09:38:41 INFO LineBufferedStream: stdout: 2019-04-19 09:38:41 INFO 
LoggingPodStatusWatcherImpl:54 - State changed, new state:
19/04/19 09:38:41 INFO LineBufferedStream: stdout: pod name: 
com--cloud-mf-trainer-submit-1555666719424-driver
19/04/19 09:38:41 INFO LineBufferedStream: stdout: namespace: default
19/04/19 09:38:41 INFO LineBufferedStream: stdout: labels: DagTask_ID -> 
54f854e2-0bce-4bd6-50e7-57b521b216f7, spark-app-selector -> 
spark-4343fe80572c4240bd933246efd975da, spark-role -> driver
19/04/19 09:38:41 INFO LineBufferedStream: stdout: pod uid: 
ea4410d5-6286-11e9-ae72-e8611f1fbb2a
19/04/19 09:38:41 INFO LineBufferedStream: stdout: creation time: 
2019-04-19T09:38:40Z
19/04/19 09:38:41 INFO LineBufferedStream: stdout: service account name: default
19/04/19 09:38:41 INFO LineBufferedStream: stdout: volumes: spark-local-dir-1, 
spark-conf-volume, default-token-q7drh
19/04/19 09:38:41 INFO LineBufferedStream: stdout: node name: 
yq01-m12-ai2b-service02.yq01..com
19/04/19 09:38:41 INFO LineBufferedStream: stdout: start time: 
2019-04-19T09:38:40Z
19/04/19 09:38:41 INFO LineBufferedStream: stdout: container images: 
10.96.0.100:5000/spark:spark-2.4.0
19/04/19 09:38:41 INFO LineBufferedStream: stdout: phase: Pending
19/04/19 09:38:41 INFO LineBufferedStream: stdout: status: 
[ContainerStatus(containerID=null, image=10.96.0.100:5000/spark:spark-2.4.0, 
imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, 
additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
restartCount=0, state=ContainerState(running=null, terminated=null, 
waiting=ContainerStateWaiting(message=null, reason=ContainerCreating, 
additionalProperties={}), additionalProperties={}), additionalProperties={})]
19/04/19 09:38:45 INFO LineBufferedStream: stdout: 

[jira] [Comment Edited] (SPARK-27574) spark on kubernetes driver pod phase changed from running to pending and starts another container in pod

2019-04-29 Thread Will Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829337#comment-16829337
 ] 

Will Zhang edited comment on SPARK-27574 at 4/29/19 3:21 PM:
-

Hi [~Udbhav Agrawal],  the driver log is nothing special, the first container 
ran successfully and exited. The second failed cause it checks the filepath of 
the output and returns error if already existed. What I can see from the log is 
that the second container starts shortly after the first one exited. I attached 
the driver log files. Thank you.

below is the output of kubectl describe pod, it only contains the second 
container id:

Name: com--cloud-mf-trainer-submit-1555666719424-driver
 Namespace: default
 Node: yq01-m12-ai2b-service02.yq01..com/10.155.197.12
 Start Time: Fri, 19 Apr 2019 17:38:40 +0800
 Labels: DagTask_ID=54f854e2-0bce-4bd6-50e7-57b521b216f7
 spark-app-selector=spark-4343fe80572c4240bd933246efd975da
 spark-role=driver
 Annotations: 
 Status: Failed
 IP: 10.244.12.106
 Containers:
 spark-kubernetes-driver:
 Container ID: 
docker://23c9ea6767a274f8e8759da39dee90f403d9d28b1fec97c1fa4cd8746b41c8c3
 Image: 10.96.0.100:5000/spark:spark-2.4.0
 Image ID: 
docker-pullable://10.96.0.100:5000/spark-2.4.0@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f
 Ports: 7078/TCP, 7079/TCP, 4040/TCP
 Host Ports: 0/TCP, 0/TCP, 0/TCP
 Args:
 driver
 --properties-file
 /opt/spark/conf/spark.properties
 --class
 com..cloud.mf.trainer.Submit
 spark-internal
 --ak
 970f5e4c-7171-4c61-603e-f101b65a573b
 --tracking_server_url
 [http://10.155.197.12:8080|http://10.155.197.12:8080/]
 --graph
 
hdfs://yq01-m12-ai2b-service02.yq01..com:9000/project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/meta/node1555661669082/graph.json
 --sk
 56305f9f-b755-4b42-4218-592555f5c4a8
 --mode
 train
 State: Terminated
 Reason: Error
 Exit Code: 1
 Started: Fri, 19 Apr 2019 17:39:57 +0800
 Finished: Fri, 19 Apr 2019 17:40:48 +0800
 Ready: False
 Restart Count: 0
 Limits:
 memory: 2432Mi
 Requests:
 cpu: 1
 memory: 2432Mi
 Environment:
 _KUBERNETES_LOG_ENDPOINT: yq01-m12-ai2b-service02.yq01..com:8070
 _KUBERNETES_LOG_FLUSH_FREQUENCY: 10s
 _KUBERNETES_LOG_PATH: 
/project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/log/driver
 SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
 SPARK_LOCAL_DIRS: /var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f
 SPARK_CONF_DIR: /opt/spark/conf
 Mounts:
 /opt/spark/conf from spark-conf-volume (rw)
 /var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f from spark-local-dir-1 
(rw)
 /var/run/secrets/kubernetes.io/serviceaccount from default-token-q7drh (ro)
 Conditions:
 Type Status
 Initialized True
 Ready False
 PodScheduled True
 Volumes:
 spark-local-dir-1:
 Type: EmptyDir (a temporary directory that shares a pod's lifetime)
 Medium:
 spark-conf-volume:
 Type: ConfigMap (a volume populated by a ConfigMap)
 Name: com--cloud-mf-trainer-submit-1555666719424-driver-conf-map
 Optional: false
 default-token-q7drh:
 Type: Secret (a volume populated by a Secret)
 SecretName: default-token-q7drh
 Optional: false
 QoS Class: Burstable
 Node-Selectors: 
 Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
 node.kubernetes.io/unreachable:NoExecute for 300s
 Events: 

 

 

 


was (Author: zyfo2):
Hi [~Udbhav Agrawal],  the driver log is nothing special, the first container 
ran successfully and exited. The second failed cause it checks the filepath of 
the output and returns error if already existed. What I can see from the log is 
that the second container starts shortly after the first one exited. I attached 
the driver log files. Thank you.

below is the output of kubectl describe pod, it only contains the second 
container id:

Name: com--cloud-mf-trainer-submit-1555666719424-driver
 Namespace: default
 Node: yq01-m12-ai2b-service02.yq01.[^driver-pod-logs.zip].com/10.155.197.12
 Start Time: Fri, 19 Apr 2019 17:38:40 +0800
 Labels: DagTask_ID=54f854e2-0bce-4bd6-50e7-57b521b216f7
 spark-app-selector=spark-4343fe80572c4240bd933246efd975da
 spark-role=driver
 Annotations: 
 Status: Failed
 IP: 10.244.12.106
 Containers:
 spark-kubernetes-driver:
 Container ID: 
docker://23c9ea6767a274f8e8759da39dee90f403d9d28b1fec97c1fa4cd8746b41c8c3
 Image: 10.96.0.100:5000/spark:spark-2.4.0
 Image ID: 
docker-pullable://10.96.0.100:5000/spark-2.4.0@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f
 Ports: 7078/TCP, 7079/TCP, 4040/TCP
 Host Ports: 0/TCP, 0/TCP, 0/TCP
 Args:
 driver
 --properties-file
 /opt/spark/conf/spark.properties
 --class
 com..cloud.mf.trainer.Submit
 spark-internal
 --ak
 970f5e4c-7171-4c61-603e-f101b65a573b
 --tracking_server_url
 

[jira] [Comment Edited] (SPARK-27574) spark on kubernetes driver pod phase changed from running to pending and starts another container in pod

2019-04-29 Thread Will Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829337#comment-16829337
 ] 

Will Zhang edited comment on SPARK-27574 at 4/29/19 3:21 PM:
-

Hi [~Udbhav Agrawal],  the driver log is nothing special, the first container 
ran successfully and exited. The second failed cause it checks the filepath of 
the output and returns error if already existed. What I can see from the log is 
that the second container starts shortly after the first one exited. I attached 
the driver log files. Thank you.

below is the output of kubectl describe pod, it only contains the second 
container id:

Name: com--cloud-mf-trainer-submit-1555666719424-driver
 Namespace: default
 Node: yq01-m12-ai2b-service02.yq01.[^driver-pod-logs.zip].com/10.155.197.12
 Start Time: Fri, 19 Apr 2019 17:38:40 +0800
 Labels: DagTask_ID=54f854e2-0bce-4bd6-50e7-57b521b216f7
 spark-app-selector=spark-4343fe80572c4240bd933246efd975da
 spark-role=driver
 Annotations: 
 Status: Failed
 IP: 10.244.12.106
 Containers:
 spark-kubernetes-driver:
 Container ID: 
docker://23c9ea6767a274f8e8759da39dee90f403d9d28b1fec97c1fa4cd8746b41c8c3
 Image: 10.96.0.100:5000/spark:spark-2.4.0
 Image ID: 
docker-pullable://10.96.0.100:5000/spark-2.4.0@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f
 Ports: 7078/TCP, 7079/TCP, 4040/TCP
 Host Ports: 0/TCP, 0/TCP, 0/TCP
 Args:
 driver
 --properties-file
 /opt/spark/conf/spark.properties
 --class
 com..cloud.mf.trainer.Submit
 spark-internal
 --ak
 970f5e4c-7171-4c61-603e-f101b65a573b
 --tracking_server_url
 [http://10.155.197.12:8080|http://10.155.197.12:8080/]
 --graph
 
hdfs://yq01-m12-ai2b-service02.yq01..com:9000/project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/meta/node1555661669082/graph.json
 --sk
 56305f9f-b755-4b42-4218-592555f5c4a8
 --mode
 train
 State: Terminated
 Reason: Error
 Exit Code: 1
 Started: Fri, 19 Apr 2019 17:39:57 +0800
 Finished: Fri, 19 Apr 2019 17:40:48 +0800
 Ready: False
 Restart Count: 0
 Limits:
 memory: 2432Mi
 Requests:
 cpu: 1
 memory: 2432Mi
 Environment:
 _KUBERNETES_LOG_ENDPOINT: yq01-m12-ai2b-service02.yq01..com:8070
 _KUBERNETES_LOG_FLUSH_FREQUENCY: 10s
 _KUBERNETES_LOG_PATH: 
/project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/log/driver
 SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
 SPARK_LOCAL_DIRS: /var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f
 SPARK_CONF_DIR: /opt/spark/conf
 Mounts:
 /opt/spark/conf from spark-conf-volume (rw)
 /var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f from spark-local-dir-1 
(rw)
 /var/run/secrets/kubernetes.io/serviceaccount from default-token-q7drh (ro)
 Conditions:
 Type Status
 Initialized True
 Ready False
 PodScheduled True
 Volumes:
 spark-local-dir-1:
 Type: EmptyDir (a temporary directory that shares a pod's lifetime)
 Medium:
 spark-conf-volume:
 Type: ConfigMap (a volume populated by a ConfigMap)
 Name: com--cloud-mf-trainer-submit-1555666719424-driver-conf-map
 Optional: false
 default-token-q7drh:
 Type: Secret (a volume populated by a Secret)
 SecretName: default-token-q7drh
 Optional: false
 QoS Class: Burstable
 Node-Selectors: 
 Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
 node.kubernetes.io/unreachable:NoExecute for 300s
 Events: 

 

 

 


was (Author: zyfo2):
Hi [~Udbhav Agrawal],  the driver log is nothing special, the first container 
ran successfully and exited. The second failed cause it checks the filepath of 
the output and returns error if already existed. What I can see from the log is 
that the second container starts shortly after the first one exited. I attached 
the driver log files.

kubectl describe pod, it only contains the second container, the describe says:

Name: com--cloud-mf-trainer-submit-1555666719424-driver
Namespace: default
Node: yq01-m12-ai2b-service02.yq01.[^driver-pod-logs.zip].com/10.155.197.12
Start Time: Fri, 19 Apr 2019 17:38:40 +0800
Labels: DagTask_ID=54f854e2-0bce-4bd6-50e7-57b521b216f7
 spark-app-selector=spark-4343fe80572c4240bd933246efd975da
 spark-role=driver
Annotations: 
Status: Failed
IP: 10.244.12.106
Containers:
 spark-kubernetes-driver:
 Container ID: 
docker://23c9ea6767a274f8e8759da39dee90f403d9d28b1fec97c1fa4cd8746b41c8c3
 Image: 10.96.0.100:5000/spark:spark-2.4.0
 Image ID: 
docker-pullable://10.96.0.100:5000/spark-2.4.0@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f
 Ports: 7078/TCP, 7079/TCP, 4040/TCP
 Host Ports: 0/TCP, 0/TCP, 0/TCP
 Args:
 driver
 --properties-file
 /opt/spark/conf/spark.properties
 --class
 com..cloud.mf.trainer.Submit
 spark-internal
 --ak
 970f5e4c-7171-4c61-603e-f101b65a573b
 --tracking_server_url
 http://10.155.197.12:8080
 --graph
 

[jira] [Commented] (SPARK-27574) spark on kubernetes driver pod phase changed from running to pending and starts another container in pod

2019-04-29 Thread Will Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829337#comment-16829337
 ] 

Will Zhang commented on SPARK-27574:


Hi [~Udbhav Agrawal],  the driver log is nothing special, the first container 
ran successfully and exited. The second failed cause it checks the filepath of 
the output and returns error if already existed. What I can see from the log is 
that the second container starts shortly after the first one exited. I attached 
the driver log files.

kubectl describe pod, it only contains the second container, the describe says:

Name: com--cloud-mf-trainer-submit-1555666719424-driver
Namespace: default
Node: yq01-m12-ai2b-service02.yq01.[^driver-pod-logs.zip].com/10.155.197.12
Start Time: Fri, 19 Apr 2019 17:38:40 +0800
Labels: DagTask_ID=54f854e2-0bce-4bd6-50e7-57b521b216f7
 spark-app-selector=spark-4343fe80572c4240bd933246efd975da
 spark-role=driver
Annotations: 
Status: Failed
IP: 10.244.12.106
Containers:
 spark-kubernetes-driver:
 Container ID: 
docker://23c9ea6767a274f8e8759da39dee90f403d9d28b1fec97c1fa4cd8746b41c8c3
 Image: 10.96.0.100:5000/spark:spark-2.4.0
 Image ID: 
docker-pullable://10.96.0.100:5000/spark-2.4.0@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f
 Ports: 7078/TCP, 7079/TCP, 4040/TCP
 Host Ports: 0/TCP, 0/TCP, 0/TCP
 Args:
 driver
 --properties-file
 /opt/spark/conf/spark.properties
 --class
 com..cloud.mf.trainer.Submit
 spark-internal
 --ak
 970f5e4c-7171-4c61-603e-f101b65a573b
 --tracking_server_url
 http://10.155.197.12:8080
 --graph
 
hdfs://yq01-m12-ai2b-service02.yq01..com:9000/project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/meta/node1555661669082/graph.json
 --sk
 56305f9f-b755-4b42-4218-592555f5c4a8
 --mode
 train
 State: Terminated
 Reason: Error
 Exit Code: 1
 Started: Fri, 19 Apr 2019 17:39:57 +0800
 Finished: Fri, 19 Apr 2019 17:40:48 +0800
 Ready: False
 Restart Count: 0
 Limits:
 memory: 2432Mi
 Requests:
 cpu: 1
 memory: 2432Mi
 Environment:
 _KUBERNETES_LOG_ENDPOINT: yq01-m12-ai2b-service02.yq01..com:8070
 _KUBERNETES_LOG_FLUSH_FREQUENCY: 10s
 _KUBERNETES_LOG_PATH: 
/project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/log/driver
 SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
 SPARK_LOCAL_DIRS: /var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f
 SPARK_CONF_DIR: /opt/spark/conf
 Mounts:
 /opt/spark/conf from spark-conf-volume (rw)
 /var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f from spark-local-dir-1 
(rw)
 /var/run/secrets/kubernetes.io/serviceaccount from default-token-q7drh (ro)
Conditions:
 Type Status
 Initialized True
 Ready False
 PodScheduled True
Volumes:
 spark-local-dir-1:
 Type: EmptyDir (a temporary directory that shares a pod's lifetime)
 Medium:
 spark-conf-volume:
 Type: ConfigMap (a volume populated by a ConfigMap)
 Name: com--cloud-mf-trainer-submit-1555666719424-driver-conf-map
 Optional: false
 default-token-q7drh:
 Type: Secret (a volume populated by a Secret)
 SecretName: default-token-q7drh
 Optional: false
QoS Class: Burstable
Node-Selectors: 
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
 node.kubernetes.io/unreachable:NoExecute for 300s
Events: 

 

 

 

> spark on kubernetes driver pod phase changed from running to pending and 
> starts another container in pod
> 
>
> Key: SPARK-27574
> URL: https://issues.apache.org/jira/browse/SPARK-27574
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
> Environment: Kubernetes version (use kubectl version):
> v1.10.0
> OS (e.g: cat /etc/os-release):
> CentOS-7
> Kernel (e.g. uname -a):
> 4.17.11-1.el7.elrepo.x86_64
> Spark-2.4.0
>Reporter: Will Zhang
>Priority: Major
> Attachments: driver-pod-logs.zip
>
>
> I'm using spark-on-kubernetes to submit spark app to kubernetes.
> most of the time, it runs smoothly.
> but sometimes, I see logs after submitting: the driver pod phase changed from 
> running to pending and starts another container in the pod though the first 
> container exited successfully.
> I use the standard spark-submit to kubernetes like:
> /opt/spark/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --deploy-mode cluster 
> --class xxx ...
>  
> log is below:
>  
>  
> 2019-04-25 13:37:01 INFO LoggingPodStatusWatcherImpl:54 - State changed, new 
> state:
> pod name: com--cloud-mf-trainer-submit-1556199419847-driver
> namespace: default
> labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, 
> spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> 
> driver
> pod uid: 

[jira] [Updated] (SPARK-27574) spark on kubernetes driver pod phase changed from running to pending and starts another container in pod

2019-04-29 Thread Will Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Zhang updated SPARK-27574:
---
Attachment: driver-pod-logs.zip

> spark on kubernetes driver pod phase changed from running to pending and 
> starts another container in pod
> 
>
> Key: SPARK-27574
> URL: https://issues.apache.org/jira/browse/SPARK-27574
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
> Environment: Kubernetes version (use kubectl version):
> v1.10.0
> OS (e.g: cat /etc/os-release):
> CentOS-7
> Kernel (e.g. uname -a):
> 4.17.11-1.el7.elrepo.x86_64
> Spark-2.4.0
>Reporter: Will Zhang
>Priority: Major
> Attachments: driver-pod-logs.zip
>
>
> I'm using spark-on-kubernetes to submit spark app to kubernetes.
> most of the time, it runs smoothly.
> but sometimes, I see logs after submitting: the driver pod phase changed from 
> running to pending and starts another container in the pod though the first 
> container exited successfully.
> I use the standard spark-submit to kubernetes like:
> /opt/spark/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --deploy-mode cluster 
> --class xxx ...
>  
> log is below:
>  
>  
> 2019-04-25 13:37:01 INFO LoggingPodStatusWatcherImpl:54 - State changed, new 
> state:
> pod name: com--cloud-mf-trainer-submit-1556199419847-driver
> namespace: default
> labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, 
> spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> 
> driver
> pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a
> creation time: 2019-04-25T13:37:01Z
> service account name: default
> volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh
> node name: N/A
> start time: N/A
> container images: N/A
> phase: Pending
> status: []
> 2019-04-25 13:37:01 INFO LoggingPodStatusWatcherImpl:54 - State changed, new 
> state:
> pod name: com--cloud-mf-trainer-submit-1556199419847-driver
> namespace: default
> labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, 
> spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> 
> driver
> pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a
> creation time: 2019-04-25T13:37:01Z
> service account name: default
> volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh
> node name: yq01-m12-ai2b-service02.yq01..com
> start time: N/A
> container images: N/A
> phase: Pending
> status: []
> 2019-04-25 13:37:01 INFO Client:54 - Waiting for application 
> com..cloud.mf.trainer.Submit to finish...
> 2019-04-25 13:37:01 INFO LoggingPodStatusWatcherImpl:54 - State changed, new 
> state:
> pod name: com--cloud-mf-trainer-submit-1556199419847-driver
> namespace: default
> labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, 
> spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> 
> driver
> pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a
> creation time: 2019-04-25T13:37:01Z
> service account name: default
> volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh
> node name: yq01-m12-ai2b-service02.yq01..com
> start time: 2019-04-25T13:37:01Z
> container images: 10.96.0.100:5000/spark:spark-2.4.0
> phase: Pending
> status: [ContainerStatus(containerID=null, 
> image=10.96.0.100:5000/spark:spark-2.4.0, imageID=, 
> lastState=ContainerState(running=null, terminated=null, waiting=null, 
> additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
> restartCount=0, state=ContainerState(running=null, terminated=null, 
> waiting=ContainerStateWaiting(message=null, reason=ContainerCreating, 
> additionalProperties={}), additionalProperties={}), additionalProperties={})]
> 2019-04-25 13:37:04 INFO LoggingPodStatusWatcherImpl:54 - State changed, new 
> state:
> pod name: com--cloud-mf-trainer-submit-1556199419847-driver
> namespace: default
> labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, 
> spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> 
> driver
> pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a
> creation time: 2019-04-25T13:37:01Z
> service account name: default
> volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh
> node name: yq01-m12-ai2b-service02.yq01..com
> start time: 2019-04-25T13:37:01Z
> container images: 10.96.0.100:5000/spark:spark-2.4.0
> phase: Running
> status: 
> [ContainerStatus(containerID=docker://120dbf8cb11cf8ef9b26cff3354e096a979beb35279de34be64b3c06e896b991,
>  image=10.96.0.100:5000/spark:spark-2.4.0, 
> imageID=docker-pullable://10.96.0.100:5000/spark@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f,
>  lastState=ContainerState(running=null, terminated=null, waiting=null, 
> 

[jira] [Commented] (SPARK-27591) A bug in UnivocityParser prevents using UDT

2019-04-29 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829336#comment-16829336
 ] 

Liang-Chi Hsieh commented on SPARK-27591:
-

Are you returning string in {{serialize}} in your {{UserDefinedType}}? If so, 
please creating a {{UTF8String}} based on the string.

And also in {{deserialize}}, the passed in data is {{UTF8String}}.

> A bug in UnivocityParser prevents using UDT
> ---
>
> Key: SPARK-27591
> URL: https://issues.apache.org/jira/browse/SPARK-27591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: Artem Kalchenko
>Priority: Minor
>
> I am trying to define a UserDefinedType based on String but different from 
> StringType in Spark 2.4.1 but it looks like there is a bug in Spark or I am 
> doing smth incorrectly.
> I define my type as follows:
> {code:java}
> class MyType extends UserDefinedType[MyValue] {
>   override def sqlType: DataType = StringType
>   ...
> }
> @SQLUserDefinedType(udt = classOf[MyType])
> case class MyValue
> {code}
> I expect it to be read and stored as String with just a custom SQL type. In 
> fact Spark can't read the string at all:
> {code:java}
> java.lang.ClassCastException: 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$11
>  cannot be cast to org.apache.spark.unsafe.types.UTF8String
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:195)
> at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
> {code}
> the problem is with UnivocityParser.makeConverter that doesn't return (String 
> => Any) function but (String => (String => Any)) in the case of UDT, see 
> UnivocityParser:184
> {code:java}
> case udt: UserDefinedType[_] => (datum: String) =>
>   makeConverter(name, udt.sqlType, nullable, options)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27213) Unexpected results when filter is used after distinct

2019-04-29 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829309#comment-16829309
 ] 

Josh Rosen commented on SPARK-27213:


Hmm, this must have been fixed relatively recently. For now, I can confirm that 
the problem still existed as of 5668c42edf20bc577305437622272bf803b6019e (which 
is what I happen to have checked out locally; that commit landed in master on 
March 5, 2019).

It'd be good to try the reproduction the latest 2.4.x and 2.3.x maintenance 
releases to see if those still have the bug.

> Unexpected results when filter is used after distinct
> -
>
> Key: SPARK-27213
> URL: https://issues.apache.org/jira/browse/SPARK-27213
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Rinaz Belhaj
>Priority: Major
>  Labels: correctness, distinct, filter
>
> The following code gives unexpected output due to the filter not getting 
> pushed down in catalyst optimizer.
> {code:java}
> df = 
> spark.createDataFrame([['a','123','12.3','n'],['a','123','12.3','y'],['a','123','12.4','y']],['x','y','z','y_n'])
> df.show(5)
> df.filter("y_n='y'").select('x','y','z').distinct().show()
> df.select('x','y','z').distinct().filter("y_n='y'").show()
> {code}
> {panel:title=Output}
> |x|y|z|y_n|
> |a|123|12.3|n|
> |a|123|12.3|y|
> |a|123|12.4|y|
>  
> |x|y|z|
> |a|123|12.3|
> |a|123|12.4|
>  
> |x|y|z|
> |a|123|12.4|
> {panel}
> Ideally, the second statement should result in an error since the column used 
> in the filter is not present in the preceding select statement. But the 
> catalyst optimizer is using first() on column y_n and then applying the 
> filter.
> Even if the filter was pushed down, the result would have been accurate.
> {code:java}
> df = 
> spark.createDataFrame([['a','123','12.3','n'],['a','123','12.3','y'],['a','123','12.4','y']],['x','y','z','y_n'])
> df.filter("y_n='y'").select('x','y','z').distinct().explain(True)
> df.select('x','y','z').distinct().filter("y_n='y'").explain(True) 
> {code}
> {panel:title=Output}
>  
>  == Parsed Logical Plan ==
>  Deduplicate [x#74, y#75, z#76|#74, y#75, z#76]
>  +- AnalysisBarrier
>  +- Project [x#74, y#75, z#76|#74, y#75, z#76]
>  +- Filter (y_n#77 = y)
>  +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false
>   
>  == Analyzed Logical Plan ==
>  x: string, y: string, z: string
>  Deduplicate [x#74, y#75, z#76|#74, y#75, z#76]
>  +- Project [x#74, y#75, z#76|#74, y#75, z#76]
>  +- Filter (y_n#77 = y)
>  +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false
>   
>  == Optimized Logical Plan ==
>  Aggregate [x#74, y#75, z#76|#74, y#75, z#76], [x#74, y#75, z#76|#74, y#75, 
> z#76]
>  +- Project [x#74, y#75, z#76|#74, y#75, z#76]
>  +- Filter (isnotnull(y_n#77) && (y_n#77 = y))
>  +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false
>   
>  == Physical Plan ==
>  *(2) HashAggregate(keys=[x#74, y#75, z#76|#74, y#75, z#76], functions=[], 
> output=[x#74, y#75, z#76|#74, y#75, z#76])
>  +- Exchange hashpartitioning(x#74, y#75, z#76, 10)
>  +- *(1) HashAggregate(keys=[x#74, y#75, z#76|#74, y#75, z#76], functions=[], 
> output=[x#74, y#75, z#76|#74, y#75, z#76])
>  +- *(1) Project [x#74, y#75, z#76|#74, y#75, z#76]
>  +- *(1) Filter (isnotnull(y_n#77) && (y_n#77 = y))
>  +- Scan ExistingRDD[x#74,y#75,z#76,y_n#77|#74,y#75,z#76,y_n#77]
>   
>  
> ---
>  
>   
>  == Parsed Logical Plan ==
>  'Filter ('y_n = y)
>  +- AnalysisBarrier
>  +- Deduplicate [x#74, y#75, z#76|#74, y#75, z#76]
>  +- Project [x#74, y#75, z#76|#74, y#75, z#76]
>  +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false
>   
>  == Analyzed Logical Plan ==
>  x: string, y: string, z: string
>  Project [x#74, y#75, z#76|#74, y#75, z#76]
>  +- Filter (y_n#77 = y)
>  +- Deduplicate [x#74, y#75, z#76|#74, y#75, z#76]
>  +- Project [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77]
>  +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false
>   
>  == Optimized Logical Plan ==
>  Project [x#74, y#75, z#76|#74, y#75, z#76]
>  +- Filter (isnotnull(y_n#77) && (y_n#77 = y))
>  +- Aggregate [x#74, y#75, z#76|#74, y#75, z#76], [x#74, y#75, z#76, 
> first(y_n#77, false) AS y_n#77|#74, y#75, z#76, first(y_n#77, false) AS 
> y_n#77]
>  +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false
>   
>  == Physical Plan ==
>  *(3) Project [x#74, y#75, z#76|#74, y#75, z#76]
>  +- *(3) Filter (isnotnull(y_n#77) && (y_n#77 = y))
>  +- SortAggregate(key=[x#74, y#75, z#76|#74, y#75, z#76], 
> functions=[first(y_n#77, false)|#77, false)], output=[x#74, y#75, z#76, 
> y_n#77|#74, y#75, z#76, y_n#77])
>  +- 

[jira] [Commented] (SPARK-27586) Improve binary comparison: replace Scala's for-comprehension if statements with while loop

2019-04-29 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829303#comment-16829303
 ] 

Josh Rosen commented on SPARK-27586:


Good find! This sounds pretty straightforward to fix; want to submit a pull 
request with the {{while}} loop version?

> Improve binary comparison: replace Scala's for-comprehension if statements 
> with while loop
> --
>
> Key: SPARK-27586
> URL: https://issues.apache.org/jira/browse/SPARK-27586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.2
> Environment: benchmark env:
>  * Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
>  * Linux 4.4.0-33.bm.1-amd64
>  * java version "1.8.0_131"
>  * Scala 2.11.8
>  * perf version 4.4.0
> Run:
> 40,000,000 times comparison on 32 bytes-length binary
>  
>Reporter: WoudyGao
>Priority: Minor
>
> I found the cpu cost of TypeUtils.compareBinary is noticeable when handle 
> some big parquet files;
> After some perf work, I found:
> the " for-comprehension if statements" will execute ≈15X instructions than 
> while loop
>  
> *'while-loop' version perf:*
>   
>  {{886.687949  task-clock (msec) #    1.257 CPUs 
> utilized}}
>  {{ 3,089  context-switches  #    0.003 M/sec}}
>  {{   265  cpu-migrations#    0.299 K/sec}}
>  {{12,227  page-faults   #    0.014 M/sec}}
>  {{ 2,209,183,920  cycles#    2.492 GHz}}
>  {{     stalled-cycles-frontend}}
>  {{     stalled-cycles-backend}}
>  {{ 6,865,836,114  instructions  #    3.11  insns per 
> cycle}}
>  {{ 1,568,910,228  branches  # 1769.405 M/sec}}
>  {{ 9,172,613  branch-misses #    0.58% of all 
> branches}}
>   
>  {{   0.705671157 seconds time elapsed}}
>   
> *TypeUtils.compareBinary perf:*
>  {{  16347.242313  task-clock (msec) #    1.233 CPUs 
> utilized}}
>  {{ 8,370  context-switches  #    0.512 K/sec}}
>  {{   481  cpu-migrations#    0.029 K/sec}}
>  {{   536,671  page-faults   #    0.033 M/sec}}
>  {{40,857,347,119  cycles#    2.499 GHz}}
>  {{     stalled-cycles-frontend}}
>  {{     stalled-cycles-backend}}
>  {{90,606,381,612  instructions  #    2.22  insns per 
> cycle}}
>  {{18,107,867,151  branches  # 1107.702 M/sec}}
>  {{12,880,296  branch-misses #    0.07% of all 
> branches}}
>   
>  {{  13.257617118 seconds time elapsed}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27574) spark on kubernetes driver pod phase changed from running to pending and starts another container in pod

2019-04-29 Thread Udbhav Agrawal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829292#comment-16829292
 ] 

Udbhav Agrawal commented on SPARK-27574:


Hey [~zyfo2] can you share driver pod logs obtained by :

kubectl logs driverpod-name -n namespace-name

> spark on kubernetes driver pod phase changed from running to pending and 
> starts another container in pod
> 
>
> Key: SPARK-27574
> URL: https://issues.apache.org/jira/browse/SPARK-27574
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
> Environment: Kubernetes version (use kubectl version):
> v1.10.0
> OS (e.g: cat /etc/os-release):
> CentOS-7
> Kernel (e.g. uname -a):
> 4.17.11-1.el7.elrepo.x86_64
> Spark-2.4.0
>Reporter: Will Zhang
>Priority: Major
>
> I'm using spark-on-kubernetes to submit spark app to kubernetes.
> most of the time, it runs smoothly.
> but sometimes, I see logs after submitting: the driver pod phase changed from 
> running to pending and starts another container in the pod though the first 
> container exited successfully.
> I use the standard spark-submit to kubernetes like:
> /opt/spark/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --deploy-mode cluster 
> --class xxx ...
>  
> log is below:
>  
>  
> 2019-04-25 13:37:01 INFO LoggingPodStatusWatcherImpl:54 - State changed, new 
> state:
> pod name: com--cloud-mf-trainer-submit-1556199419847-driver
> namespace: default
> labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, 
> spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> 
> driver
> pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a
> creation time: 2019-04-25T13:37:01Z
> service account name: default
> volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh
> node name: N/A
> start time: N/A
> container images: N/A
> phase: Pending
> status: []
> 2019-04-25 13:37:01 INFO LoggingPodStatusWatcherImpl:54 - State changed, new 
> state:
> pod name: com--cloud-mf-trainer-submit-1556199419847-driver
> namespace: default
> labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, 
> spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> 
> driver
> pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a
> creation time: 2019-04-25T13:37:01Z
> service account name: default
> volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh
> node name: yq01-m12-ai2b-service02.yq01..com
> start time: N/A
> container images: N/A
> phase: Pending
> status: []
> 2019-04-25 13:37:01 INFO Client:54 - Waiting for application 
> com..cloud.mf.trainer.Submit to finish...
> 2019-04-25 13:37:01 INFO LoggingPodStatusWatcherImpl:54 - State changed, new 
> state:
> pod name: com--cloud-mf-trainer-submit-1556199419847-driver
> namespace: default
> labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, 
> spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> 
> driver
> pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a
> creation time: 2019-04-25T13:37:01Z
> service account name: default
> volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh
> node name: yq01-m12-ai2b-service02.yq01..com
> start time: 2019-04-25T13:37:01Z
> container images: 10.96.0.100:5000/spark:spark-2.4.0
> phase: Pending
> status: [ContainerStatus(containerID=null, 
> image=10.96.0.100:5000/spark:spark-2.4.0, imageID=, 
> lastState=ContainerState(running=null, terminated=null, waiting=null, 
> additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
> restartCount=0, state=ContainerState(running=null, terminated=null, 
> waiting=ContainerStateWaiting(message=null, reason=ContainerCreating, 
> additionalProperties={}), additionalProperties={}), additionalProperties={})]
> 2019-04-25 13:37:04 INFO LoggingPodStatusWatcherImpl:54 - State changed, new 
> state:
> pod name: com--cloud-mf-trainer-submit-1556199419847-driver
> namespace: default
> labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, 
> spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> 
> driver
> pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a
> creation time: 2019-04-25T13:37:01Z
> service account name: default
> volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh
> node name: yq01-m12-ai2b-service02.yq01..com
> start time: 2019-04-25T13:37:01Z
> container images: 10.96.0.100:5000/spark:spark-2.4.0
> phase: Running
> status: 
> [ContainerStatus(containerID=docker://120dbf8cb11cf8ef9b26cff3354e096a979beb35279de34be64b3c06e896b991,
>  image=10.96.0.100:5000/spark:spark-2.4.0, 
> imageID=docker-pullable://10.96.0.100:5000/spark@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f,
>  

[jira] [Created] (SPARK-27594) spark.sql.orc.enableVectorizedReader causes milliseconds in Timestamp to be read incorrectly

2019-04-29 Thread Jan-Willem van der Sijp (JIRA)
Jan-Willem van der Sijp created SPARK-27594:
---

 Summary: spark.sql.orc.enableVectorizedReader causes milliseconds 
in Timestamp to be read incorrectly
 Key: SPARK-27594
 URL: https://issues.apache.org/jira/browse/SPARK-27594
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Jan-Willem van der Sijp


Using {{spark.sql.orc.impl=native}} and 
{{spark.sql.orc.enableVectorizedReader=true}} causes reading of TIMESTAMP 
columns in HIVE stored as ORC to be interpreted incorrectly. Specifically, the 
milliseconds of time timestamp will be doubled.

Input/output of a Zeppelin session to demonstrate:

{code:python}
%pyspark

from pprint import pprint

spark.conf.set("spark.sql.orc.impl", "native")
spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")

pprint(spark.sparkContext.getConf().getAll())

[('sql.stacktrace', 'false'),
 ('spark.eventLog.enabled', 'true'),
 ('spark.app.id', 'application_1556200632329_0005'),
 ('importImplicit', 'true'),
 ('printREPLOutput', 'true'),
 ('spark.history.ui.port', '18081'),
 ('spark.driver.extraLibraryPath',
  
'/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'),
 ('spark.driver.extraJavaOptions',
  ' -Dfile.encoding=UTF-8 '
  
'-Dlog4j.configuration=file:///usr/hdp/current/zeppelin-server/conf/log4j.properties
 '
  
'-Dzeppelin.log.file=/var/log/zeppelin/zeppelin-interpreter-spark2-spark-zeppelin-sandbox-hdp.hortonworks.com.log'),
 ('concurrentSQL', 'false'),
 ('spark.driver.port', '40195'),
 ('spark.executor.extraLibraryPath',
  
'/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'),
 ('useHiveContext', 'true'),
 ('spark.jars',
  
'file:/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'),
 ('spark.history.provider',
  'org.apache.spark.deploy.history.FsHistoryProvider'),
 ('spark.yarn.historyServer.address', 'sandbox-hdp.hortonworks.com:18081'),
 ('spark.submit.deployMode', 'client'),
 ('spark.ui.filters',
  'org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter'),
 
('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS',
  'sandbox-hdp.hortonworks.com'),
 ('spark.eventLog.dir', 'hdfs:///spark2-history/'),
 ('spark.repl.class.uri', 'spark://sandbox-hdp.hortonworks.com:40195/classes'),
 ('spark.driver.host', 'sandbox-hdp.hortonworks.com'),
 ('master', 'yarn'),
 ('spark.yarn.dist.archives',
  '/usr/hdp/current/spark2-client/R/lib/sparkr.zip#sparkr'),
 ('spark.scheduler.mode', 'FAIR'),
 ('spark.yarn.queue', 'default'),
 ('spark.history.kerberos.keytab',
  '/etc/security/keytabs/spark.headless.keytab'),
 ('spark.executor.id', 'driver'),
 ('spark.history.fs.logDirectory', 'hdfs:///spark2-history/'),
 ('spark.history.kerberos.enabled', 'false'),
 ('spark.master', 'yarn'),
 ('spark.sql.catalogImplementation', 'hive'),
 ('spark.history.kerberos.principal', 'none'),
 ('spark.driver.extraClassPath',
  
':/usr/hdp/current/zeppelin-server/interpreter/spark/*:/usr/hdp/current/zeppelin-server/lib/interpreter/*::/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'),
 ('spark.driver.appUIAddress', 'http://sandbox-hdp.hortonworks.com:4040'),
 ('spark.repl.class.outputDir',
  '/tmp/spark-555b2143-0efa-45c1-aecc-53810f89aa5f'),
 ('spark.yarn.isPython', 'true'),
 ('spark.app.name', 'Zeppelin'),
 
('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES',
  
'http://sandbox-hdp.hortonworks.com:8088/proxy/application_1556200632329_0005'),
 ('maxResult', '1000'),
 ('spark.executorEnv.PYTHONPATH',
  
'/usr/hdp/current/spark2-client//python/lib/py4j-0.10.6-src.zip:/usr/hdp/current/spark2-client//python/:/usr/hdp/current/spark2-client//python:/usr/hdp/current/spark2-client//python/lib/py4j-0.8.2.1-src.zip{{PWD}}/pyspark.zip{{PWD}}/py4j-0.10.6-src.zip'),
 ('spark.ui.proxyBase', '/proxy/application_1556200632329_0005')]
{code}

{code:python}
%pyspark

spark.sql("""
DROP TABLE IF EXISTS default.hivetest
""")

spark.sql("""
CREATE TABLE default.hivetest (
day DATE,
time TIMESTAMP,
timestring STRING
)
USING ORC
""")
{code}

{code:python}
%pyspark

df1 = spark.createDataFrame(
[
("2019-01-01", "2019-01-01 12:15:31.123", "2019-01-01 12:15:31.123")
],
schema=("date", "timestamp", "string")
)

df2 = spark.createDataFrame(
[
("2019-01-02", "2019-01-02 13:15:32.234", "2019-01-02 13:15:32.234")
],
schema=("date", "timestamp", "string")
)
{code}

{code:python}
%pyspark

spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")
df1.write.insertInto("default.hivetest")

spark.conf.set("spark.sql.orc.enableVectorizedReader", "false")
df1.write.insertInto("default.hivetest")
{code}

{code:python}
%pyspark


[jira] [Commented] (SPARK-23191) Workers registration failes in case of network drop

2019-04-29 Thread wuyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829272#comment-16829272
 ] 

wuyi commented on SPARK-23191:
--

[~cloud_fan] Ok, I'll have a deep look after 5.1 holiday.

> Workers registration failes in case of network drop
> ---
>
> Key: SPARK-23191
> URL: https://issues.apache.org/jira/browse/SPARK-23191
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.1, 2.3.0
> Environment: OS:- Centos 6.9(64 bit)
>  
>Reporter: Neeraj Gupta
>Priority: Critical
>
> We have a 3 node cluster. We were facing issues of multiple driver running in 
> some scenario in production.
> On further investigation we were able to reproduce iin both 1.6.3 and 2.2.1 
> versions the scenario with following steps:-
>  # Setup a 3 node cluster. Start master and slaves.
>  # On any node where the worker process is running block the connections on 
> port 7077 using iptables.
> {code:java}
> iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # After about 10-15 secs we get the error on node that it is unable to 
> connect to master.
> {code:java}
> 2018-01-23 12:08:51,639 [rpc-client-1-1] WARN  
> org.apache.spark.network.server.TransportChannelHandler - Exception in 
> connection from 
> java.io.IOException: Connection timed out
>     at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>     at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>     at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
>     at 
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
>     at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
>     at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
>     at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
>     at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>     at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>     at java.lang.Thread.run(Thread.java:745)
> 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR 
> org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting 
> for master to reconnect...
> 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR 
> org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting 
> for master to reconnect...
> {code}
>  # Once we get this exception we renable the connections to port 7077 using
> {code:java}
> iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # Worker tries to register again with master but is unable to do so. It 
> gives following error
> {code:java}
> 2018-01-23 12:08:58,657 [worker-register-master-threadpool-2] WARN  
> org.apache.spark.deploy.worker.Worker - Failed to connect to master 
> :7077
> org.apache.spark.SparkException: Exception thrown in awaitResult:
>     at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
>     at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>     at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
>     at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
>     at 
> org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:241)
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to :7077
>     at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
>     at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
>     at 
> org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
>     

[jira] [Assigned] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf

2019-04-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27555:


Assignee: (was: Apache Spark)

> cannot create table by using the hive default fileformat in both 
> hive-site.xml and spark-defaults.conf
> --
>
> Key: SPARK-27555
> URL: https://issues.apache.org/jira/browse/SPARK-27555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Hui WANG
>Priority: Major
> Attachments: Try.pdf
>
>
> *You can see details in attachment called Try.pdf.*
> I already seen https://issues.apache.org/jira/browse/SPARK-17620 
> and https://issues.apache.org/jira/browse/SPARK-18397
> and I check source code of Spark for the change of  set 
> "spark.sql.hive.covertCTAS=true" and then spark will use 
> "spark.sql.sources.default" which is parquet as storage format in "create 
> table as select" scenario.
> But my case is just create table without select. When I set  
> hive.default.fileformat=parquet in hive-site.xml or set  
> spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after 
> create a table, when i check the hive table, it still use textfile fileformat.
>  
> It seems HiveSerDe gets the value of the hive.default.fileformat parameter 
> from SQLConf
> The parameter values in SQLConf are copied from SparkContext's SparkConf at 
> SparkSession initialization, while the configuration parameters in 
> hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters 
> by SharedState, And all the config with "spark.hadoop" conf are setted to 
> hadoopconfig, so the configuration does not take effect.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf

2019-04-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27555:


Assignee: Apache Spark

> cannot create table by using the hive default fileformat in both 
> hive-site.xml and spark-defaults.conf
> --
>
> Key: SPARK-27555
> URL: https://issues.apache.org/jira/browse/SPARK-27555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Hui WANG
>Assignee: Apache Spark
>Priority: Major
> Attachments: Try.pdf
>
>
> *You can see details in attachment called Try.pdf.*
> I already seen https://issues.apache.org/jira/browse/SPARK-17620 
> and https://issues.apache.org/jira/browse/SPARK-18397
> and I check source code of Spark for the change of  set 
> "spark.sql.hive.covertCTAS=true" and then spark will use 
> "spark.sql.sources.default" which is parquet as storage format in "create 
> table as select" scenario.
> But my case is just create table without select. When I set  
> hive.default.fileformat=parquet in hive-site.xml or set  
> spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after 
> create a table, when i check the hive table, it still use textfile fileformat.
>  
> It seems HiveSerDe gets the value of the hive.default.fileformat parameter 
> from SQLConf
> The parameter values in SQLConf are copied from SparkContext's SparkConf at 
> SparkSession initialization, while the configuration parameters in 
> hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters 
> by SharedState, And all the config with "spark.hadoop" conf are setted to 
> hadoopconfig, so the configuration does not take effect.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23191) Workers registration failes in case of network drop

2019-04-29 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829260#comment-16829260
 ] 

Wenchen Fan commented on SPARK-23191:
-

[~Ngone51] can you take a look please?

> Workers registration failes in case of network drop
> ---
>
> Key: SPARK-23191
> URL: https://issues.apache.org/jira/browse/SPARK-23191
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.1, 2.3.0
> Environment: OS:- Centos 6.9(64 bit)
>  
>Reporter: Neeraj Gupta
>Priority: Critical
>
> We have a 3 node cluster. We were facing issues of multiple driver running in 
> some scenario in production.
> On further investigation we were able to reproduce iin both 1.6.3 and 2.2.1 
> versions the scenario with following steps:-
>  # Setup a 3 node cluster. Start master and slaves.
>  # On any node where the worker process is running block the connections on 
> port 7077 using iptables.
> {code:java}
> iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # After about 10-15 secs we get the error on node that it is unable to 
> connect to master.
> {code:java}
> 2018-01-23 12:08:51,639 [rpc-client-1-1] WARN  
> org.apache.spark.network.server.TransportChannelHandler - Exception in 
> connection from 
> java.io.IOException: Connection timed out
>     at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>     at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>     at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
>     at 
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
>     at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
>     at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
>     at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
>     at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>     at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>     at java.lang.Thread.run(Thread.java:745)
> 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR 
> org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting 
> for master to reconnect...
> 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR 
> org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting 
> for master to reconnect...
> {code}
>  # Once we get this exception we renable the connections to port 7077 using
> {code:java}
> iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # Worker tries to register again with master but is unable to do so. It 
> gives following error
> {code:java}
> 2018-01-23 12:08:58,657 [worker-register-master-threadpool-2] WARN  
> org.apache.spark.deploy.worker.Worker - Failed to connect to master 
> :7077
> org.apache.spark.SparkException: Exception thrown in awaitResult:
>     at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
>     at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>     at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
>     at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
>     at 
> org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:241)
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to :7077
>     at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
>     at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
>     at 
> org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
>     at 

[jira] [Assigned] (SPARK-27581) DataFrame countDistinct("*") fails with AnalysisException: "Invalid usage of '*' in expression 'count'"

2019-04-29 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27581:
---

Assignee: Liang-Chi Hsieh

> DataFrame countDistinct("*") fails with AnalysisException: "Invalid usage of 
> '*' in expression 'count'"
> ---
>
> Key: SPARK-27581
> URL: https://issues.apache.org/jira/browse/SPARK-27581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> If I have a DataFrame then I can use {{count("*")}} as an expression, e.g.:
> {code:java}
> import org.apache.spark.sql.functions._
> val df = sql("select id % 100 from range(10)")
> df.select(count("*")).first()
> {code}
> However, if I try to do the same thing with {{countDistinct}} I get an error:
> {code:java}
> import org.apache.spark.sql.functions._
> val df = sql("select id % 100 from range(10)")
> df.select(countDistinct("*")).first()
> org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 
> 'count';
> {code}
> As a workaround, I need to use {{expr}}, e.g.
> {code:java}
> import org.apache.spark.sql.functions._
> val df = sql("select id % 100 from range(10)")
> df.select(expr("count(distinct(*))")).first()
> {code}
> You might be wondering "why not just use {{df.count()}} or 
> {{df.distinct().count()}}?" but in my case I'd ultimately to compute both 
> counts as part of the same aggregation, e.g.
> {code:java}
> val (cnt, distinctCnt) = df.select(count("*"), countDistinct("*)).as[(Long, 
> Long)].first()
> {code}
> I'm reporting this because it's a minor usability annoyance / surprise for 
> inexperienced Spark users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27581) DataFrame countDistinct("*") fails with AnalysisException: "Invalid usage of '*' in expression 'count'"

2019-04-29 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27581.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24482
[https://github.com/apache/spark/pull/24482]

> DataFrame countDistinct("*") fails with AnalysisException: "Invalid usage of 
> '*' in expression 'count'"
> ---
>
> Key: SPARK-27581
> URL: https://issues.apache.org/jira/browse/SPARK-27581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
> Fix For: 3.0.0
>
>
> If I have a DataFrame then I can use {{count("*")}} as an expression, e.g.:
> {code:java}
> import org.apache.spark.sql.functions._
> val df = sql("select id % 100 from range(10)")
> df.select(count("*")).first()
> {code}
> However, if I try to do the same thing with {{countDistinct}} I get an error:
> {code:java}
> import org.apache.spark.sql.functions._
> val df = sql("select id % 100 from range(10)")
> df.select(countDistinct("*")).first()
> org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 
> 'count';
> {code}
> As a workaround, I need to use {{expr}}, e.g.
> {code:java}
> import org.apache.spark.sql.functions._
> val df = sql("select id % 100 from range(10)")
> df.select(expr("count(distinct(*))")).first()
> {code}
> You might be wondering "why not just use {{df.count()}} or 
> {{df.distinct().count()}}?" but in my case I'd ultimately to compute both 
> counts as part of the same aggregation, e.g.
> {code:java}
> val (cnt, distinctCnt) = df.select(count("*"), countDistinct("*)).as[(Long, 
> Long)].first()
> {code}
> I'm reporting this because it's a minor usability annoyance / surprise for 
> inexperienced Spark users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21827) Task fail due to executor exception when enable Sasl Encryption

2019-04-29 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-21827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829089#comment-16829089
 ] 

Sébastien BARNOUD edited comment on SPARK-21827 at 4/29/19 12:15 PM:
-

Hi,

 

I was investigating timeout on HBase Client (version 1.1.2) on my Hadoop 
cluster with security enabled using hotspot jdk 1.8.0_92-b14.

I have found the following message in logs each time I get a timeout:

*sasl:1481  - DIGEST41:Unmatched MACs*

 

After a look at the code, I understand that the message is simply ignored if an 
invalid MAC is received. In my opinion, this is not a normal behavior. It 
allows at least an attacker to flood the connection.

 

But, in my case, there is no men in the middle, but I get this message. It 
looks like there is bug (probably a not thread safe method somewhere) in the 
MAC validation, leading to the message to be ignored, and to my HBase timeout.

In the same time, we have found some TEZ job stuck on our cluster since we have 
enabled security on shuffle (mapreduce, TEZ and Spark). In each hanged job, we 
could identify that the SSL handshake never finished:

 

"fetcher \{Map_4} #34" #78 daemon prio=5 os_prio=0 tid=0x7fd86905d000 
nid=0x13dad runnable [0x7fd83beb6000]

   java.lang.Thread.State: RUNNABLE

   at java.net.SocketInputStream.socketRead0(Native Method)

   at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)

   at java.net.SocketInputStream.read(SocketInputStream.java:170)

   at java.net.SocketInputStream.read(SocketInputStream.java:141)

   at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)

   at sun.security.ssl.InputRecord.read(InputRecord.java:503)

   at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)

   - locked <0x0007b997a470> (a java.lang.Object)

   at 
sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)

   - locked <0x0007b997a430> (a java.lang.Object)

   at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)

   at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)

   at 
sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)

   at 
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:100)

   at 
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:80)

   at 
sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:672)

   at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1534)

   - locked <0x0007b9979f10> (a 
sun.net.www.protocol.https.DelegateHttpsURLConnection)

   at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441)

   - locked <0x0007b9979f10> (a 
sun.net.www.protocol.https.DelegateHttpsURLConnection)

   at 
sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254)

   - locked <0x0007b9979ea8> (a 
sun.net.www.protocol.https.HttpsURLConnectionImpl)

   at 
org.apache.tez.runtime.library.common.shuffle.HttpConnection.getInputStream(HttpConnection.java:253)

   at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:356)

   at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:264)

   at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:176)

   at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run(FetcherOrderedGrouped.java:191)

 

Looking a TEZ source, shows that there are no timeout in the code leading to 
this infinite wait.

 

After some more investigation, I found:

-) https://issues.apache.org/jira/browse/SPARK-21827

-) [https://issues.cask.co/browse/CDAP-12737]

-) [https://bugster.forgerock.org/jira/browse/OPENDJ-4956]

 

It seems that this issue affects a lot of software, and ForgeRock seems to have 
identified the thread safety issue.

 

To summarize, there are 2 issues:
 # the message shouldn’t be ignored when the MAC is invalid, an exception 
should be throwed.
 # The thread safety issue should be investigated and corrected in the JDK, 
because relying on a synchronized method at the application layer is not 
viable. Typically, an application like Spark uses multiple SASL implementation 
and can’t synchronize all of them.

 

I sent this to [secalert...@oracle.com|mailto:secalert...@oracle.com] because 
IMO it's a JDK bug.


was (Author: sbarnoud):
Hi,

 

I was investigating timeout on HBase Client (version 1.1.2) on my Hadoop 

[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8.2

2019-04-29 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829155#comment-16829155
 ] 

Steve Loughran commented on SPARK-24771:


Update Hadoop is going to update its avro version too HADOOP-13386

Ultimately this is a good thing, it's just going to be one of those 
transitive-changes-may-break-compiled code things, which can really hurt 
people. The good news: spark is ahead of the curve here

> Upgrade AVRO version from 1.7.7 to 1.8.2
> 
>
> Key: SPARK-24771
> URL: https://issues.apache.org/jira/browse/SPARK-24771
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27593) CSV Parser returns 2 DataFrame - Valid and Malformed DFs

2019-04-29 Thread Ladislav Jech (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ladislav Jech updated SPARK-27593:
--
Issue Type: New Feature  (was: Improvement)

> CSV Parser returns 2 DataFrame - Valid and Malformed DFs
> 
>
> Key: SPARK-27593
> URL: https://issues.apache.org/jira/browse/SPARK-27593
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.2
>Reporter: Ladislav Jech
>Priority: Major
>
> When we process CSV in any kind of data warehouse, its common procedure to 
> report corrupted records for audit purposes and feedback back to vendor, so 
> they can enhance their procedure. CSV is no difference from XSD from 
> perspective that it define a schema although in very limited way (in some 
> cases only as number of columns without even headers, and we don't have 
> types), but when I check XML document against XSD file, I get exact report of 
> if the file is completely valid and if not I get exact report of what records 
> are not following schema. 
> Such feature will have big value in Spark for CSV, get malformed records into 
> some dataframe, with line count (pointer within the data object), so I can 
> log both pointer and real data (line/row) and trigger action on this 
> unfortunate event.
> load() method could return Array of DFs (Valid, Invalid)
> PERMISSIVE MODE isn't enough as soon as it fill missing fields with nulls, so 
> it is even harder to detect what is really wrong. Another approach at moment 
> is to read both permissive and dropmalformed modes into 2 dataframes and 
> compare those one against each other.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27593) CSV Parser returns 2 DataFrame - Valid and Malformed DFs

2019-04-29 Thread Ladislav Jech (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ladislav Jech updated SPARK-27593:
--
Description: 
When we process CSV in any kind of data warehouse, its common procedure to 
report corrupted records for audit purposes and feedback back to vendor, so 
they can enhance their procedure. CSV is no difference from XSD from 
perspective that it define a schema although in very limited way (in some cases 
only as number of columns without even headers, and we don't have types), but 
when I check XML document against XSD file, I get exact report of if the file 
is completely valid and if not I get exact report of what records are not 
following schema. 

Such feature will have big value in Spark for CSV, get malformed records into 
some dataframe, with line count (pointer within the data object), so I can log 
both pointer and real data (line/row) and trigger action on this unfortunate 
event.

load() method could return Array of DFs (Valid, Invalid)

  was:
When we process CSV in any kind of data warehouse, its common procedure to 
report corrupted records for audit purposes and feedback back to vendor, so 
they can enhance their procedure. CSV is no difference from XSD from 
perspective that it define a schema although in very limited way (in some cases 
only as number of columns without even headers, and we don't have types), but 
when I check XML document against XSD file, I get exact report of if the file 
is completely valid and if not I get exact report of what records are not 
following schema. 

Such feature will have big value in Spark for CSV, get malformed records into 
some dataframe, with line count (pointer within the data object), so I can log 
both pointer and real data (line/row) and trigger action on this unfortunate 
event.


> CSV Parser returns 2 DataFrame - Valid and Malformed DFs
> 
>
> Key: SPARK-27593
> URL: https://issues.apache.org/jira/browse/SPARK-27593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.2
>Reporter: Ladislav Jech
>Priority: Major
>
> When we process CSV in any kind of data warehouse, its common procedure to 
> report corrupted records for audit purposes and feedback back to vendor, so 
> they can enhance their procedure. CSV is no difference from XSD from 
> perspective that it define a schema although in very limited way (in some 
> cases only as number of columns without even headers, and we don't have 
> types), but when I check XML document against XSD file, I get exact report of 
> if the file is completely valid and if not I get exact report of what records 
> are not following schema. 
> Such feature will have big value in Spark for CSV, get malformed records into 
> some dataframe, with line count (pointer within the data object), so I can 
> log both pointer and real data (line/row) and trigger action on this 
> unfortunate event.
> load() method could return Array of DFs (Valid, Invalid)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27593) CSV Parser returns 2 DataFrame - Valid and Malformed DFs

2019-04-29 Thread Ladislav Jech (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ladislav Jech updated SPARK-27593:
--
Description: 
When we process CSV in any kind of data warehouse, its common procedure to 
report corrupted records for audit purposes and feedback back to vendor, so 
they can enhance their procedure. CSV is no difference from XSD from 
perspective that it define a schema although in very limited way (in some cases 
only as number of columns without even headers, and we don't have types), but 
when I check XML document against XSD file, I get exact report of if the file 
is completely valid and if not I get exact report of what records are not 
following schema. 

Such feature will have big value in Spark for CSV, get malformed records into 
some dataframe, with line count (pointer within the data object), so I can log 
both pointer and real data (line/row) and trigger action on this unfortunate 
event.

load() method could return Array of DFs (Valid, Invalid)

PERMISSIVE MODE isn't enough as soon as it fill missing fields with nulls, so 
it is even harder to detect what is really wrong. Another approach at moment is 
to read both permissive and dropmalformed modes into 2 dataframes and compare 
those one against each other.

  was:
When we process CSV in any kind of data warehouse, its common procedure to 
report corrupted records for audit purposes and feedback back to vendor, so 
they can enhance their procedure. CSV is no difference from XSD from 
perspective that it define a schema although in very limited way (in some cases 
only as number of columns without even headers, and we don't have types), but 
when I check XML document against XSD file, I get exact report of if the file 
is completely valid and if not I get exact report of what records are not 
following schema. 

Such feature will have big value in Spark for CSV, get malformed records into 
some dataframe, with line count (pointer within the data object), so I can log 
both pointer and real data (line/row) and trigger action on this unfortunate 
event.

load() method could return Array of DFs (Valid, Invalid)


> CSV Parser returns 2 DataFrame - Valid and Malformed DFs
> 
>
> Key: SPARK-27593
> URL: https://issues.apache.org/jira/browse/SPARK-27593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.2
>Reporter: Ladislav Jech
>Priority: Major
>
> When we process CSV in any kind of data warehouse, its common procedure to 
> report corrupted records for audit purposes and feedback back to vendor, so 
> they can enhance their procedure. CSV is no difference from XSD from 
> perspective that it define a schema although in very limited way (in some 
> cases only as number of columns without even headers, and we don't have 
> types), but when I check XML document against XSD file, I get exact report of 
> if the file is completely valid and if not I get exact report of what records 
> are not following schema. 
> Such feature will have big value in Spark for CSV, get malformed records into 
> some dataframe, with line count (pointer within the data object), so I can 
> log both pointer and real data (line/row) and trigger action on this 
> unfortunate event.
> load() method could return Array of DFs (Valid, Invalid)
> PERMISSIVE MODE isn't enough as soon as it fill missing fields with nulls, so 
> it is even harder to detect what is really wrong. Another approach at moment 
> is to read both permissive and dropmalformed modes into 2 dataframes and 
> compare those one against each other.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27593) CSV Parser returns 2 DataFrame - Valid and Malformed DFs

2019-04-29 Thread Ladislav Jech (JIRA)
Ladislav Jech created SPARK-27593:
-

 Summary: CSV Parser returns 2 DataFrame - Valid and Malformed DFs
 Key: SPARK-27593
 URL: https://issues.apache.org/jira/browse/SPARK-27593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.2
Reporter: Ladislav Jech


When we process CSV in any kind of data warehouse, its common procedure to 
report corrupted records for audit purposes and feedback back to vendor, so 
they can enhance their procedure. CSV is no difference from XSD from 
perspective that it define a schema although in very limited way (in some cases 
only as number of columns without even headers, and we don't have types), but 
when I check XML document against XSD file, I get exact report of if the file 
is completely valid and if not I get exact report of what records are not 
following schema. 

Such feature will have big value in Spark for CSV, get malformed records into 
some dataframe, with line count (pointer within the data object), so I can log 
both pointer and real data (line/row) and trigger action on this unfortunate 
event.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20880) When spark SQL is used with Avro-backed HIVE tables, NPE from org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories.

2019-04-29 Thread DineshPandian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829093#comment-16829093
 ] 

DineshPandian commented on SPARK-20880:
---

Hi, any update on this?

> When spark SQL is used with  Avro-backed HIVE tables,  NPE from 
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories.
> 
>
> Key: SPARK-20880
> URL: https://issues.apache.org/jira/browse/SPARK-20880
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Vinod KC
>Priority: Minor
>
> When spark SQL is used with  Avro-backed HIVE tables,  intermittently getting 
> NPE from 
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories.
> Root cause is due race condition in hive 1.2.1 jar used in Spark SQL .
> In HIVE 2.2 this issue has been fixed (HIVE JIRA: 
> https://issues.apache.org/jira/browse/HIVE-16175. ), since  Spark is still 
> using Hive 1.2.1 jars we are still getting into race condition.
> One workaround  is to run Spark with a single task per executor, however it 
> will slow down the jobs. 
> Exception stack trace
> 13/05/07 09:18:39 WARN scheduler.TaskSetManager: Lost task 18.0 in stage 0.0 
> (TID 18, aiyhyashu.dxc.com): java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories(AvroObjectInspectorGenerator.java:142)
> at 
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:91)
> at 
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:104)
> at 
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:104)
> at 
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:120)
> at 
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspector(AvroObjectInspectorGenerator.java:83)
> at 
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.(AvroObjectInspectorGenerator.java:56)
> at 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:124)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$10.apply(TableReader.scala:251)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$10.apply(TableReader.scala:239)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:785)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:785)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at 

[jira] [Assigned] (SPARK-27592) Write the data of table write information to metadata

2019-04-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27592:


Assignee: (was: Apache Spark)

> Write the data of table write information to metadata
> -
>
> Key: SPARK-27592
> URL: https://issues.apache.org/jira/browse/SPARK-27592
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27592) Write the data of table write information to metadata

2019-04-29 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27592:

Description: 
We hint Hive using incorrect 
InputFormat(org.apache.hadoop.mapred.SequenceFileInputFormat) to read Spark's 
Parquet datasource bucket table:
{noformat}
spark-sql> CREATE TABLE t (c1 INT, c2 INT) USING parquet CLUSTERED BY (c1) 
SORTED BY (c1) INTO 2 BUCKETS;
 2019-04-29 17:52:05 WARN HiveExternalCatalog:66 - Persisting bucketed data 
source table `default`.`t` into Hive metastore in Spark SQL specific format, 
which is NOT compatible with Hive.
 spark-sql> DESC EXTENDED t;
 c1 int NULL
 c2 int NULL
 # Detailed Table Information
 Database default
 Table t
 Owner yumwang
 Created Time Mon Apr 29 17:52:05 CST 2019
 Last Access Thu Jan 01 08:00:00 CST 1970
 Created By Spark 2.4.0
 Type MANAGED
 Provider parquet
 Num Buckets 2
 Bucket Columns [`c1`]
 Sort Columns [`c1`]
 Table Properties [transient_lastDdlTime=1556531525]
 Location [file:/user/hive/warehouse/t|file:///user/hive/warehouse/t]
 Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
 InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
 OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
 Storage Properties [serialization.format=1]
{noformat}
 We can see incompatible information when creating the table:
{noformat}
WARN HiveExternalCatalog:66 - Persisting bucketed data source table 
`default`.`t` into Hive metastore in Spark SQL specific format, which is NOT 
compatible with Hive.
{noformat}
 But downstream don’t know the compatibility. I'd like to write the write 
information of this table to metadata so that each engine decides compatibility 
itself.

> Write the data of table write information to metadata
> -
>
> Key: SPARK-27592
> URL: https://issues.apache.org/jira/browse/SPARK-27592
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> We hint Hive using incorrect 
> InputFormat(org.apache.hadoop.mapred.SequenceFileInputFormat) to read Spark's 
> Parquet datasource bucket table:
> {noformat}
> spark-sql> CREATE TABLE t (c1 INT, c2 INT) USING parquet CLUSTERED BY (c1) 
> SORTED BY (c1) INTO 2 BUCKETS;
>  2019-04-29 17:52:05 WARN HiveExternalCatalog:66 - Persisting bucketed data 
> source table `default`.`t` into Hive metastore in Spark SQL specific format, 
> which is NOT compatible with Hive.
>  spark-sql> DESC EXTENDED t;
>  c1 int NULL
>  c2 int NULL
>  # Detailed Table Information
>  Database default
>  Table t
>  Owner yumwang
>  Created Time Mon Apr 29 17:52:05 CST 2019
>  Last Access Thu Jan 01 08:00:00 CST 1970
>  Created By Spark 2.4.0
>  Type MANAGED
>  Provider parquet
>  Num Buckets 2
>  Bucket Columns [`c1`]
>  Sort Columns [`c1`]
>  Table Properties [transient_lastDdlTime=1556531525]
>  Location [file:/user/hive/warehouse/t|file:///user/hive/warehouse/t]
>  Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>  InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
>  OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  Storage Properties [serialization.format=1]
> {noformat}
>  We can see incompatible information when creating the table:
> {noformat}
> WARN HiveExternalCatalog:66 - Persisting bucketed data source table 
> `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT 
> compatible with Hive.
> {noformat}
>  But downstream don’t know the compatibility. I'd like to write the write 
> information of this table to metadata so that each engine decides 
> compatibility itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27592) Write the data of table write information to metadata

2019-04-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27592:


Assignee: Apache Spark

> Write the data of table write information to metadata
> -
>
> Key: SPARK-27592
> URL: https://issues.apache.org/jira/browse/SPARK-27592
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21827) Task fail due to executor exception when enable Sasl Encryption

2019-04-29 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-21827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829089#comment-16829089
 ] 

Sébastien BARNOUD edited comment on SPARK-21827 at 4/29/19 10:05 AM:
-

Hi,

 

I was investigating timeout on HBase Client (version 1.1.2) on my Hadoop 
cluster with security enabled using hotspot jdk 1.8.0_92-b14.

I have found the following message in logs each time I get a timeout:

*sasl:1481  - DIGEST41:Unmatched MACs*

 

After a look at the code, I understand that the message is simply ignored if an 
invalid MAC is received. In my opinion, this is not a normal behavior. It 
allows at least an attacker to flood the connection.

 

But, in my case, there is no men in the middle, but I get this message. It 
looks like there is bug (probably a not thread safe method somewhere) in the 
MAC validation, leading to the message to be ignored, and to my HBase timeout.

In the same time, we have found some TEZ job stuck on our cluster since we have 
enabled security on shuffle (mapreduce, TEZ and Spark). In each hanged job, we 
could identify that the SSL handshake never finished:

 

"fetcher \{Map_4} #34" #78 daemon prio=5 os_prio=0 tid=0x7fd86905d000 
nid=0x13dad runnable [0x7fd83beb6000]

   java.lang.Thread.State: RUNNABLE

   at java.net.SocketInputStream.socketRead0(Native Method)

   at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)

   at java.net.SocketInputStream.read(SocketInputStream.java:170)

   at java.net.SocketInputStream.read(SocketInputStream.java:141)

   at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)

   at sun.security.ssl.InputRecord.read(InputRecord.java:503)

   at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)

   - locked <0x0007b997a470> (a java.lang.Object)

   at 
sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)

   - locked <0x0007b997a430> (a java.lang.Object)

   at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)

   at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)

   at 
sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)

   at 
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:100)

   at 
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:80)

   at 
sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:672)

   at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1534)

   - locked <0x0007b9979f10> (a 
sun.net.www.protocol.https.DelegateHttpsURLConnection)

   at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441)

   - locked <0x0007b9979f10> (a 
sun.net.www.protocol.https.DelegateHttpsURLConnection)

   at 
sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254)

   - locked <0x0007b9979ea8> (a 
sun.net.www.protocol.https.HttpsURLConnectionImpl)

   at 
org.apache.tez.runtime.library.common.shuffle.HttpConnection.getInputStream(HttpConnection.java:253)

   at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:356)

   at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:264)

   at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:176)

   at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run(FetcherOrderedGrouped.java:191)

 

Looking a TEZ source, shows that there are no timeout in the code leading to 
this infinite wait.

 

After some more investigation, I found:

-) https://issues.apache.org/jira/browse/SPARK-21827

-) [https://issues.cask.co/browse/CDAP-12737]

-) [https://bugster.forgerock.org/jira/browse/OPENDJ-4956]

 

It seems that this issue affects a lot of software, and ForgeRock seems to have 
identified the thread safety issue.

 

To summarize, there are 2 issues:
 # the message shouldn’t be ignored when the MAC is invalid, an exception 
should be throwed.
 # The thread safety issue should be investigated and corrected in the JDK, 
because relying on a synchronized method at the application layer is not 
viable. Typically, an application like Spark uses multiple SASL implementation 
and can’t synchronize all of them.

 

I sent this to [secalert...@oracle.com|mailto:secalert...@oracle.com] because 
IMO it's a JDK bug.

Regards,

 

Sébastien BARNOUD


was (Author: sbarnoud):
Hi,

 

I was investigating timeout on HBase 

[jira] [Comment Edited] (SPARK-21827) Task fail due to executor exception when enable Sasl Encryption

2019-04-29 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-21827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829089#comment-16829089
 ] 

Sébastien BARNOUD edited comment on SPARK-21827 at 4/29/19 10:03 AM:
-

Hi,

 

I was investigating timeout on HBase Client (version 1.1.2) on my Hadoop 
cluster with security enabled using hotspot jdk 1.8.0_92-b14.

I have found the following message in logs each time I get a timeout:

*sasl:1481  - DIGEST41:Unmatched MACs*

 

After a look at the code, I understand that the message is simply ignored if an 
invalid MAC is received. In my opinion, this is not a normal behavior. It 
allows at least an attacker to flood the connection.

 

But, in my case, there is no men in the middle, but a get this message. It 
looks like there is bug (probably a not thread safe method somewhere) in the 
MAC validation, leading to the message to be ignored, and to my HBase timeout.

In the same time, we have found some TEZ job stuck on our cluster since we have 
enabled security on shuffle (mapreduce, TEZ and Spark). In each hanged job, we 
could identify that the SSL handshake never finished:

 

"fetcher \{Map_4} #34" #78 daemon prio=5 os_prio=0 tid=0x7fd86905d000 
nid=0x13dad runnable [0x7fd83beb6000]

   java.lang.Thread.State: RUNNABLE

   at java.net.SocketInputStream.socketRead0(Native Method)

   at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)

   at java.net.SocketInputStream.read(SocketInputStream.java:170)

   at java.net.SocketInputStream.read(SocketInputStream.java:141)

   at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)

   at sun.security.ssl.InputRecord.read(InputRecord.java:503)

   at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)

   - locked <0x0007b997a470> (a java.lang.Object)

   at 
sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)

   - locked <0x0007b997a430> (a java.lang.Object)

   at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)

   at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)

   at 
sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)

   at 
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:100)

   at 
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:80)

   at 
sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:672)

   at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1534)

   - locked <0x0007b9979f10> (a 
sun.net.www.protocol.https.DelegateHttpsURLConnection)

   at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441)

   - locked <0x0007b9979f10> (a 
sun.net.www.protocol.https.DelegateHttpsURLConnection)

   at 
sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254)

   - locked <0x0007b9979ea8> (a 
sun.net.www.protocol.https.HttpsURLConnectionImpl)

   at 
org.apache.tez.runtime.library.common.shuffle.HttpConnection.getInputStream(HttpConnection.java:253)

   at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:356)

   at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:264)

   at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:176)

   at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run(FetcherOrderedGrouped.java:191)

 

Looking a TEZ source, shows that there are no timeout in the code leading to 
this infinite wait.

 

After some more investigation, I found:

-) https://issues.apache.org/jira/browse/SPARK-21827

-) [https://issues.cask.co/browse/CDAP-12737]

-) [https://bugster.forgerock.org/jira/browse/OPENDJ-4956]

 

It seems that this issue affects a lot of software, and ForgeRock seems to have 
identified the thread safety issue.

 

To summarize, there are 2 issues:
 # the message shouldn’t be ignored when the MAC is invalid, an exception 
should be throwed.
 # The thread safety issue should be investigated and corrected in the JDK, 
because relying on a synchronized method at the application layer is not 
viable. Typically, an application like Spark uses multiple SASL implementation 
and can’t synchronize all of them.

 

I sent this to [secalert...@oracle.com|mailto:secalert...@oracle.com] because 
IMO it's a JDK bug.

Regards,

 

Sébastien BARNOUD


was (Author: sbarnoud):
Hi,

 

I was investigating timeout on HBase 

[jira] [Commented] (SPARK-21827) Task fail due to executor exception when enable Sasl Encryption

2019-04-29 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-21827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829089#comment-16829089
 ] 

Sébastien BARNOUD commented on SPARK-21827:
---

Hi,

 

I was investigating timeout on HBase Client (version 1.1.2) on my Hadoop 
cluster with security enabled using hotspot jdk 1.8.0_92-b14.

I have found the following message in logs each time a get a timeout:

*sasl:1481  - DIGEST41:Unmatched MACs*

 

After a look at the code, if understand that the message is simply ignored if 
an invalid MAC is received. In my opinion, this is not a normal behavior. It 
allows at least an attacker to flood the connection.

 

But, in my case, there is no men in the middle, but a get this message. It 
looks like there is bug (probably a not thread safe method somewhere) in the 
MAC validation, leading to the message to be ignored, and to my HBase timeout.

In the same time, we have found some TEZ job stuck on our cluster since we have 
enabled security on shuffle (mapreduce, TEZ and Spark). In each hanged job, we 
could identify that the SSL handshake never finished:

 

"fetcher \{Map_4} #34" #78 daemon prio=5 os_prio=0 tid=0x7fd86905d000 
nid=0x13dad runnable [0x7fd83beb6000]

   java.lang.Thread.State: RUNNABLE

   at java.net.SocketInputStream.socketRead0(Native Method)

   at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)

   at java.net.SocketInputStream.read(SocketInputStream.java:170)

   at java.net.SocketInputStream.read(SocketInputStream.java:141)

   at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)

   at sun.security.ssl.InputRecord.read(InputRecord.java:503)

   at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)

   - locked <0x0007b997a470> (a java.lang.Object)

   at 
sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)

   - locked <0x0007b997a430> (a java.lang.Object)

   at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)

   at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)

   at 
sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)

   at 
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:100)

   at 
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:80)

   at 
sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:672)

   at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1534)

   - locked <0x0007b9979f10> (a 
sun.net.www.protocol.https.DelegateHttpsURLConnection)

   at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441)

   - locked <0x0007b9979f10> (a 
sun.net.www.protocol.https.DelegateHttpsURLConnection)

   at 
sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254)

   - locked <0x0007b9979ea8> (a 
sun.net.www.protocol.https.HttpsURLConnectionImpl)

   at 
org.apache.tez.runtime.library.common.shuffle.HttpConnection.getInputStream(HttpConnection.java:253)

   at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:356)

   at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:264)

   at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:176)

   at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run(FetcherOrderedGrouped.java:191)

 

Looking a TEZ source, shows that there are no timeout in the code leading to 
this infinite wait.

 

After some more investigation, I found:

-) https://issues.apache.org/jira/browse/SPARK-21827

-) [https://issues.cask.co/browse/CDAP-12737]

-) [https://bugster.forgerock.org/jira/browse/OPENDJ-4956]

 

It seems that this issue affects a lot of software, and ForgeRock seems to have 
identified the thread safety issue.

 

To summarize, there are 2 issues:
 # the message shouldn’t be ignored when the MAC is invalid, an exception 
should be throwed.
 # The thread safety issue should be investigated and corrected in the JDK, 
because relying on a synchronized method at the application layer is not 
viable. Typically, an application like Spark uses multiple SASL implementation 
and can’t synchronize all of them.

 

I sent this to [secalert...@oracle.com|mailto:secalert...@oracle.com] because 
IMO it's a JDK bug. 

Regards,

 

Sébastien BARNOUD

> Task fail due to executor exception when enable Sasl Encryption
> 

[jira] [Created] (SPARK-27592) Write the data of table write information to metadata

2019-04-29 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-27592:
---

 Summary: Write the data of table write information to metadata
 Key: SPARK-27592
 URL: https://issues.apache.org/jira/browse/SPARK-27592
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27580) Implement `doCanonicalize` in BatchScanExec for comparing query plan results

2019-04-29 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27580:
---

Assignee: Gengliang Wang

> Implement `doCanonicalize` in BatchScanExec for comparing query plan results
> 
>
> Key: SPARK-27580
> URL: https://issues.apache.org/jira/browse/SPARK-27580
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> The method `QueryPlan.sameResult` is used for comparing logical plans in 
> order to:
> 1. cache data in CacheManager
> 2. uncache data in CacheManager
> 3. Reuse subqueries
> 4. etc...
> Currently the method `sameReuslt` always return false for `BatchScanExec`. We 
> should fix it by implementing `doCanonicalize` for the node.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27580) Implement `doCanonicalize` in BatchScanExec for comparing query plan results

2019-04-29 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27580.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24475
[https://github.com/apache/spark/pull/24475]

> Implement `doCanonicalize` in BatchScanExec for comparing query plan results
> 
>
> Key: SPARK-27580
> URL: https://issues.apache.org/jira/browse/SPARK-27580
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> The method `QueryPlan.sameResult` is used for comparing logical plans in 
> order to:
> 1. cache data in CacheManager
> 2. uncache data in CacheManager
> 3. Reuse subqueries
> 4. etc...
> Currently the method `sameReuslt` always return false for `BatchScanExec`. We 
> should fix it by implementing `doCanonicalize` for the node.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27591) A bug in UnivocityParser prevents using UDT

2019-04-29 Thread Artem Kalchenko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829067#comment-16829067
 ] 

Artem Kalchenko commented on SPARK-27591:
-

I change the line UnivocityParser:184 to 

 
{code:java}
case udt: UserDefinedType[_] => 
  makeConverter(name, udt.sqlType, nullable, options)
{code}
and it works as I expected

 

> A bug in UnivocityParser prevents using UDT
> ---
>
> Key: SPARK-27591
> URL: https://issues.apache.org/jira/browse/SPARK-27591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: Artem Kalchenko
>Priority: Minor
>
> I am trying to define a UserDefinedType based on String but different from 
> StringType in Spark 2.4.1 but it looks like there is a bug in Spark or I am 
> doing smth incorrectly.
> I define my type as follows:
> {code:java}
> class MyType extends UserDefinedType[MyValue] {
>   override def sqlType: DataType = StringType
>   ...
> }
> @SQLUserDefinedType(udt = classOf[MyType])
> case class MyValue
> {code}
> I expect it to be read and stored as String with just a custom SQL type. In 
> fact Spark can't read the string at all:
> {code:java}
> java.lang.ClassCastException: 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$11
>  cannot be cast to org.apache.spark.unsafe.types.UTF8String
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:195)
> at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
> {code}
> the problem is with UnivocityParser.makeConverter that doesn't return (String 
> => Any) function but (String => (String => Any)) in the case of UDT, see 
> UnivocityParser:184
> {code:java}
> case udt: UserDefinedType[_] => (datum: String) =>
>   makeConverter(name, udt.sqlType, nullable, options)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27591) A bug in UnivocityParser prevents using UDT

2019-04-29 Thread Artem Kalchenko (JIRA)
Artem Kalchenko created SPARK-27591:
---

 Summary: A bug in UnivocityParser prevents using UDT
 Key: SPARK-27591
 URL: https://issues.apache.org/jira/browse/SPARK-27591
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.2
Reporter: Artem Kalchenko


I am trying to define a UserDefinedType based on String but different from 
StringType in Spark 2.4.1 but it looks like there is a bug in Spark or I am 
doing smth incorrectly.

I define my type as follows:
{code:java}
class MyType extends UserDefinedType[MyValue] {
  override def sqlType: DataType = StringType
  ...
}

@SQLUserDefinedType(udt = classOf[MyType])
case class MyValue
{code}
I expect it to be read and stored as String with just a custom SQL type. In 
fact Spark can't read the string at all:
{code:java}
java.lang.ClassCastException: 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$11
 cannot be cast to org.apache.spark.unsafe.types.UTF8String
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:195)
at 
org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
{code}
the problem is with UnivocityParser.makeConverter that doesn't return (String 
=> Any) function but (String => (String => Any)) in the case of UDT, see 
UnivocityParser:184
{code:java}
case udt: UserDefinedType[_] => (datum: String) =>
  makeConverter(name, udt.sqlType, nullable, options)

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23191) Workers registration failes in case of network drop

2019-04-29 Thread zuotingbing (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829051#comment-16829051
 ] 

zuotingbing edited comment on SPARK-23191 at 4/29/19 9:28 AM:
--

we faced the same issue in standalone HA mode. Could you please take a view on 
this issue?
{code:java}
2019-03-15 20:22:10,474 INFO Worker: Master has changed, new master is at 
spark://vmax17:7077 
2019-03-15 20:22:14,862 INFO Worker: Master with url spark://vmax18:7077 
requested this worker to reconnect.
2019-03-15 20:22:14,863 INFO Worker: Connecting to master vmax18:7077... 
2019-03-15 20:22:14,863 INFO Worker: Connecting to master vmax17:7077... 
2019-03-15 20:22:14,865 INFO Worker: Master with url spark://vmax18:7077 
requested this worker to reconnect.
2019-03-15 20:22:14,865 INFO Worker: Not spawning another attempt to register 
with the master, since there is an attempt scheduled already. 
2019-03-15 20:22:14,868 INFO Worker: Master with url spark://vmax18:7077 
requested this worker to reconnect. 
2019-03-15 20:22:14,868 INFO Worker: Not spawning another attempt to register 
with the master, since there is an attempt scheduled already. 
2019-03-15 20:22:14,871 INFO Worker: Master with url spark://vmax18:7077 
requested this worker to reconnect. 
2019-03-15 20:22:14,871 INFO Worker: Not spawning another attempt to register 
with the master, since there is an attempt scheduled already. 
2019-03-15 20:22:14,879 ERROR Worker: Worker registration failed: Duplicate 
worker ID
2019-03-15 20:22:14,891 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,891 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,893 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,893 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,893 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 
2019-03-15 20:22:14,896 INFO ShutdownHookManager: Shutdown hook called 
2019-03-15 20:22:14,898 INFO ShutdownHookManager: Deleting directory 
/data4/zdh/spark/tmp/spark-c578bf32-6a5e-44a5-843b-c796f44648ee 
2019-03-15 20:22:14,908 INFO ShutdownHookManager: Deleting directory 
/data3/zdh/spark/tmp/spark-7e57e77d-cbb7-47d3-a6dd-737b57788533 
2019-03-15 20:22:14,920 INFO ShutdownHookManager: Deleting directory 
/data2/zdh/spark/tmp/spark-0beebf20-abbd-4d99-a401-3ef0e88e0b05{code}
 

[~andrewor14]  [~cloud_fan] [~vanzin]


was (Author: zuo.tingbing9):
we faced the same issue in standalone HA mode. Could you please take a view on 
this issue?

[~andrewor14]  [~cloud_fan] [~vanzin]

> Workers registration failes in case of network drop
> ---
>
> Key: SPARK-23191
> URL: https://issues.apache.org/jira/browse/SPARK-23191
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.1, 2.3.0
> Environment: OS:- Centos 6.9(64 bit)
>  
>Reporter: Neeraj Gupta
>Priority: Critical
>
> We have a 3 node cluster. We were facing issues of multiple driver running in 
> some scenario in production.
> On further investigation we were able to reproduce iin both 1.6.3 and 2.2.1 
> versions the scenario with following steps:-
>  # Setup a 3 node cluster. Start master and slaves.
>  # On any node where the worker process is running block the connections on 
> port 7077 using iptables.
> {code:java}
> iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # After about 10-15 secs we get the error on node that it is unable to 
> connect to master.
> {code:java}
> 2018-01-23 12:08:51,639 [rpc-client-1-1] WARN  
> org.apache.spark.network.server.TransportChannelHandler - Exception in 
> connection from 
> java.io.IOException: Connection timed out
>     at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>     at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>     at 
> 

[jira] [Commented] (SPARK-23191) Workers registration failes in case of network drop

2019-04-29 Thread zuotingbing (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829051#comment-16829051
 ] 

zuotingbing commented on SPARK-23191:
-

we faced the same issue in standalone HA mode. Could you please take a view on 
this issue?

[~andrewor14]  [~cloud_fan] [~vanzin]

> Workers registration failes in case of network drop
> ---
>
> Key: SPARK-23191
> URL: https://issues.apache.org/jira/browse/SPARK-23191
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.1, 2.3.0
> Environment: OS:- Centos 6.9(64 bit)
>  
>Reporter: Neeraj Gupta
>Priority: Critical
>
> We have a 3 node cluster. We were facing issues of multiple driver running in 
> some scenario in production.
> On further investigation we were able to reproduce iin both 1.6.3 and 2.2.1 
> versions the scenario with following steps:-
>  # Setup a 3 node cluster. Start master and slaves.
>  # On any node where the worker process is running block the connections on 
> port 7077 using iptables.
> {code:java}
> iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # After about 10-15 secs we get the error on node that it is unable to 
> connect to master.
> {code:java}
> 2018-01-23 12:08:51,639 [rpc-client-1-1] WARN  
> org.apache.spark.network.server.TransportChannelHandler - Exception in 
> connection from 
> java.io.IOException: Connection timed out
>     at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>     at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>     at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
>     at 
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
>     at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
>     at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
>     at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
>     at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>     at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>     at java.lang.Thread.run(Thread.java:745)
> 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR 
> org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting 
> for master to reconnect...
> 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR 
> org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting 
> for master to reconnect...
> {code}
>  # Once we get this exception we renable the connections to port 7077 using
> {code:java}
> iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # Worker tries to register again with master but is unable to do so. It 
> gives following error
> {code:java}
> 2018-01-23 12:08:58,657 [worker-register-master-threadpool-2] WARN  
> org.apache.spark.deploy.worker.Worker - Failed to connect to master 
> :7077
> org.apache.spark.SparkException: Exception thrown in awaitResult:
>     at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
>     at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>     at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
>     at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
>     at 
> org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:241)
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to :7077
>     at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
>     at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
>     at 

[jira] [Created] (SPARK-27590) do not consider skipped tasks when scheduling speculative tasks

2019-04-29 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-27590:
---

 Summary: do not consider skipped tasks when scheduling speculative 
tasks
 Key: SPARK-27590
 URL: https://issues.apache.org/jira/browse/SPARK-27590
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27590) do not consider skipped tasks when scheduling speculative tasks

2019-04-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27590:


Assignee: Apache Spark  (was: Wenchen Fan)

> do not consider skipped tasks when scheduling speculative tasks
> ---
>
> Key: SPARK-27590
> URL: https://issues.apache.org/jira/browse/SPARK-27590
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27590) do not consider skipped tasks when scheduling speculative tasks

2019-04-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27590:


Assignee: Wenchen Fan  (was: Apache Spark)

> do not consider skipped tasks when scheduling speculative tasks
> ---
>
> Key: SPARK-27590
> URL: https://issues.apache.org/jira/browse/SPARK-27590
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27474) avoid retrying a task failed with CommitDeniedException many times

2019-04-29 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27474.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24375
[https://github.com/apache/spark/pull/24375]

> avoid retrying a task failed with CommitDeniedException many times
> --
>
> Key: SPARK-27474
> URL: https://issues.apache.org/jira/browse/SPARK-27474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org