[jira] [Commented] (SPARK-27051) Bump Jackson version to 2.9.8
[ https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829960#comment-16829960 ] Apache Spark commented on SPARK-27051: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/24493 > Bump Jackson version to 2.9.8 > - > > Key: SPARK-27051 > URL: https://issues.apache.org/jira/browse/SPARK-27051 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Major > Fix For: 3.0.0, 2.4.3 > > > Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs > [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix > bump the dependent Jackson to 2.9.8. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27051) Bump Jackson version to 2.9.8
[ https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829959#comment-16829959 ] Apache Spark commented on SPARK-27051: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/24493 > Bump Jackson version to 2.9.8 > - > > Key: SPARK-27051 > URL: https://issues.apache.org/jira/browse/SPARK-27051 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Major > Fix For: 3.0.0, 2.4.3 > > > Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs > [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix > bump the dependent Jackson to 2.9.8. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27601) Upgrade stream-lib to 2.9.6
[ https://issues.apache.org/jira/browse/SPARK-27601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27601: Assignee: Apache Spark > Upgrade stream-lib to 2.9.6 > --- > > Key: SPARK-27601 > URL: https://issues.apache.org/jira/browse/SPARK-27601 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > 1. Improve HyperLogLogPlus.merge and HyperLogLogPlus.mergeEstimators by using > native arrays instead of ArrayLists to avoid boxing and unboxing of integers. > https://github.com/addthis/stream-lib/pull/98 > 2. Improve HyperLogLogPlus.sortEncodedSet by using Arrays.sort on > appropriately transformed encoded values, in this way boxing and unboxing of > integers is avoided. https://github.com/addthis/stream-lib/pull/97 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27601) Upgrade stream-lib to 2.9.6
[ https://issues.apache.org/jira/browse/SPARK-27601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27601: Assignee: (was: Apache Spark) > Upgrade stream-lib to 2.9.6 > --- > > Key: SPARK-27601 > URL: https://issues.apache.org/jira/browse/SPARK-27601 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > 1. Improve HyperLogLogPlus.merge and HyperLogLogPlus.mergeEstimators by using > native arrays instead of ArrayLists to avoid boxing and unboxing of integers. > https://github.com/addthis/stream-lib/pull/98 > 2. Improve HyperLogLogPlus.sortEncodedSet by using Arrays.sort on > appropriately transformed encoded values, in this way boxing and unboxing of > integers is avoided. https://github.com/addthis/stream-lib/pull/97 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27602) SparkSQL CBO can't get true size of partition table after partition pruning
angerszhu created SPARK-27602: - Summary: SparkSQL CBO can't get true size of partition table after partition pruning Key: SPARK-27602 URL: https://issues.apache.org/jira/browse/SPARK-27602 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0, 2.3.0, 2.2.0 Reporter: angerszhu When I want to do extract a cost of one sql for myself's cost framework, I found that CBO can't get true size of partition table since when partition pruning is true. we just need corresponding partition's size. It just use the tables's statistic. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27601) Upgrade stream-lib to 2.9.6
[ https://issues.apache.org/jira/browse/SPARK-27601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27601: Description: 1. Improve HyperLogLogPlus.merge and HyperLogLogPlus.mergeEstimators by using native arrays instead of ArrayLists to avoid boxing and unboxing of integers. https://github.com/addthis/stream-lib/pull/98 2. Improve HyperLogLogPlus.sortEncodedSet by using Arrays.sort on appropriately transformed encoded values, in this way boxing and unboxing of integers is avoided. https://github.com/addthis/stream-lib/pull/97 was: 1. Improve HyperLogLogPlus.merge and HyperLogLogPlus.mergeEstimators by using native arrays instead of ArrayLists to avoid boxing and unboxing of integers. 2. Improve HyperLogLogPlus.sortEncodedSet by using Arrays.sort on appropriately transformed encoded values, in this way boxing and unboxing of integers is avoided. > Upgrade stream-lib to 2.9.6 > --- > > Key: SPARK-27601 > URL: https://issues.apache.org/jira/browse/SPARK-27601 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > 1. Improve HyperLogLogPlus.merge and HyperLogLogPlus.mergeEstimators by using > native arrays instead of ArrayLists to avoid boxing and unboxing of integers. > https://github.com/addthis/stream-lib/pull/98 > 2. Improve HyperLogLogPlus.sortEncodedSet by using Arrays.sort on > appropriately transformed encoded values, in this way boxing and unboxing of > integers is avoided. https://github.com/addthis/stream-lib/pull/97 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27601) Upgrade stream-lib to 2.9.6
Yuming Wang created SPARK-27601: --- Summary: Upgrade stream-lib to 2.9.6 Key: SPARK-27601 URL: https://issues.apache.org/jira/browse/SPARK-27601 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.0.0 Reporter: Yuming Wang 1. Improve HyperLogLogPlus.merge and HyperLogLogPlus.mergeEstimators by using native arrays instead of ArrayLists to avoid boxing and unboxing of integers. 2. Improve HyperLogLogPlus.sortEncodedSet by using Arrays.sort on appropriately transformed encoded values, in this way boxing and unboxing of integers is avoided. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27525) Exclude commons-httpclient when interacting with different versions of the HiveMetastoreClient
[ https://issues.apache.org/jira/browse/SPARK-27525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27525: Description: {noformat} Error Message java.lang.RuntimeException: [download failed: commons-httpclient#commons-httpclient;3.0.1!commons-httpclient.jar] Stacktrace sbt.ForkMain$ForkError: java.lang.RuntimeException: [download failed: commons-httpclient#commons-httpclient;3.0.1!commons-httpclient.jar] at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1321) at org.apache.spark.sql.hive.client.IsolatedClientLoader$.$anonfun$downloadVersion$2(IsolatedClientLoader.scala:122) at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:41) at org.apache.spark.sql.hive.client.IsolatedClientLoader$.downloadVersion(IsolatedClientLoader.scala:122) at org.apache.spark.sql.hive.client.IsolatedClientLoader$.liftedTree1$1(IsolatedClientLoader.scala:65) at org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:64) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:370) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:305) at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:68) at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:67) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:217) at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:217) at org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists(ExternalCatalog.scala:41) at org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists$(ExternalCatalog.scala:40) at org.apache.spark.sql.hive.HiveExternalCatalog.requireDbExists(HiveExternalCatalog.scala:57) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:242) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:238) at org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.testAlterTable(Hive_2_1_DDLSuite.scala:123) at org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.$anonfun$new$1(Hive_2_1_DDLSuite.scala:89) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:105) at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) at org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(Hive_2_1_DDLSuite.scala:41) at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221) at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214) at org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.runTest(Hive_2_1_DDLSuite.scala:41) at org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229) at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396) at scala.collection.immutable.List.foreach(List.scala:392) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229) at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228) at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) at org.scalatest.Suite.run(Suite.scala:1147) at org.scalatest.Suite.run$(Suite.scala:1129) at
[jira] [Resolved] (SPARK-27525) Exclude commons-httpclient when interacting with different versions of the HiveMetastoreClient
[ https://issues.apache.org/jira/browse/SPARK-27525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-27525. - Resolution: Not A Problem > Exclude commons-httpclient when interacting with different versions of the > HiveMetastoreClient > --- > > Key: SPARK-27525 > URL: https://issues.apache.org/jira/browse/SPARK-27525 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > {noformat} > Error Message > java.lang.RuntimeException: [download failed: > commons-httpclient#commons-httpclient;3.0.1!commons-httpclient.jar] > Stacktrace > sbt.ForkMain$ForkError: java.lang.RuntimeException: [download failed: > commons-httpclient#commons-httpclient;3.0.1!commons-httpclient.jar] > at > org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1321) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$.$anonfun$downloadVersion$2(IsolatedClientLoader.scala:122) > at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:41) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$.downloadVersion(IsolatedClientLoader.scala:122) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$.liftedTree1$1(IsolatedClientLoader.scala:65) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:64) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:370) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:305) > at > org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:68) > at > org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:67) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:217) > at > scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) > at > org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:217) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists(ExternalCatalog.scala:41) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists$(ExternalCatalog.scala:40) > at > org.apache.spark.sql.hive.HiveExternalCatalog.requireDbExists(HiveExternalCatalog.scala:57) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:242) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:238) > at > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.testAlterTable(Hive_2_1_DDLSuite.scala:123) > at > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.$anonfun$new$1(Hive_2_1_DDLSuite.scala:89) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:105) > at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) > at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) > at > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(Hive_2_1_DDLSuite.scala:41) > at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221) > at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214) > at > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.runTest(Hive_2_1_DDLSuite.scala:41) > at > org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396) > at
[jira] [Commented] (SPARK-27525) Exclude commons-httpclient when interacting with different versions of the HiveMetastoreClient
[ https://issues.apache.org/jira/browse/SPARK-27525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829880#comment-16829880 ] Yuming Wang commented on SPARK-27525: - It's jenkins issue. > Exclude commons-httpclient when interacting with different versions of the > HiveMetastoreClient > --- > > Key: SPARK-27525 > URL: https://issues.apache.org/jira/browse/SPARK-27525 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > {noformat} > Error Message > java.lang.RuntimeException: [download failed: > commons-httpclient#commons-httpclient;3.0.1!commons-httpclient.jar] > Stacktrace > sbt.ForkMain$ForkError: java.lang.RuntimeException: [download failed: > commons-httpclient#commons-httpclient;3.0.1!commons-httpclient.jar] > at > org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1321) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$.$anonfun$downloadVersion$2(IsolatedClientLoader.scala:122) > at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:41) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$.downloadVersion(IsolatedClientLoader.scala:122) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$.liftedTree1$1(IsolatedClientLoader.scala:65) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:64) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:370) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:305) > at > org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:68) > at > org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:67) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:217) > at > scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) > at > org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:217) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists(ExternalCatalog.scala:41) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists$(ExternalCatalog.scala:40) > at > org.apache.spark.sql.hive.HiveExternalCatalog.requireDbExists(HiveExternalCatalog.scala:57) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:242) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:238) > at > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.testAlterTable(Hive_2_1_DDLSuite.scala:123) > at > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.$anonfun$new$1(Hive_2_1_DDLSuite.scala:89) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:105) > at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) > at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) > at > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(Hive_2_1_DDLSuite.scala:41) > at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221) > at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214) > at > org.apache.spark.sql.hive.execution.Hive_2_1_DDLSuite.runTest(Hive_2_1_DDLSuite.scala:41) > at > org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396) >
[jira] [Created] (SPARK-27600) Unable to start Spark Hive Thrift Server when multiple hive server server share the same metastore
pin_zhang created SPARK-27600: - Summary: Unable to start Spark Hive Thrift Server when multiple hive server server share the same metastore Key: SPARK-27600 URL: https://issues.apache.org/jira/browse/SPARK-27600 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: pin_zhang When start ten or more spark hive thrift servers at the same time, more than one version saved to table VERSION when meet exception WARN [DataNucleus.Query] (main:) Query for candidates of org.apache.hadoop.hive.metastore.model.MVersionTable and subclasses resulted in no possible candidates Exception thrown obtaining schema column information from datastore org.datanucleus.exceptions.NucleusDataStoreException: Exception thrown obtaining schema column information from datastore Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Table 'via_ms.deleteme1556239494724' doesn't exist at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at com.mysql.jdbc.Util.handleNewInstance(Util.java:425) at com.mysql.jdbc.Util.getInstance(Util.java:408) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:944) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3978) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3914) at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2530) at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2683) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2491) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2449) at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1381) at com.mysql.jdbc.DatabaseMetaData$2.forEach(DatabaseMetaData.java:2441) at com.mysql.jdbc.DatabaseMetaData$2.forEach(DatabaseMetaData.java:2339) at com.mysql.jdbc.IterateBlock.doForAll(IterateBlock.java:50) at com.mysql.jdbc.DatabaseMetaData.getColumns(DatabaseMetaData.java:2337) at org.apache.commons.dbcp.DelegatingDatabaseMetaData.getColumns(DelegatingDatabaseMetaData.java:218) at org.datanucleus.store.rdbms.adapter.BaseDatastoreAdapter.getColumns(BaseDatastoreAdapter.java:1532) at org.datanucleus.store.rdbms.schema.RDBMSSchemaHandler.refreshTableData(RDBMSSchemaHandler.java:921) Then cannot start hive server any more because of MetaException(message:Metastore contains multiple versions (2) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27598) DStreams checkpointing does not work with Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-27598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-27598: Description: When I restarted a stream with checkpointing enabled I got this: {quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from file [file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk] java.io.IOException: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.streaming.dstream.FileInputDStream.filter of type scala.Function1 in instance of org.apache.spark.streaming.dstream.FileInputDStream at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322) at org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) {quote} It seems that the closure is stored in the Serialized format and cannot be assigned back to a scala function1 Details of how to reproduce it here: [https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6] Maybe this is spark-shell specific and is not expected to work anyway, as I dont see this to be an issues with a normal jar. Note that with Spark 2.3.3 the error is different and this still does not work but with a different error. was: When I restarted a stream with checkpointing enabled I got this: {quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from file [file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk] java.io.IOException: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.streaming.dstream.FileInputDStream.filter of type scala.Function1 in instance of org.apache.spark.streaming.dstream.FileInputDStream at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322) at org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) {quote} It seems that the closure is stored in the Serialized format and cannot be assigned back to a scala function1 Details of how to reproduce it here: [https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6] Maybe this is spark-shell specific and is not expected to work anyway, as I dont see this to be an issues with a normal jar. > DStreams checkpointing does not work with Scala 2.12 > > > Key: SPARK-27598 > URL: https://issues.apache.org/jira/browse/SPARK-27598 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Critical > > When I restarted a stream with checkpointing enabled I got this: > {quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from > file > [file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk] > java.io.IOException: java.lang.ClassCastException: cannot assign instance of > java.lang.invoke.SerializedLambda to field > org.apache.spark.streaming.dstream.FileInputDStream.filter of type > scala.Function1 in instance of > org.apache.spark.streaming.dstream.FileInputDStream > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322) > at > org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > {quote} > It seems that the closure is stored in the Serialized format and cannot be > assigned back to a scala function1 > Details of how to reproduce it here: > [https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6] > Maybe this is spark-shell specific and is not expected to work anyway, as I > dont see this to be an issues with a normal jar. > Note that with Spark 2.3.3 the error is different and this still does not > work but with a different error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27598) DStreams checkpointing does not work with Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-27598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-27598: Priority: Major (was: Critical) > DStreams checkpointing does not work with Scala 2.12 > > > Key: SPARK-27598 > URL: https://issues.apache.org/jira/browse/SPARK-27598 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > When I restarted a stream with checkpointing enabled I got this: > {quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from > file > [file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk] > java.io.IOException: java.lang.ClassCastException: cannot assign instance of > java.lang.invoke.SerializedLambda to field > org.apache.spark.streaming.dstream.FileInputDStream.filter of type > scala.Function1 in instance of > org.apache.spark.streaming.dstream.FileInputDStream > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322) > at > org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > {quote} > It seems that the closure is stored in the Serialized format and cannot be > assigned back to a scala function1 > Details of how to reproduce it here: > [https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6] > Maybe this is spark-shell specific and is not expected to work anyway, as I > dont see this to be an issues with a normal jar. > Note that with Spark 2.3.3 the error is different and this still does not > work but with a different error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27548) PySpark toLocalIterator does not raise errors from worker
[ https://issues.apache.org/jira/browse/SPARK-27548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829861#comment-16829861 ] Bryan Cutler edited comment on SPARK-27548 at 4/30/19 12:46 AM: This is not that easy to fix by itself. Since there is no call from Py4J after the initial socket is setup, when a task fails the Thread serving the iterator to python just terminates and the serializer stops. This ends up giving partial results that look fine because the error is never seen by the Python driver. Because the serving thread is being run asynchronously, the exception would have to be caught and then be transferred to Python after being joined. This would require lots of changes, but is pretty easy to do with the changes from SPARK-23961, so I will add the fix there. was (Author: bryanc): This is not that easy to fix by itself. Since there is no call from Py4J after the initial socket is setup, when a task fails the Thread serving the iterator to python just terminates and the serializer stops, so the error is never seen by the Python driver. Because the serving thread is being run asynchronously, the exception would have to be caught and then be transferred to Python after being joined. This would require lots of changes, but is pretty easy to do with the changes from SPARK-23961, so I will add the fix there. > PySpark toLocalIterator does not raise errors from worker > - > > Key: SPARK-27548 > URL: https://issues.apache.org/jira/browse/SPARK-27548 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.1 >Reporter: Bryan Cutler >Priority: Major > > When using a PySpark RDD local iterator and an error occurs on the worker, it > is not picked up by Py4J and raised in the Python driver process. So unless > looking at logs, there is no way for the application to know the worker had > an error. This is a test that should pass if the error is raised in the > driver: > {code} > def test_to_local_iterator_error(self): > def fail(_): > raise RuntimeError("local iterator error") > rdd = self.sc.parallelize(range(10)).map(fail) > with self.assertRaisesRegexp(Exception, "local iterator error"): > for _ in rdd.toLocalIterator(): > pass{code} > but it does not raise an exception: > {noformat} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 428, in main > process() > File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 423, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/serializers.py", > line 505, in dump_stream > vs = list(itertools.islice(iterator, batch)) > File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/util.py", line > 99, in wrapper > return f(*args, **kwargs) > File "/home/bryan/git/spark/python/pyspark/tests/test_rdd.py", line 742, in > fail > raise RuntimeError("local iterator error") > RuntimeError: local iterator error > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:453) > ... > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > FAIL > == > FAIL: test_to_local_iterator_error (pyspark.tests.test_rdd.RDDTests) > -- > Traceback (most recent call last): > File "/home/bryan/git/spark/python/pyspark/tests/test_rdd.py", line 748, in > test_to_local_iterator_error > pass > AssertionError: Exception not raised{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27548) PySpark toLocalIterator does not raise errors from worker
[ https://issues.apache.org/jira/browse/SPARK-27548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829861#comment-16829861 ] Bryan Cutler edited comment on SPARK-27548 at 4/30/19 12:46 AM: This is not that easy to fix by itself. Since there is no call from Py4J after the initial socket is setup, when a task fails the Thread serving the iterator to python just terminates and the serializer stops. This ends up giving partial results that look fine because the error is never seen by the Python driver. Because the serving thread is being run asynchronously, the exception would have to be caught and then be transferred to Python after being joined. This would require lots of modifications, but is pretty easy to do with the changes from SPARK-23961, so I will add the fix there. was (Author: bryanc): This is not that easy to fix by itself. Since there is no call from Py4J after the initial socket is setup, when a task fails the Thread serving the iterator to python just terminates and the serializer stops. This ends up giving partial results that look fine because the error is never seen by the Python driver. Because the serving thread is being run asynchronously, the exception would have to be caught and then be transferred to Python after being joined. This would require lots of changes, but is pretty easy to do with the changes from SPARK-23961, so I will add the fix there. > PySpark toLocalIterator does not raise errors from worker > - > > Key: SPARK-27548 > URL: https://issues.apache.org/jira/browse/SPARK-27548 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.1 >Reporter: Bryan Cutler >Priority: Major > > When using a PySpark RDD local iterator and an error occurs on the worker, it > is not picked up by Py4J and raised in the Python driver process. So unless > looking at logs, there is no way for the application to know the worker had > an error. This is a test that should pass if the error is raised in the > driver: > {code} > def test_to_local_iterator_error(self): > def fail(_): > raise RuntimeError("local iterator error") > rdd = self.sc.parallelize(range(10)).map(fail) > with self.assertRaisesRegexp(Exception, "local iterator error"): > for _ in rdd.toLocalIterator(): > pass{code} > but it does not raise an exception: > {noformat} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 428, in main > process() > File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 423, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/serializers.py", > line 505, in dump_stream > vs = list(itertools.islice(iterator, batch)) > File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/util.py", line > 99, in wrapper > return f(*args, **kwargs) > File "/home/bryan/git/spark/python/pyspark/tests/test_rdd.py", line 742, in > fail > raise RuntimeError("local iterator error") > RuntimeError: local iterator error > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:453) > ... > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > FAIL > == > FAIL: test_to_local_iterator_error (pyspark.tests.test_rdd.RDDTests) > -- > Traceback (most recent call last): > File "/home/bryan/git/spark/python/pyspark/tests/test_rdd.py", line 748, in > test_to_local_iterator_error > pass > AssertionError: Exception not raised{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27548) PySpark toLocalIterator does not raise errors from worker
[ https://issues.apache.org/jira/browse/SPARK-27548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829861#comment-16829861 ] Bryan Cutler commented on SPARK-27548: -- This is not that easy to fix by itself. Since there is no call from Py4J after the initial socket is setup, when a task fails the Thread serving the iterator to python just terminates and the serializer stops, so the error is never seen by the Python driver. Because the serving thread is being run asynchronously, the exception would have to be caught and then be transferred to Python after being joined. This would require lots of changes, but is pretty easy to do with the changes from SPARK-23961, so I will add the fix there. > PySpark toLocalIterator does not raise errors from worker > - > > Key: SPARK-27548 > URL: https://issues.apache.org/jira/browse/SPARK-27548 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.1 >Reporter: Bryan Cutler >Priority: Major > > When using a PySpark RDD local iterator and an error occurs on the worker, it > is not picked up by Py4J and raised in the Python driver process. So unless > looking at logs, there is no way for the application to know the worker had > an error. This is a test that should pass if the error is raised in the > driver: > {code} > def test_to_local_iterator_error(self): > def fail(_): > raise RuntimeError("local iterator error") > rdd = self.sc.parallelize(range(10)).map(fail) > with self.assertRaisesRegexp(Exception, "local iterator error"): > for _ in rdd.toLocalIterator(): > pass{code} > but it does not raise an exception: > {noformat} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 428, in main > process() > File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 423, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/serializers.py", > line 505, in dump_stream > vs = list(itertools.islice(iterator, batch)) > File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/util.py", line > 99, in wrapper > return f(*args, **kwargs) > File "/home/bryan/git/spark/python/pyspark/tests/test_rdd.py", line 742, in > fail > raise RuntimeError("local iterator error") > RuntimeError: local iterator error > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:453) > ... > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > FAIL > == > FAIL: test_to_local_iterator_error (pyspark.tests.test_rdd.RDDTests) > -- > Traceback (most recent call last): > File "/home/bryan/git/spark/python/pyspark/tests/test_rdd.py", line 748, in > test_to_local_iterator_error > pass > AssertionError: Exception not raised{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27588) Fail fast if binary file data source will load a file that is bigger than 2GB
[ https://issues.apache.org/jira/browse/SPARK-27588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-27588. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24483 [https://github.com/apache/spark/pull/24483] > Fail fast if binary file data source will load a file that is bigger than 2GB > - > > Key: SPARK-27588 > URL: https://issues.apache.org/jira/browse/SPARK-27588 > Project: Spark > Issue Type: Story > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Fix For: 3.0.0 > > > We use BinaryType to store file content, if a file is bigger than 2GB, we > should fail the job without trying to load the file that increases the error > waiting time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16367) Wheelhouse Support for PySpark
[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-16367. Resolution: Duplicate This is somewhat similar to SPARK-13587 so let's keep the discussion in one place. > Wheelhouse Support for PySpark > -- > > Key: SPARK-16367 > URL: https://issues.apache.org/jira/browse/SPARK-16367 > Project: Spark > Issue Type: New Feature > Components: Deploy, PySpark >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: gsemet >Priority: Major > Labels: newbie, python, python-wheel, wheelhouse > Original Estimate: 168h > Remaining Estimate: 168h > > *Rational* > Is it recommended, in order to deploying Scala packages written in Scala, to > build big fat jar files. This allows to have all dependencies on one package > so the only "cost" is copy time to deploy this file on every Spark Node. > On the other hand, Python deployment is more difficult once you want to use > external packages, and you don't really want to mess with the IT to deploy > the packages on the virtualenv of each nodes. > This ticket proposes to allow users the ability to deploy their job as > "Wheels" packages. The Python community is strongly advocating to promote > this way of packaging and distributing Python application as a "standard way > of deploying Python App". In other word, this is the "Pythonic Way of > Deployment". > *Previous approaches* > I based the current proposal over the two following bugs related to this > point: > - SPARK-6764 ("Wheel support for PySpark") > - SPARK-13587("Support virtualenv in PySpark") > First part of my proposal was to merge, in order to support wheels install > and virtualenv creation > *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* > In Python, the packaging standard is now the "wheels" file format, which goes > further that good old ".egg" files. With a wheel file (".whl"), the package > is already prepared for a given architecture. You can have several wheels for > a given package version, each specific to an architecture, or environment. > For example, look at https://pypi.python.org/pypi/numpy all the different > version of Wheel available. > The {{pip}} tools knows how to select the right wheel file matching the > current system, and how to install this package in a light speed (without > compilation). Said otherwise, package that requires compilation of a C > module, for instance "numpy", does *not* compile anything when installing > from wheel file. > {{pypi.pypthon.org}} already provided wheels for major python version. It the > wheel is not available, pip will compile it from source anyway. Mirroring of > Pypi is possible through projects such as http://doc.devpi.net/latest/ > (untested) or the Pypi mirror support on Artifactory (tested personnally). > {{pip}} also provides the ability to generate easily all wheels of all > packages used for a given project which is inside a "virtualenv". This is > called "wheelhouse". You can even don't mess with this compilation and > retrieve it directly from pypi.python.org. > *Use Case 1: no internet connectivity* > Here my first proposal for a deployment workflow, in the case where the Spark > cluster does not have any internet connectivity or access to a Pypi mirror. > In this case the simplest way to deploy a project with several dependencies > is to build and then send to complete "wheelhouse": > - you are writing a PySpark script that increase in term of size and > dependencies. Deploying on Spark for example requires to build numpy or > Theano and other dependencies > - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script > into a standard Python package: > -- write a {{requirements.txt}}. I recommend to specify all package version. > You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the > requirements.txt > {code} > astroid==1.4.6 # via pylint > autopep8==1.2.4 > click==6.6 # via pip-tools > colorama==0.3.7 # via pylint > enum34==1.1.6 # via hypothesis > findspark==1.0.0 # via spark-testing-base > first==2.0.1 # via pip-tools > hypothesis==3.4.0 # via spark-testing-base > lazy-object-proxy==1.2.2 # via astroid > linecache2==1.0.0 # via traceback2 > pbr==1.10.0 > pep8==1.7.0 # via autopep8 > pip-tools==1.6.5 > py==1.4.31 # via pytest > pyflakes==1.2.3 > pylint==1.5.6 > pytest==2.9.2 # via spark-testing-base > six==1.10.0 # via astroid, pip-tools, pylint, unittest2 > spark-testing-base==0.0.7.post2 > traceback2==1.4.0 # via unittest2 > unittest2==1.1.0 # via spark-testing-base > wheel==0.29.0 > wrapt==1.10.8 # via astroid > {code} > -- write a setup.py with some entry points or package. Use > [PBR|http://docs.openstack.org/developer/pbr/] it
[jira] [Assigned] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-13587: -- Assignee: Marcelo Vanzin > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Jeff Zhang >Assignee: Marcelo Vanzin >Priority: Major > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-13587: --- Target Version/s: (was: 3.0.0) > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Jeff Zhang >Priority: Major > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-13587: -- Assignee: (was: Marcelo Vanzin) > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Jeff Zhang >Priority: Major > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27547) fix DataFrame self-join problems
[ https://issues.apache.org/jira/browse/SPARK-27547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27547: -- Labels: correctness (was: ) > fix DataFrame self-join problems > > > Key: SPARK-27547 > URL: https://issues.apache.org/jira/browse/SPARK-27547 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: correctness > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27599) DataFrameWriter$partitionBy should be optional when writing to a hive table
Nick Dimiduk created SPARK-27599: Summary: DataFrameWriter$partitionBy should be optional when writing to a hive table Key: SPARK-27599 URL: https://issues.apache.org/jira/browse/SPARK-27599 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.1 Reporter: Nick Dimiduk When writing to an existing, partitioned table stored in the Hive metastore, Spark requires the call to {{saveAsTable}} to provide a value for {{partitionedBy}}, even though that information is provided by the metastore itself. Indeed, that information is available to Spark, as it will error if the specified {{partitionBy}} does not match that of the table definition in metastore. There may be other attributes of the save call that can be retrieved from the metastore... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27598) DStreams checkpointing does not work with Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-27598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-27598: Description: When I restarted a stream with checkpointing enabled I got this: {quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from file [file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk] java.io.IOException: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.streaming.dstream.FileInputDStream.filter of type scala.Function1 in instance of org.apache.spark.streaming.dstream.FileInputDStream at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322) at org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) {quote} It seems that the closure is stored in the Serialized format and cannot be assigned back to a scala function1 Details of how to reproduce it here: [https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6] Maybe this is spark-shell specific and is not expected to work anyway, as I dont see this to be an issues with a normal jar. was: When I restarted a stream with checkpointing enabled I got this: {quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from file [file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk] java.io.IOException: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.streaming.dstream.FileInputDStream.filter of type scala.Function1 in instance of org.apache.spark.streaming.dstream.FileInputDStream at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322) at org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) {quote} It seems that the closure is stored in the Serialized format and cannot be assigned back to a scala function1 Details of how to reproduce it here: [https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6] Maybe this is spark-shell specific and is not expected to work anyway. > DStreams checkpointing does not work with Scala 2.12 > > > Key: SPARK-27598 > URL: https://issues.apache.org/jira/browse/SPARK-27598 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Critical > > When I restarted a stream with checkpointing enabled I got this: > {quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from > file > [file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk] > java.io.IOException: java.lang.ClassCastException: cannot assign instance of > java.lang.invoke.SerializedLambda to field > org.apache.spark.streaming.dstream.FileInputDStream.filter of type > scala.Function1 in instance of > org.apache.spark.streaming.dstream.FileInputDStream > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322) > at > org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > {quote} > It seems that the closure is stored in the Serialized format and cannot be > assigned back to a scala function1 > Details of how to reproduce it here: > [https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6] > Maybe this is spark-shell specific and is not expected to work anyway, as I > dont see this to be an issues with a normal jar. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27599) DataFrameWriter.partitionBy should be optional when writing to a hive table
[ https://issues.apache.org/jira/browse/SPARK-27599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Dimiduk updated SPARK-27599: - Summary: DataFrameWriter.partitionBy should be optional when writing to a hive table (was: DataFrameWriter$partitionBy should be optional when writing to a hive table) > DataFrameWriter.partitionBy should be optional when writing to a hive table > --- > > Key: SPARK-27599 > URL: https://issues.apache.org/jira/browse/SPARK-27599 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.1 >Reporter: Nick Dimiduk >Priority: Minor > > When writing to an existing, partitioned table stored in the Hive metastore, > Spark requires the call to {{saveAsTable}} to provide a value for > {{partitionedBy}}, even though that information is provided by the metastore > itself. Indeed, that information is available to Spark, as it will error if > the specified {{partitionBy}} does not match that of the table definition in > metastore. > There may be other attributes of the save call that can be retrieved from the > metastore... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27598) DStreams checkpointing does not work with Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-27598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-27598: Priority: Critical (was: Blocker) > DStreams checkpointing does not work with Scala 2.12 > > > Key: SPARK-27598 > URL: https://issues.apache.org/jira/browse/SPARK-27598 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Critical > > When I restarted a stream with checkpointing enabled I got this: > {quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from > file > [file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk] > java.io.IOException: java.lang.ClassCastException: cannot assign instance of > java.lang.invoke.SerializedLambda to field > org.apache.spark.streaming.dstream.FileInputDStream.filter of type > scala.Function1 in instance of > org.apache.spark.streaming.dstream.FileInputDStream > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322) > at > org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > {quote} > It seems that the closure is stored in the Serialized format and cannot be > assigned back to a scala function1 > Details of how to reproduce it here: > [https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6] > Maybe this is spark-shell specific and is not expected to work anyway. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27598) DStreams checkpointing does not work with Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-27598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-27598: Description: When I restarted a stream with checkpointing enabled I got this: {quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from file [file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk] java.io.IOException: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.streaming.dstream.FileInputDStream.filter of type scala.Function1 in instance of org.apache.spark.streaming.dstream.FileInputDStream at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322) at org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) {quote} It seems that the closure is stored in the Serialized format and cannot be assigned back to a scala function1 Details of how to reproduce it here: [https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6] Maybe this is spark-shell specific and is not expected to work anyway. was: When I restarted a stream with checkpointing enabled I got this: {quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from file [file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk] java.io.IOException: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.streaming.dstream.FileInputDStream.filter of type scala.Function1 in instance of org.apache.spark.streaming.dstream.FileInputDStream at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322) at org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) {quote} It seems that the closure is stored in the Serialized format and cannot be assigned back to a scala function1 Details of how to reproduce it here: [https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6] Maybe this is spark-shell specific. > DStreams checkpointing does not work with Scala 2.12 > > > Key: SPARK-27598 > URL: https://issues.apache.org/jira/browse/SPARK-27598 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Blocker > > When I restarted a stream with checkpointing enabled I got this: > {quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from > file > [file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk] > java.io.IOException: java.lang.ClassCastException: cannot assign instance of > java.lang.invoke.SerializedLambda to field > org.apache.spark.streaming.dstream.FileInputDStream.filter of type > scala.Function1 in instance of > org.apache.spark.streaming.dstream.FileInputDStream > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322) > at > org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > {quote} > It seems that the closure is stored in the Serialized format and cannot be > assigned back to a scala function1 > Details of how to reproduce it here: > [https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6] > Maybe this is spark-shell specific and is not expected to work anyway. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27598) DStreams checkpointing does not work with Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-27598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-27598: Description: When I restarted a stream with checkpointing enabled I got this: {quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from file [file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk] java.io.IOException: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.streaming.dstream.FileInputDStream.filter of type scala.Function1 in instance of org.apache.spark.streaming.dstream.FileInputDStream at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322) at org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) {quote} It seems that the closure is stored in the Serialized format and cannot be assigned back to a scala function1 Details of how to reproduce it here: [https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6] Maybe this is spark-shell specific. was: When I restarted a stream with checkpointing enabled I got this: {quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from file file:/tmp/checkpoint/checkpoint-155656695.bk java.io.IOException: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.streaming.dstream.FileInputDStream.filter of type scala.Function1 in instance of org.apache.spark.streaming.dstream.FileInputDStream at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322) at org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) {quote} It seems that the closure is stored in the Serialized format and cannot be assigned back to a scala function1 Details of how to reproduce it here: https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6 > DStreams checkpointing does not work with Scala 2.12 > > > Key: SPARK-27598 > URL: https://issues.apache.org/jira/browse/SPARK-27598 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Blocker > > When I restarted a stream with checkpointing enabled I got this: > {quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from > file > [file:/tmp/checkpoint/checkpoint-155656695.bk|file:///tmp/checkpoint/checkpoint-155656695.bk] > java.io.IOException: java.lang.ClassCastException: cannot assign instance of > java.lang.invoke.SerializedLambda to field > org.apache.spark.streaming.dstream.FileInputDStream.filter of type > scala.Function1 in instance of > org.apache.spark.streaming.dstream.FileInputDStream > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322) > at > org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > {quote} > It seems that the closure is stored in the Serialized format and cannot be > assigned back to a scala function1 > Details of how to reproduce it here: > [https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6] > Maybe this is spark-shell specific. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27598) DStreams checkpointing does not work with Scala 2.12
Stavros Kontopoulos created SPARK-27598: --- Summary: DStreams checkpointing does not work with Scala 2.12 Key: SPARK-27598 URL: https://issues.apache.org/jira/browse/SPARK-27598 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 2.4.2, 2.4.1, 2.4.0, 3.0.0 Reporter: Stavros Kontopoulos When I restarted a stream with checkpointing enabled I got this: {quote}19/04/29 22:45:06 WARN CheckpointReader: Error reading checkpoint from file file:/tmp/checkpoint/checkpoint-155656695.bk java.io.IOException: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.streaming.dstream.FileInputDStream.filter of type scala.Function1 in instance of org.apache.spark.streaming.dstream.FileInputDStream at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1322) at org.apache.spark.streaming.dstream.FileInputDStream.readObject(FileInputDStream.scala:314) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) {quote} It seems that the closure is stored in the Serialized format and cannot be assigned back to a scala function1 Details of how to reproduce it here: https://gist.github.com/skonto/87d5b2368b0bf7786d9dd992a710e4e6 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829631#comment-16829631 ] Robert Joseph Evans commented on SPARK-27396: - I have updated this SPIP to clarify some things and to reduce the scope of it. We are no longer trying to tackle anything around data movement to/from AI/ML systems. We are no longer trying to expose any Arrow APIs or formatting to end users. To avoid any stability issues with either the Arrow API or the Arrow memory layout specification we are going to split that part off as a separate SPIP that can be added in later. We will work with the Arrow community to see what if any stability guarantees we can get before putting that SPIP forward. The only APIs that we are going to expose in this first SPIP is through the spark sql extensions config/API so that groups that want to do accelerated columnar ETL can have an API to do it, even if it is an unstable API. Please take another look, and if there is not much in the way of comments we will put it up for another vote. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *SPIP: Columnar Processing Without Arrow Formatting Guarantees.* > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > # Add to the current sql extensions mechanism so advanced users can have > access to the physical SparkPlan and manipulate it to provide columnar > processing for existing operators, including shuffle. This will allow them > to implement their own cost based optimizers to decide when processing should > be columnar and when it should not. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the users so operations that are not columnar see the > data as rows, and operations that are columnar see the data as columns. > > Not Requirements, but things that would be nice to have. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > > *Q2.* What problem is this proposal NOT designed to solve? > The goal of this is not for ML/AI but to provide APIs for accelerated > computing in Spark primarily targeting SQL/ETL like workloads. ML/AI already > have several mechanisms to get data into/out of them. These can be improved > but will be covered in a separate SPIP. > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation. > This does not cover exposing the underlying format of the data. The only way > to get at the data in a ColumnVector is through the public APIs. Exposing > the underlying format to improve efficiency will be covered in a separate > SPIP. > This is not trying to implement new ways of transferring data to external > ML/AI applications. That is covered by separate SPIPs already. > This is not trying to add in generic code generation for columnar processing. > Currently code generation for columnar processing is only supported when > translating columns to rows. We will continue to support this, but will not > extend it as a general solution. That will be covered in a separate SPIP if > we find it is helpful. For now columnar processing will be interpreted. > This is not trying to expose a way to get columnar data into Spark through > DataSource V2 or any other similar API. That would be covered by a separate > SPIP if we find it is needed. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Internal implementations of FileFormats, optionally can return a > ColumnarBatch instead of rows. The code generation phase knows how to take > that columnar data and iterate through it as rows for stages that wants rows, > which currently is almost everything. The limitations here are mostly > implementation specific. The current standard is to abuse Scala’s type > erasure to return ColumnarBatches as the elements of an
[jira] [Updated] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Joseph Evans updated SPARK-27396: Description: *SPIP: Columnar Processing Without Arrow Formatting Guarantees.* *Q1.* What are you trying to do? Articulate your objectives using absolutely no jargon. The Dataset/DataFrame API in Spark currently only exposes to users one row at a time when processing data. The goals of this are to # Add to the current sql extensions mechanism so advanced users can have access to the physical SparkPlan and manipulate it to provide columnar processing for existing operators, including shuffle. This will allow them to implement their own cost based optimizers to decide when processing should be columnar and when it should not. # Make any transitions between the columnar memory layout and a row based layout transparent to the users so operations that are not columnar see the data as rows, and operations that are columnar see the data as columns. Not Requirements, but things that would be nice to have. # Transition the existing in memory columnar layouts to be compatible with Apache Arrow. This would make the transformations to Apache Arrow format a no-op. The existing formats are already very close to those layouts in many cases. This would not be using the Apache Arrow java library, but instead being compatible with the memory [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a subset of that layout. *Q2.* What problem is this proposal NOT designed to solve? The goal of this is not for ML/AI but to provide APIs for accelerated computing in Spark primarily targeting SQL/ETL like workloads. ML/AI already have several mechanisms to get data into/out of them. These can be improved but will be covered in a separate SPIP. This is not trying to implement any of the processing itself in a columnar way, with the exception of examples for documentation. This does not cover exposing the underlying format of the data. The only way to get at the data in a ColumnVector is through the public APIs. Exposing the underlying format to improve efficiency will be covered in a separate SPIP. This is not trying to implement new ways of transferring data to external ML/AI applications. That is covered by separate SPIPs already. This is not trying to add in generic code generation for columnar processing. Currently code generation for columnar processing is only supported when translating columns to rows. We will continue to support this, but will not extend it as a general solution. That will be covered in a separate SPIP if we find it is helpful. For now columnar processing will be interpreted. This is not trying to expose a way to get columnar data into Spark through DataSource V2 or any other similar API. That would be covered by a separate SPIP if we find it is needed. *Q3.* How is it done today, and what are the limits of current practice? The current columnar support is limited to 3 areas. # Internal implementations of FileFormats, optionally can return a ColumnarBatch instead of rows. The code generation phase knows how to take that columnar data and iterate through it as rows for stages that wants rows, which currently is almost everything. The limitations here are mostly implementation specific. The current standard is to abuse Scala’s type erasure to return ColumnarBatches as the elements of an RDD[InternalRow]. The code generation can handle this because it is generating java code, so it bypasses scala’s type checking and just casts the InternalRow to the desired ColumnarBatch. This makes it difficult for others to implement the same functionality for different processing because they can only do it through code generation. There really is no clean separate path in the code generation for columnar vs row based. Additionally, because it is only supported through code generation if for any reason code generation would fail there is no backup. This is typically fine for input formats but can be problematic when we get into more extensive processing. # When caching data it can optionally be cached in a columnar format if the input is also columnar. This is similar to the first area and has the same limitations because the cache acts as an input, but it is the only piece of code that also consumes columnar data as an input. # Pandas vectorized processing. To be able to support Pandas UDFs Spark will build up a batch of data and send it to python for processing, and then get a batch of data back as a result. The format of the data being sent to python can either be pickle, which is the default, or optionally Arrow. The result returned is the same format. The limitations here really are around performance. Transforming the data back and forth can be very expensive. *Q4.* What is new in
[jira] [Commented] (SPARK-27597) RuntimeConfig should be serializable
[ https://issues.apache.org/jira/browse/SPARK-27597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829612#comment-16829612 ] Nick Dimiduk commented on SPARK-27597: -- It would nice nice if there was an API built into the {{FunctionalInterface}} that let their implementation access the fully populated {{SparkSession}}. > RuntimeConfig should be serializable > > > Key: SPARK-27597 > URL: https://issues.apache.org/jira/browse/SPARK-27597 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.1 >Reporter: Nick Dimiduk >Priority: Major > > When implementing a UDF or similar, it's quite surprising to see that the > {{SparkSession}} is {{Serializable}} but {{RuntimeConf}} is not. When > modeling UDFs in an object-oriented way, this leads to quite a surprise, an > ugly NPE from the {{call}} site. > {noformat} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:143) > at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:141) > at org.apache.spark.sql.SparkSession.conf$lzycompute(SparkSession.scala:170) > at org.apache.spark.sql.SparkSession.conf(SparkSession.scala:170) > ...{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27597) RuntimeConfig should be serializable
Nick Dimiduk created SPARK-27597: Summary: RuntimeConfig should be serializable Key: SPARK-27597 URL: https://issues.apache.org/jira/browse/SPARK-27597 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.1 Reporter: Nick Dimiduk When implementing a UDF or similar, it's quite surprising to see that the {{SparkSession}} is {{Serializable}} but {{RuntimeConf}} is not. When modeling UDFs in an object-oriented way, this leads to quite a surprise, an ugly NPE from the {{call}} site. {noformat} Caused by: java.lang.NullPointerException at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:143) at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:141) at org.apache.spark.sql.SparkSession.conf$lzycompute(SparkSession.scala:170) at org.apache.spark.sql.SparkSession.conf(SparkSession.scala:170) ...{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27567) Spark Streaming consumers (from Kafka) intermittently die with 'SparkException: Couldn't find leaders for Set'
[ https://issues.apache.org/jira/browse/SPARK-27567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829555#comment-16829555 ] Dmitry Goldenberg commented on SPARK-27567: --- Hi Liang-Chi, So you must be referring to this comment from that post, I gather: {noformat} This is expected behaviour. You have requested that each topic be stored on one machine by setting ReplicationFactor to one. When the one machine that happens to store the topic normalized-tenant4 is taken down, the consumer cannot find the leader of the topic. See http://kafka.apache.org/documentation.html#intro_guarantees.{noformat} I believe that indeed we have replication factor set to 1 for Kafka in this particular cluster I'm looking at. Could it possibly be that the node that happens to hold particular segments of Kafka data becomes intermittently inaccessible (e.g. due to a network hiccup) and then the consumer is not able to retrieve the data from Kafka? And then the issue is really one of Kafka configuration? It sounds like if we increase the replication factor on the Kafka side the issue on the Spark side may go away. Agree / disagree? In any case, the error "Couldn't find leaders for Set" is not very direct in terms of what the error causes may be. Perhaps this can be improved in Spark Streaming. At the very least, this Jira incident may help others. I'll look into increasing the replication factor on the Kafka side and update this ticket. > Spark Streaming consumers (from Kafka) intermittently die with > 'SparkException: Couldn't find leaders for Set' > -- > > Key: SPARK-27567 > URL: https://issues.apache.org/jira/browse/SPARK-27567 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.5.0 > Environment: GCP / 170~14.04.1-Ubuntu >Reporter: Dmitry Goldenberg >Priority: Major > > Some of our consumers intermittently die with the stack traces I'm including. > Once restarted they run for a while then die again. > I can't find any cohesive documentation on what this error means and how to > go about troubleshooting it. Any help would be appreciated. > *Kafka version* is 0.8.2.1 (2.10-0.8.2.1). > Some of the errors seen look like this: > {noformat} > ERROR org.apache.spark.scheduler.TaskSchedulerImpl: Lost executor 2 on > 10.150.0.54: remote Rpc client disassociated{noformat} > Main error stack trace: > {noformat} > 2019-04-23 20:36:54,323 ERROR > org.apache.spark.streaming.scheduler.JobScheduler: Error g > enerating jobs for time 1556066214000 ms > org.apache.spark.SparkException: ArrayBuffer(org.apache.spark.SparkException: > Couldn't find leaders for Set([hdfs.hbase.acme.attachments,49], > [hdfs.hbase.acme.attachmen > ts,63], [hdfs.hbase.acme.attachments,31], [hdfs.hbase.acme.attachments,9], > [hdfs.hbase.acme.attachments,25], [hdfs.hbase.acme.attachments,55], > [hdfs.hbase.acme.attachme > nts,5], [hdfs.hbase.acme.attachments,37], [hdfs.hbase.acme.attachments,7], > [hdfs.hbase.acme.attachments,47], [hdfs.hbase.acme.attachments,13], > [hdfs.hbase.acme.attachme > nts,43], [hdfs.hbase.acme.attachments,19], [hdfs.hbase.acme.attachments,15], > [hdfs.hbase.acme.attachments,23], [hdfs.hbase.acme.attachments,53], > [hdfs.hbase.acme.attach > ments,1], [hdfs.hbase.acme.attachments,27], [hdfs.hbase.acme.attachments,57], > [hdfs.hbase.acme.attachments,39], [hdfs.hbase.acme.attachments,11], > [hdfs.hbase.acme.attac > hments,29], [hdfs.hbase.acme.attachments,33], > [hdfs.hbase.acme.attachments,35], [hdfs.hbase.acme.attachments,51], > [hdfs.hbase.acme.attachments,45], [hdfs.hbase.acme.att > achments,21], [hdfs.hbase.acme.attachments,3], > [hdfs.hbase.acme.attachments,59], [hdfs.hbase.acme.attachments,41], > [hdfs.hbase.acme.attachments,17], [hdfs.hbase.acme.at > tachments,61])) > at > org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:123) > ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.j > ar:?] > at > org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:145) > ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.jar:?] > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350) > ~[spark-assembly-1.5.0-hadoop2.4.0.ja > r:1.5.0] > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350) > ~[spark-assembly-1.5.0-hadoop2.4.0.ja > r:1.5.0] > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > ~[acmedsc-ingest-kafka-spark-2.0.0-SNAPSHOT.jar:?] > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349) >
[jira] [Commented] (SPARK-21492) Memory leak in SortMergeJoin
[ https://issues.apache.org/jira/browse/SPARK-21492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829545#comment-16829545 ] Tao Luo commented on SPARK-21492: - The problem is that the task won't complete because of memory being leaked (You can see from the simple example above) Secondly, it's not just the last page, it's every page with records from unused iterators. Can we increase the priority of this bug? SMJ is a pretty integral part of Spark SQL, and it seems like no progress is being made on this bug, which is causing jobs to fail and has no workaround. I don't think that it's a hack: the argument seems to be that limit also needs to fixed, so let's not fix this bug until that is also fixed, meanwhile this issue has been lingering since at least July 2017. This would fix a memory leak and improve performance from not spilling unnecessarily. Why don't we target this fix for SMJ first, since it's pretty isolated to UnsafeExternalRowIterator in SMJ, run it through all the test cases, and make additional changes as necessary in the future. I've been porting [this PR|https://github.com/apache/spark/pull/23762] onto my production Spark cluster for the last 3 months, but I'm hoping we can get some sort of fix into 3.0 at least. I started a discussion thread here, hopefully people can jump in: http://apache-spark-developers-list.1001551.n3.nabble.com/Memory-leak-in-SortMergeJoin-td27152.html > Memory leak in SortMergeJoin > > > Key: SPARK-21492 > URL: https://issues.apache.org/jira/browse/SPARK-21492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.0, 2.3.1, 3.0.0 >Reporter: Zhan Zhang >Priority: Major > > In SortMergeJoin, if the iterator is not exhausted, there will be memory leak > caused by the Sort. The memory is not released until the task end, and cannot > be used by other operators causing performance drop or OOM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27575) Spark overwrites existing value of spark.yarn.dist.* instead of merging value
[ https://issues.apache.org/jira/browse/SPARK-27575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-27575. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24465 [https://github.com/apache/spark/pull/24465] > Spark overwrites existing value of spark.yarn.dist.* instead of merging value > - > > Key: SPARK-27575 > URL: https://issues.apache.org/jira/browse/SPARK-27575 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > If we specify `--files` arg when submitting app where configuration has > "spark.yarn.dist.files", SparkSubmit overwrites the new files (files provided > in arg) instead of merging existing value in configuration. > Same issue happens also on "spark.yarn.dist.pyFiles", "spark.yarn.dist.jars", > "spark.yarn.dist.archives". > > While I encountered the issue in Spark 2.4.0, I can see the issue from > codebase in master branch and also branch-2.3 as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27575) Spark overwrites existing value of spark.yarn.dist.* instead of merging value
[ https://issues.apache.org/jira/browse/SPARK-27575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-27575: -- Assignee: Jungtaek Lim > Spark overwrites existing value of spark.yarn.dist.* instead of merging value > - > > Key: SPARK-27575 > URL: https://issues.apache.org/jira/browse/SPARK-27575 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > > If we specify `--files` arg when submitting app where configuration has > "spark.yarn.dist.files", SparkSubmit overwrites the new files (files provided > in arg) instead of merging existing value in configuration. > Same issue happens also on "spark.yarn.dist.pyFiles", "spark.yarn.dist.jars", > "spark.yarn.dist.archives". > > While I encountered the issue in Spark 2.4.0, I can see the issue from > codebase in master branch and also branch-2.3 as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23014) Migrate MemorySink fully to v2
[ https://issues.apache.org/jira/browse/SPARK-23014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23014. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24403 [https://github.com/apache/spark/pull/24403] > Migrate MemorySink fully to v2 > -- > > Key: SPARK-23014 > URL: https://issues.apache.org/jira/browse/SPARK-23014 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.0.0 > > > There's already a MemorySinkV2, but its use is controlled by a flag. We need > to remove the V1 sink and always use it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23014) Migrate MemorySink fully to v2
[ https://issues.apache.org/jira/browse/SPARK-23014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23014: -- Assignee: Gabor Somogyi > Migrate MemorySink fully to v2 > -- > > Key: SPARK-23014 > URL: https://issues.apache.org/jira/browse/SPARK-23014 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Gabor Somogyi >Priority: Major > > There's already a MemorySinkV2, but its use is controlled by a flag. We need > to remove the V1 sink and always use it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27596) The JDBC 'query' option doesn't work for Oracle database
Xiao Li created SPARK-27596: --- Summary: The JDBC 'query' option doesn't work for Oracle database Key: SPARK-27596 URL: https://issues.apache.org/jira/browse/SPARK-27596 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.2 Reporter: Xiao Li Assignee: Dilip Biswal For the JDBC option `query`, we use the identifier name to start with underscore: s"(${subquery}) __SPARK_GEN_JDBC_SUBQUERY_NAME_${curId.getAndIncrement()}". This is not supported by Oracle. The Oracle doesn't seem to support identifier name to start with non-alphabet character (unless it is quoted) and has length restrictions as well. https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements008.htm {code:java} Nonquoted identifiers must begin with an alphabetic character from your database character set. Quoted identifiers can begin with any character as per below documentation - Nonquoted identifiers can contain only alphanumeric characters from your database character set and the underscore (_), dollar sign ($), and pound sign (#). Database links can also contain periods (.) and "at" signs (@). Oracle strongly discourages you from using $ and # in nonquoted identifiers. {code} The alias name '_SPARK_GEN_JDBC_SUBQUERY_NAME' should be fixed to remove "__" prefix ( or make it quoted.not sure if it may impact other sources) to make it work for Oracle. Also the length should be limited as it is hitting below error on removing the prefix. {code:java} java.sql.SQLSyntaxErrorException: ORA-00972: identifier is too long {code} It can be verified using below sqlfiddle link. http://www.sqlfiddle.com/#!4/9bbe9a/10050 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27571) Spark 3.0 build warnings: reflectiveCalls edition
[ https://issues.apache.org/jira/browse/SPARK-27571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27571. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24463 [https://github.com/apache/spark/pull/24463] > Spark 3.0 build warnings: reflectiveCalls edition > - > > Key: SPARK-27571 > URL: https://issues.apache.org/jira/browse/SPARK-27571 > Project: Spark > Issue Type: Improvement > Components: Examples, Spark Core, YARN >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Fix For: 3.0.0 > > > We have a few places in the code where {{scala.lang.reflectiveCalls}} is > imported to avoid warnings about, well, reflective calls in Scala. This > usually comes up in tests where an anonymous inner class has been defined > with additional methods, which actually can't be accessed directly without > reflection. This is a little undesirable and easy enough to avoid. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27595) Spark couldn't read partitioned(string type) Orc column correctly if the value contains Float/Double value
Ameer Basha Pattan created SPARK-27595: -- Summary: Spark couldn't read partitioned(string type) Orc column correctly if the value contains Float/Double value Key: SPARK-27595 URL: https://issues.apache.org/jira/browse/SPARK-27595 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.0 Reporter: Ameer Basha Pattan create external table unique_keys ( key string ,locator_id string , create_date string , sequence int ) partitioned by (source string) stored as orc location '/user/hive/warehouse/reservation.db/unique_keys'; /user/hive/warehouse/reservation.db/unique_keys contains data like below: /user/hive/warehouse/reservation.db/unique_keys/source=6S /user/hive/warehouse/reservation.db/unique_keys/source=7F /user/hive/warehouse/reservation.db/unique_keys/source=7H /user/hive/warehouse/reservation.db/unique_keys/source=8D If I try to read orc files through Spark, val masterDF = hiveContext.read.orc("/user/hive/warehouse/reservation.db/unique_keys") source value getting changed to *7.0 and 8.0* for 7F and 8D respectively. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27536) Code improvements for 3.0: existentials edition
[ https://issues.apache.org/jira/browse/SPARK-27536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27536. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24431 [https://github.com/apache/spark/pull/24431] > Code improvements for 3.0: existentials edition > --- > > Key: SPARK-27536 > URL: https://issues.apache.org/jira/browse/SPARK-27536 > Project: Spark > Issue Type: Improvement > Components: ML, Spark Core, SQL, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Major > Fix For: 3.0.0 > > > The Spark code base makes use of 'existential types' in Scala, a language > feature which is quasi-deprecated -- it generates a warning unless > scala.language.existentials is imported, and there is talk of removing it > from future Scala versions: > https://contributors.scala-lang.org/t/proposal-to-remove-existential-types-from-the-language/2785 > We can get rid of most usages of this feature with lots of minor changes to > the code. A PR is coming to demonstrate what's involved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27591) A bug in UnivocityParser prevents using UDT
[ https://issues.apache.org/jira/browse/SPARK-27591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829379#comment-16829379 ] Artem Kalchenko commented on SPARK-27591: - yes, I will later today > A bug in UnivocityParser prevents using UDT > --- > > Key: SPARK-27591 > URL: https://issues.apache.org/jira/browse/SPARK-27591 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: Artem Kalchenko >Priority: Minor > > I am trying to define a UserDefinedType based on String but different from > StringType in Spark 2.4.1 but it looks like there is a bug in Spark or I am > doing smth incorrectly. > I define my type as follows: > {code:java} > class MyType extends UserDefinedType[MyValue] { > override def sqlType: DataType = StringType > ... > } > @SQLUserDefinedType(udt = classOf[MyType]) > case class MyValue > {code} > I expect it to be read and stored as String with just a custom SQL type. In > fact Spark can't read the string at all: > {code:java} > java.lang.ClassCastException: > org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$11 > cannot be cast to org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:195) > at > org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102) > {code} > the problem is with UnivocityParser.makeConverter that doesn't return (String > => Any) function but (String => (String => Any)) in the case of UDT, see > UnivocityParser:184 > {code:java} > case udt: UserDefinedType[_] => (datum: String) => > makeConverter(name, udt.sqlType, nullable, options) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27472) Docuement binary file data source in Spark user guide
[ https://issues.apache.org/jira/browse/SPARK-27472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-27472. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24484 [https://github.com/apache/spark/pull/24484] > Docuement binary file data source in Spark user guide > - > > Key: SPARK-27472 > URL: https://issues.apache.org/jira/browse/SPARK-27472 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Fix For: 3.0.0 > > > We should add binary file data source to > https://spark.apache.org/docs/latest/sql-data-sources.html after SPARK-25348. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27591) A bug in UnivocityParser prevents using UDT
[ https://issues.apache.org/jira/browse/SPARK-27591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829370#comment-16829370 ] Liang-Chi Hsieh commented on SPARK-27591: - oh, you're right. I've misread the description. Want to submit a PR with your fix? > A bug in UnivocityParser prevents using UDT > --- > > Key: SPARK-27591 > URL: https://issues.apache.org/jira/browse/SPARK-27591 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: Artem Kalchenko >Priority: Minor > > I am trying to define a UserDefinedType based on String but different from > StringType in Spark 2.4.1 but it looks like there is a bug in Spark or I am > doing smth incorrectly. > I define my type as follows: > {code:java} > class MyType extends UserDefinedType[MyValue] { > override def sqlType: DataType = StringType > ... > } > @SQLUserDefinedType(udt = classOf[MyType]) > case class MyValue > {code} > I expect it to be read and stored as String with just a custom SQL type. In > fact Spark can't read the string at all: > {code:java} > java.lang.ClassCastException: > org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$11 > cannot be cast to org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:195) > at > org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102) > {code} > the problem is with UnivocityParser.makeConverter that doesn't return (String > => Any) function but (String => (String => Any)) in the case of UDT, see > UnivocityParser:184 > {code:java} > case udt: UserDefinedType[_] => (datum: String) => > makeConverter(name, udt.sqlType, nullable, options) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27591) A bug in UnivocityParser prevents using UDT
[ https://issues.apache.org/jira/browse/SPARK-27591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829354#comment-16829354 ] Artem Kalchenko commented on SPARK-27591: - [~viirya], yes, I'm returning a string. I tried creating a UTF8String, but it didn't help. As I mentioned in the description, the method is returning String => makeConverter(...) instead of just calling makeConverter on udt.sqlType > A bug in UnivocityParser prevents using UDT > --- > > Key: SPARK-27591 > URL: https://issues.apache.org/jira/browse/SPARK-27591 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: Artem Kalchenko >Priority: Minor > > I am trying to define a UserDefinedType based on String but different from > StringType in Spark 2.4.1 but it looks like there is a bug in Spark or I am > doing smth incorrectly. > I define my type as follows: > {code:java} > class MyType extends UserDefinedType[MyValue] { > override def sqlType: DataType = StringType > ... > } > @SQLUserDefinedType(udt = classOf[MyType]) > case class MyValue > {code} > I expect it to be read and stored as String with just a custom SQL type. In > fact Spark can't read the string at all: > {code:java} > java.lang.ClassCastException: > org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$11 > cannot be cast to org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:195) > at > org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102) > {code} > the problem is with UnivocityParser.makeConverter that doesn't return (String > => Any) function but (String => (String => Any)) in the case of UDT, see > UnivocityParser:184 > {code:java} > case udt: UserDefinedType[_] => (datum: String) => > makeConverter(name, udt.sqlType, nullable, options) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27574) spark on kubernetes driver pod phase changed from running to pending and starts another container in pod
[ https://issues.apache.org/jira/browse/SPARK-27574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Zhang updated SPARK-27574: --- Description: I'm using spark-on-kubernetes to submit spark app to kubernetes. most of the time, it runs smoothly. but sometimes, I see logs after submitting: the driver pod phase changed from running to pending and starts another container in the pod though the first container exited successfully. I use the standard spark-submit to kubernetes like: /opt/spark/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --deploy-mode cluster --class xxx ... log is below: 19/04/19 09:38:40 INFO LineBufferedStream: stdout: 2019-04-19 09:38:40 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state: 19/04/19 09:38:40 INFO LineBufferedStream: stdout: pod name: com--cloud-mf-trainer-submit-1555666719424-driver 19/04/19 09:38:40 INFO LineBufferedStream: stdout: namespace: default 19/04/19 09:38:40 INFO LineBufferedStream: stdout: labels: DagTask_ID -> 54f854e2-0bce-4bd6-50e7-57b521b216f7, spark-app-selector -> spark-4343fe80572c4240bd933246efd975da, spark-role -> driver 19/04/19 09:38:40 INFO LineBufferedStream: stdout: pod uid: ea4410d5-6286-11e9-ae72-e8611f1fbb2a 19/04/19 09:38:40 INFO LineBufferedStream: stdout: creation time: 2019-04-19T09:38:40Z 19/04/19 09:38:40 INFO LineBufferedStream: stdout: service account name: default 19/04/19 09:38:40 INFO LineBufferedStream: stdout: volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh 19/04/19 09:38:40 INFO LineBufferedStream: stdout: node name: N/A 19/04/19 09:38:40 INFO LineBufferedStream: stdout: start time: N/A 19/04/19 09:38:40 INFO LineBufferedStream: stdout: container images: N/A 19/04/19 09:38:40 INFO LineBufferedStream: stdout: phase: Pending 19/04/19 09:38:40 INFO LineBufferedStream: stdout: status: [] 19/04/19 09:38:40 INFO LineBufferedStream: stdout: 2019-04-19 09:38:40 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state: 19/04/19 09:38:40 INFO LineBufferedStream: stdout: pod name: com--cloud-mf-trainer-submit-1555666719424-driver 19/04/19 09:38:40 INFO LineBufferedStream: stdout: namespace: default 19/04/19 09:38:40 INFO LineBufferedStream: stdout: labels: DagTask_ID -> 54f854e2-0bce-4bd6-50e7-57b521b216f7, spark-app-selector -> spark-4343fe80572c4240bd933246efd975da, spark-role -> driver 19/04/19 09:38:40 INFO LineBufferedStream: stdout: pod uid: ea4410d5-6286-11e9-ae72-e8611f1fbb2a 19/04/19 09:38:40 INFO LineBufferedStream: stdout: creation time: 2019-04-19T09:38:40Z 19/04/19 09:38:40 INFO LineBufferedStream: stdout: service account name: default 19/04/19 09:38:40 INFO LineBufferedStream: stdout: volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh 19/04/19 09:38:40 INFO LineBufferedStream: stdout: node name: yq01-m12-ai2b-service02.yq01..com 19/04/19 09:38:40 INFO LineBufferedStream: stdout: start time: N/A 19/04/19 09:38:40 INFO LineBufferedStream: stdout: container images: N/A 19/04/19 09:38:40 INFO LineBufferedStream: stdout: phase: Pending 19/04/19 09:38:40 INFO LineBufferedStream: stdout: status: [] 19/04/19 09:38:41 INFO LineBufferedStream: stdout: 2019-04-19 09:38:41 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state: 19/04/19 09:38:41 INFO LineBufferedStream: stdout: pod name: com--cloud-mf-trainer-submit-1555666719424-driver 19/04/19 09:38:41 INFO LineBufferedStream: stdout: namespace: default 19/04/19 09:38:41 INFO LineBufferedStream: stdout: labels: DagTask_ID -> 54f854e2-0bce-4bd6-50e7-57b521b216f7, spark-app-selector -> spark-4343fe80572c4240bd933246efd975da, spark-role -> driver 19/04/19 09:38:41 INFO LineBufferedStream: stdout: pod uid: ea4410d5-6286-11e9-ae72-e8611f1fbb2a 19/04/19 09:38:41 INFO LineBufferedStream: stdout: creation time: 2019-04-19T09:38:40Z 19/04/19 09:38:41 INFO LineBufferedStream: stdout: service account name: default 19/04/19 09:38:41 INFO LineBufferedStream: stdout: volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh 19/04/19 09:38:41 INFO LineBufferedStream: stdout: node name: yq01-m12-ai2b-service02.yq01..com 19/04/19 09:38:41 INFO LineBufferedStream: stdout: start time: 2019-04-19T09:38:40Z 19/04/19 09:38:41 INFO LineBufferedStream: stdout: container images: 10.96.0.100:5000/spark:spark-2.4.0 19/04/19 09:38:41 INFO LineBufferedStream: stdout: phase: Pending 19/04/19 09:38:41 INFO LineBufferedStream: stdout: status: [ContainerStatus(containerID=null, image=10.96.0.100:5000/spark:spark-2.4.0, imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=null, waiting=ContainerStateWaiting(message=null, reason=ContainerCreating, additionalProperties={}), additionalProperties={}), additionalProperties={})] 19/04/19 09:38:45 INFO LineBufferedStream: stdout:
[jira] [Comment Edited] (SPARK-27574) spark on kubernetes driver pod phase changed from running to pending and starts another container in pod
[ https://issues.apache.org/jira/browse/SPARK-27574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829337#comment-16829337 ] Will Zhang edited comment on SPARK-27574 at 4/29/19 3:21 PM: - Hi [~Udbhav Agrawal], the driver log is nothing special, the first container ran successfully and exited. The second failed cause it checks the filepath of the output and returns error if already existed. What I can see from the log is that the second container starts shortly after the first one exited. I attached the driver log files. Thank you. below is the output of kubectl describe pod, it only contains the second container id: Name: com--cloud-mf-trainer-submit-1555666719424-driver Namespace: default Node: yq01-m12-ai2b-service02.yq01..com/10.155.197.12 Start Time: Fri, 19 Apr 2019 17:38:40 +0800 Labels: DagTask_ID=54f854e2-0bce-4bd6-50e7-57b521b216f7 spark-app-selector=spark-4343fe80572c4240bd933246efd975da spark-role=driver Annotations: Status: Failed IP: 10.244.12.106 Containers: spark-kubernetes-driver: Container ID: docker://23c9ea6767a274f8e8759da39dee90f403d9d28b1fec97c1fa4cd8746b41c8c3 Image: 10.96.0.100:5000/spark:spark-2.4.0 Image ID: docker-pullable://10.96.0.100:5000/spark-2.4.0@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f Ports: 7078/TCP, 7079/TCP, 4040/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP Args: driver --properties-file /opt/spark/conf/spark.properties --class com..cloud.mf.trainer.Submit spark-internal --ak 970f5e4c-7171-4c61-603e-f101b65a573b --tracking_server_url [http://10.155.197.12:8080|http://10.155.197.12:8080/] --graph hdfs://yq01-m12-ai2b-service02.yq01..com:9000/project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/meta/node1555661669082/graph.json --sk 56305f9f-b755-4b42-4218-592555f5c4a8 --mode train State: Terminated Reason: Error Exit Code: 1 Started: Fri, 19 Apr 2019 17:39:57 +0800 Finished: Fri, 19 Apr 2019 17:40:48 +0800 Ready: False Restart Count: 0 Limits: memory: 2432Mi Requests: cpu: 1 memory: 2432Mi Environment: _KUBERNETES_LOG_ENDPOINT: yq01-m12-ai2b-service02.yq01..com:8070 _KUBERNETES_LOG_FLUSH_FREQUENCY: 10s _KUBERNETES_LOG_PATH: /project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/log/driver SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) SPARK_LOCAL_DIRS: /var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f SPARK_CONF_DIR: /opt/spark/conf Mounts: /opt/spark/conf from spark-conf-volume (rw) /var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f from spark-local-dir-1 (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-q7drh (ro) Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: spark-local-dir-1: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: spark-conf-volume: Type: ConfigMap (a volume populated by a ConfigMap) Name: com--cloud-mf-trainer-submit-1555666719424-driver-conf-map Optional: false default-token-q7drh: Type: Secret (a volume populated by a Secret) SecretName: default-token-q7drh Optional: false QoS Class: Burstable Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: was (Author: zyfo2): Hi [~Udbhav Agrawal], the driver log is nothing special, the first container ran successfully and exited. The second failed cause it checks the filepath of the output and returns error if already existed. What I can see from the log is that the second container starts shortly after the first one exited. I attached the driver log files. Thank you. below is the output of kubectl describe pod, it only contains the second container id: Name: com--cloud-mf-trainer-submit-1555666719424-driver Namespace: default Node: yq01-m12-ai2b-service02.yq01.[^driver-pod-logs.zip].com/10.155.197.12 Start Time: Fri, 19 Apr 2019 17:38:40 +0800 Labels: DagTask_ID=54f854e2-0bce-4bd6-50e7-57b521b216f7 spark-app-selector=spark-4343fe80572c4240bd933246efd975da spark-role=driver Annotations: Status: Failed IP: 10.244.12.106 Containers: spark-kubernetes-driver: Container ID: docker://23c9ea6767a274f8e8759da39dee90f403d9d28b1fec97c1fa4cd8746b41c8c3 Image: 10.96.0.100:5000/spark:spark-2.4.0 Image ID: docker-pullable://10.96.0.100:5000/spark-2.4.0@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f Ports: 7078/TCP, 7079/TCP, 4040/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP Args: driver --properties-file /opt/spark/conf/spark.properties --class com..cloud.mf.trainer.Submit spark-internal --ak 970f5e4c-7171-4c61-603e-f101b65a573b --tracking_server_url
[jira] [Comment Edited] (SPARK-27574) spark on kubernetes driver pod phase changed from running to pending and starts another container in pod
[ https://issues.apache.org/jira/browse/SPARK-27574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829337#comment-16829337 ] Will Zhang edited comment on SPARK-27574 at 4/29/19 3:21 PM: - Hi [~Udbhav Agrawal], the driver log is nothing special, the first container ran successfully and exited. The second failed cause it checks the filepath of the output and returns error if already existed. What I can see from the log is that the second container starts shortly after the first one exited. I attached the driver log files. Thank you. below is the output of kubectl describe pod, it only contains the second container id: Name: com--cloud-mf-trainer-submit-1555666719424-driver Namespace: default Node: yq01-m12-ai2b-service02.yq01.[^driver-pod-logs.zip].com/10.155.197.12 Start Time: Fri, 19 Apr 2019 17:38:40 +0800 Labels: DagTask_ID=54f854e2-0bce-4bd6-50e7-57b521b216f7 spark-app-selector=spark-4343fe80572c4240bd933246efd975da spark-role=driver Annotations: Status: Failed IP: 10.244.12.106 Containers: spark-kubernetes-driver: Container ID: docker://23c9ea6767a274f8e8759da39dee90f403d9d28b1fec97c1fa4cd8746b41c8c3 Image: 10.96.0.100:5000/spark:spark-2.4.0 Image ID: docker-pullable://10.96.0.100:5000/spark-2.4.0@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f Ports: 7078/TCP, 7079/TCP, 4040/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP Args: driver --properties-file /opt/spark/conf/spark.properties --class com..cloud.mf.trainer.Submit spark-internal --ak 970f5e4c-7171-4c61-603e-f101b65a573b --tracking_server_url [http://10.155.197.12:8080|http://10.155.197.12:8080/] --graph hdfs://yq01-m12-ai2b-service02.yq01..com:9000/project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/meta/node1555661669082/graph.json --sk 56305f9f-b755-4b42-4218-592555f5c4a8 --mode train State: Terminated Reason: Error Exit Code: 1 Started: Fri, 19 Apr 2019 17:39:57 +0800 Finished: Fri, 19 Apr 2019 17:40:48 +0800 Ready: False Restart Count: 0 Limits: memory: 2432Mi Requests: cpu: 1 memory: 2432Mi Environment: _KUBERNETES_LOG_ENDPOINT: yq01-m12-ai2b-service02.yq01..com:8070 _KUBERNETES_LOG_FLUSH_FREQUENCY: 10s _KUBERNETES_LOG_PATH: /project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/log/driver SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) SPARK_LOCAL_DIRS: /var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f SPARK_CONF_DIR: /opt/spark/conf Mounts: /opt/spark/conf from spark-conf-volume (rw) /var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f from spark-local-dir-1 (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-q7drh (ro) Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: spark-local-dir-1: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: spark-conf-volume: Type: ConfigMap (a volume populated by a ConfigMap) Name: com--cloud-mf-trainer-submit-1555666719424-driver-conf-map Optional: false default-token-q7drh: Type: Secret (a volume populated by a Secret) SecretName: default-token-q7drh Optional: false QoS Class: Burstable Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: was (Author: zyfo2): Hi [~Udbhav Agrawal], the driver log is nothing special, the first container ran successfully and exited. The second failed cause it checks the filepath of the output and returns error if already existed. What I can see from the log is that the second container starts shortly after the first one exited. I attached the driver log files. kubectl describe pod, it only contains the second container, the describe says: Name: com--cloud-mf-trainer-submit-1555666719424-driver Namespace: default Node: yq01-m12-ai2b-service02.yq01.[^driver-pod-logs.zip].com/10.155.197.12 Start Time: Fri, 19 Apr 2019 17:38:40 +0800 Labels: DagTask_ID=54f854e2-0bce-4bd6-50e7-57b521b216f7 spark-app-selector=spark-4343fe80572c4240bd933246efd975da spark-role=driver Annotations: Status: Failed IP: 10.244.12.106 Containers: spark-kubernetes-driver: Container ID: docker://23c9ea6767a274f8e8759da39dee90f403d9d28b1fec97c1fa4cd8746b41c8c3 Image: 10.96.0.100:5000/spark:spark-2.4.0 Image ID: docker-pullable://10.96.0.100:5000/spark-2.4.0@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f Ports: 7078/TCP, 7079/TCP, 4040/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP Args: driver --properties-file /opt/spark/conf/spark.properties --class com..cloud.mf.trainer.Submit spark-internal --ak 970f5e4c-7171-4c61-603e-f101b65a573b --tracking_server_url http://10.155.197.12:8080 --graph
[jira] [Commented] (SPARK-27574) spark on kubernetes driver pod phase changed from running to pending and starts another container in pod
[ https://issues.apache.org/jira/browse/SPARK-27574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829337#comment-16829337 ] Will Zhang commented on SPARK-27574: Hi [~Udbhav Agrawal], the driver log is nothing special, the first container ran successfully and exited. The second failed cause it checks the filepath of the output and returns error if already existed. What I can see from the log is that the second container starts shortly after the first one exited. I attached the driver log files. kubectl describe pod, it only contains the second container, the describe says: Name: com--cloud-mf-trainer-submit-1555666719424-driver Namespace: default Node: yq01-m12-ai2b-service02.yq01.[^driver-pod-logs.zip].com/10.155.197.12 Start Time: Fri, 19 Apr 2019 17:38:40 +0800 Labels: DagTask_ID=54f854e2-0bce-4bd6-50e7-57b521b216f7 spark-app-selector=spark-4343fe80572c4240bd933246efd975da spark-role=driver Annotations: Status: Failed IP: 10.244.12.106 Containers: spark-kubernetes-driver: Container ID: docker://23c9ea6767a274f8e8759da39dee90f403d9d28b1fec97c1fa4cd8746b41c8c3 Image: 10.96.0.100:5000/spark:spark-2.4.0 Image ID: docker-pullable://10.96.0.100:5000/spark-2.4.0@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f Ports: 7078/TCP, 7079/TCP, 4040/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP Args: driver --properties-file /opt/spark/conf/spark.properties --class com..cloud.mf.trainer.Submit spark-internal --ak 970f5e4c-7171-4c61-603e-f101b65a573b --tracking_server_url http://10.155.197.12:8080 --graph hdfs://yq01-m12-ai2b-service02.yq01..com:9000/project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/meta/node1555661669082/graph.json --sk 56305f9f-b755-4b42-4218-592555f5c4a8 --mode train State: Terminated Reason: Error Exit Code: 1 Started: Fri, 19 Apr 2019 17:39:57 +0800 Finished: Fri, 19 Apr 2019 17:40:48 +0800 Ready: False Restart Count: 0 Limits: memory: 2432Mi Requests: cpu: 1 memory: 2432Mi Environment: _KUBERNETES_LOG_ENDPOINT: yq01-m12-ai2b-service02.yq01..com:8070 _KUBERNETES_LOG_FLUSH_FREQUENCY: 10s _KUBERNETES_LOG_PATH: /project/62247e3a-e322-4456-6387-a66e9490652e/exp/62c37ae9-12aa-43f7-671f-d187e1bf1f84/graph/08e1dfad-c272-45ca-4201-1a8bc691a56e/log/driver SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) SPARK_LOCAL_DIRS: /var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f SPARK_CONF_DIR: /opt/spark/conf Mounts: /opt/spark/conf from spark-conf-volume (rw) /var/data/spark-b7e8109a-57c8-439d-b5a8-c0135a7a6e7f from spark-local-dir-1 (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-q7drh (ro) Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: spark-local-dir-1: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: spark-conf-volume: Type: ConfigMap (a volume populated by a ConfigMap) Name: com--cloud-mf-trainer-submit-1555666719424-driver-conf-map Optional: false default-token-q7drh: Type: Secret (a volume populated by a Secret) SecretName: default-token-q7drh Optional: false QoS Class: Burstable Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: > spark on kubernetes driver pod phase changed from running to pending and > starts another container in pod > > > Key: SPARK-27574 > URL: https://issues.apache.org/jira/browse/SPARK-27574 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 > Environment: Kubernetes version (use kubectl version): > v1.10.0 > OS (e.g: cat /etc/os-release): > CentOS-7 > Kernel (e.g. uname -a): > 4.17.11-1.el7.elrepo.x86_64 > Spark-2.4.0 >Reporter: Will Zhang >Priority: Major > Attachments: driver-pod-logs.zip > > > I'm using spark-on-kubernetes to submit spark app to kubernetes. > most of the time, it runs smoothly. > but sometimes, I see logs after submitting: the driver pod phase changed from > running to pending and starts another container in the pod though the first > container exited successfully. > I use the standard spark-submit to kubernetes like: > /opt/spark/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --deploy-mode cluster > --class xxx ... > > log is below: > > > 2019-04-25 13:37:01 INFO LoggingPodStatusWatcherImpl:54 - State changed, new > state: > pod name: com--cloud-mf-trainer-submit-1556199419847-driver > namespace: default > labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, > spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> > driver > pod uid:
[jira] [Updated] (SPARK-27574) spark on kubernetes driver pod phase changed from running to pending and starts another container in pod
[ https://issues.apache.org/jira/browse/SPARK-27574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Zhang updated SPARK-27574: --- Attachment: driver-pod-logs.zip > spark on kubernetes driver pod phase changed from running to pending and > starts another container in pod > > > Key: SPARK-27574 > URL: https://issues.apache.org/jira/browse/SPARK-27574 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 > Environment: Kubernetes version (use kubectl version): > v1.10.0 > OS (e.g: cat /etc/os-release): > CentOS-7 > Kernel (e.g. uname -a): > 4.17.11-1.el7.elrepo.x86_64 > Spark-2.4.0 >Reporter: Will Zhang >Priority: Major > Attachments: driver-pod-logs.zip > > > I'm using spark-on-kubernetes to submit spark app to kubernetes. > most of the time, it runs smoothly. > but sometimes, I see logs after submitting: the driver pod phase changed from > running to pending and starts another container in the pod though the first > container exited successfully. > I use the standard spark-submit to kubernetes like: > /opt/spark/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --deploy-mode cluster > --class xxx ... > > log is below: > > > 2019-04-25 13:37:01 INFO LoggingPodStatusWatcherImpl:54 - State changed, new > state: > pod name: com--cloud-mf-trainer-submit-1556199419847-driver > namespace: default > labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, > spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> > driver > pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a > creation time: 2019-04-25T13:37:01Z > service account name: default > volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh > node name: N/A > start time: N/A > container images: N/A > phase: Pending > status: [] > 2019-04-25 13:37:01 INFO LoggingPodStatusWatcherImpl:54 - State changed, new > state: > pod name: com--cloud-mf-trainer-submit-1556199419847-driver > namespace: default > labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, > spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> > driver > pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a > creation time: 2019-04-25T13:37:01Z > service account name: default > volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh > node name: yq01-m12-ai2b-service02.yq01..com > start time: N/A > container images: N/A > phase: Pending > status: [] > 2019-04-25 13:37:01 INFO Client:54 - Waiting for application > com..cloud.mf.trainer.Submit to finish... > 2019-04-25 13:37:01 INFO LoggingPodStatusWatcherImpl:54 - State changed, new > state: > pod name: com--cloud-mf-trainer-submit-1556199419847-driver > namespace: default > labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, > spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> > driver > pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a > creation time: 2019-04-25T13:37:01Z > service account name: default > volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh > node name: yq01-m12-ai2b-service02.yq01..com > start time: 2019-04-25T13:37:01Z > container images: 10.96.0.100:5000/spark:spark-2.4.0 > phase: Pending > status: [ContainerStatus(containerID=null, > image=10.96.0.100:5000/spark:spark-2.4.0, imageID=, > lastState=ContainerState(running=null, terminated=null, waiting=null, > additionalProperties={}), name=spark-kubernetes-driver, ready=false, > restartCount=0, state=ContainerState(running=null, terminated=null, > waiting=ContainerStateWaiting(message=null, reason=ContainerCreating, > additionalProperties={}), additionalProperties={}), additionalProperties={})] > 2019-04-25 13:37:04 INFO LoggingPodStatusWatcherImpl:54 - State changed, new > state: > pod name: com--cloud-mf-trainer-submit-1556199419847-driver > namespace: default > labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, > spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> > driver > pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a > creation time: 2019-04-25T13:37:01Z > service account name: default > volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh > node name: yq01-m12-ai2b-service02.yq01..com > start time: 2019-04-25T13:37:01Z > container images: 10.96.0.100:5000/spark:spark-2.4.0 > phase: Running > status: > [ContainerStatus(containerID=docker://120dbf8cb11cf8ef9b26cff3354e096a979beb35279de34be64b3c06e896b991, > image=10.96.0.100:5000/spark:spark-2.4.0, > imageID=docker-pullable://10.96.0.100:5000/spark@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f, > lastState=ContainerState(running=null, terminated=null, waiting=null, >
[jira] [Commented] (SPARK-27591) A bug in UnivocityParser prevents using UDT
[ https://issues.apache.org/jira/browse/SPARK-27591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829336#comment-16829336 ] Liang-Chi Hsieh commented on SPARK-27591: - Are you returning string in {{serialize}} in your {{UserDefinedType}}? If so, please creating a {{UTF8String}} based on the string. And also in {{deserialize}}, the passed in data is {{UTF8String}}. > A bug in UnivocityParser prevents using UDT > --- > > Key: SPARK-27591 > URL: https://issues.apache.org/jira/browse/SPARK-27591 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: Artem Kalchenko >Priority: Minor > > I am trying to define a UserDefinedType based on String but different from > StringType in Spark 2.4.1 but it looks like there is a bug in Spark or I am > doing smth incorrectly. > I define my type as follows: > {code:java} > class MyType extends UserDefinedType[MyValue] { > override def sqlType: DataType = StringType > ... > } > @SQLUserDefinedType(udt = classOf[MyType]) > case class MyValue > {code} > I expect it to be read and stored as String with just a custom SQL type. In > fact Spark can't read the string at all: > {code:java} > java.lang.ClassCastException: > org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$11 > cannot be cast to org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:195) > at > org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102) > {code} > the problem is with UnivocityParser.makeConverter that doesn't return (String > => Any) function but (String => (String => Any)) in the case of UDT, see > UnivocityParser:184 > {code:java} > case udt: UserDefinedType[_] => (datum: String) => > makeConverter(name, udt.sqlType, nullable, options) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27213) Unexpected results when filter is used after distinct
[ https://issues.apache.org/jira/browse/SPARK-27213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829309#comment-16829309 ] Josh Rosen commented on SPARK-27213: Hmm, this must have been fixed relatively recently. For now, I can confirm that the problem still existed as of 5668c42edf20bc577305437622272bf803b6019e (which is what I happen to have checked out locally; that commit landed in master on March 5, 2019). It'd be good to try the reproduction the latest 2.4.x and 2.3.x maintenance releases to see if those still have the bug. > Unexpected results when filter is used after distinct > - > > Key: SPARK-27213 > URL: https://issues.apache.org/jira/browse/SPARK-27213 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Rinaz Belhaj >Priority: Major > Labels: correctness, distinct, filter > > The following code gives unexpected output due to the filter not getting > pushed down in catalyst optimizer. > {code:java} > df = > spark.createDataFrame([['a','123','12.3','n'],['a','123','12.3','y'],['a','123','12.4','y']],['x','y','z','y_n']) > df.show(5) > df.filter("y_n='y'").select('x','y','z').distinct().show() > df.select('x','y','z').distinct().filter("y_n='y'").show() > {code} > {panel:title=Output} > |x|y|z|y_n| > |a|123|12.3|n| > |a|123|12.3|y| > |a|123|12.4|y| > > |x|y|z| > |a|123|12.3| > |a|123|12.4| > > |x|y|z| > |a|123|12.4| > {panel} > Ideally, the second statement should result in an error since the column used > in the filter is not present in the preceding select statement. But the > catalyst optimizer is using first() on column y_n and then applying the > filter. > Even if the filter was pushed down, the result would have been accurate. > {code:java} > df = > spark.createDataFrame([['a','123','12.3','n'],['a','123','12.3','y'],['a','123','12.4','y']],['x','y','z','y_n']) > df.filter("y_n='y'").select('x','y','z').distinct().explain(True) > df.select('x','y','z').distinct().filter("y_n='y'").explain(True) > {code} > {panel:title=Output} > > == Parsed Logical Plan == > Deduplicate [x#74, y#75, z#76|#74, y#75, z#76] > +- AnalysisBarrier > +- Project [x#74, y#75, z#76|#74, y#75, z#76] > +- Filter (y_n#77 = y) > +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false > > == Analyzed Logical Plan == > x: string, y: string, z: string > Deduplicate [x#74, y#75, z#76|#74, y#75, z#76] > +- Project [x#74, y#75, z#76|#74, y#75, z#76] > +- Filter (y_n#77 = y) > +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false > > == Optimized Logical Plan == > Aggregate [x#74, y#75, z#76|#74, y#75, z#76], [x#74, y#75, z#76|#74, y#75, > z#76] > +- Project [x#74, y#75, z#76|#74, y#75, z#76] > +- Filter (isnotnull(y_n#77) && (y_n#77 = y)) > +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false > > == Physical Plan == > *(2) HashAggregate(keys=[x#74, y#75, z#76|#74, y#75, z#76], functions=[], > output=[x#74, y#75, z#76|#74, y#75, z#76]) > +- Exchange hashpartitioning(x#74, y#75, z#76, 10) > +- *(1) HashAggregate(keys=[x#74, y#75, z#76|#74, y#75, z#76], functions=[], > output=[x#74, y#75, z#76|#74, y#75, z#76]) > +- *(1) Project [x#74, y#75, z#76|#74, y#75, z#76] > +- *(1) Filter (isnotnull(y_n#77) && (y_n#77 = y)) > +- Scan ExistingRDD[x#74,y#75,z#76,y_n#77|#74,y#75,z#76,y_n#77] > > > --- > > > == Parsed Logical Plan == > 'Filter ('y_n = y) > +- AnalysisBarrier > +- Deduplicate [x#74, y#75, z#76|#74, y#75, z#76] > +- Project [x#74, y#75, z#76|#74, y#75, z#76] > +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false > > == Analyzed Logical Plan == > x: string, y: string, z: string > Project [x#74, y#75, z#76|#74, y#75, z#76] > +- Filter (y_n#77 = y) > +- Deduplicate [x#74, y#75, z#76|#74, y#75, z#76] > +- Project [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77] > +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false > > == Optimized Logical Plan == > Project [x#74, y#75, z#76|#74, y#75, z#76] > +- Filter (isnotnull(y_n#77) && (y_n#77 = y)) > +- Aggregate [x#74, y#75, z#76|#74, y#75, z#76], [x#74, y#75, z#76, > first(y_n#77, false) AS y_n#77|#74, y#75, z#76, first(y_n#77, false) AS > y_n#77] > +- LogicalRDD [x#74, y#75, z#76, y_n#77|#74, y#75, z#76, y_n#77], false > > == Physical Plan == > *(3) Project [x#74, y#75, z#76|#74, y#75, z#76] > +- *(3) Filter (isnotnull(y_n#77) && (y_n#77 = y)) > +- SortAggregate(key=[x#74, y#75, z#76|#74, y#75, z#76], > functions=[first(y_n#77, false)|#77, false)], output=[x#74, y#75, z#76, > y_n#77|#74, y#75, z#76, y_n#77]) > +-
[jira] [Commented] (SPARK-27586) Improve binary comparison: replace Scala's for-comprehension if statements with while loop
[ https://issues.apache.org/jira/browse/SPARK-27586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829303#comment-16829303 ] Josh Rosen commented on SPARK-27586: Good find! This sounds pretty straightforward to fix; want to submit a pull request with the {{while}} loop version? > Improve binary comparison: replace Scala's for-comprehension if statements > with while loop > -- > > Key: SPARK-27586 > URL: https://issues.apache.org/jira/browse/SPARK-27586 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.2 > Environment: benchmark env: > * Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz > * Linux 4.4.0-33.bm.1-amd64 > * java version "1.8.0_131" > * Scala 2.11.8 > * perf version 4.4.0 > Run: > 40,000,000 times comparison on 32 bytes-length binary > >Reporter: WoudyGao >Priority: Minor > > I found the cpu cost of TypeUtils.compareBinary is noticeable when handle > some big parquet files; > After some perf work, I found: > the " for-comprehension if statements" will execute ≈15X instructions than > while loop > > *'while-loop' version perf:* > > {{886.687949 task-clock (msec) # 1.257 CPUs > utilized}} > {{ 3,089 context-switches # 0.003 M/sec}} > {{ 265 cpu-migrations# 0.299 K/sec}} > {{12,227 page-faults # 0.014 M/sec}} > {{ 2,209,183,920 cycles# 2.492 GHz}} > {{ stalled-cycles-frontend}} > {{ stalled-cycles-backend}} > {{ 6,865,836,114 instructions # 3.11 insns per > cycle}} > {{ 1,568,910,228 branches # 1769.405 M/sec}} > {{ 9,172,613 branch-misses # 0.58% of all > branches}} > > {{ 0.705671157 seconds time elapsed}} > > *TypeUtils.compareBinary perf:* > {{ 16347.242313 task-clock (msec) # 1.233 CPUs > utilized}} > {{ 8,370 context-switches # 0.512 K/sec}} > {{ 481 cpu-migrations# 0.029 K/sec}} > {{ 536,671 page-faults # 0.033 M/sec}} > {{40,857,347,119 cycles# 2.499 GHz}} > {{ stalled-cycles-frontend}} > {{ stalled-cycles-backend}} > {{90,606,381,612 instructions # 2.22 insns per > cycle}} > {{18,107,867,151 branches # 1107.702 M/sec}} > {{12,880,296 branch-misses # 0.07% of all > branches}} > > {{ 13.257617118 seconds time elapsed}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27574) spark on kubernetes driver pod phase changed from running to pending and starts another container in pod
[ https://issues.apache.org/jira/browse/SPARK-27574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829292#comment-16829292 ] Udbhav Agrawal commented on SPARK-27574: Hey [~zyfo2] can you share driver pod logs obtained by : kubectl logs driverpod-name -n namespace-name > spark on kubernetes driver pod phase changed from running to pending and > starts another container in pod > > > Key: SPARK-27574 > URL: https://issues.apache.org/jira/browse/SPARK-27574 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 > Environment: Kubernetes version (use kubectl version): > v1.10.0 > OS (e.g: cat /etc/os-release): > CentOS-7 > Kernel (e.g. uname -a): > 4.17.11-1.el7.elrepo.x86_64 > Spark-2.4.0 >Reporter: Will Zhang >Priority: Major > > I'm using spark-on-kubernetes to submit spark app to kubernetes. > most of the time, it runs smoothly. > but sometimes, I see logs after submitting: the driver pod phase changed from > running to pending and starts another container in the pod though the first > container exited successfully. > I use the standard spark-submit to kubernetes like: > /opt/spark/spark-2.4.0-bin-hadoop2.7/bin/spark-submit --deploy-mode cluster > --class xxx ... > > log is below: > > > 2019-04-25 13:37:01 INFO LoggingPodStatusWatcherImpl:54 - State changed, new > state: > pod name: com--cloud-mf-trainer-submit-1556199419847-driver > namespace: default > labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, > spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> > driver > pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a > creation time: 2019-04-25T13:37:01Z > service account name: default > volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh > node name: N/A > start time: N/A > container images: N/A > phase: Pending > status: [] > 2019-04-25 13:37:01 INFO LoggingPodStatusWatcherImpl:54 - State changed, new > state: > pod name: com--cloud-mf-trainer-submit-1556199419847-driver > namespace: default > labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, > spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> > driver > pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a > creation time: 2019-04-25T13:37:01Z > service account name: default > volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh > node name: yq01-m12-ai2b-service02.yq01..com > start time: N/A > container images: N/A > phase: Pending > status: [] > 2019-04-25 13:37:01 INFO Client:54 - Waiting for application > com..cloud.mf.trainer.Submit to finish... > 2019-04-25 13:37:01 INFO LoggingPodStatusWatcherImpl:54 - State changed, new > state: > pod name: com--cloud-mf-trainer-submit-1556199419847-driver > namespace: default > labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, > spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> > driver > pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a > creation time: 2019-04-25T13:37:01Z > service account name: default > volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh > node name: yq01-m12-ai2b-service02.yq01..com > start time: 2019-04-25T13:37:01Z > container images: 10.96.0.100:5000/spark:spark-2.4.0 > phase: Pending > status: [ContainerStatus(containerID=null, > image=10.96.0.100:5000/spark:spark-2.4.0, imageID=, > lastState=ContainerState(running=null, terminated=null, waiting=null, > additionalProperties={}), name=spark-kubernetes-driver, ready=false, > restartCount=0, state=ContainerState(running=null, terminated=null, > waiting=ContainerStateWaiting(message=null, reason=ContainerCreating, > additionalProperties={}), additionalProperties={}), additionalProperties={})] > 2019-04-25 13:37:04 INFO LoggingPodStatusWatcherImpl:54 - State changed, new > state: > pod name: com--cloud-mf-trainer-submit-1556199419847-driver > namespace: default > labels: DagTask_ID -> 5fd12b90-fbbb-41f0-41ad-7bc5bd0abfe0, > spark-app-selector -> spark-3c8350a62ab44c139ce073d654fddebb, spark-role -> > driver > pod uid: 348cdcf5-675f-11e9-ae72-e8611f1fbb2a > creation time: 2019-04-25T13:37:01Z > service account name: default > volumes: spark-local-dir-1, spark-conf-volume, default-token-q7drh > node name: yq01-m12-ai2b-service02.yq01..com > start time: 2019-04-25T13:37:01Z > container images: 10.96.0.100:5000/spark:spark-2.4.0 > phase: Running > status: > [ContainerStatus(containerID=docker://120dbf8cb11cf8ef9b26cff3354e096a979beb35279de34be64b3c06e896b991, > image=10.96.0.100:5000/spark:spark-2.4.0, > imageID=docker-pullable://10.96.0.100:5000/spark@sha256:5b47e2a29aeb1c644fc3853933be2ad08f9cd233dec0977908803e9a1f870b0f, >
[jira] [Created] (SPARK-27594) spark.sql.orc.enableVectorizedReader causes milliseconds in Timestamp to be read incorrectly
Jan-Willem van der Sijp created SPARK-27594: --- Summary: spark.sql.orc.enableVectorizedReader causes milliseconds in Timestamp to be read incorrectly Key: SPARK-27594 URL: https://issues.apache.org/jira/browse/SPARK-27594 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Jan-Willem van der Sijp Using {{spark.sql.orc.impl=native}} and {{spark.sql.orc.enableVectorizedReader=true}} causes reading of TIMESTAMP columns in HIVE stored as ORC to be interpreted incorrectly. Specifically, the milliseconds of time timestamp will be doubled. Input/output of a Zeppelin session to demonstrate: {code:python} %pyspark from pprint import pprint spark.conf.set("spark.sql.orc.impl", "native") spark.conf.set("spark.sql.orc.enableVectorizedReader", "true") pprint(spark.sparkContext.getConf().getAll()) [('sql.stacktrace', 'false'), ('spark.eventLog.enabled', 'true'), ('spark.app.id', 'application_1556200632329_0005'), ('importImplicit', 'true'), ('printREPLOutput', 'true'), ('spark.history.ui.port', '18081'), ('spark.driver.extraLibraryPath', '/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'), ('spark.driver.extraJavaOptions', ' -Dfile.encoding=UTF-8 ' '-Dlog4j.configuration=file:///usr/hdp/current/zeppelin-server/conf/log4j.properties ' '-Dzeppelin.log.file=/var/log/zeppelin/zeppelin-interpreter-spark2-spark-zeppelin-sandbox-hdp.hortonworks.com.log'), ('concurrentSQL', 'false'), ('spark.driver.port', '40195'), ('spark.executor.extraLibraryPath', '/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'), ('useHiveContext', 'true'), ('spark.jars', 'file:/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'), ('spark.history.provider', 'org.apache.spark.deploy.history.FsHistoryProvider'), ('spark.yarn.historyServer.address', 'sandbox-hdp.hortonworks.com:18081'), ('spark.submit.deployMode', 'client'), ('spark.ui.filters', 'org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter'), ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS', 'sandbox-hdp.hortonworks.com'), ('spark.eventLog.dir', 'hdfs:///spark2-history/'), ('spark.repl.class.uri', 'spark://sandbox-hdp.hortonworks.com:40195/classes'), ('spark.driver.host', 'sandbox-hdp.hortonworks.com'), ('master', 'yarn'), ('spark.yarn.dist.archives', '/usr/hdp/current/spark2-client/R/lib/sparkr.zip#sparkr'), ('spark.scheduler.mode', 'FAIR'), ('spark.yarn.queue', 'default'), ('spark.history.kerberos.keytab', '/etc/security/keytabs/spark.headless.keytab'), ('spark.executor.id', 'driver'), ('spark.history.fs.logDirectory', 'hdfs:///spark2-history/'), ('spark.history.kerberos.enabled', 'false'), ('spark.master', 'yarn'), ('spark.sql.catalogImplementation', 'hive'), ('spark.history.kerberos.principal', 'none'), ('spark.driver.extraClassPath', ':/usr/hdp/current/zeppelin-server/interpreter/spark/*:/usr/hdp/current/zeppelin-server/lib/interpreter/*::/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'), ('spark.driver.appUIAddress', 'http://sandbox-hdp.hortonworks.com:4040'), ('spark.repl.class.outputDir', '/tmp/spark-555b2143-0efa-45c1-aecc-53810f89aa5f'), ('spark.yarn.isPython', 'true'), ('spark.app.name', 'Zeppelin'), ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES', 'http://sandbox-hdp.hortonworks.com:8088/proxy/application_1556200632329_0005'), ('maxResult', '1000'), ('spark.executorEnv.PYTHONPATH', '/usr/hdp/current/spark2-client//python/lib/py4j-0.10.6-src.zip:/usr/hdp/current/spark2-client//python/:/usr/hdp/current/spark2-client//python:/usr/hdp/current/spark2-client//python/lib/py4j-0.8.2.1-src.zip{{PWD}}/pyspark.zip{{PWD}}/py4j-0.10.6-src.zip'), ('spark.ui.proxyBase', '/proxy/application_1556200632329_0005')] {code} {code:python} %pyspark spark.sql(""" DROP TABLE IF EXISTS default.hivetest """) spark.sql(""" CREATE TABLE default.hivetest ( day DATE, time TIMESTAMP, timestring STRING ) USING ORC """) {code} {code:python} %pyspark df1 = spark.createDataFrame( [ ("2019-01-01", "2019-01-01 12:15:31.123", "2019-01-01 12:15:31.123") ], schema=("date", "timestamp", "string") ) df2 = spark.createDataFrame( [ ("2019-01-02", "2019-01-02 13:15:32.234", "2019-01-02 13:15:32.234") ], schema=("date", "timestamp", "string") ) {code} {code:python} %pyspark spark.conf.set("spark.sql.orc.enableVectorizedReader", "true") df1.write.insertInto("default.hivetest") spark.conf.set("spark.sql.orc.enableVectorizedReader", "false") df1.write.insertInto("default.hivetest") {code} {code:python} %pyspark
[jira] [Commented] (SPARK-23191) Workers registration failes in case of network drop
[ https://issues.apache.org/jira/browse/SPARK-23191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829272#comment-16829272 ] wuyi commented on SPARK-23191: -- [~cloud_fan] Ok, I'll have a deep look after 5.1 holiday. > Workers registration failes in case of network drop > --- > > Key: SPARK-23191 > URL: https://issues.apache.org/jira/browse/SPARK-23191 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.3, 2.2.1, 2.3.0 > Environment: OS:- Centos 6.9(64 bit) > >Reporter: Neeraj Gupta >Priority: Critical > > We have a 3 node cluster. We were facing issues of multiple driver running in > some scenario in production. > On further investigation we were able to reproduce iin both 1.6.3 and 2.2.1 > versions the scenario with following steps:- > # Setup a 3 node cluster. Start master and slaves. > # On any node where the worker process is running block the connections on > port 7077 using iptables. > {code:java} > iptables -A OUTPUT -p tcp --dport 7077 -j DROP > {code} > # After about 10-15 secs we get the error on node that it is unable to > connect to master. > {code:java} > 2018-01-23 12:08:51,639 [rpc-client-1-1] WARN > org.apache.spark.network.server.TransportChannelHandler - Exception in > connection from > java.io.IOException: Connection timed out > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221) > at > io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899) > at > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > at java.lang.Thread.run(Thread.java:745) > 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR > org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting > for master to reconnect... > 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR > org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting > for master to reconnect... > {code} > # Once we get this exception we renable the connections to port 7077 using > {code:java} > iptables -D OUTPUT -p tcp --dport 7077 -j DROP > {code} > # Worker tries to register again with master but is unable to do so. It > gives following error > {code:java} > 2018-01-23 12:08:58,657 [worker-register-master-threadpool-2] WARN > org.apache.spark.deploy.worker.Worker - Failed to connect to master > :7077 > org.apache.spark.SparkException: Exception thrown in awaitResult: > at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100) > at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108) > at > org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:241) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to connect to :7077 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182) > at > org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197) >
[jira] [Assigned] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-27555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27555: Assignee: (was: Apache Spark) > cannot create table by using the hive default fileformat in both > hive-site.xml and spark-defaults.conf > -- > > Key: SPARK-27555 > URL: https://issues.apache.org/jira/browse/SPARK-27555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Hui WANG >Priority: Major > Attachments: Try.pdf > > > *You can see details in attachment called Try.pdf.* > I already seen https://issues.apache.org/jira/browse/SPARK-17620 > and https://issues.apache.org/jira/browse/SPARK-18397 > and I check source code of Spark for the change of set > "spark.sql.hive.covertCTAS=true" and then spark will use > "spark.sql.sources.default" which is parquet as storage format in "create > table as select" scenario. > But my case is just create table without select. When I set > hive.default.fileformat=parquet in hive-site.xml or set > spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after > create a table, when i check the hive table, it still use textfile fileformat. > > It seems HiveSerDe gets the value of the hive.default.fileformat parameter > from SQLConf > The parameter values in SQLConf are copied from SparkContext's SparkConf at > SparkSession initialization, while the configuration parameters in > hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters > by SharedState, And all the config with "spark.hadoop" conf are setted to > hadoopconfig, so the configuration does not take effect. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27555) cannot create table by using the hive default fileformat in both hive-site.xml and spark-defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-27555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27555: Assignee: Apache Spark > cannot create table by using the hive default fileformat in both > hive-site.xml and spark-defaults.conf > -- > > Key: SPARK-27555 > URL: https://issues.apache.org/jira/browse/SPARK-27555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Hui WANG >Assignee: Apache Spark >Priority: Major > Attachments: Try.pdf > > > *You can see details in attachment called Try.pdf.* > I already seen https://issues.apache.org/jira/browse/SPARK-17620 > and https://issues.apache.org/jira/browse/SPARK-18397 > and I check source code of Spark for the change of set > "spark.sql.hive.covertCTAS=true" and then spark will use > "spark.sql.sources.default" which is parquet as storage format in "create > table as select" scenario. > But my case is just create table without select. When I set > hive.default.fileformat=parquet in hive-site.xml or set > spark.hadoop.hive.default.fileformat=parquet in spark-defaults.conf, after > create a table, when i check the hive table, it still use textfile fileformat. > > It seems HiveSerDe gets the value of the hive.default.fileformat parameter > from SQLConf > The parameter values in SQLConf are copied from SparkContext's SparkConf at > SparkSession initialization, while the configuration parameters in > hive-site.xml are loaded into SparkContext's hadoopConfiguration parameters > by SharedState, And all the config with "spark.hadoop" conf are setted to > hadoopconfig, so the configuration does not take effect. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23191) Workers registration failes in case of network drop
[ https://issues.apache.org/jira/browse/SPARK-23191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829260#comment-16829260 ] Wenchen Fan commented on SPARK-23191: - [~Ngone51] can you take a look please? > Workers registration failes in case of network drop > --- > > Key: SPARK-23191 > URL: https://issues.apache.org/jira/browse/SPARK-23191 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.3, 2.2.1, 2.3.0 > Environment: OS:- Centos 6.9(64 bit) > >Reporter: Neeraj Gupta >Priority: Critical > > We have a 3 node cluster. We were facing issues of multiple driver running in > some scenario in production. > On further investigation we were able to reproduce iin both 1.6.3 and 2.2.1 > versions the scenario with following steps:- > # Setup a 3 node cluster. Start master and slaves. > # On any node where the worker process is running block the connections on > port 7077 using iptables. > {code:java} > iptables -A OUTPUT -p tcp --dport 7077 -j DROP > {code} > # After about 10-15 secs we get the error on node that it is unable to > connect to master. > {code:java} > 2018-01-23 12:08:51,639 [rpc-client-1-1] WARN > org.apache.spark.network.server.TransportChannelHandler - Exception in > connection from > java.io.IOException: Connection timed out > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221) > at > io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899) > at > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > at java.lang.Thread.run(Thread.java:745) > 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR > org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting > for master to reconnect... > 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR > org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting > for master to reconnect... > {code} > # Once we get this exception we renable the connections to port 7077 using > {code:java} > iptables -D OUTPUT -p tcp --dport 7077 -j DROP > {code} > # Worker tries to register again with master but is unable to do so. It > gives following error > {code:java} > 2018-01-23 12:08:58,657 [worker-register-master-threadpool-2] WARN > org.apache.spark.deploy.worker.Worker - Failed to connect to master > :7077 > org.apache.spark.SparkException: Exception thrown in awaitResult: > at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100) > at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108) > at > org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:241) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to connect to :7077 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182) > at > org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197) > at
[jira] [Assigned] (SPARK-27581) DataFrame countDistinct("*") fails with AnalysisException: "Invalid usage of '*' in expression 'count'"
[ https://issues.apache.org/jira/browse/SPARK-27581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27581: --- Assignee: Liang-Chi Hsieh > DataFrame countDistinct("*") fails with AnalysisException: "Invalid usage of > '*' in expression 'count'" > --- > > Key: SPARK-27581 > URL: https://issues.apache.org/jira/browse/SPARK-27581 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 3.0.0 > > > If I have a DataFrame then I can use {{count("*")}} as an expression, e.g.: > {code:java} > import org.apache.spark.sql.functions._ > val df = sql("select id % 100 from range(10)") > df.select(count("*")).first() > {code} > However, if I try to do the same thing with {{countDistinct}} I get an error: > {code:java} > import org.apache.spark.sql.functions._ > val df = sql("select id % 100 from range(10)") > df.select(countDistinct("*")).first() > org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression > 'count'; > {code} > As a workaround, I need to use {{expr}}, e.g. > {code:java} > import org.apache.spark.sql.functions._ > val df = sql("select id % 100 from range(10)") > df.select(expr("count(distinct(*))")).first() > {code} > You might be wondering "why not just use {{df.count()}} or > {{df.distinct().count()}}?" but in my case I'd ultimately to compute both > counts as part of the same aggregation, e.g. > {code:java} > val (cnt, distinctCnt) = df.select(count("*"), countDistinct("*)).as[(Long, > Long)].first() > {code} > I'm reporting this because it's a minor usability annoyance / surprise for > inexperienced Spark users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27581) DataFrame countDistinct("*") fails with AnalysisException: "Invalid usage of '*' in expression 'count'"
[ https://issues.apache.org/jira/browse/SPARK-27581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27581. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24482 [https://github.com/apache/spark/pull/24482] > DataFrame countDistinct("*") fails with AnalysisException: "Invalid usage of > '*' in expression 'count'" > --- > > Key: SPARK-27581 > URL: https://issues.apache.org/jira/browse/SPARK-27581 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Major > Fix For: 3.0.0 > > > If I have a DataFrame then I can use {{count("*")}} as an expression, e.g.: > {code:java} > import org.apache.spark.sql.functions._ > val df = sql("select id % 100 from range(10)") > df.select(count("*")).first() > {code} > However, if I try to do the same thing with {{countDistinct}} I get an error: > {code:java} > import org.apache.spark.sql.functions._ > val df = sql("select id % 100 from range(10)") > df.select(countDistinct("*")).first() > org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression > 'count'; > {code} > As a workaround, I need to use {{expr}}, e.g. > {code:java} > import org.apache.spark.sql.functions._ > val df = sql("select id % 100 from range(10)") > df.select(expr("count(distinct(*))")).first() > {code} > You might be wondering "why not just use {{df.count()}} or > {{df.distinct().count()}}?" but in my case I'd ultimately to compute both > counts as part of the same aggregation, e.g. > {code:java} > val (cnt, distinctCnt) = df.select(count("*"), countDistinct("*)).as[(Long, > Long)].first() > {code} > I'm reporting this because it's a minor usability annoyance / surprise for > inexperienced Spark users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21827) Task fail due to executor exception when enable Sasl Encryption
[ https://issues.apache.org/jira/browse/SPARK-21827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829089#comment-16829089 ] Sébastien BARNOUD edited comment on SPARK-21827 at 4/29/19 12:15 PM: - Hi, I was investigating timeout on HBase Client (version 1.1.2) on my Hadoop cluster with security enabled using hotspot jdk 1.8.0_92-b14. I have found the following message in logs each time I get a timeout: *sasl:1481 - DIGEST41:Unmatched MACs* After a look at the code, I understand that the message is simply ignored if an invalid MAC is received. In my opinion, this is not a normal behavior. It allows at least an attacker to flood the connection. But, in my case, there is no men in the middle, but I get this message. It looks like there is bug (probably a not thread safe method somewhere) in the MAC validation, leading to the message to be ignored, and to my HBase timeout. In the same time, we have found some TEZ job stuck on our cluster since we have enabled security on shuffle (mapreduce, TEZ and Spark). In each hanged job, we could identify that the SSL handshake never finished: "fetcher \{Map_4} #34" #78 daemon prio=5 os_prio=0 tid=0x7fd86905d000 nid=0x13dad runnable [0x7fd83beb6000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) at sun.security.ssl.InputRecord.read(InputRecord.java:503) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) - locked <0x0007b997a470> (a java.lang.Object) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) - locked <0x0007b997a430> (a java.lang.Object) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:100) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:80) at sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:672) at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1534) - locked <0x0007b9979f10> (a sun.net.www.protocol.https.DelegateHttpsURLConnection) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441) - locked <0x0007b9979f10> (a sun.net.www.protocol.https.DelegateHttpsURLConnection) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254) - locked <0x0007b9979ea8> (a sun.net.www.protocol.https.HttpsURLConnectionImpl) at org.apache.tez.runtime.library.common.shuffle.HttpConnection.getInputStream(HttpConnection.java:253) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:356) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:264) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:176) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run(FetcherOrderedGrouped.java:191) Looking a TEZ source, shows that there are no timeout in the code leading to this infinite wait. After some more investigation, I found: -) https://issues.apache.org/jira/browse/SPARK-21827 -) [https://issues.cask.co/browse/CDAP-12737] -) [https://bugster.forgerock.org/jira/browse/OPENDJ-4956] It seems that this issue affects a lot of software, and ForgeRock seems to have identified the thread safety issue. To summarize, there are 2 issues: # the message shouldn’t be ignored when the MAC is invalid, an exception should be throwed. # The thread safety issue should be investigated and corrected in the JDK, because relying on a synchronized method at the application layer is not viable. Typically, an application like Spark uses multiple SASL implementation and can’t synchronize all of them. I sent this to [secalert...@oracle.com|mailto:secalert...@oracle.com] because IMO it's a JDK bug. was (Author: sbarnoud): Hi, I was investigating timeout on HBase Client (version 1.1.2) on my Hadoop
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8.2
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829155#comment-16829155 ] Steve Loughran commented on SPARK-24771: Update Hadoop is going to update its avro version too HADOOP-13386 Ultimately this is a good thing, it's just going to be one of those transitive-changes-may-break-compiled code things, which can really hurt people. The good news: spark is ahead of the curve here > Upgrade AVRO version from 1.7.7 to 1.8.2 > > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: release-notes > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27593) CSV Parser returns 2 DataFrame - Valid and Malformed DFs
[ https://issues.apache.org/jira/browse/SPARK-27593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ladislav Jech updated SPARK-27593: -- Issue Type: New Feature (was: Improvement) > CSV Parser returns 2 DataFrame - Valid and Malformed DFs > > > Key: SPARK-27593 > URL: https://issues.apache.org/jira/browse/SPARK-27593 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.2 >Reporter: Ladislav Jech >Priority: Major > > When we process CSV in any kind of data warehouse, its common procedure to > report corrupted records for audit purposes and feedback back to vendor, so > they can enhance their procedure. CSV is no difference from XSD from > perspective that it define a schema although in very limited way (in some > cases only as number of columns without even headers, and we don't have > types), but when I check XML document against XSD file, I get exact report of > if the file is completely valid and if not I get exact report of what records > are not following schema. > Such feature will have big value in Spark for CSV, get malformed records into > some dataframe, with line count (pointer within the data object), so I can > log both pointer and real data (line/row) and trigger action on this > unfortunate event. > load() method could return Array of DFs (Valid, Invalid) > PERMISSIVE MODE isn't enough as soon as it fill missing fields with nulls, so > it is even harder to detect what is really wrong. Another approach at moment > is to read both permissive and dropmalformed modes into 2 dataframes and > compare those one against each other. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27593) CSV Parser returns 2 DataFrame - Valid and Malformed DFs
[ https://issues.apache.org/jira/browse/SPARK-27593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ladislav Jech updated SPARK-27593: -- Description: When we process CSV in any kind of data warehouse, its common procedure to report corrupted records for audit purposes and feedback back to vendor, so they can enhance their procedure. CSV is no difference from XSD from perspective that it define a schema although in very limited way (in some cases only as number of columns without even headers, and we don't have types), but when I check XML document against XSD file, I get exact report of if the file is completely valid and if not I get exact report of what records are not following schema. Such feature will have big value in Spark for CSV, get malformed records into some dataframe, with line count (pointer within the data object), so I can log both pointer and real data (line/row) and trigger action on this unfortunate event. load() method could return Array of DFs (Valid, Invalid) was: When we process CSV in any kind of data warehouse, its common procedure to report corrupted records for audit purposes and feedback back to vendor, so they can enhance their procedure. CSV is no difference from XSD from perspective that it define a schema although in very limited way (in some cases only as number of columns without even headers, and we don't have types), but when I check XML document against XSD file, I get exact report of if the file is completely valid and if not I get exact report of what records are not following schema. Such feature will have big value in Spark for CSV, get malformed records into some dataframe, with line count (pointer within the data object), so I can log both pointer and real data (line/row) and trigger action on this unfortunate event. > CSV Parser returns 2 DataFrame - Valid and Malformed DFs > > > Key: SPARK-27593 > URL: https://issues.apache.org/jira/browse/SPARK-27593 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.2 >Reporter: Ladislav Jech >Priority: Major > > When we process CSV in any kind of data warehouse, its common procedure to > report corrupted records for audit purposes and feedback back to vendor, so > they can enhance their procedure. CSV is no difference from XSD from > perspective that it define a schema although in very limited way (in some > cases only as number of columns without even headers, and we don't have > types), but when I check XML document against XSD file, I get exact report of > if the file is completely valid and if not I get exact report of what records > are not following schema. > Such feature will have big value in Spark for CSV, get malformed records into > some dataframe, with line count (pointer within the data object), so I can > log both pointer and real data (line/row) and trigger action on this > unfortunate event. > load() method could return Array of DFs (Valid, Invalid) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27593) CSV Parser returns 2 DataFrame - Valid and Malformed DFs
[ https://issues.apache.org/jira/browse/SPARK-27593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ladislav Jech updated SPARK-27593: -- Description: When we process CSV in any kind of data warehouse, its common procedure to report corrupted records for audit purposes and feedback back to vendor, so they can enhance their procedure. CSV is no difference from XSD from perspective that it define a schema although in very limited way (in some cases only as number of columns without even headers, and we don't have types), but when I check XML document against XSD file, I get exact report of if the file is completely valid and if not I get exact report of what records are not following schema. Such feature will have big value in Spark for CSV, get malformed records into some dataframe, with line count (pointer within the data object), so I can log both pointer and real data (line/row) and trigger action on this unfortunate event. load() method could return Array of DFs (Valid, Invalid) PERMISSIVE MODE isn't enough as soon as it fill missing fields with nulls, so it is even harder to detect what is really wrong. Another approach at moment is to read both permissive and dropmalformed modes into 2 dataframes and compare those one against each other. was: When we process CSV in any kind of data warehouse, its common procedure to report corrupted records for audit purposes and feedback back to vendor, so they can enhance their procedure. CSV is no difference from XSD from perspective that it define a schema although in very limited way (in some cases only as number of columns without even headers, and we don't have types), but when I check XML document against XSD file, I get exact report of if the file is completely valid and if not I get exact report of what records are not following schema. Such feature will have big value in Spark for CSV, get malformed records into some dataframe, with line count (pointer within the data object), so I can log both pointer and real data (line/row) and trigger action on this unfortunate event. load() method could return Array of DFs (Valid, Invalid) > CSV Parser returns 2 DataFrame - Valid and Malformed DFs > > > Key: SPARK-27593 > URL: https://issues.apache.org/jira/browse/SPARK-27593 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.2 >Reporter: Ladislav Jech >Priority: Major > > When we process CSV in any kind of data warehouse, its common procedure to > report corrupted records for audit purposes and feedback back to vendor, so > they can enhance their procedure. CSV is no difference from XSD from > perspective that it define a schema although in very limited way (in some > cases only as number of columns without even headers, and we don't have > types), but when I check XML document against XSD file, I get exact report of > if the file is completely valid and if not I get exact report of what records > are not following schema. > Such feature will have big value in Spark for CSV, get malformed records into > some dataframe, with line count (pointer within the data object), so I can > log both pointer and real data (line/row) and trigger action on this > unfortunate event. > load() method could return Array of DFs (Valid, Invalid) > PERMISSIVE MODE isn't enough as soon as it fill missing fields with nulls, so > it is even harder to detect what is really wrong. Another approach at moment > is to read both permissive and dropmalformed modes into 2 dataframes and > compare those one against each other. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27593) CSV Parser returns 2 DataFrame - Valid and Malformed DFs
Ladislav Jech created SPARK-27593: - Summary: CSV Parser returns 2 DataFrame - Valid and Malformed DFs Key: SPARK-27593 URL: https://issues.apache.org/jira/browse/SPARK-27593 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.2 Reporter: Ladislav Jech When we process CSV in any kind of data warehouse, its common procedure to report corrupted records for audit purposes and feedback back to vendor, so they can enhance their procedure. CSV is no difference from XSD from perspective that it define a schema although in very limited way (in some cases only as number of columns without even headers, and we don't have types), but when I check XML document against XSD file, I get exact report of if the file is completely valid and if not I get exact report of what records are not following schema. Such feature will have big value in Spark for CSV, get malformed records into some dataframe, with line count (pointer within the data object), so I can log both pointer and real data (line/row) and trigger action on this unfortunate event. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20880) When spark SQL is used with Avro-backed HIVE tables, NPE from org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories.
[ https://issues.apache.org/jira/browse/SPARK-20880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829093#comment-16829093 ] DineshPandian commented on SPARK-20880: --- Hi, any update on this? > When spark SQL is used with Avro-backed HIVE tables, NPE from > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories. > > > Key: SPARK-20880 > URL: https://issues.apache.org/jira/browse/SPARK-20880 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Vinod KC >Priority: Minor > > When spark SQL is used with Avro-backed HIVE tables, intermittently getting > NPE from > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories. > Root cause is due race condition in hive 1.2.1 jar used in Spark SQL . > In HIVE 2.2 this issue has been fixed (HIVE JIRA: > https://issues.apache.org/jira/browse/HIVE-16175. ), since Spark is still > using Hive 1.2.1 jars we are still getting into race condition. > One workaround is to run Spark with a single task per executor, however it > will slow down the jobs. > Exception stack trace > 13/05/07 09:18:39 WARN scheduler.TaskSetManager: Lost task 18.0 in stage 0.0 > (TID 18, aiyhyashu.dxc.com): java.lang.NullPointerException > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories(AvroObjectInspectorGenerator.java:142) > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:91) > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:104) > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:104) > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:120) > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspector(AvroObjectInspectorGenerator.java:83) > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.(AvroObjectInspectorGenerator.java:56) > at > org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:124) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$10.apply(TableReader.scala:251) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$10.apply(TableReader.scala:239) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:785) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:785) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at
[jira] [Assigned] (SPARK-27592) Write the data of table write information to metadata
[ https://issues.apache.org/jira/browse/SPARK-27592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27592: Assignee: (was: Apache Spark) > Write the data of table write information to metadata > - > > Key: SPARK-27592 > URL: https://issues.apache.org/jira/browse/SPARK-27592 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27592) Write the data of table write information to metadata
[ https://issues.apache.org/jira/browse/SPARK-27592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27592: Description: We hint Hive using incorrect InputFormat(org.apache.hadoop.mapred.SequenceFileInputFormat) to read Spark's Parquet datasource bucket table: {noformat} spark-sql> CREATE TABLE t (c1 INT, c2 INT) USING parquet CLUSTERED BY (c1) SORTED BY (c1) INTO 2 BUCKETS; 2019-04-29 17:52:05 WARN HiveExternalCatalog:66 - Persisting bucketed data source table `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. spark-sql> DESC EXTENDED t; c1 int NULL c2 int NULL # Detailed Table Information Database default Table t Owner yumwang Created Time Mon Apr 29 17:52:05 CST 2019 Last Access Thu Jan 01 08:00:00 CST 1970 Created By Spark 2.4.0 Type MANAGED Provider parquet Num Buckets 2 Bucket Columns [`c1`] Sort Columns [`c1`] Table Properties [transient_lastDdlTime=1556531525] Location [file:/user/hive/warehouse/t|file:///user/hive/warehouse/t] Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Storage Properties [serialization.format=1] {noformat} We can see incompatible information when creating the table: {noformat} WARN HiveExternalCatalog:66 - Persisting bucketed data source table `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. {noformat} But downstream don’t know the compatibility. I'd like to write the write information of this table to metadata so that each engine decides compatibility itself. > Write the data of table write information to metadata > - > > Key: SPARK-27592 > URL: https://issues.apache.org/jira/browse/SPARK-27592 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > We hint Hive using incorrect > InputFormat(org.apache.hadoop.mapred.SequenceFileInputFormat) to read Spark's > Parquet datasource bucket table: > {noformat} > spark-sql> CREATE TABLE t (c1 INT, c2 INT) USING parquet CLUSTERED BY (c1) > SORTED BY (c1) INTO 2 BUCKETS; > 2019-04-29 17:52:05 WARN HiveExternalCatalog:66 - Persisting bucketed data > source table `default`.`t` into Hive metastore in Spark SQL specific format, > which is NOT compatible with Hive. > spark-sql> DESC EXTENDED t; > c1 int NULL > c2 int NULL > # Detailed Table Information > Database default > Table t > Owner yumwang > Created Time Mon Apr 29 17:52:05 CST 2019 > Last Access Thu Jan 01 08:00:00 CST 1970 > Created By Spark 2.4.0 > Type MANAGED > Provider parquet > Num Buckets 2 > Bucket Columns [`c1`] > Sort Columns [`c1`] > Table Properties [transient_lastDdlTime=1556531525] > Location [file:/user/hive/warehouse/t|file:///user/hive/warehouse/t] > Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > Storage Properties [serialization.format=1] > {noformat} > We can see incompatible information when creating the table: > {noformat} > WARN HiveExternalCatalog:66 - Persisting bucketed data source table > `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT > compatible with Hive. > {noformat} > But downstream don’t know the compatibility. I'd like to write the write > information of this table to metadata so that each engine decides > compatibility itself. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27592) Write the data of table write information to metadata
[ https://issues.apache.org/jira/browse/SPARK-27592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27592: Assignee: Apache Spark > Write the data of table write information to metadata > - > > Key: SPARK-27592 > URL: https://issues.apache.org/jira/browse/SPARK-27592 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21827) Task fail due to executor exception when enable Sasl Encryption
[ https://issues.apache.org/jira/browse/SPARK-21827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829089#comment-16829089 ] Sébastien BARNOUD edited comment on SPARK-21827 at 4/29/19 10:05 AM: - Hi, I was investigating timeout on HBase Client (version 1.1.2) on my Hadoop cluster with security enabled using hotspot jdk 1.8.0_92-b14. I have found the following message in logs each time I get a timeout: *sasl:1481 - DIGEST41:Unmatched MACs* After a look at the code, I understand that the message is simply ignored if an invalid MAC is received. In my opinion, this is not a normal behavior. It allows at least an attacker to flood the connection. But, in my case, there is no men in the middle, but I get this message. It looks like there is bug (probably a not thread safe method somewhere) in the MAC validation, leading to the message to be ignored, and to my HBase timeout. In the same time, we have found some TEZ job stuck on our cluster since we have enabled security on shuffle (mapreduce, TEZ and Spark). In each hanged job, we could identify that the SSL handshake never finished: "fetcher \{Map_4} #34" #78 daemon prio=5 os_prio=0 tid=0x7fd86905d000 nid=0x13dad runnable [0x7fd83beb6000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) at sun.security.ssl.InputRecord.read(InputRecord.java:503) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) - locked <0x0007b997a470> (a java.lang.Object) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) - locked <0x0007b997a430> (a java.lang.Object) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:100) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:80) at sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:672) at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1534) - locked <0x0007b9979f10> (a sun.net.www.protocol.https.DelegateHttpsURLConnection) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441) - locked <0x0007b9979f10> (a sun.net.www.protocol.https.DelegateHttpsURLConnection) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254) - locked <0x0007b9979ea8> (a sun.net.www.protocol.https.HttpsURLConnectionImpl) at org.apache.tez.runtime.library.common.shuffle.HttpConnection.getInputStream(HttpConnection.java:253) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:356) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:264) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:176) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run(FetcherOrderedGrouped.java:191) Looking a TEZ source, shows that there are no timeout in the code leading to this infinite wait. After some more investigation, I found: -) https://issues.apache.org/jira/browse/SPARK-21827 -) [https://issues.cask.co/browse/CDAP-12737] -) [https://bugster.forgerock.org/jira/browse/OPENDJ-4956] It seems that this issue affects a lot of software, and ForgeRock seems to have identified the thread safety issue. To summarize, there are 2 issues: # the message shouldn’t be ignored when the MAC is invalid, an exception should be throwed. # The thread safety issue should be investigated and corrected in the JDK, because relying on a synchronized method at the application layer is not viable. Typically, an application like Spark uses multiple SASL implementation and can’t synchronize all of them. I sent this to [secalert...@oracle.com|mailto:secalert...@oracle.com] because IMO it's a JDK bug. Regards, Sébastien BARNOUD was (Author: sbarnoud): Hi, I was investigating timeout on HBase
[jira] [Comment Edited] (SPARK-21827) Task fail due to executor exception when enable Sasl Encryption
[ https://issues.apache.org/jira/browse/SPARK-21827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829089#comment-16829089 ] Sébastien BARNOUD edited comment on SPARK-21827 at 4/29/19 10:03 AM: - Hi, I was investigating timeout on HBase Client (version 1.1.2) on my Hadoop cluster with security enabled using hotspot jdk 1.8.0_92-b14. I have found the following message in logs each time I get a timeout: *sasl:1481 - DIGEST41:Unmatched MACs* After a look at the code, I understand that the message is simply ignored if an invalid MAC is received. In my opinion, this is not a normal behavior. It allows at least an attacker to flood the connection. But, in my case, there is no men in the middle, but a get this message. It looks like there is bug (probably a not thread safe method somewhere) in the MAC validation, leading to the message to be ignored, and to my HBase timeout. In the same time, we have found some TEZ job stuck on our cluster since we have enabled security on shuffle (mapreduce, TEZ and Spark). In each hanged job, we could identify that the SSL handshake never finished: "fetcher \{Map_4} #34" #78 daemon prio=5 os_prio=0 tid=0x7fd86905d000 nid=0x13dad runnable [0x7fd83beb6000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) at sun.security.ssl.InputRecord.read(InputRecord.java:503) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) - locked <0x0007b997a470> (a java.lang.Object) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) - locked <0x0007b997a430> (a java.lang.Object) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:100) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:80) at sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:672) at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1534) - locked <0x0007b9979f10> (a sun.net.www.protocol.https.DelegateHttpsURLConnection) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441) - locked <0x0007b9979f10> (a sun.net.www.protocol.https.DelegateHttpsURLConnection) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254) - locked <0x0007b9979ea8> (a sun.net.www.protocol.https.HttpsURLConnectionImpl) at org.apache.tez.runtime.library.common.shuffle.HttpConnection.getInputStream(HttpConnection.java:253) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:356) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:264) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:176) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run(FetcherOrderedGrouped.java:191) Looking a TEZ source, shows that there are no timeout in the code leading to this infinite wait. After some more investigation, I found: -) https://issues.apache.org/jira/browse/SPARK-21827 -) [https://issues.cask.co/browse/CDAP-12737] -) [https://bugster.forgerock.org/jira/browse/OPENDJ-4956] It seems that this issue affects a lot of software, and ForgeRock seems to have identified the thread safety issue. To summarize, there are 2 issues: # the message shouldn’t be ignored when the MAC is invalid, an exception should be throwed. # The thread safety issue should be investigated and corrected in the JDK, because relying on a synchronized method at the application layer is not viable. Typically, an application like Spark uses multiple SASL implementation and can’t synchronize all of them. I sent this to [secalert...@oracle.com|mailto:secalert...@oracle.com] because IMO it's a JDK bug. Regards, Sébastien BARNOUD was (Author: sbarnoud): Hi, I was investigating timeout on HBase
[jira] [Commented] (SPARK-21827) Task fail due to executor exception when enable Sasl Encryption
[ https://issues.apache.org/jira/browse/SPARK-21827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829089#comment-16829089 ] Sébastien BARNOUD commented on SPARK-21827: --- Hi, I was investigating timeout on HBase Client (version 1.1.2) on my Hadoop cluster with security enabled using hotspot jdk 1.8.0_92-b14. I have found the following message in logs each time a get a timeout: *sasl:1481 - DIGEST41:Unmatched MACs* After a look at the code, if understand that the message is simply ignored if an invalid MAC is received. In my opinion, this is not a normal behavior. It allows at least an attacker to flood the connection. But, in my case, there is no men in the middle, but a get this message. It looks like there is bug (probably a not thread safe method somewhere) in the MAC validation, leading to the message to be ignored, and to my HBase timeout. In the same time, we have found some TEZ job stuck on our cluster since we have enabled security on shuffle (mapreduce, TEZ and Spark). In each hanged job, we could identify that the SSL handshake never finished: "fetcher \{Map_4} #34" #78 daemon prio=5 os_prio=0 tid=0x7fd86905d000 nid=0x13dad runnable [0x7fd83beb6000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) at sun.security.ssl.InputRecord.read(InputRecord.java:503) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) - locked <0x0007b997a470> (a java.lang.Object) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) - locked <0x0007b997a430> (a java.lang.Object) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:100) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.setNewClient(AbstractDelegateHttpsURLConnection.java:80) at sun.net.www.protocol.http.HttpURLConnection.writeRequests(HttpURLConnection.java:672) at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1534) - locked <0x0007b9979f10> (a sun.net.www.protocol.https.DelegateHttpsURLConnection) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441) - locked <0x0007b9979f10> (a sun.net.www.protocol.https.DelegateHttpsURLConnection) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254) - locked <0x0007b9979ea8> (a sun.net.www.protocol.https.HttpsURLConnectionImpl) at org.apache.tez.runtime.library.common.shuffle.HttpConnection.getInputStream(HttpConnection.java:253) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:356) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:264) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:176) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run(FetcherOrderedGrouped.java:191) Looking a TEZ source, shows that there are no timeout in the code leading to this infinite wait. After some more investigation, I found: -) https://issues.apache.org/jira/browse/SPARK-21827 -) [https://issues.cask.co/browse/CDAP-12737] -) [https://bugster.forgerock.org/jira/browse/OPENDJ-4956] It seems that this issue affects a lot of software, and ForgeRock seems to have identified the thread safety issue. To summarize, there are 2 issues: # the message shouldn’t be ignored when the MAC is invalid, an exception should be throwed. # The thread safety issue should be investigated and corrected in the JDK, because relying on a synchronized method at the application layer is not viable. Typically, an application like Spark uses multiple SASL implementation and can’t synchronize all of them. I sent this to [secalert...@oracle.com|mailto:secalert...@oracle.com] because IMO it's a JDK bug. Regards, Sébastien BARNOUD > Task fail due to executor exception when enable Sasl Encryption >
[jira] [Created] (SPARK-27592) Write the data of table write information to metadata
Yuming Wang created SPARK-27592: --- Summary: Write the data of table write information to metadata Key: SPARK-27592 URL: https://issues.apache.org/jira/browse/SPARK-27592 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27580) Implement `doCanonicalize` in BatchScanExec for comparing query plan results
[ https://issues.apache.org/jira/browse/SPARK-27580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27580: --- Assignee: Gengliang Wang > Implement `doCanonicalize` in BatchScanExec for comparing query plan results > > > Key: SPARK-27580 > URL: https://issues.apache.org/jira/browse/SPARK-27580 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > The method `QueryPlan.sameResult` is used for comparing logical plans in > order to: > 1. cache data in CacheManager > 2. uncache data in CacheManager > 3. Reuse subqueries > 4. etc... > Currently the method `sameReuslt` always return false for `BatchScanExec`. We > should fix it by implementing `doCanonicalize` for the node. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27580) Implement `doCanonicalize` in BatchScanExec for comparing query plan results
[ https://issues.apache.org/jira/browse/SPARK-27580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27580. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24475 [https://github.com/apache/spark/pull/24475] > Implement `doCanonicalize` in BatchScanExec for comparing query plan results > > > Key: SPARK-27580 > URL: https://issues.apache.org/jira/browse/SPARK-27580 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > > The method `QueryPlan.sameResult` is used for comparing logical plans in > order to: > 1. cache data in CacheManager > 2. uncache data in CacheManager > 3. Reuse subqueries > 4. etc... > Currently the method `sameReuslt` always return false for `BatchScanExec`. We > should fix it by implementing `doCanonicalize` for the node. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27591) A bug in UnivocityParser prevents using UDT
[ https://issues.apache.org/jira/browse/SPARK-27591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829067#comment-16829067 ] Artem Kalchenko commented on SPARK-27591: - I change the line UnivocityParser:184 to {code:java} case udt: UserDefinedType[_] => makeConverter(name, udt.sqlType, nullable, options) {code} and it works as I expected > A bug in UnivocityParser prevents using UDT > --- > > Key: SPARK-27591 > URL: https://issues.apache.org/jira/browse/SPARK-27591 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: Artem Kalchenko >Priority: Minor > > I am trying to define a UserDefinedType based on String but different from > StringType in Spark 2.4.1 but it looks like there is a bug in Spark or I am > doing smth incorrectly. > I define my type as follows: > {code:java} > class MyType extends UserDefinedType[MyValue] { > override def sqlType: DataType = StringType > ... > } > @SQLUserDefinedType(udt = classOf[MyType]) > case class MyValue > {code} > I expect it to be read and stored as String with just a custom SQL type. In > fact Spark can't read the string at all: > {code:java} > java.lang.ClassCastException: > org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$11 > cannot be cast to org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46) > at > org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:195) > at > org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102) > {code} > the problem is with UnivocityParser.makeConverter that doesn't return (String > => Any) function but (String => (String => Any)) in the case of UDT, see > UnivocityParser:184 > {code:java} > case udt: UserDefinedType[_] => (datum: String) => > makeConverter(name, udt.sqlType, nullable, options) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27591) A bug in UnivocityParser prevents using UDT
Artem Kalchenko created SPARK-27591: --- Summary: A bug in UnivocityParser prevents using UDT Key: SPARK-27591 URL: https://issues.apache.org/jira/browse/SPARK-27591 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.2 Reporter: Artem Kalchenko I am trying to define a UserDefinedType based on String but different from StringType in Spark 2.4.1 but it looks like there is a bug in Spark or I am doing smth incorrectly. I define my type as follows: {code:java} class MyType extends UserDefinedType[MyValue] { override def sqlType: DataType = StringType ... } @SQLUserDefinedType(udt = classOf[MyType]) case class MyValue {code} I expect it to be read and stored as String with just a custom SQL type. In fact Spark can't read the string at all: {code:java} java.lang.ClassCastException: org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$11 cannot be cast to org.apache.spark.unsafe.types.UTF8String at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:195) at org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102) {code} the problem is with UnivocityParser.makeConverter that doesn't return (String => Any) function but (String => (String => Any)) in the case of UDT, see UnivocityParser:184 {code:java} case udt: UserDefinedType[_] => (datum: String) => makeConverter(name, udt.sqlType, nullable, options) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23191) Workers registration failes in case of network drop
[ https://issues.apache.org/jira/browse/SPARK-23191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829051#comment-16829051 ] zuotingbing edited comment on SPARK-23191 at 4/29/19 9:28 AM: -- we faced the same issue in standalone HA mode. Could you please take a view on this issue? {code:java} 2019-03-15 20:22:10,474 INFO Worker: Master has changed, new master is at spark://vmax17:7077 2019-03-15 20:22:14,862 INFO Worker: Master with url spark://vmax18:7077 requested this worker to reconnect. 2019-03-15 20:22:14,863 INFO Worker: Connecting to master vmax18:7077... 2019-03-15 20:22:14,863 INFO Worker: Connecting to master vmax17:7077... 2019-03-15 20:22:14,865 INFO Worker: Master with url spark://vmax18:7077 requested this worker to reconnect. 2019-03-15 20:22:14,865 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already. 2019-03-15 20:22:14,868 INFO Worker: Master with url spark://vmax18:7077 requested this worker to reconnect. 2019-03-15 20:22:14,868 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already. 2019-03-15 20:22:14,871 INFO Worker: Master with url spark://vmax18:7077 requested this worker to reconnect. 2019-03-15 20:22:14,871 INFO Worker: Not spawning another attempt to register with the master, since there is an attempt scheduled already. 2019-03-15 20:22:14,879 ERROR Worker: Worker registration failed: Duplicate worker ID 2019-03-15 20:22:14,891 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,891 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,893 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,893 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,893 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,894 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process! 2019-03-15 20:22:14,896 INFO ShutdownHookManager: Shutdown hook called 2019-03-15 20:22:14,898 INFO ShutdownHookManager: Deleting directory /data4/zdh/spark/tmp/spark-c578bf32-6a5e-44a5-843b-c796f44648ee 2019-03-15 20:22:14,908 INFO ShutdownHookManager: Deleting directory /data3/zdh/spark/tmp/spark-7e57e77d-cbb7-47d3-a6dd-737b57788533 2019-03-15 20:22:14,920 INFO ShutdownHookManager: Deleting directory /data2/zdh/spark/tmp/spark-0beebf20-abbd-4d99-a401-3ef0e88e0b05{code} [~andrewor14] [~cloud_fan] [~vanzin] was (Author: zuo.tingbing9): we faced the same issue in standalone HA mode. Could you please take a view on this issue? [~andrewor14] [~cloud_fan] [~vanzin] > Workers registration failes in case of network drop > --- > > Key: SPARK-23191 > URL: https://issues.apache.org/jira/browse/SPARK-23191 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.3, 2.2.1, 2.3.0 > Environment: OS:- Centos 6.9(64 bit) > >Reporter: Neeraj Gupta >Priority: Critical > > We have a 3 node cluster. We were facing issues of multiple driver running in > some scenario in production. > On further investigation we were able to reproduce iin both 1.6.3 and 2.2.1 > versions the scenario with following steps:- > # Setup a 3 node cluster. Start master and slaves. > # On any node where the worker process is running block the connections on > port 7077 using iptables. > {code:java} > iptables -A OUTPUT -p tcp --dport 7077 -j DROP > {code} > # After about 10-15 secs we get the error on node that it is unable to > connect to master. > {code:java} > 2018-01-23 12:08:51,639 [rpc-client-1-1] WARN > org.apache.spark.network.server.TransportChannelHandler - Exception in > connection from > java.io.IOException: Connection timed out > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at >
[jira] [Commented] (SPARK-23191) Workers registration failes in case of network drop
[ https://issues.apache.org/jira/browse/SPARK-23191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829051#comment-16829051 ] zuotingbing commented on SPARK-23191: - we faced the same issue in standalone HA mode. Could you please take a view on this issue? [~andrewor14] [~cloud_fan] [~vanzin] > Workers registration failes in case of network drop > --- > > Key: SPARK-23191 > URL: https://issues.apache.org/jira/browse/SPARK-23191 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.3, 2.2.1, 2.3.0 > Environment: OS:- Centos 6.9(64 bit) > >Reporter: Neeraj Gupta >Priority: Critical > > We have a 3 node cluster. We were facing issues of multiple driver running in > some scenario in production. > On further investigation we were able to reproduce iin both 1.6.3 and 2.2.1 > versions the scenario with following steps:- > # Setup a 3 node cluster. Start master and slaves. > # On any node where the worker process is running block the connections on > port 7077 using iptables. > {code:java} > iptables -A OUTPUT -p tcp --dport 7077 -j DROP > {code} > # After about 10-15 secs we get the error on node that it is unable to > connect to master. > {code:java} > 2018-01-23 12:08:51,639 [rpc-client-1-1] WARN > org.apache.spark.network.server.TransportChannelHandler - Exception in > connection from > java.io.IOException: Connection timed out > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221) > at > io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899) > at > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > at java.lang.Thread.run(Thread.java:745) > 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR > org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting > for master to reconnect... > 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR > org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting > for master to reconnect... > {code} > # Once we get this exception we renable the connections to port 7077 using > {code:java} > iptables -D OUTPUT -p tcp --dport 7077 -j DROP > {code} > # Worker tries to register again with master but is unable to do so. It > gives following error > {code:java} > 2018-01-23 12:08:58,657 [worker-register-master-threadpool-2] WARN > org.apache.spark.deploy.worker.Worker - Failed to connect to master > :7077 > org.apache.spark.SparkException: Exception thrown in awaitResult: > at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100) > at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108) > at > org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:241) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to connect to :7077 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182) > at
[jira] [Created] (SPARK-27590) do not consider skipped tasks when scheduling speculative tasks
Wenchen Fan created SPARK-27590: --- Summary: do not consider skipped tasks when scheduling speculative tasks Key: SPARK-27590 URL: https://issues.apache.org/jira/browse/SPARK-27590 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27590) do not consider skipped tasks when scheduling speculative tasks
[ https://issues.apache.org/jira/browse/SPARK-27590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27590: Assignee: Apache Spark (was: Wenchen Fan) > do not consider skipped tasks when scheduling speculative tasks > --- > > Key: SPARK-27590 > URL: https://issues.apache.org/jira/browse/SPARK-27590 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27590) do not consider skipped tasks when scheduling speculative tasks
[ https://issues.apache.org/jira/browse/SPARK-27590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27590: Assignee: Wenchen Fan (was: Apache Spark) > do not consider skipped tasks when scheduling speculative tasks > --- > > Key: SPARK-27590 > URL: https://issues.apache.org/jira/browse/SPARK-27590 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27474) avoid retrying a task failed with CommitDeniedException many times
[ https://issues.apache.org/jira/browse/SPARK-27474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27474. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24375 [https://github.com/apache/spark/pull/24375] > avoid retrying a task failed with CommitDeniedException many times > -- > > Key: SPARK-27474 > URL: https://issues.apache.org/jira/browse/SPARK-27474 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org