[jira] [Created] (SPARK-25960) Support subpath mounting with Kubernetes
Timothy Chen created SPARK-25960: Summary: Support subpath mounting with Kubernetes Key: SPARK-25960 URL: https://issues.apache.org/jira/browse/SPARK-25960 Project: Spark Issue Type: New Feature Components: Kubernetes Affects Versions: 2.5.0 Reporter: Timothy Chen Currently we support mounting volumes into executor and driver, but there is no option to provide a subpath to be mounted from the volume. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18466) Missing withFilter method causes errors when using for comprehensions in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-18466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-18466: -- Affects Version/s: (was: 2.3.0) (was: 2.0.2) (was: 2.0.1) (was: 1.6.3) (was: 1.6.2) (was: 1.6.1) (was: 1.5.2) (was: 1.5.1) (was: 1.6.0) (was: 1.4.1) (was: 1.5.0) (was: 2.0.0) (was: 1.4.0) > Missing withFilter method causes errors when using for comprehensions in > Scala 2.12 > --- > > Key: SPARK-18466 > URL: https://issues.apache.org/jira/browse/SPARK-18466 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Richard W. Eggert II >Priority: Minor > Labels: easyfix > Original Estimate: 1h > Remaining Estimate: 1h > > The fact that the RDD class has a {{filter}} method but not a {{withFilter}} > method results in compiler warnings when using RDDs in {{for}} > comprehensions. As of Scala 2.12, falling back to use of {{filter}} is no > longer supported, so {{for}} comprehensions that use filters will no longer > compile. Semantically, the only difference between {{withFilter}} and > {{filter}} is that {{withFilter}} is lazy, and since RDDs are lazy by nature, > one can simply be aliased to the other. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18466) Missing withFilter method causes errors when using for comprehensions
[ https://issues.apache.org/jira/browse/SPARK-18466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-18466: -- Summary: Missing withFilter method causes errors when using for comprehensions (was: Missing withFilter method causes warnings when using for comprehensions) > Missing withFilter method causes errors when using for comprehensions > - > > Key: SPARK-18466 > URL: https://issues.apache.org/jira/browse/SPARK-18466 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, > 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.3.0, 2.4.0 >Reporter: Richard W. Eggert II >Priority: Minor > Labels: easyfix > Original Estimate: 1h > Remaining Estimate: 1h > > The fact that the RDD class has a {{filter}} method but not a {{withFilter}} > method results in compiler warnings when using RDDs in {{for}} > comprehensions. As of Scala 2.12, falling back to use of {{filter}} is no > longer supported, so {{for}} comprehensions that use filters will no longer > compile. Semantically, the only difference between {{withFilter}} and > {{filter}} is that {{withFilter}} is lazy, and since RDDs are lazy by nature, > one can simply be aliased to the other. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18466) Missing withFilter method causes errors when using for comprehensions in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-18466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-18466: -- Summary: Missing withFilter method causes errors when using for comprehensions in Scala 2.12 (was: Missing withFilter method causes errors when using for comprehensions) > Missing withFilter method causes errors when using for comprehensions in > Scala 2.12 > --- > > Key: SPARK-18466 > URL: https://issues.apache.org/jira/browse/SPARK-18466 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, > 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.3.0, 2.4.0 >Reporter: Richard W. Eggert II >Priority: Minor > Labels: easyfix > Original Estimate: 1h > Remaining Estimate: 1h > > The fact that the RDD class has a {{filter}} method but not a {{withFilter}} > method results in compiler warnings when using RDDs in {{for}} > comprehensions. As of Scala 2.12, falling back to use of {{filter}} is no > longer supported, so {{for}} comprehensions that use filters will no longer > compile. Semantically, the only difference between {{withFilter}} and > {{filter}} is that {{withFilter}} is lazy, and since RDDs are lazy by nature, > one can simply be aliased to the other. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25959) Difference in featureImportances results on computed vs saved models
[ https://issues.apache.org/jira/browse/SPARK-25959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677752#comment-16677752 ] shahid commented on SPARK-25959: Thanks. I will analyze the issue. > Difference in featureImportances results on computed vs saved models > > > Key: SPARK-25959 > URL: https://issues.apache.org/jira/browse/SPARK-25959 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Suraj Nayak >Priority: Major > > I tried to implement GBT and found that the feature Importance computed while > the model was fit is different when the same model was saved into a storage > and loaded back. > > I also found that once the persistent model is loaded and saved back again > and loaded, the feature importance remains the same. > > Not sure if its bug while storing and reading the model first time or am > missing some parameter that need to be set before saving the model (thus > model is picking some defaults - causing feature importance to change) > > *Below is the test code:* > val testDF = Seq( > (1, 3, 2, 1, 1), > (3, 2, 1, 2, 0), > (2, 2, 1, 1, 0), > (3, 4, 2, 2, 0), > (2, 2, 1, 3, 1) > ).toDF("a", "b", "c", "d", "e") > val featureColumns = testDF.columns.filter(_ != "e") > // Assemble the features into a vector > val assembler = new > VectorAssembler().setInputCols(featureColumns).setOutputCol("features") > // Transform the data to get the feature data set > val featureDF = assembler.transform(testDF) > // Train a GBT model. > val gbt = new GBTClassifier() > .setLabelCol("e") > .setFeaturesCol("features") > .setMaxDepth(2) > .setMaxBins(5) > .setMaxIter(10) > .setSeed(10) > .fit(featureDF) > gbt.transform(featureDF).show(false) > // Write out the model > featureColumns.zip(gbt.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println) > /* Prints > (d,0.5931875075767403) > (a,0.3747184548362353) > (b,0.03209403758702444) > (c,0.0) > */ > gbt.write.overwrite().save("file:///tmp/test123") > println("Reading model again") > val gbtload = GBTClassificationModel.load("file:///tmp/test123") > featureColumns.zip(gbtload.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println) > /* > Prints > (d,0.6455841215290767) > (a,0.3316126797964181) > (b,0.022803198674505094) > (c,0.0) > */ > gbtload.write.overwrite().save("file:///tmp/test123_rewrite") > val gbtload2 = GBTClassificationModel.load("file:///tmp/test123_rewrite") > featureColumns.zip(gbtload2.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println) > /* prints > (d,0.6455841215290767) > (a,0.3316126797964181) > (b,0.022803198674505094) > (c,0.0) > */ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18466) Missing withFilter method causes warnings when using for comprehensions
[ https://issues.apache.org/jira/browse/SPARK-18466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-18466. - > Missing withFilter method causes warnings when using for comprehensions > --- > > Key: SPARK-18466 > URL: https://issues.apache.org/jira/browse/SPARK-18466 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, > 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.3.0, 2.4.0 >Reporter: Richard W. Eggert II >Priority: Minor > Labels: easyfix > Original Estimate: 1h > Remaining Estimate: 1h > > The fact that the RDD class has a {{filter}} method but not a {{withFilter}} > method results in compiler warnings when using RDDs in {{for}} > comprehensions. As of Scala 2.12, falling back to use of {{filter}} is no > longer supported, so {{for}} comprehensions that use filters will no longer > compile. Semantically, the only difference between {{withFilter}} and > {{filter}} is that {{withFilter}} is lazy, and since RDDs are lazy by nature, > one can simply be aliased to the other. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18466) Missing withFilter method causes warnings when using for comprehensions
[ https://issues.apache.org/jira/browse/SPARK-18466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-18466. --- Resolution: Won't Fix Thank you for reporting and contribution, [~reggert1980]. We know that this has been a long pending PR. Although this issue causes a regression with Scala 2.12 at Spark 2.4, Apache Spark community will not add this feature due to the risk. Please see the discussion on the PR. {code} To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context available as 'sc' (master = local[*], app id = local-1541571276105). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.0 /_/ Using Scala version 2.12.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181) Type in expressions to have them evaluated. Type :help for more information. scala> (for (n <- sc.parallelize(Seq(1,2,3)) if n > 2) yield n).toDebugString :25: error: value withFilter is not a member of org.apache.spark.rdd.RDD[Int] (for (n <- sc.parallelize(Seq(1,2,3)) if n > 2) yield n).toDebugString {code} > Missing withFilter method causes warnings when using for comprehensions > --- > > Key: SPARK-18466 > URL: https://issues.apache.org/jira/browse/SPARK-18466 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, > 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.3.0, 2.4.0 >Reporter: Richard W. Eggert II >Priority: Minor > Labels: easyfix > Original Estimate: 1h > Remaining Estimate: 1h > > The fact that the RDD class has a {{filter}} method but not a {{withFilter}} > method results in compiler warnings when using RDDs in {{for}} > comprehensions. As of Scala 2.12, falling back to use of {{filter}} is no > longer supported, so {{for}} comprehensions that use filters will no longer > compile. Semantically, the only difference between {{withFilter}} and > {{filter}} is that {{withFilter}} is lazy, and since RDDs are lazy by nature, > one can simply be aliased to the other. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18466) Missing withFilter method causes warnings when using for comprehensions
[ https://issues.apache.org/jira/browse/SPARK-18466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-18466: -- Issue Type: Sub-task (was: Improvement) Parent: SPARK-24417 > Missing withFilter method causes warnings when using for comprehensions > --- > > Key: SPARK-18466 > URL: https://issues.apache.org/jira/browse/SPARK-18466 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, > 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.3.0, 2.4.0 >Reporter: Richard W. Eggert II >Priority: Minor > Labels: easyfix > Original Estimate: 1h > Remaining Estimate: 1h > > The fact that the RDD class has a {{filter}} method but not a {{withFilter}} > method results in compiler warnings when using RDDs in {{for}} > comprehensions. As of Scala 2.12, falling back to use of {{filter}} is no > longer supported, so {{for}} comprehensions that use filters will no longer > compile. Semantically, the only difference between {{withFilter}} and > {{filter}} is that {{withFilter}} is lazy, and since RDDs are lazy by nature, > one can simply be aliased to the other. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18466) Missing withFilter method causes warnings when using for comprehensions
[ https://issues.apache.org/jira/browse/SPARK-18466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-18466: -- Affects Version/s: 2.4.0 2.3.0 > Missing withFilter method causes warnings when using for comprehensions > --- > > Key: SPARK-18466 > URL: https://issues.apache.org/jira/browse/SPARK-18466 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, > 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.3.0, 2.4.0 >Reporter: Richard W. Eggert II >Priority: Minor > Labels: easyfix > Original Estimate: 1h > Remaining Estimate: 1h > > The fact that the RDD class has a {{filter}} method but not a {{withFilter}} > method results in compiler warnings when using RDDs in {{for}} > comprehensions. As of Scala 2.12, falling back to use of {{filter}} is no > longer supported, so {{for}} comprehensions that use filters will no longer > compile. Semantically, the only difference between {{withFilter}} and > {{filter}} is that {{withFilter}} is lazy, and since RDDs are lazy by nature, > one can simply be aliased to the other. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25098) ‘Cast’ will return NULL when input string starts/ends with special character(s)
[ https://issues.apache.org/jira/browse/SPARK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25098. --- Resolution: Fixed Assignee: Yuming Wang Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/22943 > ‘Cast’ will return NULL when input string starts/ends with special > character(s) > --- > > Key: SPARK-25098 > URL: https://issues.apache.org/jira/browse/SPARK-25098 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: ice bai >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > UDF ‘Cast’ will return NULL when input string starts/ends with special > character, but hive doesn't. > For examle, we get hour from a string ends with a blank : > hive: > ``` > hive> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank > OK > 2018-08-13 > hive> SELECT HOUR('2018-08-13 17:20:07 );//ends with a blank > OK > 17 > ``` > spark-sql: > ``` > spark-sql> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank > NULL > spark-sql> SELECT HOUR('2018-08-13 17:20:07 );//ends with a blank > NULL > ``` > All of the following UDFs will be affected: > ``` > year > month > day > hour > minute > second > date_add > date_sub > ``` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25098) Trim the string when cast stringToTimestamp and stringToDate
[ https://issues.apache.org/jira/browse/SPARK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25098: -- Summary: Trim the string when cast stringToTimestamp and stringToDate (was: ‘Cast’ will return NULL when input string starts/ends with special character(s)) > Trim the string when cast stringToTimestamp and stringToDate > > > Key: SPARK-25098 > URL: https://issues.apache.org/jira/browse/SPARK-25098 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: ice bai >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > UDF ‘Cast’ will return NULL when input string starts/ends with special > character, but hive doesn't. > For examle, we get hour from a string ends with a blank : > hive: > ``` > hive> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank > OK > 2018-08-13 > hive> SELECT HOUR('2018-08-13 17:20:07 );//ends with a blank > OK > 17 > ``` > spark-sql: > ``` > spark-sql> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank > NULL > spark-sql> SELECT HOUR('2018-08-13 17:20:07 );//ends with a blank > NULL > ``` > All of the following UDFs will be affected: > ``` > year > month > day > hour > minute > second > date_add > date_sub > ``` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25950) from_csv should respect to spark.sql.columnNameOfCorruptRecord
[ https://issues.apache.org/jira/browse/SPARK-25950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-25950: Assignee: Maxim Gekk > from_csv should respect to spark.sql.columnNameOfCorruptRecord > -- > > Key: SPARK-25950 > URL: https://issues.apache.org/jira/browse/SPARK-25950 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > The from_csv() functions should respect to SQL config > *spark.sql.columnNameOfCorruptRecord* as from_json() does. Currently it takes > into account CSV option *columnNameOfCorruptRecord* only. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25950) from_csv should respect to spark.sql.columnNameOfCorruptRecord
[ https://issues.apache.org/jira/browse/SPARK-25950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25950. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22956 [https://github.com/apache/spark/pull/22956] > from_csv should respect to spark.sql.columnNameOfCorruptRecord > -- > > Key: SPARK-25950 > URL: https://issues.apache.org/jira/browse/SPARK-25950 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > The from_csv() functions should respect to SQL config > *spark.sql.columnNameOfCorruptRecord* as from_json() does. Currently it takes > into account CSV option *columnNameOfCorruptRecord* only. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25959) Difference in featureImportances results on computed vs saved models
Suraj Nayak created SPARK-25959: --- Summary: Difference in featureImportances results on computed vs saved models Key: SPARK-25959 URL: https://issues.apache.org/jira/browse/SPARK-25959 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 2.2.0 Reporter: Suraj Nayak I tried to implement GBT and found that the feature Importance computed while the model was fit is different when the same model was saved into a storage and loaded back. I also found that once the persistent model is loaded and saved back again and loaded, the feature importance remains the same. Not sure if its bug while storing and reading the model first time or am missing some parameter that need to be set before saving the model (thus model is picking some defaults - causing feature importance to change) *Below is the test code:* val testDF = Seq( (1, 3, 2, 1, 1), (3, 2, 1, 2, 0), (2, 2, 1, 1, 0), (3, 4, 2, 2, 0), (2, 2, 1, 3, 1) ).toDF("a", "b", "c", "d", "e") val featureColumns = testDF.columns.filter(_ != "e") // Assemble the features into a vector val assembler = new VectorAssembler().setInputCols(featureColumns).setOutputCol("features") // Transform the data to get the feature data set val featureDF = assembler.transform(testDF) // Train a GBT model. val gbt = new GBTClassifier() .setLabelCol("e") .setFeaturesCol("features") .setMaxDepth(2) .setMaxBins(5) .setMaxIter(10) .setSeed(10) .fit(featureDF) gbt.transform(featureDF).show(false) // Write out the model featureColumns.zip(gbt.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println) /* Prints (d,0.5931875075767403) (a,0.3747184548362353) (b,0.03209403758702444) (c,0.0) */ gbt.write.overwrite().save("file:///tmp/test123") println("Reading model again") val gbtload = GBTClassificationModel.load("file:///tmp/test123") featureColumns.zip(gbtload.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println) /* Prints (d,0.6455841215290767) (a,0.3316126797964181) (b,0.022803198674505094) (c,0.0) */ gbtload.write.overwrite().save("file:///tmp/test123_rewrite") val gbtload2 = GBTClassificationModel.load("file:///tmp/test123_rewrite") featureColumns.zip(gbtload2.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println) /* prints (d,0.6455841215290767) (a,0.3316126797964181) (b,0.022803198674505094) (c,0.0) */ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25947) Reduce memory usage in ShuffleExchangeExec by selecting only the sort columns
[ https://issues.apache.org/jira/browse/SPARK-25947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25947: Assignee: Apache Spark > Reduce memory usage in ShuffleExchangeExec by selecting only the sort columns > - > > Key: SPARK-25947 > URL: https://issues.apache.org/jira/browse/SPARK-25947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2 >Reporter: Shuheng Dai >Assignee: Apache Spark >Priority: Major > > When sorting rows, ShuffleExchangeExec uses the entire row instead of just > the columns references in SortOrder to create the RangePartitioner. This > causes the RangePartitioner to sample entire rows to create rangeBounds and > can cause OOM issues on the driver when rows contain large fields. > Create a projection and only use columns involved in the SortOrder for the > RangePartitioner -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25947) Reduce memory usage in ShuffleExchangeExec by selecting only the sort columns
[ https://issues.apache.org/jira/browse/SPARK-25947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677513#comment-16677513 ] Apache Spark commented on SPARK-25947: -- User 'mu5358271' has created a pull request for this issue: https://github.com/apache/spark/pull/22961 > Reduce memory usage in ShuffleExchangeExec by selecting only the sort columns > - > > Key: SPARK-25947 > URL: https://issues.apache.org/jira/browse/SPARK-25947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2 >Reporter: Shuheng Dai >Priority: Major > > When sorting rows, ShuffleExchangeExec uses the entire row instead of just > the columns references in SortOrder to create the RangePartitioner. This > causes the RangePartitioner to sample entire rows to create rangeBounds and > can cause OOM issues on the driver when rows contain large fields. > Create a projection and only use columns involved in the SortOrder for the > RangePartitioner -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25947) Reduce memory usage in ShuffleExchangeExec by selecting only the sort columns
[ https://issues.apache.org/jira/browse/SPARK-25947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677514#comment-16677514 ] Apache Spark commented on SPARK-25947: -- User 'mu5358271' has created a pull request for this issue: https://github.com/apache/spark/pull/22961 > Reduce memory usage in ShuffleExchangeExec by selecting only the sort columns > - > > Key: SPARK-25947 > URL: https://issues.apache.org/jira/browse/SPARK-25947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2 >Reporter: Shuheng Dai >Priority: Major > > When sorting rows, ShuffleExchangeExec uses the entire row instead of just > the columns references in SortOrder to create the RangePartitioner. This > causes the RangePartitioner to sample entire rows to create rangeBounds and > can cause OOM issues on the driver when rows contain large fields. > Create a projection and only use columns involved in the SortOrder for the > RangePartitioner -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25947) Reduce memory usage in ShuffleExchangeExec by selecting only the sort columns
[ https://issues.apache.org/jira/browse/SPARK-25947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25947: Assignee: (was: Apache Spark) > Reduce memory usage in ShuffleExchangeExec by selecting only the sort columns > - > > Key: SPARK-25947 > URL: https://issues.apache.org/jira/browse/SPARK-25947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2 >Reporter: Shuheng Dai >Priority: Major > > When sorting rows, ShuffleExchangeExec uses the entire row instead of just > the columns references in SortOrder to create the RangePartitioner. This > causes the RangePartitioner to sample entire rows to create rangeBounds and > can cause OOM issues on the driver when rows contain large fields. > Create a projection and only use columns involved in the SortOrder for the > RangePartitioner -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17967) Support for list or other types as an option for datasources
[ https://issues.apache.org/jira/browse/SPARK-17967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677486#comment-16677486 ] Hyukjin Kwon commented on SPARK-17967: -- It's a rough idea but I was also thinking allowing binary and sending some CSV setting object directly ({{CsvWriterSettings}}) from Scala and Java. Current Univocity parser allows too many options and it's kind of troublesome to judge which one should be added or not (https://github.com/apache/spark/pull/22590). > Support for list or other types as an option for datasources > > > Key: SPARK-17967 > URL: https://issues.apache.org/jira/browse/SPARK-17967 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Hyukjin Kwon >Priority: Major > > This was discussed in SPARK-17878 > For other datasources, it seems okay with string/long/boolean/double value as > an option but it seems it is not enough for the datasource such as CSV. As it > is an interface for other external datasources, I guess it'd affect several > ones out there. > I took a look a first but it seems it'd be difficult to support this (need to > change a lot). > One suggestion is support this as a JSON array. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()
Ruslan Dautkhanov created SPARK-25958: - Summary: error: [Errno 97] Address family not supported by protocol in dataframe.take() Key: SPARK-25958 URL: https://issues.apache.org/jira/browse/SPARK-25958 Project: Spark Issue Type: New Feature Components: PySpark, Spark Core Affects Versions: 2.3.2, 2.3.1 Reporter: Ruslan Dautkhanov Following error happens on a heavy Spark job after 4 hours of runtime.. {code:python} 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: [Errno 97] Address family not supported by protocol Traceback (most recent call last): File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault item.create_persistent_data() File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", line 53, in create_persistent_data single_obj.create_persistent_data() File "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 21, in create_persistent_data main_df = self.generate_dataframe_main() File "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 98, in generate_dataframe_main raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation() File "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py", line 16, in get_raw_data_with_metadata_and_aggregation main_df = self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df) File "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py", line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe return_df = self.get_dataframe_from_binary_value_iteration(input_df) File "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py", line 136, in get_dataframe_from_binary_value_iteration combine_df = self.get_dataframe_from_binary_value(input_df=input_df, binary_value=count) File "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py", line 154, in get_dataframe_from_binary_value if len(results_of_filter_df.take(1)) == 0: File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", line 504, in take return self.limit(num).collect() File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", line 467, in collect return list(_load_from_socket(sock_info, BatchedSerializer(PickleSerializer( File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 148, in _load_from_socket sock = socket.socket(af, socktype, proto) File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in __init__ _sock = _realsocket(family, type, proto) error: [Errno 97] Address family not supported by protocol {code} Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148: {code:python} def _load_from_socket(sock_info, serializer): port, auth_secret = sock_info sock = None # Support for both IPv4 and IPv6. # On most of IPv6-ready systems, IPv6 will take precedence. for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM): af, socktype, proto, canonname, sa = res sock = socket.socket(af, socktype, proto) try: sock.settimeout(15) sock.connect(sa) except socket.error: sock.close() sock = None continue break if not sock: raise Exception("could not open socket") # The RDD materialization time is unpredicable, if we set a timeout for socket reading # operation, it will very possibly fail. See SPARK-18281. sock.settimeout(None) sockfile = sock.makefile("rwb", 65536) do_server_auth(sockfile, auth_secret) # The socket will be automatically closed when garbage-collected. return serializer.load_stream(sockfile) {code} the culprint is in the line {code:python} socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM) {code} so the error "error: [Errno 97] *Address family* not supported by protocol" seems to be caused by socket.AF_UNSPEC third option to the socket.getaddrinfo() call. I tried to call similar socket.getaddrinfo call locally outside of PySpark and it worked fine. RHEL 7.5. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17967) Support for list or other types as an option for datasources
[ https://issues.apache.org/jira/browse/SPARK-17967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677478#comment-16677478 ] Hyukjin Kwon commented on SPARK-17967: -- For CSV itself, yea, there are workaround and I agree - for CSV, it should be just a good to do. However, other cases like, for instance, specifying binary format (https://github.com/apache/spark/pull/21192) ideally needs this. it is also needed to specify multiple delimiters or dates format (there are already some JIRAs open). > Support for list or other types as an option for datasources > > > Key: SPARK-17967 > URL: https://issues.apache.org/jira/browse/SPARK-17967 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Hyukjin Kwon >Priority: Major > > This was discussed in SPARK-17878 > For other datasources, it seems okay with string/long/boolean/double value as > an option but it seems it is not enough for the datasource such as CSV. As it > is an interface for other external datasources, I guess it'd affect several > ones out there. > I took a look a first but it seems it'd be difficult to support this (need to > change a lot). > One suggestion is support this as a JSON array. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17166) CTAS lost table properties after conversion to data source tables.
[ https://issues.apache.org/jira/browse/SPARK-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-17166. --- Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.4.0 It is decided to support Spark-related Parquet/ORC properties and implemented in Spark 2.4.0. The other properties which are unknown to Spark will be ignored intentionally. > CTAS lost table properties after conversion to data source tables. > -- > > Key: SPARK-17166 > URL: https://issues.apache.org/jira/browse/SPARK-17166 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.4.0 > > > CTAS lost table properties after conversion to data source tables. For > example, > {noformat} > CREATE TABLE t TBLPROPERTIES('prop1' = 'c', 'prop2' = 'd') AS SELECT 1 as a, > 1 as b > {noformat} > The output of `DESC FORMATTED t` does not have the related properties. > {noformat} > |Table Parameters: | > | | > | rawDataSize |-1 > | | > | numFiles |1 > | | > | transient_lastDdlTime |1471670983 > | | > | totalSize |496 > | | > | spark.sql.sources.provider|parquet > | | > | EXTERNAL |FALSE > | | > | COLUMN_STATS_ACCURATE |false > | | > | numRows |-1 > | | > || > | | > |# Storage Information | > | | > |SerDe Library: > |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > | | > |InputFormat: > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > | | > |OutputFormat: > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > | | > |Compressed: |No > | | > |Storage Desc Parameters:| > | | > | serialization.format |1 > | | > | path > |file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-f3aa2927-6464-4a35-a715-1300dde6c614/t| >| > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8.2
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24771: -- Summary: Upgrade AVRO version from 1.7.7 to 1.8.2 (was: Upgrade AVRO version from 1.7.7 to 1.8) > Upgrade AVRO version from 1.7.7 to 1.8.2 > > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: release-notes > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24422) Add JDK11 in our Jenkins' build servers
[ https://issues.apache.org/jira/browse/SPARK-24422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-24422: Summary: Add JDK11 in our Jenkins' build servers (was: Add JDK9+ in our Jenkins' build servers) > Add JDK11 in our Jenkins' build servers > --- > > Key: SPARK-24422 > URL: https://issues.apache.org/jira/browse/SPARK-24422 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 2.3.0 >Reporter: DB Tsai >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25957) Skip building spark-r docker image if spark distribution does not have R support
[ https://issues.apache.org/jira/browse/SPARK-25957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nagaram Prasad Addepally updated SPARK-25957: - Description: [docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh] script by default tries to build spark-r image. We may not always build spark distribution with R support. It would be good to skip building and publishing spark-r images if R support is not available in the spark distribution. (was: [docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh] script by default tries to build spark-r image by default. We may not always build spark distribution with R support. It would be good to skip building and publishing spark-r images if R support is not available in the spark distribution.) > Skip building spark-r docker image if spark distribution does not have R > support > > > Key: SPARK-25957 > URL: https://issues.apache.org/jira/browse/SPARK-25957 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Nagaram Prasad Addepally >Priority: Major > > [docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh] > script by default tries to build spark-r image. We may not always build > spark distribution with R support. It would be good to skip building and > publishing spark-r images if R support is not available in the spark > distribution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25957) Skip building spark-r docker image if spark distribution does not have R support
Nagaram Prasad Addepally created SPARK-25957: Summary: Skip building spark-r docker image if spark distribution does not have R support Key: SPARK-25957 URL: https://issues.apache.org/jira/browse/SPARK-25957 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.4.0 Reporter: Nagaram Prasad Addepally [docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh] script by default tries to build spark-r image by default. We may not always build spark distribution with R support. It would be good to skip building and publishing spark-r images if R support is not available in the spark distribution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25676) Refactor BenchmarkWideTable to use main method
[ https://issues.apache.org/jira/browse/SPARK-25676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25676. --- Resolution: Fixed Assignee: yucai Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/22823 . > Refactor BenchmarkWideTable to use main method > -- > > Key: SPARK-25676 > URL: https://issues.apache.org/jira/browse/SPARK-25676 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: yucai >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17967) Support for list or other types as an option for datasources
[ https://issues.apache.org/jira/browse/SPARK-17967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677418#comment-16677418 ] Reynold Xin commented on SPARK-17967: - BTW how important is this? Seems like for CSV people can just replace the null values with null themselves using the programmatic API. > Support for list or other types as an option for datasources > > > Key: SPARK-17967 > URL: https://issues.apache.org/jira/browse/SPARK-17967 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Hyukjin Kwon >Priority: Major > > This was discussed in SPARK-17878 > For other datasources, it seems okay with string/long/boolean/double value as > an option but it seems it is not enough for the datasource such as CSV. As it > is an interface for other external datasources, I guess it'd affect several > ones out there. > I took a look a first but it seems it'd be difficult to support this (need to > change a lot). > One suggestion is support this as a JSON array. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25956) Make Scala 2.12 as default Scala version in Spark 3.0
DB Tsai created SPARK-25956: --- Summary: Make Scala 2.12 as default Scala version in Spark 3.0 Key: SPARK-25956 URL: https://issues.apache.org/jira/browse/SPARK-25956 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 2.4.0 Reporter: DB Tsai Scala 2.11 will unlikely support Java 11 https://github.com/scala/scala-dev/issues/559#issuecomment-436160166; hence, we will make Scala 2.12 as default Scala version in Spark 3.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24421) sun.misc.Unsafe in JDK9+
[ https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677280#comment-16677280 ] Sean Owen commented on SPARK-24421: --- BTW thanks for the tip [~akorzhuev] ; I checked and unfortunately DirectByteBuffer in Java 9 holds an instance of jdk.internal.ref.Cleaner, not java.lang.ref.Cleaner. The latter isn't available in Java 8 anyway. So we'd have to hack this with reflection regardless. > sun.misc.Unsafe in JDK9+ > > > Key: SPARK-24421 > URL: https://issues.apache.org/jira/browse/SPARK-24421 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.3.0 >Reporter: DB Tsai >Priority: Major > > Many internal APIs such as unsafe are encapsulated in JDK9+, see > http://openjdk.java.net/jeps/260 for detail. > To use Unsafe, we need to add *jdk.unsupported* to our code’s module > declaration: > {code:java} > module java9unsafe { > requires jdk.unsupported; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25922) [K8] Spark Driver/Executor "spark-app-selector" label mismatch
[ https://issues.apache.org/jira/browse/SPARK-25922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677276#comment-16677276 ] Yinan Li commented on SPARK-25922: -- The application ID used to set the {{spark-app-selector}} label for the driver pod is from this line [https://github.com/apache/spark/blob/3404a73f4cf7be37e574026d08ad5cf82cfac871/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L217.] The application ID used to set the {{spark-app-selector}} label for the executor pod is from this line [https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L87|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L87,]. Agreed that it's problematic that two different labels are used. > [K8] Spark Driver/Executor "spark-app-selector" label mismatch > -- > > Key: SPARK-25922 > URL: https://issues.apache.org/jira/browse/SPARK-25922 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 RC4 >Reporter: Anmol Khurana >Priority: Major > > Hi, > I have been testing Spark 2.4.0 RC4 on Kubernetes to run Python Spark > Applications and running into an issue where the AppId label on the driver > and executors mis-match. I am using the > [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator] to run these > applications. > I see a spark.app.id of the form spark-* as "spark-app-selector" label on > the driver as well as in the K8 config-map which gets created for the driver > via spark-submit . My guess is this is coming from > [https://github.com/apache/spark/blob/f6cc354d83c2c9a757f9b507aadd4dbdc5825cca/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L211] > > But when the driver actually comes up and brings up executors etc. , I see > that the "spark-app-selector" label on the executors as well as the > spark.app.Id config within the user-code on the driver is something of the > form spark-application-* ( probably from > [https://github.com/apache/spark/blob/b19a28dea098c7d6188f8540429c50f42952d678/core/src/main/scala/org/apache/spark/SparkContext.scala#L511] > & > [https://github.com/apache/spark/blob/bfb74394a5513134ea1da9fcf4a1783b77dd64e4/core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala#L26|https://github.com/apache/spark/blob/bfb74394a5513134ea1da9fcf4a1783b77dd64e4/core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala#L26)] > ) > We were consuming this "spark-app-selector" label on the Driver Pod to get > the App Id and use it to look-up the app in SparkHistory server (among other > use-cases). but due to this mis-match, this logic no longer works. This was > working fine in Spark 2.2 fork for Kubernetes which i was using earlier. Is > this expected behavior and if yes, what's the correct way to fetch the > applicationId from outside the application ? > Let me know if I can provide any more details or if I am doing something > wrong. Here is an example run with different *spark-app-selector* label on > the driver/executor : > > {code:java} > Name: pyfiles-driver > Namespace: default > Priority: 0 > PriorityClassName: > Start Time: Thu, 01 Nov 2018 18:19:46 -0700 > Labels: spark-app-selector=spark-b78bb10feebf4e2d98c11d7b6320e18f > spark-role=driver > sparkoperator.k8s.io/app-name=pyfiles > sparkoperator.k8s.io/launched-by-spark-operator=true > version=2.4.0 > Status: Running > Name: pyfiles-1541121585642-exec-1 > Namespace: default > Priority: 0 > PriorityClassName: > Start Time: Thu, 01 Nov 2018 18:24:02 -0700 > Labels: spark-app-selector=spark-application-1541121829445 > spark-exec-id=1 > spark-role=executor > sparkoperator.k8s.io/app-name=pyfiles > sparkoperator.k8s.io/launched-by-spark-operator=true > version=2.4.0 > Status: Pending > {code} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25955) Porting JSON test for CSV functions
[ https://issues.apache.org/jira/browse/SPARK-25955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25955: Assignee: (was: Apache Spark) > Porting JSON test for CSV functions > --- > > Key: SPARK-25955 > URL: https://issues.apache.org/jira/browse/SPARK-25955 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > JsonFunctionsSuite contains test that are applicable and useful for CSV > functions - from_csv, to_csv and schema_of_csv: > * uses DDL strings for defining a schema - java > * roundtrip to_csv -> from_csv > * roundtrip from_csv -> to_csv > * infers schemas of a CSV string and pass to to from_csv > * Support to_csv in SQL > * Support from_csv in SQL -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25955) Porting JSON test for CSV functions
[ https://issues.apache.org/jira/browse/SPARK-25955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677272#comment-16677272 ] Apache Spark commented on SPARK-25955: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/22960 > Porting JSON test for CSV functions > --- > > Key: SPARK-25955 > URL: https://issues.apache.org/jira/browse/SPARK-25955 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > JsonFunctionsSuite contains test that are applicable and useful for CSV > functions - from_csv, to_csv and schema_of_csv: > * uses DDL strings for defining a schema - java > * roundtrip to_csv -> from_csv > * roundtrip from_csv -> to_csv > * infers schemas of a CSV string and pass to to from_csv > * Support to_csv in SQL > * Support from_csv in SQL -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25955) Porting JSON test for CSV functions
[ https://issues.apache.org/jira/browse/SPARK-25955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25955: Assignee: Apache Spark > Porting JSON test for CSV functions > --- > > Key: SPARK-25955 > URL: https://issues.apache.org/jira/browse/SPARK-25955 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > JsonFunctionsSuite contains test that are applicable and useful for CSV > functions - from_csv, to_csv and schema_of_csv: > * uses DDL strings for defining a schema - java > * roundtrip to_csv -> from_csv > * roundtrip from_csv -> to_csv > * infers schemas of a CSV string and pass to to from_csv > * Support to_csv in SQL > * Support from_csv in SQL -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25955) Porting JSON test for CSV functions
Maxim Gekk created SPARK-25955: -- Summary: Porting JSON test for CSV functions Key: SPARK-25955 URL: https://issues.apache.org/jira/browse/SPARK-25955 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk JsonFunctionsSuite contains test that are applicable and useful for CSV functions - from_csv, to_csv and schema_of_csv: * uses DDL strings for defining a schema - java * roundtrip to_csv -> from_csv * roundtrip from_csv -> to_csv * infers schemas of a CSV string and pass to to from_csv * Support to_csv in SQL * Support from_csv in SQL -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25222) Spark on Kubernetes Pod Watcher dumps raw container status
[ https://issues.apache.org/jira/browse/SPARK-25222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-25222: --- Fix Version/s: 3.0.0 > Spark on Kubernetes Pod Watcher dumps raw container status > -- > > Key: SPARK-25222 > URL: https://issues.apache.org/jira/browse/SPARK-25222 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1 >Reporter: Rob Vesse >Priority: Minor > Fix For: 3.0.0 > > > Spark on Kubernetes provides logging of the pod/container status as a monitor > of the job progress. However the logger just dumps the raw container status > object leading to fairly unreadable output like so: > {noformat} > 18/08/24 09:03:27 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-groupby-1535101393784-driver >namespace: default >labels: spark-app-selector -> spark-47f7248122b9444b8d5fd3701028a1e8, > spark-role -> driver >pod uid: 88de6467-a77c-11e8-b9da-a4bf0128b75b >creation time: 2018-08-24T09:03:14Z >service account name: spark >volumes: spark-local-dir-1, spark-conf-volume, spark-token-kjxkv >node name: tab-cmp4 >start time: 2018-08-24T09:03:14Z >container images: rvesse/spark:latest >phase: Running >status: > [ContainerStatus(containerID=docker://23ae58571f59505e837dca40455d0347fb90e9b88e2a2b145a38e2919fceb447, > image=rvesse/spark:latest, > imageID=docker-pullable://rvesse/spark@sha256:92abf0b718743d0f5a26068fc94ec42233db0493c55a8570dc8c851c62a4bc0a, > lastState=ContainerState(running=null, terminated=null, waiting=null, > additionalProperties={}), name=spark-kubernetes-driver, ready=true, > restartCount=0, > state=ContainerState(running=ContainerStateRunning(startedAt=Time(time=2018-08-24T09:03:26Z, > additionalProperties={}), additionalProperties={}), terminated=null, > waiting=null, additionalProperties={}), additionalProperties={})] > {noformat} > The {{LoggingPodStatusWatcher}} actually already includes code to nicely > format this information but only invokes it at the end of the job: > {noformat} > 18/08/24 09:04:07 INFO LoggingPodStatusWatcherImpl: Container final statuses: > Container name: spark-kubernetes-driver > Container image: rvesse/spark:latest > Container state: Terminated > Exit code: 0 > {noformat} > It would be nice if we continually used the nice formatting throughout the > logging. > We already have patched this on our internal fork and will upstream a fix > shortly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25222) Spark on Kubernetes Pod Watcher dumps raw container status
[ https://issues.apache.org/jira/browse/SPARK-25222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-25222: -- Assignee: Rob Vesse > Spark on Kubernetes Pod Watcher dumps raw container status > -- > > Key: SPARK-25222 > URL: https://issues.apache.org/jira/browse/SPARK-25222 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1 >Reporter: Rob Vesse >Assignee: Rob Vesse >Priority: Minor > Fix For: 3.0.0 > > > Spark on Kubernetes provides logging of the pod/container status as a monitor > of the job progress. However the logger just dumps the raw container status > object leading to fairly unreadable output like so: > {noformat} > 18/08/24 09:03:27 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-groupby-1535101393784-driver >namespace: default >labels: spark-app-selector -> spark-47f7248122b9444b8d5fd3701028a1e8, > spark-role -> driver >pod uid: 88de6467-a77c-11e8-b9da-a4bf0128b75b >creation time: 2018-08-24T09:03:14Z >service account name: spark >volumes: spark-local-dir-1, spark-conf-volume, spark-token-kjxkv >node name: tab-cmp4 >start time: 2018-08-24T09:03:14Z >container images: rvesse/spark:latest >phase: Running >status: > [ContainerStatus(containerID=docker://23ae58571f59505e837dca40455d0347fb90e9b88e2a2b145a38e2919fceb447, > image=rvesse/spark:latest, > imageID=docker-pullable://rvesse/spark@sha256:92abf0b718743d0f5a26068fc94ec42233db0493c55a8570dc8c851c62a4bc0a, > lastState=ContainerState(running=null, terminated=null, waiting=null, > additionalProperties={}), name=spark-kubernetes-driver, ready=true, > restartCount=0, > state=ContainerState(running=ContainerStateRunning(startedAt=Time(time=2018-08-24T09:03:26Z, > additionalProperties={}), additionalProperties={}), terminated=null, > waiting=null, additionalProperties={}), additionalProperties={})] > {noformat} > The {{LoggingPodStatusWatcher}} actually already includes code to nicely > format this information but only invokes it at the end of the job: > {noformat} > 18/08/24 09:04:07 INFO LoggingPodStatusWatcherImpl: Container final statuses: > Container name: spark-kubernetes-driver > Container image: rvesse/spark:latest > Container state: Terminated > Exit code: 0 > {noformat} > It would be nice if we continually used the nice formatting throughout the > logging. > We already have patched this on our internal fork and will upstream a fix > shortly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25871) Streaming WAL should not use hdfs erasure coding, regardless of FS defaults
[ https://issues.apache.org/jira/browse/SPARK-25871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-25871. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22882 [https://github.com/apache/spark/pull/22882] > Streaming WAL should not use hdfs erasure coding, regardless of FS defaults > --- > > Key: SPARK-25871 > URL: https://issues.apache.org/jira/browse/SPARK-25871 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > Fix For: 3.0.0 > > > The {{FileBasedWriteAheadLogWriter}} expects the output stream for the WAL to > support {{hflush()}}, but hdfs erasure coded files do not support that. > https://hadoop.apache.org/docs/r3.0.0/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html#Limitations > otherwise you get exceptions like: > {noformat} > 17/10/17 17:31:34 ERROR executor.Executor: Exception in task 0.2 in stage 6.0 > (TID 85) > org.apache.spark.SparkException: Could not read data from write ahead log > record > FileBasedWriteAheadLogSegment(hdfs://quasar-yxckyb-1.vpc.cloudera.com:8020/tmp/__spark__a10be3a3-85ec-4d4f-8782-a4760df4cc4c/88657/checkpoints/receivedData/0/log-1508286672978-1508286732978,1321921,189000) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.EOFException: Cannot seek after EOF > at > org.apache.hadoop.hdfs.DFSStripedInputStream.seek(DFSStripedInputStream.java:331) > at > org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogRandomReader.read(FileBasedWriteAheadLogRandomReader.scala:37) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.read(FileBasedWriteAheadLog.scala:120) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:142) > ... 18 more > {noformat} > HDFS allows you to force a file to be replicated, regardless of the FS > defaults -- we should do that for the WAL. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25871) Streaming WAL should not use hdfs erasure coding, regardless of FS defaults
[ https://issues.apache.org/jira/browse/SPARK-25871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-25871: -- Assignee: Imran Rashid > Streaming WAL should not use hdfs erasure coding, regardless of FS defaults > --- > > Key: SPARK-25871 > URL: https://issues.apache.org/jira/browse/SPARK-25871 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > Fix For: 3.0.0 > > > The {{FileBasedWriteAheadLogWriter}} expects the output stream for the WAL to > support {{hflush()}}, but hdfs erasure coded files do not support that. > https://hadoop.apache.org/docs/r3.0.0/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html#Limitations > otherwise you get exceptions like: > {noformat} > 17/10/17 17:31:34 ERROR executor.Executor: Exception in task 0.2 in stage 6.0 > (TID 85) > org.apache.spark.SparkException: Could not read data from write ahead log > record > FileBasedWriteAheadLogSegment(hdfs://quasar-yxckyb-1.vpc.cloudera.com:8020/tmp/__spark__a10be3a3-85ec-4d4f-8782-a4760df4cc4c/88657/checkpoints/receivedData/0/log-1508286672978-1508286732978,1321921,189000) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.EOFException: Cannot seek after EOF > at > org.apache.hadoop.hdfs.DFSStripedInputStream.seek(DFSStripedInputStream.java:331) > at > org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogRandomReader.read(FileBasedWriteAheadLogRandomReader.scala:37) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.read(FileBasedWriteAheadLog.scala:120) > at > org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:142) > ... 18 more > {noformat} > HDFS allows you to force a file to be replicated, regardless of the FS > defaults -- we should do that for the WAL. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25954) Upgrade to Kafka 2.1.0
[ https://issues.apache.org/jira/browse/SPARK-25954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677163#comment-16677163 ] Dongjoon Hyun commented on SPARK-25954: --- Thanks. Of course, it's for Spark 3.0 next year. > Upgrade to Kafka 2.1.0 > -- > > Key: SPARK-25954 > URL: https://issues.apache.org/jira/browse/SPARK-25954 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > Kafka 2.1.0 RC0 is started. Since this includes official KAFKA-7264 JDK 11 > support, we had better use that. > - > https://lists.apache.org/thread.html/8288f0afdfed4d329f1a8338320b6e24e7684a0593b4bbd6f1b79101@%3Cdev.kafka.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25954) Upgrade to Kafka 2.1.0
[ https://issues.apache.org/jira/browse/SPARK-25954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677152#comment-16677152 ] Ted Yu commented on SPARK-25954: Looking at Kafka thread, message from Satish indicated there may be another RC coming. > Upgrade to Kafka 2.1.0 > -- > > Key: SPARK-25954 > URL: https://issues.apache.org/jira/browse/SPARK-25954 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > Kafka 2.1.0 RC0 is started. Since this includes official KAFKA-7264 JDK 11 > support, we had better use that. > - > https://lists.apache.org/thread.html/8288f0afdfed4d329f1a8338320b6e24e7684a0593b4bbd6f1b79101@%3Cdev.kafka.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25876) Simplify configuration types in k8s backend
[ https://issues.apache.org/jira/browse/SPARK-25876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677138#comment-16677138 ] Apache Spark commented on SPARK-25876: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/22959 > Simplify configuration types in k8s backend > --- > > Key: SPARK-25876 > URL: https://issues.apache.org/jira/browse/SPARK-25876 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Major > > This is a child of SPARK-25874 to deal with the current issues with the > different configuration objects used in the k8s backend. Please refer to the > parent for further discussion of what this means. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25876) Simplify configuration types in k8s backend
[ https://issues.apache.org/jira/browse/SPARK-25876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25876: Assignee: (was: Apache Spark) > Simplify configuration types in k8s backend > --- > > Key: SPARK-25876 > URL: https://issues.apache.org/jira/browse/SPARK-25876 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Major > > This is a child of SPARK-25874 to deal with the current issues with the > different configuration objects used in the k8s backend. Please refer to the > parent for further discussion of what this means. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25876) Simplify configuration types in k8s backend
[ https://issues.apache.org/jira/browse/SPARK-25876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25876: Assignee: Apache Spark > Simplify configuration types in k8s backend > --- > > Key: SPARK-25876 > URL: https://issues.apache.org/jira/browse/SPARK-25876 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Major > > This is a child of SPARK-25874 to deal with the current issues with the > different configuration objects used in the k8s backend. Please refer to the > parent for further discussion of what this means. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25876) Simplify configuration types in k8s backend
[ https://issues.apache.org/jira/browse/SPARK-25876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677135#comment-16677135 ] Apache Spark commented on SPARK-25876: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/22959 > Simplify configuration types in k8s backend > --- > > Key: SPARK-25876 > URL: https://issues.apache.org/jira/browse/SPARK-25876 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Major > > This is a child of SPARK-25874 to deal with the current issues with the > different configuration objects used in the k8s backend. Please refer to the > parent for further discussion of what this means. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25954) Upgrade to Kafka 2.1.0
[ https://issues.apache.org/jira/browse/SPARK-25954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677107#comment-16677107 ] Dongjoon Hyun commented on SPARK-25954: --- Ping, [~te...@apache.org]. Are you interested in this since you did SPARK-18057? > Upgrade to Kafka 2.1.0 > -- > > Key: SPARK-25954 > URL: https://issues.apache.org/jira/browse/SPARK-25954 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > Kafka 2.1.0 RC0 is started. Since this includes official KAFKA-7264 JDK 11 > support, we had better use that. > - > https://lists.apache.org/thread.html/8288f0afdfed4d329f1a8338320b6e24e7684a0593b4bbd6f1b79101@%3Cdev.kafka.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25954) Upgrade to Kafka 2.1.0
Dongjoon Hyun created SPARK-25954: - Summary: Upgrade to Kafka 2.1.0 Key: SPARK-25954 URL: https://issues.apache.org/jira/browse/SPARK-25954 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Dongjoon Hyun Kafka 2.1.0 RC0 is started. Since this includes official KAFKA-7264 JDK 11 support, we had better use that. - https://lists.apache.org/thread.html/8288f0afdfed4d329f1a8338320b6e24e7684a0593b4bbd6f1b79101@%3Cdev.kafka.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25953) install jdk11 on jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-25953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp updated SPARK-25953: Description: once we pin down exact what we want installed on the jenkins workers, i will add it to our ansible and deploy. > install jdk11 on jenkins workers > > > Key: SPARK-25953 > URL: https://issues.apache.org/jira/browse/SPARK-25953 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Critical > > once we pin down exact what we want installed on the jenkins workers, i will > add it to our ansible and deploy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25952) from_json returns wrong result if corrupt record column is in the middle of schema
[ https://issues.apache.org/jira/browse/SPARK-25952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677073#comment-16677073 ] Apache Spark commented on SPARK-25952: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/22958 > from_json returns wrong result if corrupt record column is in the middle of > schema > -- > > Key: SPARK-25952 > URL: https://issues.apache.org/jira/browse/SPARK-25952 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > If an user specifies a corrupt record column via > spark.sql.columnNameOfCorruptRecord or JSON options > columnNameOfCorruptRecord, schema with the column is propagated to Jackson > parser. This breaks an assumption inside of FailureSafeParser that a row > returned from Jackson Parser contains only actual data. As a consequence of > that FailureSafeParser writes a bad record in wrong position. > For example: > {code:scala} > val schema = new StructType() > .add("a", IntegerType) > .add("_unparsed", StringType) > .add("b", IntegerType) > val badRec = """{"a" 1, "b": 11}""" > val df = Seq(badRec, """{"a": 2, "b": 12}""").toDS() > {code} > the collect() action below > {code:scala} > df.select(from_json($"value", schema, Map("columnNameOfCorruptRecord" -> > "_unparsed"))).collect() > {code} > loses 12: > {code} > Array(Row(Row(null, "{"a" 1, "b": 11}", null)), Row(Row(2, null, null))) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25952) from_json returns wrong result if corrupt record column is in the middle of schema
[ https://issues.apache.org/jira/browse/SPARK-25952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25952: Assignee: Apache Spark > from_json returns wrong result if corrupt record column is in the middle of > schema > -- > > Key: SPARK-25952 > URL: https://issues.apache.org/jira/browse/SPARK-25952 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > If an user specifies a corrupt record column via > spark.sql.columnNameOfCorruptRecord or JSON options > columnNameOfCorruptRecord, schema with the column is propagated to Jackson > parser. This breaks an assumption inside of FailureSafeParser that a row > returned from Jackson Parser contains only actual data. As a consequence of > that FailureSafeParser writes a bad record in wrong position. > For example: > {code:scala} > val schema = new StructType() > .add("a", IntegerType) > .add("_unparsed", StringType) > .add("b", IntegerType) > val badRec = """{"a" 1, "b": 11}""" > val df = Seq(badRec, """{"a": 2, "b": 12}""").toDS() > {code} > the collect() action below > {code:scala} > df.select(from_json($"value", schema, Map("columnNameOfCorruptRecord" -> > "_unparsed"))).collect() > {code} > loses 12: > {code} > Array(Row(Row(null, "{"a" 1, "b": 11}", null)), Row(Row(2, null, null))) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25952) from_json returns wrong result if corrupt record column is in the middle of schema
[ https://issues.apache.org/jira/browse/SPARK-25952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25952: Assignee: (was: Apache Spark) > from_json returns wrong result if corrupt record column is in the middle of > schema > -- > > Key: SPARK-25952 > URL: https://issues.apache.org/jira/browse/SPARK-25952 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > If an user specifies a corrupt record column via > spark.sql.columnNameOfCorruptRecord or JSON options > columnNameOfCorruptRecord, schema with the column is propagated to Jackson > parser. This breaks an assumption inside of FailureSafeParser that a row > returned from Jackson Parser contains only actual data. As a consequence of > that FailureSafeParser writes a bad record in wrong position. > For example: > {code:scala} > val schema = new StructType() > .add("a", IntegerType) > .add("_unparsed", StringType) > .add("b", IntegerType) > val badRec = """{"a" 1, "b": 11}""" > val df = Seq(badRec, """{"a": 2, "b": 12}""").toDS() > {code} > the collect() action below > {code:scala} > df.select(from_json($"value", schema, Map("columnNameOfCorruptRecord" -> > "_unparsed"))).collect() > {code} > loses 12: > {code} > Array(Row(Row(null, "{"a" 1, "b": 11}", null)), Row(Row(2, null, null))) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25953) install jdk11 on jenkins workers
shane knapp created SPARK-25953: --- Summary: install jdk11 on jenkins workers Key: SPARK-25953 URL: https://issues.apache.org/jira/browse/SPARK-25953 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 3.0.0 Reporter: shane knapp Assignee: shane knapp -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25952) from_json returns wrong result if corrupt record column is in the middle of schema
Maxim Gekk created SPARK-25952: -- Summary: from_json returns wrong result if corrupt record column is in the middle of schema Key: SPARK-25952 URL: https://issues.apache.org/jira/browse/SPARK-25952 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk If an user specifies a corrupt record column via spark.sql.columnNameOfCorruptRecord or JSON options columnNameOfCorruptRecord, schema with the column is propagated to Jackson parser. This breaks an assumption inside of FailureSafeParser that a row returned from Jackson Parser contains only actual data. As a consequence of that FailureSafeParser writes a bad record in wrong position. For example: {code:scala} val schema = new StructType() .add("a", IntegerType) .add("_unparsed", StringType) .add("b", IntegerType) val badRec = """{"a" 1, "b": 11}""" val df = Seq(badRec, """{"a": 2, "b": 12}""").toDS() {code} the collect() action below {code:scala} df.select(from_json($"value", schema, Map("columnNameOfCorruptRecord" -> "_unparsed"))).collect() {code} loses 12: {code} Array(Row(Row(null, "{"a" 1, "b": 11}", null)), Row(Row(2, null, null))) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24417) Build and Run Spark on JDK11
[ https://issues.apache.org/jira/browse/SPARK-24417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677040#comment-16677040 ] Dongjoon Hyun commented on SPARK-24417: --- +1 for moving to Scala 2.12 as default Scala version. > Build and Run Spark on JDK11 > > > Key: SPARK-24417 > URL: https://issues.apache.org/jira/browse/SPARK-24417 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 2.3.0 >Reporter: DB Tsai >Priority: Major > > This is an umbrella JIRA for Apache Spark to support JDK11 > As JDK8 is reaching EOL, and JDK9 and 10 are already end of life, per > community discussion, we will skip JDK9 and 10 to support JDK 11 directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25951) Redundant shuffle if column is renamed
[ https://issues.apache.org/jira/browse/SPARK-25951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25951: Assignee: (was: Apache Spark) > Redundant shuffle if column is renamed > -- > > Key: SPARK-25951 > URL: https://issues.apache.org/jira/browse/SPARK-25951 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ohad Raviv >Priority: Minor > > we've noticed that sometimes a column rename causes extra shuffle: > {code:java} > val N = 1 << 12 > spark.sql("set spark.sql.autoBroadcastJoinThreshold=0") > val t1 = spark.range(N).selectExpr("floor(id/4) as key1") > val t2 = spark.range(N).selectExpr("floor(id/4) as key2") > import org.apache.spark.sql.functions._ > t1.groupBy("key1").agg(count(lit("1")).as("cnt1")) > .join(t2.groupBy("key2").agg(count(lit("1")).as("cnt2")).withColumnRenamed("key2", > "key3"), > col("key1")===col("key3")) > .explain() > {code} > results in: > {noformat} > == Physical Plan == > *(6) SortMergeJoin [key1#6L], [key3#22L], Inner > :- *(2) Sort [key1#6L ASC NULLS FIRST], false, 0 > : +- *(2) HashAggregate(keys=[key1#6L], functions=[count(1)], > output=[key1#6L, cnt1#14L]) > : +- Exchange hashpartitioning(key1#6L, 2) > :+- *(1) HashAggregate(keys=[key1#6L], functions=[partial_count(1)], > output=[key1#6L, count#39L]) > : +- *(1) Project [FLOOR((cast(id#4L as double) / 4.0)) AS key1#6L] > : +- *(1) Filter isnotnull(FLOOR((cast(id#4L as double) / 4.0))) > : +- *(1) Range (0, 4096, step=1, splits=1) > +- *(5) Sort [key3#22L ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(key3#22L, 2) > +- *(4) HashAggregate(keys=[key2#10L], functions=[count(1)], > output=[key3#22L, cnt2#19L]) > +- Exchange hashpartitioning(key2#10L, 2) > +- *(3) HashAggregate(keys=[key2#10L], > functions=[partial_count(1)], output=[key2#10L, count#41L]) >+- *(3) Project [FLOOR((cast(id#8L as double) / 4.0)) AS > key2#10L] > +- *(3) Filter isnotnull(FLOOR((cast(id#8L as double) / > 4.0))) > +- *(3) Range (0, 4096, step=1, splits=1) > {noformat} > I was able to track it down to this code in class HashPartitioning: > {code:java} > case h: HashClusteredDistribution => > expressions.length == h.expressions.length && > expressions.zip(h.expressions).forall { > case (l, r) => l.semanticEquals(r) > } > {code} > the semanticEquals returns false as it compares key2 and key3 eventhough key3 > is just a rename of key2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25951) Redundant shuffle if column is renamed
[ https://issues.apache.org/jira/browse/SPARK-25951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25951: Assignee: Apache Spark > Redundant shuffle if column is renamed > -- > > Key: SPARK-25951 > URL: https://issues.apache.org/jira/browse/SPARK-25951 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ohad Raviv >Assignee: Apache Spark >Priority: Minor > > we've noticed that sometimes a column rename causes extra shuffle: > {code:java} > val N = 1 << 12 > spark.sql("set spark.sql.autoBroadcastJoinThreshold=0") > val t1 = spark.range(N).selectExpr("floor(id/4) as key1") > val t2 = spark.range(N).selectExpr("floor(id/4) as key2") > import org.apache.spark.sql.functions._ > t1.groupBy("key1").agg(count(lit("1")).as("cnt1")) > .join(t2.groupBy("key2").agg(count(lit("1")).as("cnt2")).withColumnRenamed("key2", > "key3"), > col("key1")===col("key3")) > .explain() > {code} > results in: > {noformat} > == Physical Plan == > *(6) SortMergeJoin [key1#6L], [key3#22L], Inner > :- *(2) Sort [key1#6L ASC NULLS FIRST], false, 0 > : +- *(2) HashAggregate(keys=[key1#6L], functions=[count(1)], > output=[key1#6L, cnt1#14L]) > : +- Exchange hashpartitioning(key1#6L, 2) > :+- *(1) HashAggregate(keys=[key1#6L], functions=[partial_count(1)], > output=[key1#6L, count#39L]) > : +- *(1) Project [FLOOR((cast(id#4L as double) / 4.0)) AS key1#6L] > : +- *(1) Filter isnotnull(FLOOR((cast(id#4L as double) / 4.0))) > : +- *(1) Range (0, 4096, step=1, splits=1) > +- *(5) Sort [key3#22L ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(key3#22L, 2) > +- *(4) HashAggregate(keys=[key2#10L], functions=[count(1)], > output=[key3#22L, cnt2#19L]) > +- Exchange hashpartitioning(key2#10L, 2) > +- *(3) HashAggregate(keys=[key2#10L], > functions=[partial_count(1)], output=[key2#10L, count#41L]) >+- *(3) Project [FLOOR((cast(id#8L as double) / 4.0)) AS > key2#10L] > +- *(3) Filter isnotnull(FLOOR((cast(id#8L as double) / > 4.0))) > +- *(3) Range (0, 4096, step=1, splits=1) > {noformat} > I was able to track it down to this code in class HashPartitioning: > {code:java} > case h: HashClusteredDistribution => > expressions.length == h.expressions.length && > expressions.zip(h.expressions).forall { > case (l, r) => l.semanticEquals(r) > } > {code} > the semanticEquals returns false as it compares key2 and key3 eventhough key3 > is just a rename of key2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25951) Redundant shuffle if column is renamed
[ https://issues.apache.org/jira/browse/SPARK-25951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676941#comment-16676941 ] Apache Spark commented on SPARK-25951: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/22957 > Redundant shuffle if column is renamed > -- > > Key: SPARK-25951 > URL: https://issues.apache.org/jira/browse/SPARK-25951 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ohad Raviv >Priority: Minor > > we've noticed that sometimes a column rename causes extra shuffle: > {code:java} > val N = 1 << 12 > spark.sql("set spark.sql.autoBroadcastJoinThreshold=0") > val t1 = spark.range(N).selectExpr("floor(id/4) as key1") > val t2 = spark.range(N).selectExpr("floor(id/4) as key2") > import org.apache.spark.sql.functions._ > t1.groupBy("key1").agg(count(lit("1")).as("cnt1")) > .join(t2.groupBy("key2").agg(count(lit("1")).as("cnt2")).withColumnRenamed("key2", > "key3"), > col("key1")===col("key3")) > .explain() > {code} > results in: > {noformat} > == Physical Plan == > *(6) SortMergeJoin [key1#6L], [key3#22L], Inner > :- *(2) Sort [key1#6L ASC NULLS FIRST], false, 0 > : +- *(2) HashAggregate(keys=[key1#6L], functions=[count(1)], > output=[key1#6L, cnt1#14L]) > : +- Exchange hashpartitioning(key1#6L, 2) > :+- *(1) HashAggregate(keys=[key1#6L], functions=[partial_count(1)], > output=[key1#6L, count#39L]) > : +- *(1) Project [FLOOR((cast(id#4L as double) / 4.0)) AS key1#6L] > : +- *(1) Filter isnotnull(FLOOR((cast(id#4L as double) / 4.0))) > : +- *(1) Range (0, 4096, step=1, splits=1) > +- *(5) Sort [key3#22L ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(key3#22L, 2) > +- *(4) HashAggregate(keys=[key2#10L], functions=[count(1)], > output=[key3#22L, cnt2#19L]) > +- Exchange hashpartitioning(key2#10L, 2) > +- *(3) HashAggregate(keys=[key2#10L], > functions=[partial_count(1)], output=[key2#10L, count#41L]) >+- *(3) Project [FLOOR((cast(id#8L as double) / 4.0)) AS > key2#10L] > +- *(3) Filter isnotnull(FLOOR((cast(id#8L as double) / > 4.0))) > +- *(3) Range (0, 4096, step=1, splits=1) > {noformat} > I was able to track it down to this code in class HashPartitioning: > {code:java} > case h: HashClusteredDistribution => > expressions.length == h.expressions.length && > expressions.zip(h.expressions).forall { > case (l, r) => l.semanticEquals(r) > } > {code} > the semanticEquals returns false as it compares key2 and key3 eventhough key3 > is just a rename of key2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25951) Redundant shuffle if column is renamed
[ https://issues.apache.org/jira/browse/SPARK-25951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676942#comment-16676942 ] Apache Spark commented on SPARK-25951: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/22957 > Redundant shuffle if column is renamed > -- > > Key: SPARK-25951 > URL: https://issues.apache.org/jira/browse/SPARK-25951 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ohad Raviv >Priority: Minor > > we've noticed that sometimes a column rename causes extra shuffle: > {code:java} > val N = 1 << 12 > spark.sql("set spark.sql.autoBroadcastJoinThreshold=0") > val t1 = spark.range(N).selectExpr("floor(id/4) as key1") > val t2 = spark.range(N).selectExpr("floor(id/4) as key2") > import org.apache.spark.sql.functions._ > t1.groupBy("key1").agg(count(lit("1")).as("cnt1")) > .join(t2.groupBy("key2").agg(count(lit("1")).as("cnt2")).withColumnRenamed("key2", > "key3"), > col("key1")===col("key3")) > .explain() > {code} > results in: > {noformat} > == Physical Plan == > *(6) SortMergeJoin [key1#6L], [key3#22L], Inner > :- *(2) Sort [key1#6L ASC NULLS FIRST], false, 0 > : +- *(2) HashAggregate(keys=[key1#6L], functions=[count(1)], > output=[key1#6L, cnt1#14L]) > : +- Exchange hashpartitioning(key1#6L, 2) > :+- *(1) HashAggregate(keys=[key1#6L], functions=[partial_count(1)], > output=[key1#6L, count#39L]) > : +- *(1) Project [FLOOR((cast(id#4L as double) / 4.0)) AS key1#6L] > : +- *(1) Filter isnotnull(FLOOR((cast(id#4L as double) / 4.0))) > : +- *(1) Range (0, 4096, step=1, splits=1) > +- *(5) Sort [key3#22L ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(key3#22L, 2) > +- *(4) HashAggregate(keys=[key2#10L], functions=[count(1)], > output=[key3#22L, cnt2#19L]) > +- Exchange hashpartitioning(key2#10L, 2) > +- *(3) HashAggregate(keys=[key2#10L], > functions=[partial_count(1)], output=[key2#10L, count#41L]) >+- *(3) Project [FLOOR((cast(id#8L as double) / 4.0)) AS > key2#10L] > +- *(3) Filter isnotnull(FLOOR((cast(id#8L as double) / > 4.0))) > +- *(3) Range (0, 4096, step=1, splits=1) > {noformat} > I was able to track it down to this code in class HashPartitioning: > {code:java} > case h: HashClusteredDistribution => > expressions.length == h.expressions.length && > expressions.zip(h.expressions).forall { > case (l, r) => l.semanticEquals(r) > } > {code} > the semanticEquals returns false as it compares key2 and key3 eventhough key3 > is just a rename of key2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25951) Redundant shuffle if column is renamed
[ https://issues.apache.org/jira/browse/SPARK-25951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ohad Raviv updated SPARK-25951: --- Description: we've noticed that sometimes a column rename causes extra shuffle: {code:java} val N = 1 << 12 spark.sql("set spark.sql.autoBroadcastJoinThreshold=0") val t1 = spark.range(N).selectExpr("floor(id/4) as key1") val t2 = spark.range(N).selectExpr("floor(id/4) as key2") import org.apache.spark.sql.functions._ t1.groupBy("key1").agg(count(lit("1")).as("cnt1")) .join(t2.groupBy("key2").agg(count(lit("1")).as("cnt2")).withColumnRenamed("key2", "key3"), col("key1")===col("key3")) .explain() {code} results in: {noformat} == Physical Plan == *(6) SortMergeJoin [key1#6L], [key3#22L], Inner :- *(2) Sort [key1#6L ASC NULLS FIRST], false, 0 : +- *(2) HashAggregate(keys=[key1#6L], functions=[count(1)], output=[key1#6L, cnt1#14L]) : +- Exchange hashpartitioning(key1#6L, 2) :+- *(1) HashAggregate(keys=[key1#6L], functions=[partial_count(1)], output=[key1#6L, count#39L]) : +- *(1) Project [FLOOR((cast(id#4L as double) / 4.0)) AS key1#6L] : +- *(1) Filter isnotnull(FLOOR((cast(id#4L as double) / 4.0))) : +- *(1) Range (0, 4096, step=1, splits=1) +- *(5) Sort [key3#22L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(key3#22L, 2) +- *(4) HashAggregate(keys=[key2#10L], functions=[count(1)], output=[key3#22L, cnt2#19L]) +- Exchange hashpartitioning(key2#10L, 2) +- *(3) HashAggregate(keys=[key2#10L], functions=[partial_count(1)], output=[key2#10L, count#41L]) +- *(3) Project [FLOOR((cast(id#8L as double) / 4.0)) AS key2#10L] +- *(3) Filter isnotnull(FLOOR((cast(id#8L as double) / 4.0))) +- *(3) Range (0, 4096, step=1, splits=1) {noformat} I was able to track it down to this code in class HashPartitioning: {code:java} case h: HashClusteredDistribution => expressions.length == h.expressions.length && expressions.zip(h.expressions).forall { case (l, r) => l.semanticEquals(r) } {code} the semanticEquals returns false as it compares key2 and key3 eventhough key3 is just a rename of key2 was: we've noticed that sometimes a column rename causes extra shuffle: {code} val N = 1 << 12 spark.sql("set spark.sql.autoBroadcastJoinThreshold=0") val t1 = spark.range(N).selectExpr("floor(id/4) as key1") val t2 = spark.range(N).selectExpr("floor(id/4) as key2") t1.groupBy("key1").agg(count(lit("1")).as("cnt1")) .join(t2.groupBy("key2").agg(count(lit("1")).as("cnt2")).withColumnRenamed("key2", "key3"), col("key1")===col("key3")) .explain() {code} results in: {noformat} == Physical Plan == *(6) SortMergeJoin [key1#6L], [key3#22L], Inner :- *(2) Sort [key1#6L ASC NULLS FIRST], false, 0 : +- *(2) HashAggregate(keys=[key1#6L], functions=[count(1)], output=[key1#6L, cnt1#14L]) : +- Exchange hashpartitioning(key1#6L, 2) :+- *(1) HashAggregate(keys=[key1#6L], functions=[partial_count(1)], output=[key1#6L, count#39L]) : +- *(1) Project [FLOOR((cast(id#4L as double) / 4.0)) AS key1#6L] : +- *(1) Filter isnotnull(FLOOR((cast(id#4L as double) / 4.0))) : +- *(1) Range (0, 4096, step=1, splits=1) +- *(5) Sort [key3#22L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(key3#22L, 2) +- *(4) HashAggregate(keys=[key2#10L], functions=[count(1)], output=[key3#22L, cnt2#19L]) +- Exchange hashpartitioning(key2#10L, 2) +- *(3) HashAggregate(keys=[key2#10L], functions=[partial_count(1)], output=[key2#10L, count#41L]) +- *(3) Project [FLOOR((cast(id#8L as double) / 4.0)) AS key2#10L] +- *(3) Filter isnotnull(FLOOR((cast(id#8L as double) / 4.0))) +- *(3) Range (0, 4096, step=1, splits=1) {noformat} I was able to track it down to this code in class HashPartitioning: {code} case h: HashClusteredDistribution => expressions.length == h.expressions.length && expressions.zip(h.expressions).forall { case (l, r) => l.semanticEquals(r) } {code} the semanticEquals returns false as it compares key2 and key3 eventhough key3 is just a rename of key2 > Redundant shuffle if column is renamed > -- > > Key: SPARK-25951 > URL: https://issues.apache.org/jira/browse/SPARK-25951 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ohad Raviv >Priority: Minor > > we've noticed that sometimes a column rename causes extra shuffle: > {code:java} > val N = 1 << 12 > spark.sql("set spark.sql.autoBroadcastJoinThreshold=0") > val t1 = spark.range(N).selectExpr("floor(id/4) as key1") > val t2 = spark.range(N).selectExpr("floor(id/4) as key2") > import org.apache.spark.sql.functions._
[jira] [Resolved] (SPARK-25866) Update KMeans formatVersion
[ https://issues.apache.org/jira/browse/SPARK-25866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25866. - Resolution: Fixed Fix Version/s: 3.0.0 2.4.1 Issue resolved by pull request 22873 [https://github.com/apache/spark/pull/22873] > Update KMeans formatVersion > --- > > Key: SPARK-25866 > URL: https://issues.apache.org/jira/browse/SPARK-25866 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Minor > Fix For: 2.4.1, 3.0.0 > > > KMeans's {{formatVersion}} has not been updated to 2.0, when the > distanceMeasure parameter has been added. Despite this causes no issue, as > {{formatVersion}} is not used anywhere, the information returned is wrong. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25866) Update KMeans formatVersion
[ https://issues.apache.org/jira/browse/SPARK-25866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-25866: --- Assignee: Marco Gaido > Update KMeans formatVersion > --- > > Key: SPARK-25866 > URL: https://issues.apache.org/jira/browse/SPARK-25866 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Minor > Fix For: 2.4.1, 3.0.0 > > > KMeans's {{formatVersion}} has not been updated to 2.0, when the > distanceMeasure parameter has been added. Despite this causes no issue, as > {{formatVersion}} is not used anywhere, the information returned is wrong. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25951) Redundant shuffle if column is renamed
[ https://issues.apache.org/jira/browse/SPARK-25951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ohad Raviv updated SPARK-25951: --- Description: we've noticed that sometimes a column rename causes extra shuffle: {code} val N = 1 << 12 spark.sql("set spark.sql.autoBroadcastJoinThreshold=0") val t1 = spark.range(N).selectExpr("floor(id/4) as key1") val t2 = spark.range(N).selectExpr("floor(id/4) as key2") t1.groupBy("key1").agg(count(lit("1")).as("cnt1")) .join(t2.groupBy("key2").agg(count(lit("1")).as("cnt2")).withColumnRenamed("key2", "key3"), col("key1")===col("key3")) .explain() {code} results in: {noformat} == Physical Plan == *(6) SortMergeJoin [key1#6L], [key3#22L], Inner :- *(2) Sort [key1#6L ASC NULLS FIRST], false, 0 : +- *(2) HashAggregate(keys=[key1#6L], functions=[count(1)], output=[key1#6L, cnt1#14L]) : +- Exchange hashpartitioning(key1#6L, 2) :+- *(1) HashAggregate(keys=[key1#6L], functions=[partial_count(1)], output=[key1#6L, count#39L]) : +- *(1) Project [FLOOR((cast(id#4L as double) / 4.0)) AS key1#6L] : +- *(1) Filter isnotnull(FLOOR((cast(id#4L as double) / 4.0))) : +- *(1) Range (0, 4096, step=1, splits=1) +- *(5) Sort [key3#22L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(key3#22L, 2) +- *(4) HashAggregate(keys=[key2#10L], functions=[count(1)], output=[key3#22L, cnt2#19L]) +- Exchange hashpartitioning(key2#10L, 2) +- *(3) HashAggregate(keys=[key2#10L], functions=[partial_count(1)], output=[key2#10L, count#41L]) +- *(3) Project [FLOOR((cast(id#8L as double) / 4.0)) AS key2#10L] +- *(3) Filter isnotnull(FLOOR((cast(id#8L as double) / 4.0))) +- *(3) Range (0, 4096, step=1, splits=1) {noformat} I was able to track it down to this code in class HashPartitioning: {code} case h: HashClusteredDistribution => expressions.length == h.expressions.length && expressions.zip(h.expressions).forall { case (l, r) => l.semanticEquals(r) } {code} the semanticEquals returns false as it compares key2 and key3 eventhough key3 is just a rename of key2 was: we've noticed that sometimes a column rename causes extra shuffle: {code} val N = 1 << 12 spark.sql("set spark.sql.autoBroadcastJoinThreshold=0") val t1 = spark.range(N).selectExpr("floor(id/4) as key1") val t2 = spark.range(N).selectExpr("floor(id/4) as key2") t1.groupBy("key1").agg(count(lit("1")).as("cnt1")) .join(t2.groupBy("key2").agg(count(lit("1")).as("cnt2")).withColumnRenamed("key2", "key3"), col("key1")===col("key3")) .explain(true) {code} results in: {code} == Physical Plan == *(6) SortMergeJoin [key1#6L], [key3#22L], Inner :- *(2) Sort [key1#6L ASC NULLS FIRST], false, 0 : +- *(2) HashAggregate(keys=[key1#6L], functions=[count(1)], output=[key1#6L, cnt1#14L]) : +- Exchange hashpartitioning(key1#6L, 2) : +- *(1) HashAggregate(keys=[key1#6L], functions=[partial_count(1)], output=[key1#6L, count#39L]) : +- *(1) Project [FLOOR((cast(id#4L as double) / 4.0)) AS key1#6L] : +- *(1) Filter isnotnull(FLOOR((cast(id#4L as double) / 4.0))) : +- *(1) Range (0, 4096, step=1, splits=1) +- *(5) Sort [key3#22L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(key3#22L, 2) +- *(4) HashAggregate(keys=[key2#10L], functions=[count(1)], output=[key3#22L, cnt2#19L]) +- Exchange hashpartitioning(key2#10L, 2) +- *(3) HashAggregate(keys=[key2#10L], functions=[partial_count(1)], output=[key2#10L, count#41L]) +- *(3) Project [FLOOR((cast(id#8L as double) / 4.0)) AS key2#10L] +- *(3) Filter isnotnull(FLOOR((cast(id#8L as double) / 4.0))) +- *(3) Range (0, 4096, step=1, splits=1) {code} I was able to track it down to this code in class HashPartitioning: {code} case h: HashClusteredDistribution => expressions.length == h.expressions.length && expressions.zip(h.expressions).forall { case (l, r) => l.semanticEquals(r) } {code} the semanticEquals returns false as it compares key2 and key3 eventhough key3 is just a rename of key2 > Redundant shuffle if column is renamed > -- > > Key: SPARK-25951 > URL: https://issues.apache.org/jira/browse/SPARK-25951 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ohad Raviv >Priority: Minor > > we've noticed that sometimes a column rename causes extra shuffle: > {code} > val N = 1 << 12 > spark.sql("set spark.sql.autoBroadcastJoinThreshold=0") > val t1 = spark.range(N).selectExpr("floor(id/4) as key1") > val t2 = spark.range(N).selectExpr("floor(id/4) as key2") > t1.groupBy("key1").agg(count(lit("1")).as("cnt1")) > .join(t2.groupBy("key2").agg(count(lit("1")).as("cnt2")).withColumnRenamed("key2", > "key3"), > col("key1")===col("key3")) > .explain() > {code} > results in: > {noformat} > ==
[jira] [Created] (SPARK-25951) Redundant shuffle if column is renamed
Ohad Raviv created SPARK-25951: -- Summary: Redundant shuffle if column is renamed Key: SPARK-25951 URL: https://issues.apache.org/jira/browse/SPARK-25951 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Ohad Raviv we've noticed that sometimes a column rename causes extra shuffle: {code} val N = 1 << 12 spark.sql("set spark.sql.autoBroadcastJoinThreshold=0") val t1 = spark.range(N).selectExpr("floor(id/4) as key1") val t2 = spark.range(N).selectExpr("floor(id/4) as key2") t1.groupBy("key1").agg(count(lit("1")).as("cnt1")) .join(t2.groupBy("key2").agg(count(lit("1")).as("cnt2")).withColumnRenamed("key2", "key3"), col("key1")===col("key3")) .explain(true) {code} results in: {code} == Physical Plan == *(6) SortMergeJoin [key1#6L], [key3#22L], Inner :- *(2) Sort [key1#6L ASC NULLS FIRST], false, 0 : +- *(2) HashAggregate(keys=[key1#6L], functions=[count(1)], output=[key1#6L, cnt1#14L]) : +- Exchange hashpartitioning(key1#6L, 2) : +- *(1) HashAggregate(keys=[key1#6L], functions=[partial_count(1)], output=[key1#6L, count#39L]) : +- *(1) Project [FLOOR((cast(id#4L as double) / 4.0)) AS key1#6L] : +- *(1) Filter isnotnull(FLOOR((cast(id#4L as double) / 4.0))) : +- *(1) Range (0, 4096, step=1, splits=1) +- *(5) Sort [key3#22L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(key3#22L, 2) +- *(4) HashAggregate(keys=[key2#10L], functions=[count(1)], output=[key3#22L, cnt2#19L]) +- Exchange hashpartitioning(key2#10L, 2) +- *(3) HashAggregate(keys=[key2#10L], functions=[partial_count(1)], output=[key2#10L, count#41L]) +- *(3) Project [FLOOR((cast(id#8L as double) / 4.0)) AS key2#10L] +- *(3) Filter isnotnull(FLOOR((cast(id#8L as double) / 4.0))) +- *(3) Range (0, 4096, step=1, splits=1) {code} I was able to track it down to this code in class HashPartitioning: {code} case h: HashClusteredDistribution => expressions.length == h.expressions.length && expressions.zip(h.expressions).forall { case (l, r) => l.semanticEquals(r) } {code} the semanticEquals returns false as it compares key2 and key3 eventhough key3 is just a rename of key2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22148) TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-22148. --- Resolution: Fixed Fix Version/s: 3.0.0 2.4.1 > TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current > executors are blacklisted but dynamic allocation is enabled > - > > Key: SPARK-22148 > URL: https://issues.apache.org/jira/browse/SPARK-22148 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.2.0 >Reporter: Juan Rodríguez Hortalá >Assignee: Dhruve Ashar >Priority: Major > Fix For: 2.4.1, 3.0.0 > > Attachments: SPARK-22148_WIP.diff > > > Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and > the whole Spark job with `task X (partition Y) cannot run anywhere due to > node and executor blacklist. Blacklisting behavior can be configured via > spark.blacklist.*.` when all the available executors are blacklisted for a > pending Task or TaskSet. This makes sense for static allocation, where the > set of executors is fixed for the duration of the application, but this might > lead to unnecessary job failures when dynamic allocation is enabled. For > example, in a Spark application with a single job at a time, when a node > fails at the end of a stage attempt, all other executors will complete their > tasks, but the tasks running in the executors of the failing node will be > pending. Spark will keep waiting for those tasks for 2 minutes by default > (spark.network.timeout) until the heartbeat timeout is triggered, and then it > will blacklist those executors for that stage. At that point in time, other > executors would had been released after being idle for 1 minute by default > (spark.dynamicAllocation.executorIdleTimeout), because the next stage hasn't > started yet and so there are no more tasks available (assuming the default of > spark.speculation = false). So Spark will fail because the only executors > available are blacklisted for that stage. > An alternative is requesting more executors to the cluster manager in this > situation. This could be retried a configurable number of times after a > configurable wait time between request attempts, so if the cluster manager > fails to provide a suitable executor then the job is aborted like in the > previous case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22148) TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-22148: - Assignee: Dhruve Ashar > TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current > executors are blacklisted but dynamic allocation is enabled > - > > Key: SPARK-22148 > URL: https://issues.apache.org/jira/browse/SPARK-22148 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.2.0 >Reporter: Juan Rodríguez Hortalá >Assignee: Dhruve Ashar >Priority: Major > Fix For: 2.4.1, 3.0.0 > > Attachments: SPARK-22148_WIP.diff > > > Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and > the whole Spark job with `task X (partition Y) cannot run anywhere due to > node and executor blacklist. Blacklisting behavior can be configured via > spark.blacklist.*.` when all the available executors are blacklisted for a > pending Task or TaskSet. This makes sense for static allocation, where the > set of executors is fixed for the duration of the application, but this might > lead to unnecessary job failures when dynamic allocation is enabled. For > example, in a Spark application with a single job at a time, when a node > fails at the end of a stage attempt, all other executors will complete their > tasks, but the tasks running in the executors of the failing node will be > pending. Spark will keep waiting for those tasks for 2 minutes by default > (spark.network.timeout) until the heartbeat timeout is triggered, and then it > will blacklist those executors for that stage. At that point in time, other > executors would had been released after being idle for 1 minute by default > (spark.dynamicAllocation.executorIdleTimeout), because the next stage hasn't > started yet and so there are no more tasks available (assuming the default of > spark.speculation = false). So Spark will fail because the only executors > available are blacklisted for that stage. > An alternative is requesting more executors to the cluster manager in this > situation. This could be retried a configurable number of times after a > configurable wait time between request attempts, so if the cluster manager > fails to provide a suitable executor then the job is aborted like in the > previous case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15815) Hang while enable blacklistExecutor and DynamicExecutorAllocator
[ https://issues.apache.org/jira/browse/SPARK-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-15815. --- Resolution: Duplicate > Hang while enable blacklistExecutor and DynamicExecutorAllocator > - > > Key: SPARK-15815 > URL: https://issues.apache.org/jira/browse/SPARK-15815 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.6.1 >Reporter: SuYan >Priority: Minor > > Enable BlacklistExecutor with some time large than 120s and enabled > DynamicAllocate with minExecutors = 0 > 1. Assume there only left 1 task running in Executor A, and other Executor > are all timeout. > 2. the task failed, so task will not scheduled in current Executor A due to > enable blacklistTime. > 3. For ExecutorAllocateManager, it always request targetNumExecutor=1 > executors, due to we already have executor A, so the oldTargetNumExecutor == > targetNumExecutor = 1, so will never add more Executors...even if Executor A > was timeout. it became endless request delta=0 executors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25950) from_csv should respect to spark.sql.columnNameOfCorruptRecord
[ https://issues.apache.org/jira/browse/SPARK-25950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25950: Assignee: (was: Apache Spark) > from_csv should respect to spark.sql.columnNameOfCorruptRecord > -- > > Key: SPARK-25950 > URL: https://issues.apache.org/jira/browse/SPARK-25950 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > The from_csv() functions should respect to SQL config > *spark.sql.columnNameOfCorruptRecord* as from_json() does. Currently it takes > into account CSV option *columnNameOfCorruptRecord* only. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25950) from_csv should respect to spark.sql.columnNameOfCorruptRecord
[ https://issues.apache.org/jira/browse/SPARK-25950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676751#comment-16676751 ] Apache Spark commented on SPARK-25950: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/22956 > from_csv should respect to spark.sql.columnNameOfCorruptRecord > -- > > Key: SPARK-25950 > URL: https://issues.apache.org/jira/browse/SPARK-25950 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > The from_csv() functions should respect to SQL config > *spark.sql.columnNameOfCorruptRecord* as from_json() does. Currently it takes > into account CSV option *columnNameOfCorruptRecord* only. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25950) from_csv should respect to spark.sql.columnNameOfCorruptRecord
[ https://issues.apache.org/jira/browse/SPARK-25950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676752#comment-16676752 ] Apache Spark commented on SPARK-25950: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/22956 > from_csv should respect to spark.sql.columnNameOfCorruptRecord > -- > > Key: SPARK-25950 > URL: https://issues.apache.org/jira/browse/SPARK-25950 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > The from_csv() functions should respect to SQL config > *spark.sql.columnNameOfCorruptRecord* as from_json() does. Currently it takes > into account CSV option *columnNameOfCorruptRecord* only. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25950) from_csv should respect to spark.sql.columnNameOfCorruptRecord
[ https://issues.apache.org/jira/browse/SPARK-25950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25950: Assignee: Apache Spark > from_csv should respect to spark.sql.columnNameOfCorruptRecord > -- > > Key: SPARK-25950 > URL: https://issues.apache.org/jira/browse/SPARK-25950 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > The from_csv() functions should respect to SQL config > *spark.sql.columnNameOfCorruptRecord* as from_json() does. Currently it takes > into account CSV option *columnNameOfCorruptRecord* only. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25950) from_csv should respect to spark.sql.columnNameOfCorruptRecord
Maxim Gekk created SPARK-25950: -- Summary: from_csv should respect to spark.sql.columnNameOfCorruptRecord Key: SPARK-25950 URL: https://issues.apache.org/jira/browse/SPARK-25950 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk The from_csv() functions should respect to SQL config *spark.sql.columnNameOfCorruptRecord* as from_json() does. Currently it takes into account CSV option *columnNameOfCorruptRecord* only. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25949) Add test for optimizer rule PullOutPythonUDFInJoinCondition
[ https://issues.apache.org/jira/browse/SPARK-25949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25949: Assignee: Apache Spark > Add test for optimizer rule PullOutPythonUDFInJoinCondition > --- > > Key: SPARK-25949 > URL: https://issues.apache.org/jira/browse/SPARK-25949 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuanjian Li >Assignee: Apache Spark >Priority: Major > > As comment in [PullOutPythonUDFInJoinCondition introduced > pr|https://github.com/apache/spark/pull/22326#issuecomment-424923967], we > test the new added optimizer rule by end-to-end test in python side, need to > add suites under org.apache.spark.sql.catalyst.optimizer like other optimizer > rules. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25949) Add test for optimizer rule PullOutPythonUDFInJoinCondition
[ https://issues.apache.org/jira/browse/SPARK-25949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676703#comment-16676703 ] Apache Spark commented on SPARK-25949: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/22955 > Add test for optimizer rule PullOutPythonUDFInJoinCondition > --- > > Key: SPARK-25949 > URL: https://issues.apache.org/jira/browse/SPARK-25949 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuanjian Li >Priority: Major > > As comment in [PullOutPythonUDFInJoinCondition introduced > pr|https://github.com/apache/spark/pull/22326#issuecomment-424923967], we > test the new added optimizer rule by end-to-end test in python side, need to > add suites under org.apache.spark.sql.catalyst.optimizer like other optimizer > rules. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25949) Add test for optimizer rule PullOutPythonUDFInJoinCondition
[ https://issues.apache.org/jira/browse/SPARK-25949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25949: Assignee: (was: Apache Spark) > Add test for optimizer rule PullOutPythonUDFInJoinCondition > --- > > Key: SPARK-25949 > URL: https://issues.apache.org/jira/browse/SPARK-25949 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuanjian Li >Priority: Major > > As comment in [PullOutPythonUDFInJoinCondition introduced > pr|https://github.com/apache/spark/pull/22326#issuecomment-424923967], we > test the new added optimizer rule by end-to-end test in python side, need to > add suites under org.apache.spark.sql.catalyst.optimizer like other optimizer > rules. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25949) Add test for optimizer rule PullOutPythonUDFInJoinCondition
[ https://issues.apache.org/jira/browse/SPARK-25949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676702#comment-16676702 ] Apache Spark commented on SPARK-25949: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/22955 > Add test for optimizer rule PullOutPythonUDFInJoinCondition > --- > > Key: SPARK-25949 > URL: https://issues.apache.org/jira/browse/SPARK-25949 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuanjian Li >Priority: Major > > As comment in [PullOutPythonUDFInJoinCondition introduced > pr|https://github.com/apache/spark/pull/22326#issuecomment-424923967], we > test the new added optimizer rule by end-to-end test in python side, need to > add suites under org.apache.spark.sql.catalyst.optimizer like other optimizer > rules. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25949) Add test for optimizer rule PullOutPythonUDFInJoinCondition
[ https://issues.apache.org/jira/browse/SPARK-25949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanjian Li updated SPARK-25949: Summary: Add test for optimizer rule PullOutPythonUDFInJoinCondition (was: Add test for optimize rule PullOutPythonUDFInJoinCondition) > Add test for optimizer rule PullOutPythonUDFInJoinCondition > --- > > Key: SPARK-25949 > URL: https://issues.apache.org/jira/browse/SPARK-25949 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuanjian Li >Priority: Major > > As comment in [PullOutPythonUDFInJoinCondition introduced > pr|https://github.com/apache/spark/pull/22326#issuecomment-424923967], we > test the new added optimizer rule by end-to-end test in python side, need to > add suites under org.apache.spark.sql.catalyst.optimizer like other optimizer > rules. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25949) Add test for optimize rule PullOutPythonUDFInJoinCondition
Yuanjian Li created SPARK-25949: --- Summary: Add test for optimize rule PullOutPythonUDFInJoinCondition Key: SPARK-25949 URL: https://issues.apache.org/jira/browse/SPARK-25949 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.4.0 Reporter: Yuanjian Li As comment in [PullOutPythonUDFInJoinCondition introduced pr|https://github.com/apache/spark/pull/22326#issuecomment-424923967], we test the new added optimizer rule by end-to-end test in python side, need to add suites under org.apache.spark.sql.catalyst.optimizer like other optimizer rules. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20389) Upgrade kryo to fix NegativeArraySizeException
[ https://issues.apache.org/jira/browse/SPARK-20389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-20389. - Resolution: Duplicate Fixed by https://github.com/apache/spark/pull/22179 > Upgrade kryo to fix NegativeArraySizeException > -- > > Key: SPARK-20389 > URL: https://issues.apache.org/jira/browse/SPARK-20389 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Submit >Affects Versions: 2.1.0, 2.2.1 > Environment: Linux, Centos7, jdk8 >Reporter: Georg Heiler >Priority: Major > > I am experiencing an issue with Kryo when writing parquet files. Similar to > https://github.com/broadinstitute/gatk/issues/1524 a > NegativeArraySizeException occurs. Apparently this is fixed in a current Kryo > version. Spark is still using the very old 3.3 Kryo. > Can you please upgrade to a fixed Kryo version. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25937) Support user-defined schema in Kafka Source & Sink
[ https://issues.apache.org/jira/browse/SPARK-25937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676374#comment-16676374 ] Jackey Lee commented on SPARK-25937: Not exactly the same thing. The current implementation is to receive the format and schema, and then parse or merge user data. The following is a simple example. {code} val format = "json" val schema: StructType = new StructType().add ("name", StringType).add ("age", IntegerType) df.withFormat(format, schema).parseValue(df.value).select("name", "age") {code} For other unsupported formats, like "protobuf", "avro", user could redefine parseValue/combineValue and use the same code to parse data. > Support user-defined schema in Kafka Source & Sink > -- > > Key: SPARK-25937 > URL: https://issues.apache.org/jira/browse/SPARK-25937 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jackey Lee >Priority: Major > > Kafka Source & Sink is widely used in Spark and has the highest frequency > in streaming production environment. But at present, both Kafka Source and > Link use the fixed schema, which force user to do data conversion when > reading and writing Kafka. So why not we use fileformat to do this just like > hive? > Flink has implemented Kafka's Json/Csv/Avro extended Source & Sink, we > can also support it in Spark. > *Main Goals:* > 1. Provide a Source and Sink that support user defined Schema. Users can read > and write Kafka directly in the program without additional data conversion. > 2. Provides read-write mechanism based on FileFormat. User's data conversion > is similar to FileFormat's read and write process, we can provide a mechanism > similar to FileFormat, which provide common read-write format conversion. It > also allow users to customize format conversion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org