[jira] [Commented] (SPARK-32778) Accidental Data Deletion on calling saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-32778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189848#comment-17189848 ] Takeshi Yamamuro commented on SPARK-32778: -- Have you tried the latest releases, v2.4.6 or v3.0.0? > Accidental Data Deletion on calling saveAsTable > --- > > Key: SPARK-32778 > URL: https://issues.apache.org/jira/browse/SPARK-32778 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Aman Rastogi >Priority: Major > > {code:java} > df.write.option("path", > "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table) > {code} > Above code deleted the data present in path "/already/existing/path". This > happened because table was already not there in hive metastore however, path > given had data. And if table is not present in Hive Metastore, SaveMode gets > modified internally to SaveMode.Overwrite irrespective of what user has > provided, which leads to data deletion. This change was introduced as part of > https://issues.apache.org/jira/browse/SPARK-19583. > Now, suppose if user is not using external hive metastore (hive metastore is > associated with a cluster) and if cluster goes down or due to some reason > user has to migrate to a new cluster. Once user tries to save data using > above code in new cluster, it will first delete the data. It could be a > production data and user is completely unaware of it as they have provided > SaveMode.Append or ErrorIfExists. This will be an accidental data deletion. > > Repro Steps: > > # Save data through a hive table as mentioned in above code > # create another cluster and save data in new table in new cluster by giving > same path > > Proposed Fix: > Instead of modifying SaveMode to Overwrite, we should modify it to > ErrorIfExists in class CreateDataSourceTableAsSelectCommand. > Change (line 154) > > {code:java} > val result = saveDataIntoTable( > sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = > false) > > {code} > to > > {code:java} > val result = saveDataIntoTable( > sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, > tableExists = false){code} > This should not break CTAS. Even in case of CTAS, user may not want to delete > data if already exists as it could be accidental. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32778) Accidental Data Deletion on calling saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-32778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Rastogi updated SPARK-32778: - Description: {code:java} df.write.option("path", "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table) {code} Above code deleted the data present in path "/already/existing/path". This happened because table was already not there in hive metastore however, path given had data. And if table is not present in Hive Metastore, SaveMode gets modified internally to SaveMode.Overwrite irrespective of what user has provided, which leads to data deletion. This change was introduced as part of https://issues.apache.org/jira/browse/SPARK-19583. Now, suppose if user is not using external hive metastore (hive metastore is associated with a cluster) and if cluster goes down or due to some reason user has to migrate to a new cluster. Once user tries to save data using above code in new cluster, it will first delete the data. It could be a production data and user is completely unaware of it as they have provided SaveMode.Append or ErrorIfExists. This will be an accidental data deletion. Repro Steps: # Save data through a hive table as mentioned in above code # create another cluster and save data in new table in new cluster by giving same path Proposed Fix: Instead of modifying SaveMode to Overwrite, we should modify it to ErrorIfExists in class CreateDataSourceTableAsSelectCommand. Change (line 154) {code:java} val result = saveDataIntoTable( sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = false) {code} to {code:java} val result = saveDataIntoTable( sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, tableExists = false){code} This should not break CTAS. Even in case of CTAS, user may not want to delete data if already exists as it could be accidental. was: {code:java} df.write.option("path", "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table) {code} Above code deleted the data present in path "/already/existing/path". This happened because table was already not there in hive metastore however, path given had data. And if table is not present in Hive Metastore, SaveMode gets modified internally to SaveMode.Overwrite irrespective of what user has provided, which leads to data deletion. This change was introduced as part of https://issues.apache.org/jira/browse/SPARK-19583. Now, suppose if user is not using external hive metastore (hive metastore is associated with a cluster) and if cluster goes down or due to some reason user has to migrate to a new cluster. Once user tries to save data using above code in new cluster, it will first delete the data. It could be a production data and user is completely unaware of it as they have provided SaveMode.Append or ErrorIfExists. This will be an accidental data deletion. Proposed Fix: Instead of modifying SaveMode to Overwrite, we should modify it to ErrorIfExists in class CreateDataSourceTableAsSelectCommand. Change (line 154) {code:java} val result = saveDataIntoTable( sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = false) {code} to {code:java} val result = saveDataIntoTable( sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, tableExists = false){code} This should not break CTAS. Even in case of CTAS, user may not want to delete data if already exists as it could be accidental. > Accidental Data Deletion on calling saveAsTable > --- > > Key: SPARK-32778 > URL: https://issues.apache.org/jira/browse/SPARK-32778 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Aman Rastogi >Priority: Major > > {code:java} > df.write.option("path", > "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table) > {code} > Above code deleted the data present in path "/already/existing/path". This > happened because table was already not there in hive metastore however, path > given had data. And if table is not present in Hive Metastore, SaveMode gets > modified internally to SaveMode.Overwrite irrespective of what user has > provided, which leads to data deletion. This change was introduced as part of > https://issues.apache.org/jira/browse/SPARK-19583. > Now, suppose if user is not using external hive metastore (hive metastore is > associated with a cluster) and if cluster goes down or due to some reason > user has to migrate to a new cluster. Once user tries to save data using > above code in new cluster, it will first delete the data. It could be a > production data and user is completely unaware of it as they have provided > SaveMode.Append or
[jira] [Commented] (SPARK-32747) Deduplicate configuration set/unset in test_sparkSQL_arrow.R
[ https://issues.apache.org/jira/browse/SPARK-32747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189841#comment-17189841 ] Takeshi Yamamuro commented on SPARK-32747: -- Yay! (Note: I've checked all the related PRs/Tickets) > Deduplicate configuration set/unset in test_sparkSQL_arrow.R > > > Key: SPARK-32747 > URL: https://issues.apache.org/jira/browse/SPARK-32747 > Project: Spark > Issue Type: Test > Components: R, Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0, 3.0.2 > > > Currently, there are many set/unset duplicated in `test_sparkSQL_arrow.R` > test cases. We can just set once in globally and deduplicate such try-catch > logics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32747) Deduplicate configuration set/unset in test_sparkSQL_arrow.R
[ https://issues.apache.org/jira/browse/SPARK-32747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189835#comment-17189835 ] Hyukjin Kwon commented on SPARK-32747: -- Thank you Takeshi for correcting it and in other JIRAs! > Deduplicate configuration set/unset in test_sparkSQL_arrow.R > > > Key: SPARK-32747 > URL: https://issues.apache.org/jira/browse/SPARK-32747 > Project: Spark > Issue Type: Test > Components: R, Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0, 3.0.2 > > > Currently, there are many set/unset duplicated in `test_sparkSQL_arrow.R` > test cases. We can just set once in globally and deduplicate such try-catch > logics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32774) Don't track docs/.jekyll-cache
[ https://issues.apache.org/jira/browse/SPARK-32774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189834#comment-17189834 ] Hyukjin Kwon commented on SPARK-32774: -- Thanks [~maropu]! > Don't track docs/.jekyll-cache > -- > > Key: SPARK-32774 > URL: https://issues.apache.org/jira/browse/SPARK-32774 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.1.0, 3.0.2 > > > I noticed sometimes, docs/.jekyll-cache can be created and it should not be > tracked. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32771) The example of expressions.Aggregator in Javadoc / Scaladoc is wrong
[ https://issues.apache.org/jira/browse/SPARK-32771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-32771: - Fix Version/s: (was: 3.0.1) 3.0.2 > The example of expressions.Aggregator in Javadoc / Scaladoc is wrong > > > Key: SPARK-32771 > URL: https://issues.apache.org/jira/browse/SPARK-32771 > Project: Spark > Issue Type: Bug > Components: docs >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 2.4.7, 3.1.0, 3.0.2 > > > There is an example of expressions.Aggregator in Javadoc and Scaladoc like as > follows. > {code:java} > val customSummer = new Aggregator[Data, Int, Int] { > def zero: Int = 0 > def reduce(b: Int, a: Data): Int = b + a.i > def merge(b1: Int, b2: Int): Int = b1 + b2 > def finish(r: Int): Int = r > }.toColumn(){code} > But this example doesn't work because it doesn't define bufferEncoder and > outputEncoder. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32782) Refactory StreamingRelationV2 and move it to catalyst
[ https://issues.apache.org/jira/browse/SPARK-32782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189826#comment-17189826 ] Apache Spark commented on SPARK-32782: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/29633 > Refactory StreamingRelationV2 and move it to catalyst > - > > Key: SPARK-32782 > URL: https://issues.apache.org/jira/browse/SPARK-32782 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Yuanjian Li >Priority: Major > > Currently, the StreamingRelationV2 is bind with TableProvider. To make it > more flexible and have better expansibility, it should be moved to the > catalyst module and bind with the Table interface. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32774) Don't track docs/.jekyll-cache
[ https://issues.apache.org/jira/browse/SPARK-32774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-32774: - Fix Version/s: (was: 3.0.1) 3.0.2 > Don't track docs/.jekyll-cache > -- > > Key: SPARK-32774 > URL: https://issues.apache.org/jira/browse/SPARK-32774 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.1.0, 3.0.2 > > > I noticed sometimes, docs/.jekyll-cache can be created and it should not be > tracked. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32782) Refactory StreamingRelationV2 and move it to catalyst
[ https://issues.apache.org/jira/browse/SPARK-32782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32782: Assignee: (was: Apache Spark) > Refactory StreamingRelationV2 and move it to catalyst > - > > Key: SPARK-32782 > URL: https://issues.apache.org/jira/browse/SPARK-32782 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Yuanjian Li >Priority: Major > > Currently, the StreamingRelationV2 is bind with TableProvider. To make it > more flexible and have better expansibility, it should be moved to the > catalyst module and bind with the Table interface. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32782) Refactory StreamingRelationV2 and move it to catalyst
[ https://issues.apache.org/jira/browse/SPARK-32782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189829#comment-17189829 ] Apache Spark commented on SPARK-32782: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/29633 > Refactory StreamingRelationV2 and move it to catalyst > - > > Key: SPARK-32782 > URL: https://issues.apache.org/jira/browse/SPARK-32782 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Yuanjian Li >Priority: Major > > Currently, the StreamingRelationV2 is bind with TableProvider. To make it > more flexible and have better expansibility, it should be moved to the > catalyst module and bind with the Table interface. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32782) Refactory StreamingRelationV2 and move it to catalyst
[ https://issues.apache.org/jira/browse/SPARK-32782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32782: Assignee: Apache Spark > Refactory StreamingRelationV2 and move it to catalyst > - > > Key: SPARK-32782 > URL: https://issues.apache.org/jira/browse/SPARK-32782 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Yuanjian Li >Assignee: Apache Spark >Priority: Major > > Currently, the StreamingRelationV2 is bind with TableProvider. To make it > more flexible and have better expansibility, it should be moved to the > catalyst module and bind with the Table interface. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32771) The example of expressions.Aggregator in Javadoc / Scaladoc is wrong
[ https://issues.apache.org/jira/browse/SPARK-32771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189828#comment-17189828 ] Takeshi Yamamuro commented on SPARK-32771: -- Since v3.0.1 does not include this fix, I reset "Target Version/s" from 3.0.1 to 3.0.2. > The example of expressions.Aggregator in Javadoc / Scaladoc is wrong > > > Key: SPARK-32771 > URL: https://issues.apache.org/jira/browse/SPARK-32771 > Project: Spark > Issue Type: Bug > Components: docs >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 2.4.7, 3.0.1, 3.1.0 > > > There is an example of expressions.Aggregator in Javadoc and Scaladoc like as > follows. > {code:java} > val customSummer = new Aggregator[Data, Int, Int] { > def zero: Int = 0 > def reduce(b: Int, a: Data): Int = b + a.i > def merge(b1: Int, b2: Int): Int = b1 + b2 > def finish(r: Int): Int = r > }.toColumn(){code} > But this example doesn't work because it doesn't define bufferEncoder and > outputEncoder. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32774) Don't track docs/.jekyll-cache
[ https://issues.apache.org/jira/browse/SPARK-32774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189827#comment-17189827 ] Takeshi Yamamuro commented on SPARK-32774: -- Since v3.0.1 does not include this fix, I reset "Target Version/s" from 3.0.1 to 3.0.2. > Don't track docs/.jekyll-cache > -- > > Key: SPARK-32774 > URL: https://issues.apache.org/jira/browse/SPARK-32774 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.0.1, 3.1.0 > > > I noticed sometimes, docs/.jekyll-cache can be created and it should not be > tracked. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32747) Deduplicate configuration set/unset in test_sparkSQL_arrow.R
[ https://issues.apache.org/jira/browse/SPARK-32747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189825#comment-17189825 ] Takeshi Yamamuro commented on SPARK-32747: -- Since v3.0.1 does not include this fix, I reset "Target Version/s" from 3.0.1 to 3.0.2. > Deduplicate configuration set/unset in test_sparkSQL_arrow.R > > > Key: SPARK-32747 > URL: https://issues.apache.org/jira/browse/SPARK-32747 > Project: Spark > Issue Type: Test > Components: R, Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > Currently, there are many set/unset duplicated in `test_sparkSQL_arrow.R` > test cases. We can just set once in globally and deduplicate such try-catch > logics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32747) Deduplicate configuration set/unset in test_sparkSQL_arrow.R
[ https://issues.apache.org/jira/browse/SPARK-32747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-32747: - Fix Version/s: (was: 3.0.1) 3.0.2 > Deduplicate configuration set/unset in test_sparkSQL_arrow.R > > > Key: SPARK-32747 > URL: https://issues.apache.org/jira/browse/SPARK-32747 > Project: Spark > Issue Type: Test > Components: R, Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0, 3.0.2 > > > Currently, there are many set/unset duplicated in `test_sparkSQL_arrow.R` > test cases. We can just set once in globally and deduplicate such try-catch > logics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32693) Compare two dataframes with same schema except nullable property
[ https://issues.apache.org/jira/browse/SPARK-32693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-32693: - Fix Version/s: (was: 3.0.1) 3.0.2 > Compare two dataframes with same schema except nullable property > > > Key: SPARK-32693 > URL: https://issues.apache.org/jira/browse/SPARK-32693 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.1.0 >Reporter: david bernuau >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 2.4.7, 3.1.0, 3.0.2 > > > My aim is to compare two dataframes with very close schemas : same number of > fields, with the same names, types and metadata. The only difference comes > from the fact that a given field might be nullable in one dataframe and not > in the other. > Here is the code that i used : > {code:java} > val session = SparkSession.builder().getOrCreate() > import org.apache.spark.sql.Row > import java.sql.Timestamp > import scala.collection.JavaConverters._ > case class A(g: Timestamp, h: Option[Timestamp], i: Int) > case class B(e: Int, f: Seq[A]) > case class C(g: Timestamp, h: Option[Timestamp], i: Option[Int]) > case class D(e: Option[Int], f: Seq[C]) > val schema1 = StructType(Array(StructField("a", IntegerType, false), > StructField("b", IntegerType, false), StructField("c", IntegerType, false))) > val rowSeq1: List[Row] = List(Row(10, 1, 1), Row(10, 50, 2)) > val df1 = session.createDataFrame(rowSeq1.asJava, schema1) > df1.printSchema() > val schema2 = StructType(Array(StructField("a", IntegerType), > StructField("b", IntegerType), StructField("c", IntegerType))) > val rowSeq2: List[Row] = List(Row(10, 1, 1)) > val df2 = session.createDataFrame(rowSeq2.asJava, schema2) > df2.printSchema() > println(s"Number of records for first case : ${df1.except(df2).count()}") > val schema3 = StructType( > Array( > StructField("a", IntegerType, false), > StructField("b", IntegerType, false), > StructField("c", IntegerType, false), > StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType, > false), StructField("f", ArrayType(StructType(Array(StructField("g", > TimestampType), StructField("h", TimestampType), StructField("i", > IntegerType, false) > > > ) > ) > val date1 = new Timestamp(1597589638L) > val date2 = new Timestamp(1597599638L) > val rowSeq3: List[Row] = List(Row(10, 1, 1, Seq(B(100, Seq(A(date1, None, > 1), Row(10, 50, 2, Seq(B(101, Seq(A(date2, None, 2)) > val df3 = session.createDataFrame(rowSeq3.asJava, schema3) > df3.printSchema() > val schema4 = StructType( > Array( > StructField("a", IntegerType), > StructField("b", IntegerType), > StructField("b", IntegerType), > StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType), > StructField("f", ArrayType(StructType(Array(StructField("g", TimestampType), > StructField("h", TimestampType), StructField("i", IntegerType) > > > ) > ) > val rowSeq4: List[Row] = List(Row(10, 1, 1, Seq(D(Some(100), Seq(C(date1, > None, Some(1))) > val df4 = session.createDataFrame(rowSeq4.asJava, schema3) > df4.printSchema() > println(s"Number of records for second case : ${df3.except(df4).count()}") > {code} > The preceeding code shows what seems to me a bug in Spark : > * If you consider two dataframes (df1 and df2) having exactly the same > schema, except fields are not nullable for the first dataframe and are > nullable for the second. Then, doing df1.except(df2).count() works well. > * Now, if you consider two other dataframes (df3 and df4) having the same > schema (with fields nullable on one side and not on the other). If these two > dataframes contain nested fields, then, this time, the action > df3.except(df4).count gives the following exception : > java.lang.IllegalArgumentException: requirement failed: Join keys from two > sides should have same types -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32693) Compare two dataframes with same schema except nullable property
[ https://issues.apache.org/jira/browse/SPARK-32693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189823#comment-17189823 ] Takeshi Yamamuro commented on SPARK-32693: -- Unfortunately, v3.0.1 does not include this fix, so I reset it back: [http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-RESULT-Release-Spark-3-0-1-RC3-td30114.html] > Compare two dataframes with same schema except nullable property > > > Key: SPARK-32693 > URL: https://issues.apache.org/jira/browse/SPARK-32693 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.1.0 >Reporter: david bernuau >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 2.4.7, 3.0.1, 3.1.0 > > > My aim is to compare two dataframes with very close schemas : same number of > fields, with the same names, types and metadata. The only difference comes > from the fact that a given field might be nullable in one dataframe and not > in the other. > Here is the code that i used : > {code:java} > val session = SparkSession.builder().getOrCreate() > import org.apache.spark.sql.Row > import java.sql.Timestamp > import scala.collection.JavaConverters._ > case class A(g: Timestamp, h: Option[Timestamp], i: Int) > case class B(e: Int, f: Seq[A]) > case class C(g: Timestamp, h: Option[Timestamp], i: Option[Int]) > case class D(e: Option[Int], f: Seq[C]) > val schema1 = StructType(Array(StructField("a", IntegerType, false), > StructField("b", IntegerType, false), StructField("c", IntegerType, false))) > val rowSeq1: List[Row] = List(Row(10, 1, 1), Row(10, 50, 2)) > val df1 = session.createDataFrame(rowSeq1.asJava, schema1) > df1.printSchema() > val schema2 = StructType(Array(StructField("a", IntegerType), > StructField("b", IntegerType), StructField("c", IntegerType))) > val rowSeq2: List[Row] = List(Row(10, 1, 1)) > val df2 = session.createDataFrame(rowSeq2.asJava, schema2) > df2.printSchema() > println(s"Number of records for first case : ${df1.except(df2).count()}") > val schema3 = StructType( > Array( > StructField("a", IntegerType, false), > StructField("b", IntegerType, false), > StructField("c", IntegerType, false), > StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType, > false), StructField("f", ArrayType(StructType(Array(StructField("g", > TimestampType), StructField("h", TimestampType), StructField("i", > IntegerType, false) > > > ) > ) > val date1 = new Timestamp(1597589638L) > val date2 = new Timestamp(1597599638L) > val rowSeq3: List[Row] = List(Row(10, 1, 1, Seq(B(100, Seq(A(date1, None, > 1), Row(10, 50, 2, Seq(B(101, Seq(A(date2, None, 2)) > val df3 = session.createDataFrame(rowSeq3.asJava, schema3) > df3.printSchema() > val schema4 = StructType( > Array( > StructField("a", IntegerType), > StructField("b", IntegerType), > StructField("b", IntegerType), > StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType), > StructField("f", ArrayType(StructType(Array(StructField("g", TimestampType), > StructField("h", TimestampType), StructField("i", IntegerType) > > > ) > ) > val rowSeq4: List[Row] = List(Row(10, 1, 1, Seq(D(Some(100), Seq(C(date1, > None, Some(1))) > val df4 = session.createDataFrame(rowSeq4.asJava, schema3) > df4.printSchema() > println(s"Number of records for second case : ${df3.except(df4).count()}") > {code} > The preceeding code shows what seems to me a bug in Spark : > * If you consider two dataframes (df1 and df2) having exactly the same > schema, except fields are not nullable for the first dataframe and are > nullable for the second. Then, doing df1.except(df2).count() works well. > * Now, if you consider two other dataframes (df3 and df4) having the same > schema (with fields nullable on one side and not on the other). If these two > dataframes contain nested fields, then, this time, the action > df3.except(df4).count gives the following exception : > java.lang.IllegalArgumentException: requirement failed: Join keys from two > sides should have same types -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32782) Refactory StreamingRelationV2 and move it to catalyst
Yuanjian Li created SPARK-32782: --- Summary: Refactory StreamingRelationV2 and move it to catalyst Key: SPARK-32782 URL: https://issues.apache.org/jira/browse/SPARK-32782 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.1.0 Reporter: Yuanjian Li Currently, the StreamingRelationV2 is bind with TableProvider. To make it more flexible and have better expansibility, it should be moved to the catalyst module and bind with the Table interface. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32739) support prune right for left semi join in DPP
[ https://issues.apache.org/jira/browse/SPARK-32739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189819#comment-17189819 ] Zhenhua Wang commented on SPARK-32739: -- [~maropu] Added description, thanks for reminding. > support prune right for left semi join in DPP > - > > Key: SPARK-32739 > URL: https://issues.apache.org/jira/browse/SPARK-32739 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang >Priority: Minor > Fix For: 3.1.0 > > > Currently in DPP, left semi can only prune left, it should also support prune > right. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32739) support prune right for left semi join in DPP
[ https://issues.apache.org/jira/browse/SPARK-32739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-32739: - Description: Currently in DPP, left semi can only prune left, it should also support prune right. > support prune right for left semi join in DPP > - > > Key: SPARK-32739 > URL: https://issues.apache.org/jira/browse/SPARK-32739 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang >Priority: Minor > Fix For: 3.1.0 > > > Currently in DPP, left semi can only prune left, it should also support prune > right. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32781) Non-ASCII characters are mistakenly omitted in the middle of intervals
[ https://issues.apache.org/jira/browse/SPARK-32781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189808#comment-17189808 ] Apache Spark commented on SPARK-32781: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/29632 > Non-ASCII characters are mistakenly omitted in the middle of intervals > -- > > Key: SPARK-32781 > URL: https://issues.apache.org/jira/browse/SPARK-32781 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > select interval '1中国day'; > we should fail case above -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32781) Non-ASCII characters are mistakenly omitted in the middle of intervals
[ https://issues.apache.org/jira/browse/SPARK-32781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32781: Assignee: (was: Apache Spark) > Non-ASCII characters are mistakenly omitted in the middle of intervals > -- > > Key: SPARK-32781 > URL: https://issues.apache.org/jira/browse/SPARK-32781 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > select interval '1中国day'; > we should fail case above -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32781) Non-ASCII characters are mistakenly omitted in the middle of intervals
[ https://issues.apache.org/jira/browse/SPARK-32781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32781: Assignee: Apache Spark > Non-ASCII characters are mistakenly omitted in the middle of intervals > -- > > Key: SPARK-32781 > URL: https://issues.apache.org/jira/browse/SPARK-32781 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > > select interval '1中国day'; > we should fail case above -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32781) Non-ASCII characters are mistakenly omitted in the middle of intervals
[ https://issues.apache.org/jira/browse/SPARK-32781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189807#comment-17189807 ] Apache Spark commented on SPARK-32781: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/29632 > Non-ASCII characters are mistakenly omitted in the middle of intervals > -- > > Key: SPARK-32781 > URL: https://issues.apache.org/jira/browse/SPARK-32781 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > select interval '1中国day'; > we should fail case above -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32738) thread safe endpoints may hang due to fatal error
[ https://issues.apache.org/jira/browse/SPARK-32738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-32738: - Description: Processing for `ThreadSafeRpcEndpoint` is controlled by 'numActiveThreads' in `Inbox`. Now if any fatal error happens during `Inbox.process`, 'numActiveThreads' is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints. was: Processing for `ThreadSafeRpcEndpoint` is controlled by 'numActiveThreads' in `Inbox`. Now if any fatal exception happens during `Inbox.process`, 'numActiveThreads' is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to hang. This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints. > thread safe endpoints may hang due to fatal error > - > > Key: SPARK-32738 > URL: https://issues.apache.org/jira/browse/SPARK-32738 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4, 2.4.6, 3.0.0 >Reporter: Zhenhua Wang >Priority: Major > > Processing for `ThreadSafeRpcEndpoint` is controlled by 'numActiveThreads' in > `Inbox`. Now if any fatal error happens during `Inbox.process`, > 'numActiveThreads' is not reduced. Then other threads can not process > messages in that inbox, which causes the endpoint to "hang". > This problem is more serious in previous Spark 2.x versions since the driver, > executor and block manager endpoints are all thread safe endpoints. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32781) Non-ASCII characters are mistakenly omitted in the middle of intervals
Kent Yao created SPARK-32781: Summary: Non-ASCII characters are mistakenly omitted in the middle of intervals Key: SPARK-32781 URL: https://issues.apache.org/jira/browse/SPARK-32781 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 3.1.0 Reporter: Kent Yao select interval '1中国day'; we should fail case above -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32738) thread safe endpoints may hang due to fatal error
[ https://issues.apache.org/jira/browse/SPARK-32738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-32738: - Summary: thread safe endpoints may hang due to fatal error (was: thread safe endpoints may hang due to fatal exception) > thread safe endpoints may hang due to fatal error > - > > Key: SPARK-32738 > URL: https://issues.apache.org/jira/browse/SPARK-32738 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4, 2.4.6, 3.0.0 >Reporter: Zhenhua Wang >Priority: Major > > Processing for `ThreadSafeRpcEndpoint` is controlled by 'numActiveThreads' in > `Inbox`. Now if any fatal exception happens during `Inbox.process`, > 'numActiveThreads' is not reduced. Then other threads can not process > messages in that inbox, which causes the endpoint to hang. > This problem is more serious in previous Spark 2.x versions since the driver, > executor and block manager endpoints are all thread safe endpoints. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins
[ https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189785#comment-17189785 ] huangtianhua commented on SPARK-32691: -- [~dongjoon], ok, thanks. And if anyone wants to reproduce the failure on ARM, I can provide an arm instance:) > Test org.apache.spark.DistributedSuite failed on arm64 jenkins > -- > > Key: SPARK-32691 > URL: https://issues.apache.org/jira/browse/SPARK-32691 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.1.0 > Environment: ARM64 >Reporter: huangtianhua >Priority: Major > Attachments: failure.log, success.log > > > Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ > - caching in memory and disk, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory and disk, serialized, replicated (encryption = on) > (with replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory, serialized, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > ... > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32312) Upgrade Apache Arrow to 1.0.0
[ https://issues.apache.org/jira/browse/SPARK-32312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189784#comment-17189784 ] Kazuaki Ishizaki commented on SPARK-32312: -- I think that [this|https://github.com/apache/arrow/pull/7746] is a work to succeed a build. What work do we need further? > Upgrade Apache Arrow to 1.0.0 > - > > Key: SPARK-32312 > URL: https://issues.apache.org/jira/browse/SPARK-32312 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > Apache Arrow will soon release v1.0.0 which provides backward/forward > compatibility guarantees as well as a number of fixes and improvements. This > will upgrade the Java artifact and PySpark API. Although PySpark will not > need special changes, it might be a good idea to bump up minimum supported > version and CI testing. > TBD: list of important improvements and fixes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32772) Reduce log messages for spark-sql CLI
[ https://issues.apache.org/jira/browse/SPARK-32772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189779#comment-17189779 ] Apache Spark commented on SPARK-32772: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/29631 > Reduce log messages for spark-sql CLI > - > > Key: SPARK-32772 > URL: https://issues.apache.org/jira/browse/SPARK-32772 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.1.0 > > > When we launch spark-sql CLI, too many log messages are shown and it's > sometimes difficult to find the result of query. > So I think it's better to reduce log messages like spark-shell and pyspark > CLI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32772) Reduce log messages for spark-sql CLI
[ https://issues.apache.org/jira/browse/SPARK-32772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189778#comment-17189778 ] Apache Spark commented on SPARK-32772: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/29631 > Reduce log messages for spark-sql CLI > - > > Key: SPARK-32772 > URL: https://issues.apache.org/jira/browse/SPARK-32772 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.1.0 > > > When we launch spark-sql CLI, too many log messages are shown and it's > sometimes difficult to find the result of query. > So I think it's better to reduce log messages like spark-shell and pyspark > CLI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32637) SPARK SQL JDBC truncates last value of seconds for datetime2 values for Azure SQL DB
[ https://issues.apache.org/jira/browse/SPARK-32637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189771#comment-17189771 ] Takeshi Yamamuro commented on SPARK-32637: -- +1 on the Maxim comment and I'll close this. > SPARK SQL JDBC truncates last value of seconds for datetime2 values for Azure > SQL DB > - > > Key: SPARK-32637 > URL: https://issues.apache.org/jira/browse/SPARK-32637 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.1 >Reporter: Mohit Dave >Priority: Major > > SPARK jdbc is truncating TIMESTAMP values for the microsecond when datetime2 > datatype is used for Microsoft SQL Server JDBC driver. > > Source data(datetime2) : '2007-08-08 12:35:29.1234567' > > After loading to target using SPARK dataframes > > Target data(datetime2) : '2007-08-08 12:35:29.1234560' > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32637) SPARK SQL JDBC truncates last value of seconds for datetime2 values for Azure SQL DB
[ https://issues.apache.org/jira/browse/SPARK-32637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-32637. -- Resolution: Invalid > SPARK SQL JDBC truncates last value of seconds for datetime2 values for Azure > SQL DB > - > > Key: SPARK-32637 > URL: https://issues.apache.org/jira/browse/SPARK-32637 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.1 >Reporter: Mohit Dave >Priority: Major > > SPARK jdbc is truncating TIMESTAMP values for the microsecond when datetime2 > datatype is used for Microsoft SQL Server JDBC driver. > > Source data(datetime2) : '2007-08-08 12:35:29.1234567' > > After loading to target using SPARK dataframes > > Target data(datetime2) : '2007-08-08 12:35:29.1234560' > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32734) RDD actions in DStream.transfrom delays batch submission
[ https://issues.apache.org/jira/browse/SPARK-32734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-32734: - Labels: (was: pull-request-available) > RDD actions in DStream.transfrom delays batch submission > > > Key: SPARK-32734 > URL: https://issues.apache.org/jira/browse/SPARK-32734 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 3.0.0 >Reporter: Liechuan Ou >Priority: Major > Original Estimate: 168h > Remaining Estimate: 168h > > h4. Issue > Some spark applications have batch creation delay after running for some > time. For instance, Batch 10:03 is submitted at 10:06. In spark UI, the > latest batch doesn't match current time. > > ||Clock||BatchTime|| > |10:00|10:00| > |10:02|10:01| > |10:04|10:02| > |10:06|10:03| > h4. Investigation > We observe such applications share a commonality that rdd actions exist in > dstream.transfrom. Those actions will be executed in dstream.compute, which > is called by JobGenerator. JobGenerator runs with a single thread event loop > so any synchronized operations will block event processing. > h4. Proposal > delegate dstream.compute to JobSchduler > > {code:java} > // class ForEachDStream > override def generateJob(time: Time): Option[Job] = { > val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) { > parent.getOrCompute(time).foreach(rdd => foreachFunc(rdd, time)) > } > Some(new Job(time, jobFunc)) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32735) RDD actions in DStream.transfrom don't show at batch page
[ https://issues.apache.org/jira/browse/SPARK-32735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-32735: - Labels: (was: pull-request-available) > RDD actions in DStream.transfrom don't show at batch page > - > > Key: SPARK-32735 > URL: https://issues.apache.org/jira/browse/SPARK-32735 > Project: Spark > Issue Type: Bug > Components: DStreams, Web UI >Affects Versions: 3.0.0 >Reporter: Liechuan Ou >Priority: Major > > h4. Issue > {code:java} > val lines = ssc.socketTextStream("localhost", ) > val words = lines.flatMap(_.split(" ")) > val mappedStream= words.transform(rdd => { > val c = rdd.count(); > rdd.map(x => s"$c x")} > ) > mappedStream.foreachRDD(rdd => rdd.foreach(x => println(x))){code} > Every batch two spark jobs are created. Only the second one is associated > with the streaming output operation and shows at batch page. > h4. Investigation > The first action rdd.count() is invoked by JobGenerator.generateJobs. Batch > time and output op id are not available in spark context because they are set > in JobScheduler later. > h4. Proposal > delegate dstream.getOrCompute to JobScheduler so that all rdd actions can run > in spark context with correct local properties. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32735) RDD actions in DStream.transfrom don't show at batch page
[ https://issues.apache.org/jira/browse/SPARK-32735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189766#comment-17189766 ] Apache Spark commented on SPARK-32735: -- User 'Olwn' has created a pull request for this issue: https://github.com/apache/spark/pull/29578 > RDD actions in DStream.transfrom don't show at batch page > - > > Key: SPARK-32735 > URL: https://issues.apache.org/jira/browse/SPARK-32735 > Project: Spark > Issue Type: Bug > Components: DStreams, Web UI >Affects Versions: 3.0.0 >Reporter: Liechuan Ou >Priority: Major > Labels: pull-request-available > > h4. Issue > {code:java} > val lines = ssc.socketTextStream("localhost", ) > val words = lines.flatMap(_.split(" ")) > val mappedStream= words.transform(rdd => { > val c = rdd.count(); > rdd.map(x => s"$c x")} > ) > mappedStream.foreachRDD(rdd => rdd.foreach(x => println(x))){code} > Every batch two spark jobs are created. Only the second one is associated > with the streaming output operation and shows at batch page. > h4. Investigation > The first action rdd.count() is invoked by JobGenerator.generateJobs. Batch > time and output op id are not available in spark context because they are set > in JobScheduler later. > h4. Proposal > delegate dstream.getOrCompute to JobScheduler so that all rdd actions can run > in spark context with correct local properties. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32735) RDD actions in DStream.transfrom don't show at batch page
[ https://issues.apache.org/jira/browse/SPARK-32735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32735: Assignee: Apache Spark > RDD actions in DStream.transfrom don't show at batch page > - > > Key: SPARK-32735 > URL: https://issues.apache.org/jira/browse/SPARK-32735 > Project: Spark > Issue Type: Bug > Components: DStreams, Web UI >Affects Versions: 3.0.0 >Reporter: Liechuan Ou >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > h4. Issue > {code:java} > val lines = ssc.socketTextStream("localhost", ) > val words = lines.flatMap(_.split(" ")) > val mappedStream= words.transform(rdd => { > val c = rdd.count(); > rdd.map(x => s"$c x")} > ) > mappedStream.foreachRDD(rdd => rdd.foreach(x => println(x))){code} > Every batch two spark jobs are created. Only the second one is associated > with the streaming output operation and shows at batch page. > h4. Investigation > The first action rdd.count() is invoked by JobGenerator.generateJobs. Batch > time and output op id are not available in spark context because they are set > in JobScheduler later. > h4. Proposal > delegate dstream.getOrCompute to JobScheduler so that all rdd actions can run > in spark context with correct local properties. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32735) RDD actions in DStream.transfrom don't show at batch page
[ https://issues.apache.org/jira/browse/SPARK-32735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189765#comment-17189765 ] Apache Spark commented on SPARK-32735: -- User 'Olwn' has created a pull request for this issue: https://github.com/apache/spark/pull/29578 > RDD actions in DStream.transfrom don't show at batch page > - > > Key: SPARK-32735 > URL: https://issues.apache.org/jira/browse/SPARK-32735 > Project: Spark > Issue Type: Bug > Components: DStreams, Web UI >Affects Versions: 3.0.0 >Reporter: Liechuan Ou >Priority: Major > Labels: pull-request-available > > h4. Issue > {code:java} > val lines = ssc.socketTextStream("localhost", ) > val words = lines.flatMap(_.split(" ")) > val mappedStream= words.transform(rdd => { > val c = rdd.count(); > rdd.map(x => s"$c x")} > ) > mappedStream.foreachRDD(rdd => rdd.foreach(x => println(x))){code} > Every batch two spark jobs are created. Only the second one is associated > with the streaming output operation and shows at batch page. > h4. Investigation > The first action rdd.count() is invoked by JobGenerator.generateJobs. Batch > time and output op id are not available in spark context because they are set > in JobScheduler later. > h4. Proposal > delegate dstream.getOrCompute to JobScheduler so that all rdd actions can run > in spark context with correct local properties. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32735) RDD actions in DStream.transfrom don't show at batch page
[ https://issues.apache.org/jira/browse/SPARK-32735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32735: Assignee: (was: Apache Spark) > RDD actions in DStream.transfrom don't show at batch page > - > > Key: SPARK-32735 > URL: https://issues.apache.org/jira/browse/SPARK-32735 > Project: Spark > Issue Type: Bug > Components: DStreams, Web UI >Affects Versions: 3.0.0 >Reporter: Liechuan Ou >Priority: Major > Labels: pull-request-available > > h4. Issue > {code:java} > val lines = ssc.socketTextStream("localhost", ) > val words = lines.flatMap(_.split(" ")) > val mappedStream= words.transform(rdd => { > val c = rdd.count(); > rdd.map(x => s"$c x")} > ) > mappedStream.foreachRDD(rdd => rdd.foreach(x => println(x))){code} > Every batch two spark jobs are created. Only the second one is associated > with the streaming output operation and shows at batch page. > h4. Investigation > The first action rdd.count() is invoked by JobGenerator.generateJobs. Batch > time and output op id are not available in spark context because they are set > in JobScheduler later. > h4. Proposal > delegate dstream.getOrCompute to JobScheduler so that all rdd actions can run > in spark context with correct local properties. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32734) RDD actions in DStream.transfrom delays batch submission
[ https://issues.apache.org/jira/browse/SPARK-32734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189764#comment-17189764 ] Apache Spark commented on SPARK-32734: -- User 'Olwn' has created a pull request for this issue: https://github.com/apache/spark/pull/29578 > RDD actions in DStream.transfrom delays batch submission > > > Key: SPARK-32734 > URL: https://issues.apache.org/jira/browse/SPARK-32734 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 3.0.0 >Reporter: Liechuan Ou >Priority: Major > Labels: pull-request-available > Original Estimate: 168h > Remaining Estimate: 168h > > h4. Issue > Some spark applications have batch creation delay after running for some > time. For instance, Batch 10:03 is submitted at 10:06. In spark UI, the > latest batch doesn't match current time. > > ||Clock||BatchTime|| > |10:00|10:00| > |10:02|10:01| > |10:04|10:02| > |10:06|10:03| > h4. Investigation > We observe such applications share a commonality that rdd actions exist in > dstream.transfrom. Those actions will be executed in dstream.compute, which > is called by JobGenerator. JobGenerator runs with a single thread event loop > so any synchronized operations will block event processing. > h4. Proposal > delegate dstream.compute to JobSchduler > > {code:java} > // class ForEachDStream > override def generateJob(time: Time): Option[Job] = { > val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) { > parent.getOrCompute(time).foreach(rdd => foreachFunc(rdd, time)) > } > Some(new Job(time, jobFunc)) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32734) RDD actions in DStream.transfrom delays batch submission
[ https://issues.apache.org/jira/browse/SPARK-32734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32734: Assignee: Apache Spark > RDD actions in DStream.transfrom delays batch submission > > > Key: SPARK-32734 > URL: https://issues.apache.org/jira/browse/SPARK-32734 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 3.0.0 >Reporter: Liechuan Ou >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > Original Estimate: 168h > Remaining Estimate: 168h > > h4. Issue > Some spark applications have batch creation delay after running for some > time. For instance, Batch 10:03 is submitted at 10:06. In spark UI, the > latest batch doesn't match current time. > > ||Clock||BatchTime|| > |10:00|10:00| > |10:02|10:01| > |10:04|10:02| > |10:06|10:03| > h4. Investigation > We observe such applications share a commonality that rdd actions exist in > dstream.transfrom. Those actions will be executed in dstream.compute, which > is called by JobGenerator. JobGenerator runs with a single thread event loop > so any synchronized operations will block event processing. > h4. Proposal > delegate dstream.compute to JobSchduler > > {code:java} > // class ForEachDStream > override def generateJob(time: Time): Option[Job] = { > val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) { > parent.getOrCompute(time).foreach(rdd => foreachFunc(rdd, time)) > } > Some(new Job(time, jobFunc)) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32734) RDD actions in DStream.transfrom delays batch submission
[ https://issues.apache.org/jira/browse/SPARK-32734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32734: Assignee: (was: Apache Spark) > RDD actions in DStream.transfrom delays batch submission > > > Key: SPARK-32734 > URL: https://issues.apache.org/jira/browse/SPARK-32734 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 3.0.0 >Reporter: Liechuan Ou >Priority: Major > Labels: pull-request-available > Original Estimate: 168h > Remaining Estimate: 168h > > h4. Issue > Some spark applications have batch creation delay after running for some > time. For instance, Batch 10:03 is submitted at 10:06. In spark UI, the > latest batch doesn't match current time. > > ||Clock||BatchTime|| > |10:00|10:00| > |10:02|10:01| > |10:04|10:02| > |10:06|10:03| > h4. Investigation > We observe such applications share a commonality that rdd actions exist in > dstream.transfrom. Those actions will be executed in dstream.compute, which > is called by JobGenerator. JobGenerator runs with a single thread event loop > so any synchronized operations will block event processing. > h4. Proposal > delegate dstream.compute to JobSchduler > > {code:java} > // class ForEachDStream > override def generateJob(time: Time): Option[Job] = { > val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) { > parent.getOrCompute(time).foreach(rdd => foreachFunc(rdd, time)) > } > Some(new Job(time, jobFunc)) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32780) Fill since fields for all the expressions
Takeshi Yamamuro created SPARK-32780: Summary: Fill since fields for all the expressions Key: SPARK-32780 URL: https://issues.apache.org/jira/browse/SPARK-32780 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Takeshi Yamamuro Some since files in ExpressionDescription are missing now, it is worth filling them to make documents better; {code:java} test("Since has a valid value") { val badExpressions = spark.sessionState.functionRegistry.listFunction() .map(spark.sessionState.catalog.lookupFunctionInfo) .filter(funcInfo => !funcInfo.getSince.matches("[0-9]+\\.[0-9]+\\.[0-9]+")) .map(_.getClassName) .distinct .sorted if (badExpressions.nonEmpty) { fail(s"${badExpressions.length} expressions with invalid 'since':\n" + badExpressions.mkString("\n")) } } [info] - Since has a valid value *** FAILED *** (16 milliseconds) [info] 67 expressions with invalid 'since': [info] org.apache.spark.sql.catalyst.expressions.Abs [info] org.apache.spark.sql.catalyst.expressions.Add [info] org.apache.spark.sql.catalyst.expressions.And [info] org.apache.spark.sql.catalyst.expressions.ArrayContains [info] org.apache.spark.sql.catalyst.expressions.AssertTrue [info] org.apache.spark.sql.catalyst.expressions.BitwiseAnd [info] org.apache.spark.sql.catalyst.expressions.BitwiseNot [info] org.apache.spark.sql.catalyst.expressions.BitwiseOr [info] org.apache.spark.sql.catalyst.expressions.BitwiseXor [info] org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection [info] org.apache.spark.sql.catalyst.expressions.CaseWhen [info] org.apache.spark.sql.catalyst.expressions.Cast [info] org.apache.spark.sql.catalyst.expressions.Concat [info] org.apache.spark.sql.catalyst.expressions.Crc32 [info] org.apache.spark.sql.catalyst.expressions.CreateArray [info] org.apache.spark.sql.catalyst.expressions.CreateMap [info] org.apache.spark.sql.catalyst.expressions.CreateNamedStruct [info] org.apache.spark.sql.catalyst.expressions.CurrentDatabase [info] org.apache.spark.sql.catalyst.expressions.Divide [info] org.apache.spark.sql.catalyst.expressions.EqualNullSafe [info] org.apache.spark.sql.catalyst.expressions.EqualTo [info] org.apache.spark.sql.catalyst.expressions.Explode [info] org.apache.spark.sql.catalyst.expressions.GetJsonObject [info] org.apache.spark.sql.catalyst.expressions.GreaterThan [info] org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual [info] org.apache.spark.sql.catalyst.expressions.Greatest [info] org.apache.spark.sql.catalyst.expressions.If [info] org.apache.spark.sql.catalyst.expressions.In [info] org.apache.spark.sql.catalyst.expressions.Inline [info] org.apache.spark.sql.catalyst.expressions.InputFileBlockLength [info] org.apache.spark.sql.catalyst.expressions.InputFileBlockStart [info] org.apache.spark.sql.catalyst.expressions.InputFileName [info] org.apache.spark.sql.catalyst.expressions.JsonTuple [info] org.apache.spark.sql.catalyst.expressions.Least [info] org.apache.spark.sql.catalyst.expressions.LessThan [info] org.apache.spark.sql.catalyst.expressions.LessThanOrEqual [info] org.apache.spark.sql.catalyst.expressions.MapKeys [info] org.apache.spark.sql.catalyst.expressions.MapValues [info] org.apache.spark.sql.catalyst.expressions.Md5 [info] org.apache.spark.sql.catalyst.expressions.MonotonicallyIncreasingID [info] org.apache.spark.sql.catalyst.expressions.Multiply [info] org.apache.spark.sql.catalyst.expressions.Murmur3Hash [info] org.apache.spark.sql.catalyst.expressions.Not [info] org.apache.spark.sql.catalyst.expressions.Or [info] org.apache.spark.sql.catalyst.expressions.Overlay [info] org.apache.spark.sql.catalyst.expressions.Pmod [info] org.apache.spark.sql.catalyst.expressions.PosExplode [info] org.apache.spark.sql.catalyst.expressions.Remainder [info] org.apache.spark.sql.catalyst.expressions.Sha1 [info] org.apache.spark.sql.catalyst.expressions.Sha2 [info] org.apache.spark.sql.catalyst.expressions.Size [info] org.apache.spark.sql.catalyst.expressions.SortArray [info] org.apache.spark.sql.catalyst.expressions.SparkPartitionID [info] org.apache.spark.sql.catalyst.expressions.Stack [info] org.apache.spark.sql.catalyst.expressions.Subtract [info] org.apache.spark.sql.catalyst.expressions.TimeWindow [info] org.apache.spark.sql.catalyst.expressions.UnaryMinus [info] org.apache.spark.sql.catalyst.expressions.UnaryPositive [info] org.apache.spark.sql.catalyst.expressions.Uuid [info] org.apache.spark.sql.catalyst.expressions.xml.XPathBoolean [info] org.apache.spark.sql.catalyst.expressions.xml.XPathDouble [info] org.apache.spark.sql.catalyst.expressions.xml.XPathFloat [info] org.apache.spark.sql.catalyst.expressions.xml.XPathInt [info]
[jira] [Commented] (SPARK-32097) Allow reading history log files from multiple directories
[ https://issues.apache.org/jira/browse/SPARK-32097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189758#comment-17189758 ] Apache Spark commented on SPARK-32097: -- User 'Gaurangi94' has created a pull request for this issue: https://github.com/apache/spark/pull/29630 > Allow reading history log files from multiple directories > - > > Key: SPARK-32097 > URL: https://issues.apache.org/jira/browse/SPARK-32097 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Gaurangi Saxena >Priority: Minor > > We would like to configure SparkHistoryServer to display applications from > multiple clusters/environments. Data displayed on this UI comes from > directory configured as log-directory. It would be nice if this log-directory > also accepted regex. This way we will be able to read and display > applications from multiple directories. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32097) Allow reading history log files from multiple directories
[ https://issues.apache.org/jira/browse/SPARK-32097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32097: Assignee: (was: Apache Spark) > Allow reading history log files from multiple directories > - > > Key: SPARK-32097 > URL: https://issues.apache.org/jira/browse/SPARK-32097 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Gaurangi Saxena >Priority: Minor > > We would like to configure SparkHistoryServer to display applications from > multiple clusters/environments. Data displayed on this UI comes from > directory configured as log-directory. It would be nice if this log-directory > also accepted regex. This way we will be able to read and display > applications from multiple directories. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32097) Allow reading history log files from multiple directories
[ https://issues.apache.org/jira/browse/SPARK-32097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32097: Assignee: Apache Spark > Allow reading history log files from multiple directories > - > > Key: SPARK-32097 > URL: https://issues.apache.org/jira/browse/SPARK-32097 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Gaurangi Saxena >Assignee: Apache Spark >Priority: Minor > > We would like to configure SparkHistoryServer to display applications from > multiple clusters/environments. Data displayed on this UI comes from > directory configured as log-directory. It would be nice if this log-directory > also accepted regex. This way we will be able to read and display > applications from multiple directories. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32779) Spark/Hive3 interaction potentially causes deadlock
Bruce Robbins created SPARK-32779: - Summary: Spark/Hive3 interaction potentially causes deadlock Key: SPARK-32779 URL: https://issues.apache.org/jira/browse/SPARK-32779 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Bruce Robbins This is an issue for applications that share a Spark Session across multiple threads. sessionCatalog.loadPartition (after checking that the table exists) grabs locks in this order: - HiveExternalCatalog - HiveSessionCatalog (in Shim_v3_0) Other operations (e.g., sessionCatalog.tableExists), grab locks in this order: - HiveSessionCatalog - HiveExternalCatalog [This|https://github.com/apache/spark/blob/ad6b887541bf90cc3ea830a1a3322b71ccdd80ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L1332] appears to be the culprit. Maybe db name should be defaulted _before_ the call to HiveClient so that Shim_v3_0 doesn't have to call back into SessionCatalog. Or possibly this is not needed at all, since loadPartition in Shim_v2_1 doesn't worry about the default db name, but that might be because of differences between Hive client libraries. Reproduction case: - You need to have a running Hive 3.x HMS instance and the appropriate hive-site.xml for your Spark instance - Adjust your spark.sql.hive.metastore.version accordingly - It might take more than one try to hit the deadlock Launch Spark: {noformat} bin/spark-shell --conf "spark.sql.hive.metastore.jars=${HIVE_HOME}/lib/*" --conf spark.sql.hive.metastore.version=3.1 {noformat} Then use the following code: {noformat} import scala.collection.mutable.ArrayBuffer import scala.util.Random val tableCount = 4 for (i <- 0 until tableCount) { val tableName = s"partitioned${i+1}" sql(s"drop table if exists $tableName") sql(s"create table $tableName (a bigint) partitioned by (b bigint) stored as orc") } val threads = new ArrayBuffer[Thread] for (i <- 0 until tableCount) { threads.append(new Thread( new Runnable { override def run: Unit = { val tableName = s"partitioned${i + 1}" val rand = Random val df = spark.range(0, 2).toDF("a") val location = s"/tmp/${rand.nextLong.abs}" df.write.mode("overwrite").orc(location) sql( s""" LOAD DATA LOCAL INPATH '$location' INTO TABLE $tableName partition (b=$i)""") } }, s"worker$i")) threads(i).start() } for (i <- 0 until tableCount) { println(s"Joining with thread $i") threads(i).join() } println("All done") {noformat} The job often gets stuck after one or two "Joining..." lines. {{kill -3}} shows something like this: {noformat} Found one Java-level deadlock: = "worker3": waiting to lock monitor 0x7fdc3cde6798 (object 0x000784d98ac8, a org.apache.spark.sql.hive.HiveSessionCatalog), which is held by "worker0" "worker0": waiting to lock monitor 0x7fdc441d1b88 (object 0x0007861d1208, a org.apache.spark.sql.hive.HiveExternalCatalog), which is held by "worker3" {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32135) Show Spark Driver name on Spark history web page
[ https://issues.apache.org/jira/browse/SPARK-32135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189744#comment-17189744 ] Apache Spark commented on SPARK-32135: -- User 'Gaurangi94' has created a pull request for this issue: https://github.com/apache/spark/pull/29629 > Show Spark Driver name on Spark history web page > > > Key: SPARK-32135 > URL: https://issues.apache.org/jira/browse/SPARK-32135 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Gaurangi Saxena >Priority: Minor > Attachments: image-2020-09-02-12-37-55-860.png > > > We would like to see spark driver host on the history server web page -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32135) Show Spark Driver name on Spark history web page
[ https://issues.apache.org/jira/browse/SPARK-32135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32135: Assignee: (was: Apache Spark) > Show Spark Driver name on Spark history web page > > > Key: SPARK-32135 > URL: https://issues.apache.org/jira/browse/SPARK-32135 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Gaurangi Saxena >Priority: Minor > Attachments: image-2020-09-02-12-37-55-860.png > > > We would like to see spark driver host on the history server web page -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32135) Show Spark Driver name on Spark history web page
[ https://issues.apache.org/jira/browse/SPARK-32135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32135: Assignee: Apache Spark > Show Spark Driver name on Spark history web page > > > Key: SPARK-32135 > URL: https://issues.apache.org/jira/browse/SPARK-32135 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Gaurangi Saxena >Assignee: Apache Spark >Priority: Minor > Attachments: image-2020-09-02-12-37-55-860.png > > > We would like to see spark driver host on the history server web page -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32135) Show Spark Driver name on Spark history web page
[ https://issues.apache.org/jira/browse/SPARK-32135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189743#comment-17189743 ] Apache Spark commented on SPARK-32135: -- User 'Gaurangi94' has created a pull request for this issue: https://github.com/apache/spark/pull/29629 > Show Spark Driver name on Spark history web page > > > Key: SPARK-32135 > URL: https://issues.apache.org/jira/browse/SPARK-32135 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Gaurangi Saxena >Priority: Minor > Attachments: image-2020-09-02-12-37-55-860.png > > > We would like to see spark driver host on the history server web page -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32764) compare of -0.0 < 0.0 return true
[ https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32764: -- Description: {code:scala} val spark: SparkSession = SparkSession .builder() .master("local") .appName("SparkByExamples.com") .getOrCreate() spark.sparkContext.setLogLevel("ERROR") import spark.sqlContext.implicits._ val df = Seq((-0.0, 0.0)).toDF("neg", "pos") .withColumn("comp", col("neg") < col("pos")) df.show(false) == ++---++ |neg |pos|comp| ++---++ |-0.0|0.0|true| ++---++{code} I think that result should be false. **Apache Spark 2.4.6 RESULT** {code} scala> spark.version res0: String = 2.4.6 scala> Seq((-0.0, 0.0)).toDF("neg", "pos").withColumn("comp", col("neg") < col("pos")).show ++---+-+ | neg|pos| comp| ++---+-+ |-0.0|0.0|false| ++---+-+ {code} was: {code:scala} val spark: SparkSession = SparkSession .builder() .master("local") .appName("SparkByExamples.com") .getOrCreate() spark.sparkContext.setLogLevel("ERROR") import spark.sqlContext.implicits._ val df = Seq((-0.0, 0.0)).toDF("neg", "pos") .withColumn("comp", col("neg") < col("pos")) df.show(false) == ++---++ |neg |pos|comp| ++---++ |-0.0|0.0|true| ++---++{code} I think that result should be false. > compare of -0.0 < 0.0 return true > - > > Key: SPARK-32764 > URL: https://issues.apache.org/jira/browse/SPARK-32764 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Izek Greenfield >Priority: Major > Labels: correctness > > {code:scala} > val spark: SparkSession = SparkSession > .builder() > .master("local") > .appName("SparkByExamples.com") > .getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > import spark.sqlContext.implicits._ > val df = Seq((-0.0, 0.0)).toDF("neg", "pos") > .withColumn("comp", col("neg") < col("pos")) > df.show(false) > == > ++---++ > |neg |pos|comp| > ++---++ > |-0.0|0.0|true| > ++---++{code} > I think that result should be false. > **Apache Spark 2.4.6 RESULT** > {code} > scala> spark.version > res0: String = 2.4.6 > scala> Seq((-0.0, 0.0)).toDF("neg", "pos").withColumn("comp", col("neg") < > col("pos")).show > ++---+-+ > | neg|pos| comp| > ++---+-+ > |-0.0|0.0|false| > ++---+-+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32764) compare of -0.0 < 0.0 return true
[ https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189720#comment-17189720 ] Dongjoon Hyun commented on SPARK-32764: --- cc [~cloud_fan] and [~smilegator] > compare of -0.0 < 0.0 return true > - > > Key: SPARK-32764 > URL: https://issues.apache.org/jira/browse/SPARK-32764 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Izek Greenfield >Priority: Major > Labels: correctness > > {code:scala} > val spark: SparkSession = SparkSession > .builder() > .master("local") > .appName("SparkByExamples.com") > .getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > import spark.sqlContext.implicits._ > val df = Seq((-0.0, 0.0)).toDF("neg", "pos") > .withColumn("comp", col("neg") < col("pos")) > df.show(false) > == > ++---++ > |neg |pos|comp| > ++---++ > |-0.0|0.0|true| > ++---++{code} > I think that result should be false. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32764) compare of -0.0 < 0.0 return true
[ https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32764: -- Affects Version/s: 3.0.1 > compare of -0.0 < 0.0 return true > - > > Key: SPARK-32764 > URL: https://issues.apache.org/jira/browse/SPARK-32764 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Izek Greenfield >Priority: Major > Labels: correctness > > {code:scala} > val spark: SparkSession = SparkSession > .builder() > .master("local") > .appName("SparkByExamples.com") > .getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > import spark.sqlContext.implicits._ > val df = Seq((-0.0, 0.0)).toDF("neg", "pos") > .withColumn("comp", col("neg") < col("pos")) > df.show(false) > == > ++---++ > |neg |pos|comp| > ++---++ > |-0.0|0.0|true| > ++---++{code} > I think that result should be false. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32764) compare of -0.0 < 0.0 return true
[ https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189719#comment-17189719 ] Dongjoon Hyun commented on SPARK-32764: --- Thank you for reporting, [~igreenfi]. I also confirm this regression. > compare of -0.0 < 0.0 return true > - > > Key: SPARK-32764 > URL: https://issues.apache.org/jira/browse/SPARK-32764 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Izek Greenfield >Priority: Major > > {code:scala} > val spark: SparkSession = SparkSession > .builder() > .master("local") > .appName("SparkByExamples.com") > .getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > import spark.sqlContext.implicits._ > val df = Seq((-0.0, 0.0)).toDF("neg", "pos") > .withColumn("comp", col("neg") < col("pos")) > df.show(false) > == > ++---++ > |neg |pos|comp| > ++---++ > |-0.0|0.0|true| > ++---++{code} > I think that result should be false. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32764) compare of -0.0 < 0.0 return true
[ https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32764: -- Target Version/s: 3.0.2 > compare of -0.0 < 0.0 return true > - > > Key: SPARK-32764 > URL: https://issues.apache.org/jira/browse/SPARK-32764 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Izek Greenfield >Priority: Major > Labels: correctness > > {code:scala} > val spark: SparkSession = SparkSession > .builder() > .master("local") > .appName("SparkByExamples.com") > .getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > import spark.sqlContext.implicits._ > val df = Seq((-0.0, 0.0)).toDF("neg", "pos") > .withColumn("comp", col("neg") < col("pos")) > df.show(false) > == > ++---++ > |neg |pos|comp| > ++---++ > |-0.0|0.0|true| > ++---++{code} > I think that result should be false. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32764) compare of -0.0 < 0.0 return true
[ https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32764: -- Labels: correctness (was: ) > compare of -0.0 < 0.0 return true > - > > Key: SPARK-32764 > URL: https://issues.apache.org/jira/browse/SPARK-32764 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Izek Greenfield >Priority: Major > Labels: correctness > > {code:scala} > val spark: SparkSession = SparkSession > .builder() > .master("local") > .appName("SparkByExamples.com") > .getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > import spark.sqlContext.implicits._ > val df = Seq((-0.0, 0.0)).toDF("neg", "pos") > .withColumn("comp", col("neg") < col("pos")) > df.show(false) > == > ++---++ > |neg |pos|comp| > ++---++ > |-0.0|0.0|true| > ++---++{code} > I think that result should be false. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32766) s3a: bucket names with dots cannot be used
[ https://issues.apache.org/jira/browse/SPARK-32766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189701#comment-17189701 ] Dongjoon Hyun commented on SPARK-32766: --- Thank you for the pointer, [~ste...@apache.org]. > s3a: bucket names with dots cannot be used > -- > > Key: SPARK-32766 > URL: https://issues.apache.org/jira/browse/SPARK-32766 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.0 >Reporter: Ondrej Kokes >Priority: Minor > > Running vanilla spark with > {noformat} > --packages=org.apache.hadoop:hadoop-aws:x.y.z{noformat} > I cannot read from S3, if the bucket name contains a dot (a valid name). > A minimal reproducible example looks like this > {{from pyspark.sql import SparkSession}} > {{import pyspark.sql.functions as f}} > {{if __name__ == '__main__':}} > {{ spark = (SparkSession}} > {{ .builder}} > {{ .appName('my_app')}} > {{ .master("local[*]")}} > {{ .getOrCreate()}} > {{ )}} > {{ spark.read.csv("s3a://test-bucket-name-v1.0/foo.csv")}} > Or just launch a spark-shell with `--packages=(...)hadoop-aws(...)` and read > that CSV. I created the same bucket without the period and it worked fine. > *Now I'm not sure whether this is a thing of prepping the path names and > passing them to the aws-sdk, or whether the fault is within the SDK itself. I > am not Java savvy to investigate the issue further, but I tried to make the > repro as short as possible.* > > I get different errors depending on which Hadoop distributions I use. If I > use the default PySpark distribution (which includes Hadoop 2), I get the > following (using hadoop-aws:2.7.4) > {{scala> spark.read.csv("s3a://okokes-test-v2.5/foo.csv").show()}} > {{java.lang.IllegalArgumentException: The bucketName parameter must be > specified.}} > {{ at > com.amazonaws.services.s3.AmazonS3Client.assertParameterNotNull(AmazonS3Client.java:2816)}} > {{ at > com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1026)}} > {{ at > com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)}} > {{ at > org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)}} > {{ at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)}} > {{ at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)}} > {{ at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)}} > {{ at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)}} > {{ at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)}} > {{ at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)}} > {{ at > org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)}} > {{ at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)}} > {{ at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)}} > {{ at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)}} > {{ at scala.Option.getOrElse(Option.scala:189)}} > {{ at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)}} > {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)}} > {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)}} > {{ ... 47 elided}} > When I downloaded 3.0.0 with Hadoop 3 and ran a spark-shell there, I got this > error (with hadoop-aws:3.2.0): > {{java.lang.NullPointerException: null uri host.}} > {{ at java.base/java.util.Objects.requireNonNull(Objects.java:246)}} > {{ at > org.apache.hadoop.fs.s3native.S3xLoginHelper.buildFSURI(S3xLoginHelper.java:71)}} > {{ at org.apache.hadoop.fs.s3a.S3AFileSystem.setUri(S3AFileSystem.java:470)}} > {{ at > org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:235)}} > {{ at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)}} > {{ at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)}} > {{ at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)}} > {{ at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)}} > {{ at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)}} > {{ at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)}} > {{ at > org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)}} > {{ at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)}} > {{ at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)}} > {{ at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)}} > {{ at scala.Option.getOrElse(Option.scala:189)}} > {{ at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)}} >
[jira] [Resolved] (SPARK-32772) Reduce log messages for spark-sql CLI
[ https://issues.apache.org/jira/browse/SPARK-32772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32772. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29619 [https://github.com/apache/spark/pull/29619] > Reduce log messages for spark-sql CLI > - > > Key: SPARK-32772 > URL: https://issues.apache.org/jira/browse/SPARK-32772 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.1.0 > > > When we launch spark-sql CLI, too many log messages are shown and it's > sometimes difficult to find the result of query. > So I think it's better to reduce log messages like spark-shell and pyspark > CLI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32077) Support host-local shuffle data reading with external shuffle service disabled
[ https://issues.apache.org/jira/browse/SPARK-32077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32077. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28911 [https://github.com/apache/spark/pull/28911] > Support host-local shuffle data reading with external shuffle service disabled > -- > > Key: SPARK-32077 > URL: https://issues.apache.org/jira/browse/SPARK-32077 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.1.0 > > > After SPARK-27651, Spark can read host-local shuffle data directly from disk > with external shuffle service enabled. To extend the future, we can also > support it with external shuffle service disabled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32077) Support host-local shuffle data reading with external shuffle service disabled
[ https://issues.apache.org/jira/browse/SPARK-32077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-32077: - Assignee: wuyi > Support host-local shuffle data reading with external shuffle service disabled > -- > > Key: SPARK-32077 > URL: https://issues.apache.org/jira/browse/SPARK-32077 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > After SPARK-27651, Spark can read host-local shuffle data directly from disk > with external shuffle service enabled. To extend the future, we can also > support it with external shuffle service disabled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32135) Show Spark Driver name on Spark history web page
[ https://issues.apache.org/jira/browse/SPARK-32135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189668#comment-17189668 ] Gaurangi Saxena edited comment on SPARK-32135 at 9/2/20, 7:45 PM: -- [~tgraves] Apologies for the late response. We would like to see the driver host of an application on the history server page. We could see it in executors page, but we would need 2 clicks to get there. This feature is useful in a multi-cluster environment with a single JHS used to index history files for all clusters to understand on what cluster a job was executed. Driver-host will help identify the cluster. I have added another Jira (https://issues.apache.org/jira/browse/SPARK-32097) that will allow log-directory to accept wild-cards. Together these changes can help configuring a single UI for multiple spark clusters. !image-2020-09-02-12-37-55-860.png! was (Author: gaurangi): [~tgraves] Apologies for the late response. We would like to see the driver host of an application on the history server page. We could see it in executors page, but we would need 2 clicks to get there. This feature is useful in a multi-cluster environment with a single JHS used to index history files for all clusters to understand on what cluster a job was executed. Driver-host will help identify the cluster. !image-2020-09-02-12-37-55-860.png! > Show Spark Driver name on Spark history web page > > > Key: SPARK-32135 > URL: https://issues.apache.org/jira/browse/SPARK-32135 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Gaurangi Saxena >Priority: Minor > Attachments: image-2020-09-02-12-37-55-860.png > > > We would like to see spark driver host on the history server web page -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32097) Allow reading history log files from multiple directories
[ https://issues.apache.org/jira/browse/SPARK-32097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gaurangi Saxena updated SPARK-32097: Description: We would like to configure SparkHistoryServer to display applications from multiple clusters/environments. Data displayed on this UI comes from directory configured as log-directory. It would be nice if this log-directory also accepted regex. This way we will be able to read and display applications from multiple directories. was:We would like to have a regex kind support in displaying log files on the Spark history server. This way we will be able to see applications that were run on multiple clusters instead of just one. > Allow reading history log files from multiple directories > - > > Key: SPARK-32097 > URL: https://issues.apache.org/jira/browse/SPARK-32097 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Gaurangi Saxena >Priority: Minor > > We would like to configure SparkHistoryServer to display applications from > multiple clusters/environments. Data displayed on this UI comes from > directory configured as log-directory. It would be nice if this log-directory > also accepted regex. This way we will be able to read and display > applications from multiple directories. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32097) Allow reading history log files from multiple directories
[ https://issues.apache.org/jira/browse/SPARK-32097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gaurangi Saxena updated SPARK-32097: Issue Type: Wish (was: Bug) > Allow reading history log files from multiple directories > - > > Key: SPARK-32097 > URL: https://issues.apache.org/jira/browse/SPARK-32097 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Gaurangi Saxena >Priority: Minor > > We would like to have a regex kind support in displaying log files on the > Spark history server. This way we will be able to see applications that were > run on multiple clusters instead of just one. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32135) Show Spark Driver name on Spark history web page
[ https://issues.apache.org/jira/browse/SPARK-32135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189668#comment-17189668 ] Gaurangi Saxena commented on SPARK-32135: - [~tgraves] Apologies for the late response. We would like to see the driver host of an application on the history server page. We could see it in executors page, but we would need 2 clicks to get there. This feature is useful in a multi-cluster environment with a single JHS used to index history files for all clusters to understand on what cluster a job was executed. Driver-host will help identify the cluster. !image-2020-09-02-12-37-55-860.png! > Show Spark Driver name on Spark history web page > > > Key: SPARK-32135 > URL: https://issues.apache.org/jira/browse/SPARK-32135 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Gaurangi Saxena >Priority: Minor > Attachments: image-2020-09-02-12-37-55-860.png > > > We would like to see spark driver host on the history server web page -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32135) Show Spark Driver name on Spark history web page
[ https://issues.apache.org/jira/browse/SPARK-32135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gaurangi Saxena updated SPARK-32135: Attachment: image-2020-09-02-12-37-55-860.png > Show Spark Driver name on Spark history web page > > > Key: SPARK-32135 > URL: https://issues.apache.org/jira/browse/SPARK-32135 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Gaurangi Saxena >Priority: Minor > Attachments: image-2020-09-02-12-37-55-860.png > > > We would like to see spark driver host on the history server web page -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32778) Accidental Data Deletion on calling saveAsTable
Aman Rastogi created SPARK-32778: Summary: Accidental Data Deletion on calling saveAsTable Key: SPARK-32778 URL: https://issues.apache.org/jira/browse/SPARK-32778 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Aman Rastogi {code:java} df.write.option("path", "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table) {code} Above code deleted the data present in path "/already/existing/path". This happened because table was already not there in hive metastore however, path given had data. And if table is not present in Hive Metastore, SaveMode gets modified internally to SaveMode.Overwrite irrespective of what user has provided, which leads to data deletion. This change was introduced as part of https://issues.apache.org/jira/browse/SPARK-19583. Now, suppose if user is not using external hive metastore (hive metastore is associated with a cluster) and if cluster goes down or due to some reason user has to migrate to a new cluster. Once user tries to save data using above code in new cluster, it will first delete the data. It could be a production data and user is completely unaware of it as they have provided SaveMode.Append or ErrorIfExists. This will be an accidental data deletion. Proposed Fix: Instead of modifying SaveMode to Overwrite, we should modify it to ErrorIfExists in class CreateDataSourceTableAsSelectCommand. Change (line 154) {code:java} val result = saveDataIntoTable( sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = false) {code} to {code:java} val result = saveDataIntoTable( sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, tableExists = false){code} This should not break CTAS. Even in case of CTAS, user may not want to delete data if already exists as it could be accidental. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28079) CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is manually added to the schema
[ https://issues.apache.org/jira/browse/SPARK-28079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189634#comment-17189634 ] Jeff Harrison commented on SPARK-28079: --- I would also like something along these lines - it would greatly simplify our product. columnNameOfCorruptRecord would ideally be created when mode is permissive. columnNameOfCorruptRecord is not created when a schema is created via headers – I would have to load the data twice. This solution wouldn't work at all for some of our JSON data. > CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is > manually added to the schema > - > > Key: SPARK-28079 > URL: https://issues.apache.org/jira/browse/SPARK-28079 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.4.3 >Reporter: F Jimenez >Priority: Major > > When reading a CSV with mode = "PERMISSIVE", corrupt records are not flagged > as such and read in. Only way to get them flagged is to manually set > "columnNameOfCorruptRecord" AND manually setting the schema including this > column. Example: > {code:java} > // Second row has a 4th column that is not declared in the header/schema > val csvText = s""" > | FieldA, FieldB, FieldC > | a1,b1,c1 > | a2,b2,c2,d*""".stripMargin > val csvFile = new File("/tmp/file.csv") > FileUtils.write(csvFile, csvText) > val reader = sqlContext.read > .format("csv") > .option("header", "true") > .option("mode", "PERMISSIVE") > .option("columnNameOfCorruptRecord", "corrupt") > .schema("corrupt STRING, fieldA STRING, fieldB STRING, fieldC STRING") > reader.load(csvFile.getAbsolutePath).show(truncate = false) > {code} > This produces the correct result: > {code:java} > ++--+--+--+ > |corrupt |fieldA|fieldB|fieldC| > ++--+--+--+ > |null | a1 |b1 |c1 | > | a2,b2,c2,d*| a2 |b2 |c2 | > ++--+--+--+ > {code} > However removing the "schema" option and going: > {code:java} > val reader = sqlContext.read > .format("csv") > .option("header", "true") > .option("mode", "PERMISSIVE") > .option("columnNameOfCorruptRecord", "corrupt") > reader.load(csvFile.getAbsolutePath).show(truncate = false) > {code} > Yields: > {code:java} > +---+---+---+ > | FieldA| FieldB| FieldC| > +---+---+---+ > | a1 |b1 |c1 | > | a2 |b2 |c2 | > +---+---+---+ > {code} > The fourth value "d*" in the second row has been removed and the row not > marked as corrupt > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32746) Not able to run Pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-32746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189589#comment-17189589 ] Rahul Bhatia commented on SPARK-32746: -- Yes, i can run other PySpark codes easily, I tried changing from PySpark 3.0.0 to PySpark 2.4.0, now it works, however upon looking at the Spark UI, I can see that it does not run in parallel, it runs only on one core on one of the executors, and all other executors are idle, I have about 20 executors, each with 16 Cores and 16 GB RAM, can you suggest what might be wrong here, I am partition on the GroupBy Key and have 200 partitions for now, can I change something to achieve maximum possible parallelism and utilise all executor cores? > Not able to run Pandas UDF > --- > > Key: SPARK-32746 > URL: https://issues.apache.org/jira/browse/SPARK-32746 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 > Environment: Pyspark 3.0.0 > PyArrow - 1.0.1(also tried with Pyarrrow 0.15.1, no progress there) > Pandas - 0.25.3 > >Reporter: Rahul Bhatia >Priority: Major > Attachments: Screenshot 2020-08-31 at 9.04.07 AM.png > > > Hi, > I am facing issues in running Pandas UDF on a yarn cluster with multiple > nodes, I am trying to perform a simple DBSCAN algorithm to multiple groups in > my dataframe, to start with, I am just using a simple example to test things > out - > {code:python} > import pandas as pd > from pyspark.sql.types import StructType, StructField, DoubleType, > StringType, IntegerType > from sklearn.cluster import DBSCAN > from pyspark.sql.functions import pandas_udf, PandasUDFTypedata > data = [(1, 11.6133, 48.1075), > (1, 11.6142, 48.1066), > (1, 11.6108, 48.1061), > (1, 11.6207, 48.1192), > (1, 11.6221, 48.1223), > (1, 11.5969, 48.1276), > (2, 11.5995, 48.1258), > (2, 11.6127, 48.1066), > (2, 11.6430, 48.1275), > (2, 11.6368, 48.1278), > (2, 11.5930, 48.1156)] > df = spark.createDataFrame(data, ["id", "X", "Y"]) > output_schema = StructType( > [ > StructField('id', IntegerType()), > StructField('X', DoubleType()), > StructField('Y', DoubleType()), > StructField('cluster', IntegerType()) > ] > ) > @pandas_udf(output_schema, PandasUDFType.GROUPED_MAP) > def dbscan(data): > data["cluster"] = DBSCAN(eps=5, min_samples=3).fit_predict(data[["X", > "Y"]]) > result = pd.DataFrame(data, columns=["id", "X", "Y", "cluster"]) > return result > res = df.groupby("id").apply(dbscan) > res.show() > {code} > > The code keeps running forever on the yarn cluster, I expect it to be > finished within seconds(this works fine on standalone mode and finishes in > 2-4 seconds), on checking the Spark UI, I can see that the Spark job is > stuck(99/580) and doesn't make any progress forever. > > Also it doesn't run in parallel, am I missing something? !Screenshot > 2020-08-31 at 9.04.07 AM.png! > > > I am new to Spark, and still trying to understand a lot of things. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32766) s3a: bucket names with dots cannot be used
[ https://issues.apache.org/jira/browse/SPARK-32766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189279#comment-17189279 ] Steve Loughran commented on SPARK-32766: not going to be fixed in the s3a code, even if there was an easy way. By the end of the month, it will be impossible to talk to any newly created S3 bucket containing a . in their name. Existing ones may work, but not in this case where the mixing of hostnames and numbers confuses the java URI parser https://aws.amazon.com/blogs/aws/amazon-s3-path-deprecation-plan-the-rest-of-the-story/ > s3a: bucket names with dots cannot be used > -- > > Key: SPARK-32766 > URL: https://issues.apache.org/jira/browse/SPARK-32766 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.0 >Reporter: Ondrej Kokes >Priority: Minor > > Running vanilla spark with > {noformat} > --packages=org.apache.hadoop:hadoop-aws:x.y.z{noformat} > I cannot read from S3, if the bucket name contains a dot (a valid name). > A minimal reproducible example looks like this > {{from pyspark.sql import SparkSession}} > {{import pyspark.sql.functions as f}} > {{if __name__ == '__main__':}} > {{ spark = (SparkSession}} > {{ .builder}} > {{ .appName('my_app')}} > {{ .master("local[*]")}} > {{ .getOrCreate()}} > {{ )}} > {{ spark.read.csv("s3a://test-bucket-name-v1.0/foo.csv")}} > Or just launch a spark-shell with `--packages=(...)hadoop-aws(...)` and read > that CSV. I created the same bucket without the period and it worked fine. > *Now I'm not sure whether this is a thing of prepping the path names and > passing them to the aws-sdk, or whether the fault is within the SDK itself. I > am not Java savvy to investigate the issue further, but I tried to make the > repro as short as possible.* > > I get different errors depending on which Hadoop distributions I use. If I > use the default PySpark distribution (which includes Hadoop 2), I get the > following (using hadoop-aws:2.7.4) > {{scala> spark.read.csv("s3a://okokes-test-v2.5/foo.csv").show()}} > {{java.lang.IllegalArgumentException: The bucketName parameter must be > specified.}} > {{ at > com.amazonaws.services.s3.AmazonS3Client.assertParameterNotNull(AmazonS3Client.java:2816)}} > {{ at > com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1026)}} > {{ at > com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)}} > {{ at > org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)}} > {{ at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)}} > {{ at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)}} > {{ at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)}} > {{ at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)}} > {{ at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)}} > {{ at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)}} > {{ at > org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)}} > {{ at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)}} > {{ at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)}} > {{ at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)}} > {{ at scala.Option.getOrElse(Option.scala:189)}} > {{ at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)}} > {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)}} > {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)}} > {{ ... 47 elided}} > When I downloaded 3.0.0 with Hadoop 3 and ran a spark-shell there, I got this > error (with hadoop-aws:3.2.0): > {{java.lang.NullPointerException: null uri host.}} > {{ at java.base/java.util.Objects.requireNonNull(Objects.java:246)}} > {{ at > org.apache.hadoop.fs.s3native.S3xLoginHelper.buildFSURI(S3xLoginHelper.java:71)}} > {{ at org.apache.hadoop.fs.s3a.S3AFileSystem.setUri(S3AFileSystem.java:470)}} > {{ at > org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:235)}} > {{ at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)}} > {{ at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)}} > {{ at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)}} > {{ at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)}} > {{ at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)}} > {{ at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)}} > {{ at > org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)}} > {{ at >
[jira] [Resolved] (SPARK-31670) Using complex type in Aggregation with cube failed Analysis error
[ https://issues.apache.org/jira/browse/SPARK-31670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31670. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28490 [https://github.com/apache/spark/pull/28490] > Using complex type in Aggregation with cube failed Analysis error > - > > Key: SPARK-31670 > URL: https://issues.apache.org/jira/browse/SPARK-31670 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.1.0 > > > Will wrong with below SQL > {code:java} > test("TEST STRUCT FIELD WITH GROUP BY with CUBE") { > withTable("t1") { > sql( > """create table t1( > |a string, > |b int, > |c array>) > |using orc""".stripMargin) > sql( > """ > |select a, > coalesce(get_json_object(each.json_string,'$.iType'),'-127') as iType, sum(b) > |from t1 > |LATERAL VIEW explode(c) x AS each > |group by a, get_json_object(each.json_string,'$.iType') > |with cube > |""".stripMargin).explain(true) > } > } > {code} > Error > {code:java} > expression 'x.`each`' is neither present in the group by, nor is it an > aggregate function. Add to group by or wrap in first() (or first_value) if > you don't care which value you get.;; > Aggregate [a#230, get_json_object(each#222.json_string AS json_string#223, > $.iType)#231, spark_grouping_id#229L], [a#230, > coalesce(get_json_object(each#222.json_string, $.iType), -127) AS iType#218, > sum(cast(b#220 as bigint)) AS sum(b)#226L] > +- Expand [List(a#219, b#220, c#221, each#222, a#227, > get_json_object(each#222.json_string AS json_string#223, $.iType)#228, 0), > List(a#219, b#220, c#221, each#222, a#227, null, 1), List(a#219, b#220, > c#221, each#222, null, get_json_object(each#222.json_string AS > json_string#223, $.iType)#228, 2), List(a#219, b#220, c#221, each#222, null, > null, 3)], [a#219, b#220, c#221, each#222, a#230, > get_json_object(each#222.json_string AS json_string#223, $.iType)#231, > spark_grouping_id#229L] >+- Project [a#219, b#220, c#221, each#222, a#219 AS a#227, > get_json_object(each#222.json_string, $.iType) AS > get_json_object(each#222.json_string AS json_string#223, $.iType)#228] > +- Generate explode(c#221), false, x, [each#222] > +- SubqueryAlias spark_catalog.default.t1 > +- Relation[a#219,b#220,c#221] orc > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31670) Using complex type in Aggregation with cube failed Analysis error
[ https://issues.apache.org/jira/browse/SPARK-31670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31670: --- Assignee: angerszhu > Using complex type in Aggregation with cube failed Analysis error > - > > Key: SPARK-31670 > URL: https://issues.apache.org/jira/browse/SPARK-31670 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > > Will wrong with below SQL > {code:java} > test("TEST STRUCT FIELD WITH GROUP BY with CUBE") { > withTable("t1") { > sql( > """create table t1( > |a string, > |b int, > |c array>) > |using orc""".stripMargin) > sql( > """ > |select a, > coalesce(get_json_object(each.json_string,'$.iType'),'-127') as iType, sum(b) > |from t1 > |LATERAL VIEW explode(c) x AS each > |group by a, get_json_object(each.json_string,'$.iType') > |with cube > |""".stripMargin).explain(true) > } > } > {code} > Error > {code:java} > expression 'x.`each`' is neither present in the group by, nor is it an > aggregate function. Add to group by or wrap in first() (or first_value) if > you don't care which value you get.;; > Aggregate [a#230, get_json_object(each#222.json_string AS json_string#223, > $.iType)#231, spark_grouping_id#229L], [a#230, > coalesce(get_json_object(each#222.json_string, $.iType), -127) AS iType#218, > sum(cast(b#220 as bigint)) AS sum(b)#226L] > +- Expand [List(a#219, b#220, c#221, each#222, a#227, > get_json_object(each#222.json_string AS json_string#223, $.iType)#228, 0), > List(a#219, b#220, c#221, each#222, a#227, null, 1), List(a#219, b#220, > c#221, each#222, null, get_json_object(each#222.json_string AS > json_string#223, $.iType)#228, 2), List(a#219, b#220, c#221, each#222, null, > null, 3)], [a#219, b#220, c#221, each#222, a#230, > get_json_object(each#222.json_string AS json_string#223, $.iType)#231, > spark_grouping_id#229L] >+- Project [a#219, b#220, c#221, each#222, a#219 AS a#227, > get_json_object(each#222.json_string, $.iType) AS > get_json_object(each#222.json_string AS json_string#223, $.iType)#228] > +- Generate explode(c#221), false, x, [each#222] > +- SubqueryAlias spark_catalog.default.t1 > +- Relation[a#219,b#220,c#221] orc > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32739) support prune right for left semi join in DPP
[ https://issues.apache.org/jira/browse/SPARK-32739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-32739. - Fix Version/s: 3.1.0 Assignee: Zhenhua Wang Resolution: Fixed Issue resolved by pull request 29582 https://github.com/apache/spark/pull/29582 > support prune right for left semi join in DPP > - > > Key: SPARK-32739 > URL: https://issues.apache.org/jira/browse/SPARK-32739 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang >Priority: Minor > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-32756. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29584 [https://github.com/apache/spark/pull/29584] > Fix CaseInsensitiveMap in Scala 2.13 > > > Key: SPARK-32756 > URL: https://issues.apache.org/jira/browse/SPARK-32756 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Karol Chmist >Assignee: Karol Chmist >Priority: Minor > Fix For: 3.1.0 > > > > "Spark SQL" module doesn't compile in Scala 2.13: > {code:java} > [info] Compiling 26 Scala sources to > /home/karol/workspace/open-source/spark/sql/core/target/scala-2.13/classes... > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:121: > value += is not a member of > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] Expression does not convert to assignment because: > [error] type mismatch; > [error] found : scala.collection.immutable.Map[String,String] > [error] required: > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] expansion: this.extraOptions = > this.extraOptions.+(key.$minus$greater(value)) > [error] this.extraOptions += (key -> value) > [error] ^ > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:132: > value += is not a member of > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] Expression does not convert to assignment because: > [error] type mismatch; > [error] found : scala.collection.immutable.Map[String,String] > [error] required: > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] expansion: this.extraOptions = > this.extraOptions.+(key.$minus$greater(value)) > [error] this.extraOptions += (key -> value) > [error] ^ > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:294: > value += is not a member of > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] Expression does not convert to assignment because: > [error] type mismatch; > [error] found : scala.collection.immutable.Map[String,String] > [error] required: > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] expansion: this.extraOptions = > this.extraOptions.+("path".$minus$greater(path)) > [error] Error occurred in an application involving default arguments. > [error] this.extraOptions += ("path" -> path) > [error] ^ > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:317: > type mismatch; > [error] found : Iterable[(String, String)] > [error] required: java.util.Map[String,String] > [error] Error occurred in an application involving default arguments. > [error] val dsOptions = new CaseInsensitiveStringMap(options.asJava) > [error] ^ > [info] Iterable[(String, String)] <: java.util.Map[String,String]? > [info] false > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:412: > value += is not a member of > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] Expression does not convert to assignment because: > [error] type mismatch; > [error] found : scala.collection.immutable.Map[String,String] > [error] required: > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] expansion: DataFrameWriter.this.extraOptions = > DataFrameWriter.this.extraOptions.+(DataSourceUtils.PARTITIONING_COLUMNS_KEY.$minus$greater(DataSourceUtils.encodePartitioningColumns(columns))) > [error] extraOptions += (DataSourceUtils.PARTITIONING_COLUMNS_KEY -> > [error] ^ > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala:85: > type mismatch; > [error] found : > scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField] > [error] required: Map[String,OrcFiltersBase.this.OrcPrimitiveField] > [error] CaseInsensitiveMap(dedupPrimitiveFields) > [error] ^ > [info] scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField] > <: Map[String,OrcFiltersBase.this.OrcPrimitiveField]? > [info] false > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala:64: > type mismatch; > [error] found : Iterable[(String, String)] > [error] required: java.util.Map[String,String] > [error] new CaseInsensitiveStringMap(withoutPath.asJava) > [error] ^ > [info] Iterable[(String, String)] <:
[jira] [Assigned] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-32756: Assignee: Karol Chmist > Fix CaseInsensitiveMap in Scala 2.13 > > > Key: SPARK-32756 > URL: https://issues.apache.org/jira/browse/SPARK-32756 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Karol Chmist >Assignee: Karol Chmist >Priority: Minor > > > "Spark SQL" module doesn't compile in Scala 2.13: > {code:java} > [info] Compiling 26 Scala sources to > /home/karol/workspace/open-source/spark/sql/core/target/scala-2.13/classes... > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:121: > value += is not a member of > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] Expression does not convert to assignment because: > [error] type mismatch; > [error] found : scala.collection.immutable.Map[String,String] > [error] required: > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] expansion: this.extraOptions = > this.extraOptions.+(key.$minus$greater(value)) > [error] this.extraOptions += (key -> value) > [error] ^ > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:132: > value += is not a member of > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] Expression does not convert to assignment because: > [error] type mismatch; > [error] found : scala.collection.immutable.Map[String,String] > [error] required: > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] expansion: this.extraOptions = > this.extraOptions.+(key.$minus$greater(value)) > [error] this.extraOptions += (key -> value) > [error] ^ > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:294: > value += is not a member of > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] Expression does not convert to assignment because: > [error] type mismatch; > [error] found : scala.collection.immutable.Map[String,String] > [error] required: > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] expansion: this.extraOptions = > this.extraOptions.+("path".$minus$greater(path)) > [error] Error occurred in an application involving default arguments. > [error] this.extraOptions += ("path" -> path) > [error] ^ > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:317: > type mismatch; > [error] found : Iterable[(String, String)] > [error] required: java.util.Map[String,String] > [error] Error occurred in an application involving default arguments. > [error] val dsOptions = new CaseInsensitiveStringMap(options.asJava) > [error] ^ > [info] Iterable[(String, String)] <: java.util.Map[String,String]? > [info] false > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:412: > value += is not a member of > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] Expression does not convert to assignment because: > [error] type mismatch; > [error] found : scala.collection.immutable.Map[String,String] > [error] required: > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] expansion: DataFrameWriter.this.extraOptions = > DataFrameWriter.this.extraOptions.+(DataSourceUtils.PARTITIONING_COLUMNS_KEY.$minus$greater(DataSourceUtils.encodePartitioningColumns(columns))) > [error] extraOptions += (DataSourceUtils.PARTITIONING_COLUMNS_KEY -> > [error] ^ > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala:85: > type mismatch; > [error] found : > scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField] > [error] required: Map[String,OrcFiltersBase.this.OrcPrimitiveField] > [error] CaseInsensitiveMap(dedupPrimitiveFields) > [error] ^ > [info] scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField] > <: Map[String,OrcFiltersBase.this.OrcPrimitiveField]? > [info] false > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala:64: > type mismatch; > [error] found : Iterable[(String, String)] > [error] required: java.util.Map[String,String] > [error] new CaseInsensitiveStringMap(withoutPath.asJava) > [error] ^ > [info] Iterable[(String, String)] <: java.util.Map[String,String]? > [error] 7 errors found{code} > > The + function in CaseInsensitiveStringMap missing, resulting in {{+}}
[jira] [Commented] (SPARK-32776) Limit in streaming should not be optimized away by PropagateEmptyRelation
[ https://issues.apache.org/jira/browse/SPARK-32776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189128#comment-17189128 ] Hyukjin Kwon commented on SPARK-32776: -- Sure, thanks [~kabhwan]. > Limit in streaming should not be optimized away by PropagateEmptyRelation > - > > Key: SPARK-32776 > URL: https://issues.apache.org/jira/browse/SPARK-32776 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Liwen Sun >Assignee: Liwen Sun >Priority: Major > Fix For: 3.1.0, 3.0.2 > > > Right now, the limit operator in a streaming query may get optimized away > when the relation is empty. This can be problematic for stateful streaming, > as this empty batch will not write any state store files, and the next batch > will fail when trying to read these state store files and throw a file not > found error. > We should not let PropagateEmptyRelation optimize away the Limit operator for > streaming queries. > This ticket is intended to apply a small and safe fix for > PropagateEmptyRelation. A fundamental fix that can prevent this from > happening again in the future and in other optimizer rules is more desirable, > but that's a much larger task. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32752) Alias breaks for interval typed literals
[ https://issues.apache.org/jira/browse/SPARK-32752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32752: Assignee: (was: Apache Spark) > Alias breaks for interval typed literals > > > Key: SPARK-32752 > URL: https://issues.apache.org/jira/browse/SPARK-32752 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > Cases we found: > {code:java} > +-- !query > +select interval '1 day' as day > +-- !query schema > +struct<> > +-- !query output > +org.apache.spark.sql.catalyst.parser.ParseException > + > +no viable alternative at input 'as'(line 1, pos 24) > + > +== SQL == > +select interval '1 day' as day > +^^^ > + > + > +-- !query > +select interval '1 day' day > +-- !query schema > +struct<> > +-- !query output > +org.apache.spark.sql.catalyst.parser.ParseException > + > +Error parsing ' 1 day day' to interval, unrecognized number 'day'(line 1, > pos 16) > + > +== SQL == > +select interval '1 day' day > +^^^ > + > + > +-- !query > +select interval '1-2' year as y > +-- !query schema > +struct<> > +-- !query output > +org.apache.spark.sql.catalyst.parser.ParseException > + > +Error parsing ' 1-2 year' to interval, invalid value '1-2'(line 1, pos 16) > + > +== SQL == > +select interval '1-2' year as y > +^^^ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32752) Alias breaks for interval typed literals
[ https://issues.apache.org/jira/browse/SPARK-32752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32752: Assignee: Apache Spark > Alias breaks for interval typed literals > > > Key: SPARK-32752 > URL: https://issues.apache.org/jira/browse/SPARK-32752 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > > Cases we found: > {code:java} > +-- !query > +select interval '1 day' as day > +-- !query schema > +struct<> > +-- !query output > +org.apache.spark.sql.catalyst.parser.ParseException > + > +no viable alternative at input 'as'(line 1, pos 24) > + > +== SQL == > +select interval '1 day' as day > +^^^ > + > + > +-- !query > +select interval '1 day' day > +-- !query schema > +struct<> > +-- !query output > +org.apache.spark.sql.catalyst.parser.ParseException > + > +Error parsing ' 1 day day' to interval, unrecognized number 'day'(line 1, > pos 16) > + > +== SQL == > +select interval '1 day' day > +^^^ > + > + > +-- !query > +select interval '1-2' year as y > +-- !query schema > +struct<> > +-- !query output > +org.apache.spark.sql.catalyst.parser.ParseException > + > +Error parsing ' 1-2 year' to interval, invalid value '1-2'(line 1, pos 16) > + > +== SQL == > +select interval '1-2' year as y > +^^^ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32752) Alias breaks for interval typed literals
[ https://issues.apache.org/jira/browse/SPARK-32752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189126#comment-17189126 ] Apache Spark commented on SPARK-32752: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/29627 > Alias breaks for interval typed literals > > > Key: SPARK-32752 > URL: https://issues.apache.org/jira/browse/SPARK-32752 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > Cases we found: > {code:java} > +-- !query > +select interval '1 day' as day > +-- !query schema > +struct<> > +-- !query output > +org.apache.spark.sql.catalyst.parser.ParseException > + > +no viable alternative at input 'as'(line 1, pos 24) > + > +== SQL == > +select interval '1 day' as day > +^^^ > + > + > +-- !query > +select interval '1 day' day > +-- !query schema > +struct<> > +-- !query output > +org.apache.spark.sql.catalyst.parser.ParseException > + > +Error parsing ' 1 day day' to interval, unrecognized number 'day'(line 1, > pos 16) > + > +== SQL == > +select interval '1 day' day > +^^^ > + > + > +-- !query > +select interval '1-2' year as y > +-- !query schema > +struct<> > +-- !query output > +org.apache.spark.sql.catalyst.parser.ParseException > + > +Error parsing ' 1-2 year' to interval, invalid value '1-2'(line 1, pos 16) > + > +== SQL == > +select interval '1-2' year as y > +^^^ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32776) Limit in streaming should not be optimized away by PropagateEmptyRelation
[ https://issues.apache.org/jira/browse/SPARK-32776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189125#comment-17189125 ] Jungtaek Lim commented on SPARK-32776: -- It sounds to be safer to mark 3.0.x to 3.0.2 until the vote is open - it's relatively easier to find issues marked as 3.0.2 and make correction, assuming the case if the vote isn't going to the predicted way. > Limit in streaming should not be optimized away by PropagateEmptyRelation > - > > Key: SPARK-32776 > URL: https://issues.apache.org/jira/browse/SPARK-32776 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Liwen Sun >Assignee: Liwen Sun >Priority: Major > Fix For: 3.1.0, 3.0.2 > > > Right now, the limit operator in a streaming query may get optimized away > when the relation is empty. This can be problematic for stateful streaming, > as this empty batch will not write any state store files, and the next batch > will fail when trying to read these state store files and throw a file not > found error. > We should not let PropagateEmptyRelation optimize away the Limit operator for > streaming queries. > This ticket is intended to apply a small and safe fix for > PropagateEmptyRelation. A fundamental fix that can prevent this from > happening again in the future and in other optimizer rules is more desirable, > but that's a much larger task. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32776) Limit in streaming should not be optimized away by PropagateEmptyRelation
[ https://issues.apache.org/jira/browse/SPARK-32776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-32776: - Fix Version/s: (was: 3.0.1) 3.0.2 > Limit in streaming should not be optimized away by PropagateEmptyRelation > - > > Key: SPARK-32776 > URL: https://issues.apache.org/jira/browse/SPARK-32776 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Liwen Sun >Assignee: Liwen Sun >Priority: Major > Fix For: 3.1.0, 3.0.2 > > > Right now, the limit operator in a streaming query may get optimized away > when the relation is empty. This can be problematic for stateful streaming, > as this empty batch will not write any state store files, and the next batch > will fail when trying to read these state store files and throw a file not > found error. > We should not let PropagateEmptyRelation optimize away the Limit operator for > streaming queries. > This ticket is intended to apply a small and safe fix for > PropagateEmptyRelation. A fundamental fix that can prevent this from > happening again in the future and in other optimizer rules is more desirable, > but that's a much larger task. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32768) Add Parquet Timestamp output configuration to docs
[ https://issues.apache.org/jira/browse/SPARK-32768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32768. -- Resolution: Not A Problem > Add Parquet Timestamp output configuration to docs > -- > > Key: SPARK-32768 > URL: https://issues.apache.org/jira/browse/SPARK-32768 > Project: Spark > Issue Type: Documentation > Components: docs >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, > 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0 >Reporter: Ron DeFreitas >Priority: Minor > Labels: docs-missing, parquet > > {{Spark 2.3.0 added the spark.sql.parquet.outputTimestampType configuration > option for controlling the underlying datatype used when writing Timestamp > column types into parquet files. This option is helpful for compatibility > with external systems that need to read the output from Spark.}} > {{This was never exposed in the documentation. Fix should be applied to docs > for both the next 3.x release and 2.4.x release.}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32776) Limit in streaming should not be optimized away by PropagateEmptyRelation
[ https://issues.apache.org/jira/browse/SPARK-32776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32776: Assignee: Liwen Sun > Limit in streaming should not be optimized away by PropagateEmptyRelation > - > > Key: SPARK-32776 > URL: https://issues.apache.org/jira/browse/SPARK-32776 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Liwen Sun >Assignee: Liwen Sun >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > Right now, the limit operator in a streaming query may get optimized away > when the relation is empty. This can be problematic for stateful streaming, > as this empty batch will not write any state store files, and the next batch > will fail when trying to read these state store files and throw a file not > found error. > We should not let PropagateEmptyRelation optimize away the Limit operator for > streaming queries. > This ticket is intended to apply a small and safe fix for > PropagateEmptyRelation. A fundamental fix that can prevent this from > happening again in the future and in other optimizer rules is more desirable, > but that's a much larger task. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32776) Limit in streaming should not be optimized away by PropagateEmptyRelation
[ https://issues.apache.org/jira/browse/SPARK-32776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32776. -- Fix Version/s: 3.1.0 3.0.1 Resolution: Fixed > Limit in streaming should not be optimized away by PropagateEmptyRelation > - > > Key: SPARK-32776 > URL: https://issues.apache.org/jira/browse/SPARK-32776 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Liwen Sun >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > Right now, the limit operator in a streaming query may get optimized away > when the relation is empty. This can be problematic for stateful streaming, > as this empty batch will not write any state store files, and the next batch > will fail when trying to read these state store files and throw a file not > found error. > We should not let PropagateEmptyRelation optimize away the Limit operator for > streaming queries. > This ticket is intended to apply a small and safe fix for > PropagateEmptyRelation. A fundamental fix that can prevent this from > happening again in the future and in other optimizer rules is more desirable, > but that's a much larger task. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32776) Limit in streaming should not be optimized away by PropagateEmptyRelation
[ https://issues.apache.org/jira/browse/SPARK-32776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189091#comment-17189091 ] Hyukjin Kwon commented on SPARK-32776: -- Fixed in https://github.com/apache/spark/pull/29623 > Limit in streaming should not be optimized away by PropagateEmptyRelation > - > > Key: SPARK-32776 > URL: https://issues.apache.org/jira/browse/SPARK-32776 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Liwen Sun >Priority: Major > > Right now, the limit operator in a streaming query may get optimized away > when the relation is empty. This can be problematic for stateful streaming, > as this empty batch will not write any state store files, and the next batch > will fail when trying to read these state store files and throw a file not > found error. > We should not let PropagateEmptyRelation optimize away the Limit operator for > streaming queries. > This ticket is intended to apply a small and safe fix for > PropagateEmptyRelation. A fundamental fix that can prevent this from > happening again in the future and in other optimizer rules is more desirable, > but that's a much larger task. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32317) Parquet file loading with different schema(Decimal(N, P)) in files is not working as expected
[ https://issues.apache.org/jira/browse/SPARK-32317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated SPARK-32317: --- Labels: (was: easyfix) > Parquet file loading with different schema(Decimal(N, P)) in files is not > working as expected > - > > Key: SPARK-32317 > URL: https://issues.apache.org/jira/browse/SPARK-32317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: Its failing in all environments that I tried. >Reporter: Krish >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > Hi, > > We generate parquet files which are partitioned on Date on a daily basis, and > we send updates to historical data some times, what we noticed is due to some > configuration error the patch data schema is inconsistent to earlier files. > Assuming we had files generated with schema having ID and Amount as fields. > Historical data is having schema like ID INT, AMOUNT DECIMAL(15,6) and the > files we send as updates has schema like DECIMAL(15,2). > > Having two different schema in a Date partition and when we load the data of > a Date into spark, it is loading the data but the amount is getting > manipulated. > > file1.snappy.parquet > ID: INT > AMOUNT: DECIMAL(15,6) > Content: > 1,19500.00 > 2,198.34 > file2.snappy.parquet > ID: INT > AMOUNT: DECIMAL(15,2) > Content: > 1,19500.00 > 3,198.34 > Load these two files togeather > df3 = spark.read.parquet("output/") > df3.show() #-we can see amount getting manipulated here, > +-+---+ > |ID| AMOUNT| > +-+---+ > |1|1.95| > |3|0.019834| > |1|19500.00| > |2|198.34| > +-+---+ > x > Options Tried: > We tried to give schema as String for all fields, but that didt work > df3 = spark.read.format("parquet").schema(schema).load("output/") > Error: "org.apache.spark.sql.execution.QueryExecutionException: Parquet > column cannot be converted in file file*.snappy.parquet. Column: > [AMOUNT], Expected: string, Found: INT64" > > I know merge schema works if it finds few extra columns in one file but the > fileds which are in common needs to have same schema. That might nort work > here. > > Looking for some work around solution here. Or if there is an option which I > havent tried you can point me to that. > > With schema merging I got below eeror: > An error occurred while calling o2272.parquet. : > org.apache.spark.SparkException: Failed merging schema: root |-- ID: string > (nullable = true) |-- AMOUNT: decimal(15,6) (nullable = true) at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5(SchemaMergeUtils.scala:100) > at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5$adapted(SchemaMergeUtils.scala:95) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.mergeSchemasInParallel(SchemaMergeUtils.scala:95) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:485) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.inferSchema(ParquetUtils.scala:107) > at > org.apache.spark.sql.execution.datasources.v2.parquet.ParquetTable.inferSchema(ParquetTable.scala:44) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69) > at scala.Option.orElse(Option.scala:447) at > org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80) > at > org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:141) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:225) > at scala.Option.map(Option.scala:230) at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:206) at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:674) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at >
[jira] [Updated] (SPARK-32317) Parquet file loading with different schema(Decimal(N, P)) in files is not working as expected
[ https://issues.apache.org/jira/browse/SPARK-32317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated SPARK-32317: --- Component/s: (was: PySpark) SQL > Parquet file loading with different schema(Decimal(N, P)) in files is not > working as expected > - > > Key: SPARK-32317 > URL: https://issues.apache.org/jira/browse/SPARK-32317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: Its failing in all environments that I tried. >Reporter: Krish >Priority: Major > Labels: easyfix > Original Estimate: 24h > Remaining Estimate: 24h > > Hi, > > We generate parquet files which are partitioned on Date on a daily basis, and > we send updates to historical data some times, what we noticed is due to some > configuration error the patch data schema is inconsistent to earlier files. > Assuming we had files generated with schema having ID and Amount as fields. > Historical data is having schema like ID INT, AMOUNT DECIMAL(15,6) and the > files we send as updates has schema like DECIMAL(15,2). > > Having two different schema in a Date partition and when we load the data of > a Date into spark, it is loading the data but the amount is getting > manipulated. > > file1.snappy.parquet > ID: INT > AMOUNT: DECIMAL(15,6) > Content: > 1,19500.00 > 2,198.34 > file2.snappy.parquet > ID: INT > AMOUNT: DECIMAL(15,2) > Content: > 1,19500.00 > 3,198.34 > Load these two files togeather > df3 = spark.read.parquet("output/") > df3.show() #-we can see amount getting manipulated here, > +-+---+ > |ID| AMOUNT| > +-+---+ > |1|1.95| > |3|0.019834| > |1|19500.00| > |2|198.34| > +-+---+ > x > Options Tried: > We tried to give schema as String for all fields, but that didt work > df3 = spark.read.format("parquet").schema(schema).load("output/") > Error: "org.apache.spark.sql.execution.QueryExecutionException: Parquet > column cannot be converted in file file*.snappy.parquet. Column: > [AMOUNT], Expected: string, Found: INT64" > > I know merge schema works if it finds few extra columns in one file but the > fileds which are in common needs to have same schema. That might nort work > here. > > Looking for some work around solution here. Or if there is an option which I > havent tried you can point me to that. > > With schema merging I got below eeror: > An error occurred while calling o2272.parquet. : > org.apache.spark.SparkException: Failed merging schema: root |-- ID: string > (nullable = true) |-- AMOUNT: decimal(15,6) (nullable = true) at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5(SchemaMergeUtils.scala:100) > at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5$adapted(SchemaMergeUtils.scala:95) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.mergeSchemasInParallel(SchemaMergeUtils.scala:95) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:485) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.inferSchema(ParquetUtils.scala:107) > at > org.apache.spark.sql.execution.datasources.v2.parquet.ParquetTable.inferSchema(ParquetTable.scala:44) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69) > at scala.Option.orElse(Option.scala:447) at > org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80) > at > org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:141) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:225) > at scala.Option.map(Option.scala:230) at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:206) at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:674) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at >
[jira] [Assigned] (SPARK-32777) Aggregation support aggregate function with multiple foldable expressions.
[ https://issues.apache.org/jira/browse/SPARK-32777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32777: Assignee: (was: Apache Spark) > Aggregation support aggregate function with multiple foldable expressions. > -- > > Key: SPARK-32777 > URL: https://issues.apache.org/jira/browse/SPARK-32777 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > Spark SQL exists a bug show below: > {code:java} > spark.sql( > " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 2, 3)") > .show() > +-++ > |count(DISTINCT 2)|count(DISTINCT 2, 3)| > +-++ > |1| 1| > +-++ > spark.sql( > " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3, 2)") > .show() > +-++ > |count(DISTINCT 2)|count(DISTINCT 3, 2)| > +-++ > |1| 0| > +-++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32777) Aggregation support aggregate function with multiple foldable expressions.
[ https://issues.apache.org/jira/browse/SPARK-32777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189044#comment-17189044 ] Apache Spark commented on SPARK-32777: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/29626 > Aggregation support aggregate function with multiple foldable expressions. > -- > > Key: SPARK-32777 > URL: https://issues.apache.org/jira/browse/SPARK-32777 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > Spark SQL exists a bug show below: > {code:java} > spark.sql( > " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 2, 3)") > .show() > +-++ > |count(DISTINCT 2)|count(DISTINCT 2, 3)| > +-++ > |1| 1| > +-++ > spark.sql( > " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3, 2)") > .show() > +-++ > |count(DISTINCT 2)|count(DISTINCT 3, 2)| > +-++ > |1| 0| > +-++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32777) Aggregation support aggregate function with multiple foldable expressions.
[ https://issues.apache.org/jira/browse/SPARK-32777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32777: Assignee: Apache Spark > Aggregation support aggregate function with multiple foldable expressions. > -- > > Key: SPARK-32777 > URL: https://issues.apache.org/jira/browse/SPARK-32777 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > Spark SQL exists a bug show below: > {code:java} > spark.sql( > " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 2, 3)") > .show() > +-++ > |count(DISTINCT 2)|count(DISTINCT 2, 3)| > +-++ > |1| 1| > +-++ > spark.sql( > " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3, 2)") > .show() > +-++ > |count(DISTINCT 2)|count(DISTINCT 3, 2)| > +-++ > |1| 0| > +-++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32777) Aggregation support aggregate function with multiple foldable expressions.
[ https://issues.apache.org/jira/browse/SPARK-32777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-32777: --- Description: Spark SQL exists a bug show below: {code:java} spark.sql( " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3, 2)") .show() +-++ |count(DISTINCT 2)|count(DISTINCT 3, 2)| +-++ |1| 0| +-++ {code} > Aggregation support aggregate function with multiple foldable expressions. > -- > > Key: SPARK-32777 > URL: https://issues.apache.org/jira/browse/SPARK-32777 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > Spark SQL exists a bug show below: > {code:java} > spark.sql( > " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3, 2)") > .show() > +-++ > |count(DISTINCT 2)|count(DISTINCT 3, 2)| > +-++ > |1| 0| > +-++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32770) Add missing imports
[ https://issues.apache.org/jira/browse/SPARK-32770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong resolved SPARK-32770. -- Resolution: Won't Fix > Add missing imports > --- > > Key: SPARK-32770 > URL: https://issues.apache.org/jira/browse/SPARK-32770 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.6, 3.0.0 >Reporter: Fokko Driesprong >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32770) Add missing imports
[ https://issues.apache.org/jira/browse/SPARK-32770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189030#comment-17189030 ] Rohit Mishra commented on SPARK-32770: -- [~fokko], While you are working on the PR, can you please add a description? > Add missing imports > --- > > Key: SPARK-32770 > URL: https://issues.apache.org/jira/browse/SPARK-32770 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.6, 3.0.0 >Reporter: Fokko Driesprong >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32777) Aggregation support aggregate function with multiple foldable expressions.
jiaan.geng created SPARK-32777: -- Summary: Aggregation support aggregate function with multiple foldable expressions. Key: SPARK-32777 URL: https://issues.apache.org/jira/browse/SPARK-32777 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: jiaan.geng -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32777) Aggregation support aggregate function with multiple foldable expressions.
[ https://issues.apache.org/jira/browse/SPARK-32777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-32777: --- Description: Spark SQL exists a bug show below: {code:java} spark.sql( " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 2, 3)") .show() +-++ |count(DISTINCT 2)|count(DISTINCT 2, 3)| +-++ |1| 1| +-++ spark.sql( " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3, 2)") .show() +-++ |count(DISTINCT 2)|count(DISTINCT 3, 2)| +-++ |1| 0| +-++ {code} was: Spark SQL exists a bug show below: {code:java} spark.sql( " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3, 2)") .show() +-++ |count(DISTINCT 2)|count(DISTINCT 3, 2)| +-++ |1| 0| +-++ {code} > Aggregation support aggregate function with multiple foldable expressions. > -- > > Key: SPARK-32777 > URL: https://issues.apache.org/jira/browse/SPARK-32777 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > Spark SQL exists a bug show below: > {code:java} > spark.sql( > " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 2, 3)") > .show() > +-++ > |count(DISTINCT 2)|count(DISTINCT 2, 3)| > +-++ > |1| 1| > +-++ > spark.sql( > " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3, 2)") > .show() > +-++ > |count(DISTINCT 2)|count(DISTINCT 3, 2)| > +-++ > |1| 0| > +-++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24528) Add support to read multiple sorted bucket files for data source v1
[ https://issues.apache.org/jira/browse/SPARK-24528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189007#comment-17189007 ] Cheng Su commented on SPARK-24528: -- Change the title of Jira to `Add support to read multiple sorted bucket files for data source v1`, for describing the actual issue more correctly. > Add support to read multiple sorted bucket files for data source v1 > --- > > Key: SPARK-24528 > URL: https://issues.apache.org/jira/browse/SPARK-24528 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Ohad Raviv >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-24528#Closely related to > SPARK-24410, we're trying to optimize a very common use case we have of > getting the most updated row by id from a fact table. > We're saving the table bucketed to skip the shuffle stage, but we're still > "waste" time on the Sort operator evethough the data is already sorted. > here's a good example: > {code:java} > sparkSession.range(N).selectExpr( > "id as key", > "id % 2 as t1", > "id % 3 as t2") > .repartition(col("key")) > .write > .mode(SaveMode.Overwrite) > .bucketBy(3, "key") > .sortBy("key", "t1") > .saveAsTable("a1"){code} > {code:java} > sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain > == Physical Plan == > SortAggregate(key=[key#24L], functions=[max(named_struct(t1, t1#25L, key, > key#24L, t1, t1#25L, t2, t2#26L))]) > +- SortAggregate(key=[key#24L], functions=[partial_max(named_struct(t1, > t1#25L, key, key#24L, t1, t1#25L, t2, t2#26L))]) > +- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: true, > Format: Parquet, Location: ...{code} > > and here's a bad example, but more realistic: > {code:java} > sparkSession.sql("set spark.sql.shuffle.partitions=2") > sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain > == Physical Plan == > SortAggregate(key=[key#32L], functions=[max(named_struct(t1, t1#33L, key, > key#32L, t1, t1#33L, t2, t2#34L))]) > +- SortAggregate(key=[key#32L], functions=[partial_max(named_struct(t1, > t1#33L, key, key#32L, t1, t1#33L, t2, t2#34L))]) > +- *(1) Sort [key#32L ASC NULLS FIRST], false, 0 > +- *(1) FileScan parquet default.a1[key#32L,t1#33L,t2#34L] Batched: true, > Format: Parquet, Location: ... > {code} > > I've traced the problem to DataSourceScanExec#235: > {code:java} > val sortOrder = if (sortColumns.nonEmpty) { > // In case of bucketing, its possible to have multiple files belonging to > the > // same bucket in a given relation. Each of these files are locally sorted > // but those files combined together are not globally sorted. Given that, > // the RDD partition will not be sorted even if the relation has sort > columns set > // Current solution is to check if all the buckets have a single file in it > val files = selectedPartitions.flatMap(partition => partition.files) > val bucketToFilesGrouping = > files.map(_.getPath.getName).groupBy(file => > BucketingUtils.getBucketId(file)) > val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= > 1){code} > so obviously the code avoids dealing with this situation now.. > could you think of a way to solve this or bypass it? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24528) Add support to read multiple sorted bucket files for data source v1
[ https://issues.apache.org/jira/browse/SPARK-24528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Su updated SPARK-24528: - Summary: Add support to read multiple sorted bucket files for data source v1 (was: Missing optimization for Aggregations/Windowing on a bucketed table) > Add support to read multiple sorted bucket files for data source v1 > --- > > Key: SPARK-24528 > URL: https://issues.apache.org/jira/browse/SPARK-24528 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Ohad Raviv >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-24528#Closely related to > SPARK-24410, we're trying to optimize a very common use case we have of > getting the most updated row by id from a fact table. > We're saving the table bucketed to skip the shuffle stage, but we're still > "waste" time on the Sort operator evethough the data is already sorted. > here's a good example: > {code:java} > sparkSession.range(N).selectExpr( > "id as key", > "id % 2 as t1", > "id % 3 as t2") > .repartition(col("key")) > .write > .mode(SaveMode.Overwrite) > .bucketBy(3, "key") > .sortBy("key", "t1") > .saveAsTable("a1"){code} > {code:java} > sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain > == Physical Plan == > SortAggregate(key=[key#24L], functions=[max(named_struct(t1, t1#25L, key, > key#24L, t1, t1#25L, t2, t2#26L))]) > +- SortAggregate(key=[key#24L], functions=[partial_max(named_struct(t1, > t1#25L, key, key#24L, t1, t1#25L, t2, t2#26L))]) > +- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: true, > Format: Parquet, Location: ...{code} > > and here's a bad example, but more realistic: > {code:java} > sparkSession.sql("set spark.sql.shuffle.partitions=2") > sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain > == Physical Plan == > SortAggregate(key=[key#32L], functions=[max(named_struct(t1, t1#33L, key, > key#32L, t1, t1#33L, t2, t2#34L))]) > +- SortAggregate(key=[key#32L], functions=[partial_max(named_struct(t1, > t1#33L, key, key#32L, t1, t1#33L, t2, t2#34L))]) > +- *(1) Sort [key#32L ASC NULLS FIRST], false, 0 > +- *(1) FileScan parquet default.a1[key#32L,t1#33L,t2#34L] Batched: true, > Format: Parquet, Location: ... > {code} > > I've traced the problem to DataSourceScanExec#235: > {code:java} > val sortOrder = if (sortColumns.nonEmpty) { > // In case of bucketing, its possible to have multiple files belonging to > the > // same bucket in a given relation. Each of these files are locally sorted > // but those files combined together are not globally sorted. Given that, > // the RDD partition will not be sorted even if the relation has sort > columns set > // Current solution is to check if all the buckets have a single file in it > val files = selectedPartitions.flatMap(partition => partition.files) > val bucketToFilesGrouping = > files.map(_.getPath.getName).groupBy(file => > BucketingUtils.getBucketId(file)) > val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= > 1){code} > so obviously the code avoids dealing with this situation now.. > could you think of a way to solve this or bypass it? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org