[jira] [Created] (SPARK-34075) Hidden directories are being listed for partition inference
Burak Yavuz created SPARK-34075: --- Summary: Hidden directories are being listed for partition inference Key: SPARK-34075 URL: https://issues.apache.org/jira/browse/SPARK-34075 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Burak Yavuz Marking this as a blocker since it seems to be a regression. We are running Delta's tests against Spark 3.1 as part of QA here: [https://github.com/delta-io/delta/pull/579] We have noticed that one of our tests regressed with: {code:java} java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths: [info] file:/private/var/folders/_2/xn1c9yr11_93wjdk2vkvmwm0gp/t/spark-18706bcc-23ea-4853-b8bc-c4cc2a5ed551 [info] file:/private/var/folders/_2/xn1c9yr11_93wjdk2vkvmwm0gp/t/spark-18706bcc-23ea-4853-b8bc-c4cc2a5ed551/_delta_log [info] [info] If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them. [info] at scala.Predef$.assert(Predef.scala:223) [info] at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:172) [info] at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:104) [info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:158) [info] at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:73) [info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:50) [info] at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:167) [info] at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:418) [info] at org.apache.spark.sql.execution.datasources.ResolveSQLOnFile$$anonfun$apply$1.applyOrElse(rules.scala:62) [info] at org.apache.spark.sql.execution.datasources.ResolveSQLOnFile$$anonfun$apply$1.applyOrElse(rules.scala:45) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108) [info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:73) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:72) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29) [info] at org.apache.spark.sql.execution.datasources.ResolveSQLOnFile.apply(rules.scala:45) [info] at org.apache.spark.sql.execution.datasources.ResolveSQLOnFile.apply(rules.scala:40) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216) [info] at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) [info] at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) [info] at scala.collection.immutable.List.foldLeft(List.scala:89) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:213) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:205) [info] at scala.collection.immutable.List.foreach(List.scala:392) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:205) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:195) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:189) {code} It seems like a hidden directory is not being filtered out, when it actually should. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail:
[jira] [Resolved] (SPARK-31255) DataSourceV2: Add metadata columns
[ https://issues.apache.org/jira/browse/SPARK-31255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-31255. - Fix Version/s: 3.1.0 Assignee: Ryan Blue Resolution: Done Resolved by https://github.com/apache/spark/pull/28027 > DataSourceV2: Add metadata columns > -- > > Key: SPARK-31255 > URL: https://issues.apache.org/jira/browse/SPARK-31255 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 3.1.0 > > > DSv2 should support reading additional metadata columns that are not in a > table's schema. This allows users to project metadata like Kafka's offset, > timestamp, and partition. It also allows other sources to expose metadata > like file and row position. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31640) Support SHOW PARTITIONS for DataSource V2 tables
[ https://issues.apache.org/jira/browse/SPARK-31640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102746#comment-17102746 ] Burak Yavuz commented on SPARK-31640: - Hi [~younggyuchun], I'd take a look at how SHOW PARTITIONS works today: - [https://github.com/apache/spark/blob/36803031e850b08d689df90d15c75e1a1eeb28a8/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L1023] which returns a list of string paths. It would be great that with [DataSourceV2 tables|[https://github.com/apache/spark/blob/36803031e850b08d689df90d15c75e1a1eeb28a8/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Table.java#L43]], we can return the list of partitions ([https://github.com/apache/spark/blob/36803031e850b08d689df90d15c75e1a1eeb28a8/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Table.java#L60]) where each Transform is a separate column. > Support SHOW PARTITIONS for DataSource V2 tables > > > Key: SPARK-31640 > URL: https://issues.apache.org/jira/browse/SPARK-31640 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Priority: Major > > SHOW PARTITIONS is supported for V1 Hive tables. We can also support it for > V2 tables, where they return the transforms and the values of those > transforms as separate columns. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31640) Support SHOW PARTITIONS for DataSource V2 tables
Burak Yavuz created SPARK-31640: --- Summary: Support SHOW PARTITIONS for DataSource V2 tables Key: SPARK-31640 URL: https://issues.apache.org/jira/browse/SPARK-31640 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Burak Yavuz SHOW PARTITIONS is supported for V1 Hive tables. We can also support it for V2 tables, where they return the transforms and the values of those transforms as separate columns. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31624) SHOW TBLPROPERTIES doesn't handle Session Catalog correctly
[ https://issues.apache.org/jira/browse/SPARK-31624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-31624: --- Assignee: Burak Yavuz > SHOW TBLPROPERTIES doesn't handle Session Catalog correctly > --- > > Key: SPARK-31624 > URL: https://issues.apache.org/jira/browse/SPARK-31624 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > > SHOW TBLPROPERTIES doesn't handle DataSource V2 tables that use the session > catalog. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31624) SHOW TBLPROPERTIES doesn't handle Session Catalog correctly
Burak Yavuz created SPARK-31624: --- Summary: SHOW TBLPROPERTIES doesn't handle Session Catalog correctly Key: SPARK-31624 URL: https://issues.apache.org/jira/browse/SPARK-31624 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Burak Yavuz SHOW TBLPROPERTIES doesn't handle DataSource V2 tables that use the session catalog. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29314) ProgressReporter.extractStateOperatorMetrics should not overwrite updated as 0 when it actually runs a batch even with no data
[ https://issues.apache.org/jira/browse/SPARK-29314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-29314. - Fix Version/s: 3.0.0 Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/25987] > ProgressReporter.extractStateOperatorMetrics should not overwrite updated as > 0 when it actually runs a batch even with no data > -- > > Key: SPARK-29314 > URL: https://issues.apache.org/jira/browse/SPARK-29314 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.4, 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > SPARK-24156 brought the ability to run a batch without actual data to enable > fast state cleanup as well as emit evicted outputs without waiting actual > data to come. > This breaks some assumption on > `ProgressReporter.extractStateOperatorMetrics`. See comment in source code: > {code:java} > // lastExecution could belong to one of the previous triggers if > `!hasNewData`. > // Walking the plan again should be inexpensive. > {code} > and newNumRowsUpdated is replaced to 0 if hasNewData is false. It makes sense > if we copy progress from previous execution (which means no batch is run for > this time), but after SPARK-24156 the precondition is broken. > Spark should still replace the value of newNumRowsUpdated with 0 if there's > no batch being run and it needs to copy the old value from previous > execution, but it shouldn't touch the value if it runs a batch for no data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29314) ProgressReporter.extractStateOperatorMetrics should not overwrite updated as 0 when it actually runs a batch even with no data
[ https://issues.apache.org/jira/browse/SPARK-29314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-29314: --- Assignee: Jungtaek Lim > ProgressReporter.extractStateOperatorMetrics should not overwrite updated as > 0 when it actually runs a batch even with no data > -- > > Key: SPARK-29314 > URL: https://issues.apache.org/jira/browse/SPARK-29314 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.4, 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > > SPARK-24156 brought the ability to run a batch without actual data to enable > fast state cleanup as well as emit evicted outputs without waiting actual > data to come. > This breaks some assumption on > `ProgressReporter.extractStateOperatorMetrics`. See comment in source code: > {code:java} > // lastExecution could belong to one of the previous triggers if > `!hasNewData`. > // Walking the plan again should be inexpensive. > {code} > and newNumRowsUpdated is replaced to 0 if hasNewData is false. It makes sense > if we copy progress from previous execution (which means no batch is run for > this time), but after SPARK-24156 the precondition is broken. > Spark should still replace the value of newNumRowsUpdated with 0 if there's > no batch being run and it needs to copy the old value from previous > execution, but it shouldn't touch the value if it runs a batch for no data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31278) numOutputRows shows value from last micro batch when there is no new data
[ https://issues.apache.org/jira/browse/SPARK-31278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-31278. - Resolution: Fixed > numOutputRows shows value from last micro batch when there is no new data > - > > Key: SPARK-31278 > URL: https://issues.apache.org/jira/browse/SPARK-31278 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 3.0.0 > > > In Structured Streaming, we provide progress updates every 10 seconds when a > stream doesn't have any new data upstream. When providing this progress > though, we zero out the input information but not the output information. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31278) numOutputRows shows value from last micro batch when there is no new data
[ https://issues.apache.org/jira/browse/SPARK-31278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-31278: --- Assignee: Burak Yavuz > numOutputRows shows value from last micro batch when there is no new data > - > > Key: SPARK-31278 > URL: https://issues.apache.org/jira/browse/SPARK-31278 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 3.0.0 > > > In Structured Streaming, we provide progress updates every 10 seconds when a > stream doesn't have any new data upstream. When providing this progress > though, we zero out the input information but not the output information. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31278) numOutputRows shows value from last micro batch when there is no new data
[ https://issues.apache.org/jira/browse/SPARK-31278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077695#comment-17077695 ] Burak Yavuz commented on SPARK-31278: - Resolved by [https://github.com/apache/spark/pull/28040] > numOutputRows shows value from last micro batch when there is no new data > - > > Key: SPARK-31278 > URL: https://issues.apache.org/jira/browse/SPARK-31278 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Priority: Major > > In Structured Streaming, we provide progress updates every 10 seconds when a > stream doesn't have any new data upstream. When providing this progress > though, we zero out the input information but not the output information. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31278) numOutputRows shows value from last micro batch when there is no new data
[ https://issues.apache.org/jira/browse/SPARK-31278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-31278: Fix Version/s: 3.0.0 > numOutputRows shows value from last micro batch when there is no new data > - > > Key: SPARK-31278 > URL: https://issues.apache.org/jira/browse/SPARK-31278 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Priority: Major > Fix For: 3.0.0 > > > In Structured Streaming, we provide progress updates every 10 seconds when a > stream doesn't have any new data upstream. When providing this progress > though, we zero out the input information but not the output information. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31278) numOutputRows shows value from last micro batch when there is no new data
Burak Yavuz created SPARK-31278: --- Summary: numOutputRows shows value from last micro batch when there is no new data Key: SPARK-31278 URL: https://issues.apache.org/jira/browse/SPARK-31278 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Burak Yavuz In Structured Streaming, we provide progress updates every 10 seconds when a stream doesn't have any new data upstream. When providing this progress though, we zero out the input information but not the output information. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31178) sql("INSERT INTO v2DataSource ...").collect() double inserts
[ https://issues.apache.org/jira/browse/SPARK-31178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-31178. - Fix Version/s: 3.0.0 Assignee: Burak Yavuz Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/27941] > sql("INSERT INTO v2DataSource ...").collect() double inserts > > > Key: SPARK-31178 > URL: https://issues.apache.org/jira/browse/SPARK-31178 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Blocker > Fix For: 3.0.0 > > > The following unit test fails in DataSourceV2SQLSuite: > {code:java} > test("do not double insert on INSERT INTO collect()") { > import testImplicits._ > val t1 = s"${catalogAndNamespace}tbl" > sql(s"CREATE TABLE $t1 (id bigint, data string) USING $v2Format") > val tmpView = "test_data" > val df = Seq((1L, "a"), (2L, "b"), (3L, "c")).toDF("id", "data") > df.createOrReplaceTempView(tmpView) > sql(s"INSERT INTO TABLE $t1 SELECT * FROM $tmpView").collect() > verifyTable(t1, df) > } {code} > The INSERT INTO is double inserting when ".collect()" is called. I think this > is because the V2 SparkPlans are not commands, and doExecute on a Spark plan > can be called multiple times causing data to be inserted multiple times. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31178) sql("INSERT INTO v2DataSource ...").collect() double inserts
[ https://issues.apache.org/jira/browse/SPARK-31178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061292#comment-17061292 ] Burak Yavuz commented on SPARK-31178: - cc [~wenchen] [~rdblue] > sql("INSERT INTO v2DataSource ...").collect() double inserts > > > Key: SPARK-31178 > URL: https://issues.apache.org/jira/browse/SPARK-31178 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Priority: Blocker > > The following unit test fails in DataSourceV2SQLSuite: > {code:java} > test("do not double insert on INSERT INTO collect()") { > import testImplicits._ > val t1 = s"${catalogAndNamespace}tbl" > sql(s"CREATE TABLE $t1 (id bigint, data string) USING $v2Format") > val tmpView = "test_data" > val df = Seq((1L, "a"), (2L, "b"), (3L, "c")).toDF("id", "data") > df.createOrReplaceTempView(tmpView) > sql(s"INSERT INTO TABLE $t1 SELECT * FROM $tmpView").collect() > verifyTable(t1, df) > } {code} > The INSERT INTO is double inserting when ".collect()" is called. I think this > is because the V2 SparkPlans are not commands, and doExecute on a Spark plan > can be called multiple times causing data to be inserted multiple times. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31178) sql("INSERT INTO v2DataSource ...").collect() double inserts
Burak Yavuz created SPARK-31178: --- Summary: sql("INSERT INTO v2DataSource ...").collect() double inserts Key: SPARK-31178 URL: https://issues.apache.org/jira/browse/SPARK-31178 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Burak Yavuz The following unit test fails in DataSourceV2SQLSuite: {code:java} test("do not double insert on INSERT INTO collect()") { import testImplicits._ val t1 = s"${catalogAndNamespace}tbl" sql(s"CREATE TABLE $t1 (id bigint, data string) USING $v2Format") val tmpView = "test_data" val df = Seq((1L, "a"), (2L, "b"), (3L, "c")).toDF("id", "data") df.createOrReplaceTempView(tmpView) sql(s"INSERT INTO TABLE $t1 SELECT * FROM $tmpView").collect() verifyTable(t1, df) } {code} The INSERT INTO is double inserting when ".collect()" is called. I think this is because the V2 SparkPlans are not commands, and doExecute on a Spark plan can be called multiple times causing data to be inserted multiple times. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31061) Impossible to change the provider of a table in the HiveMetaStore
Burak Yavuz created SPARK-31061: --- Summary: Impossible to change the provider of a table in the HiveMetaStore Key: SPARK-31061 URL: https://issues.apache.org/jira/browse/SPARK-31061 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Burak Yavuz Currently, it's impossible to alter the datasource of a table in the HiveMetaStore by using alterTable, as the HiveExternalCatalog doesn't change the provider table property during an alterTable command. This is required to support changing table formats when using commands like REPLACE TABLE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30924) Add additional validation into Merge Into
Burak Yavuz created SPARK-30924: --- Summary: Add additional validation into Merge Into Key: SPARK-30924 URL: https://issues.apache.org/jira/browse/SPARK-30924 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Burak Yavuz Merge Into is currently missing additional validation around: 1. The lack of any WHEN statements 2. Single use of UPDATE/DELETE 3. The first WHEN MATCHED statement needs to have a condition if there are two WHEN MATCHED statements. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29908) Support partitioning for DataSource V2 tables in DataFrameWriter.save
[ https://issues.apache.org/jira/browse/SPARK-29908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-29908. - Resolution: Duplicate > Support partitioning for DataSource V2 tables in DataFrameWriter.save > - > > Key: SPARK-29908 > URL: https://issues.apache.org/jira/browse/SPARK-29908 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Priority: Blocker > > Currently, any data source that that upgrades to DataSource V2 loses the > partition transform information when using DataFrameWriter.save. The main > reason is the lack of an API for "creating" a table with partitioning and > schema information for V2 tables without a catalog. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30814) Add Columns references should be able to resolve each other
[ https://issues.apache.org/jira/browse/SPARK-30814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036418#comment-17036418 ] Burak Yavuz commented on SPARK-30814: - cc [~cloud_fan] [~imback82], can we prioritize this over REPLACE COLUMNS if possible? > Add Columns references should be able to resolve each other > --- > > Key: SPARK-30814 > URL: https://issues.apache.org/jira/browse/SPARK-30814 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Priority: Major > > In ResolveAlterTableChanges, we have checks that make sure that positional > arguments exist and are normalized around case sensitivity for ALTER TABLE > ADD COLUMNS. However, we missed the case, where a column in ADD COLUMNS can > depend on the position of a column that is just being added. > For example for the schema: > {code:java} > root: > - a: string > - b: long > {code} > > The following should work: > {code:java} > ALTER TABLE ... ADD COLUMNS (x int AFTER a, y int AFTER x) {code} > Currently, the above statement will throw an error saying that AFTER x cannot > be resolved, because x doesn't exist yet. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30814) Add Columns references should be able to resolve each other
Burak Yavuz created SPARK-30814: --- Summary: Add Columns references should be able to resolve each other Key: SPARK-30814 URL: https://issues.apache.org/jira/browse/SPARK-30814 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Burak Yavuz In ResolveAlterTableChanges, we have checks that make sure that positional arguments exist and are normalized around case sensitivity for ALTER TABLE ADD COLUMNS. However, we missed the case, where a column in ADD COLUMNS can depend on the position of a column that is just being added. For example for the schema: {code:java} root: - a: string - b: long {code} The following should work: {code:java} ALTER TABLE ... ADD COLUMNS (x int AFTER a, y int AFTER x) {code} Currently, the above statement will throw an error saying that AFTER x cannot be resolved, because x doesn't exist yet. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30697) Handle database and namespace exceptions in catalog.isView
[ https://issues.apache.org/jira/browse/SPARK-30697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-30697: --- Assignee: Burak Yavuz > Handle database and namespace exceptions in catalog.isView > -- > > Key: SPARK-30697 > URL: https://issues.apache.org/jira/browse/SPARK-30697 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > > The non-existence of a database shouldn't throw a NoSuchDatabaseException > from catalog.isView -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30697) Handle database and namespace exceptions in catalog.isView
[ https://issues.apache.org/jira/browse/SPARK-30697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-30697. - Fix Version/s: 3.0.0 Resolution: Fixed Resolved by [#27423|https://github.com/apache/spark/pull/27423] > Handle database and namespace exceptions in catalog.isView > -- > > Key: SPARK-30697 > URL: https://issues.apache.org/jira/browse/SPARK-30697 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 3.0.0 > > > The non-existence of a database shouldn't throw a NoSuchDatabaseException > from catalog.isView -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30697) Handle database and namespace exceptions in catalog.isView
Burak Yavuz created SPARK-30697: --- Summary: Handle database and namespace exceptions in catalog.isView Key: SPARK-30697 URL: https://issues.apache.org/jira/browse/SPARK-30697 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Burak Yavuz The non-existence of a database shouldn't throw a NoSuchDatabaseException from catalog.isView -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30669) Introduce AdmissionControl API to Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-30669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-30669. - Fix Version/s: 3.0.0 Assignee: Burak Yavuz Resolution: Done Resolved by [https://github.com/apache/spark/pull/27380] > Introduce AdmissionControl API to Structured Streaming > -- > > Key: SPARK-30669 > URL: https://issues.apache.org/jira/browse/SPARK-30669 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 3.0.0 > > > In Structured Streaming, we have the concept of Triggers. With a trigger like > Trigger.Once(), the semantics are to process all the data available to the > datasource in a single micro-batch. However, this semantic can be broken when > data source options such as `maxOffsetsPerTrigger` (in the Kafka source) rate > limit the amount of data read for that micro-batch. > We propose to add a new interface `SupportsAdmissionControl` and `ReadLimit`. > A ReadLimit defines how much data should be read in the next micro-batch. > `SupportsAdmissionControl` specifies that a source can rate limit its ingest > into the system. The source can tell the system what the user specified as a > read limit, and the system can enforce this limit within each micro-batch or > impose it's own limit if the Trigger is Trigger.Once() for example. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30669) Introduce AdmissionControl API to Structured Streaming
Burak Yavuz created SPARK-30669: --- Summary: Introduce AdmissionControl API to Structured Streaming Key: SPARK-30669 URL: https://issues.apache.org/jira/browse/SPARK-30669 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.4 Reporter: Burak Yavuz In Structured Streaming, we have the concept of Triggers. With a trigger like Trigger.Once(), the semantics are to process all the data available to the datasource in a single micro-batch. However, this semantic can be broken when data source options such as `maxOffsetsPerTrigger` (in the Kafka source) rate limit the amount of data read for that micro-batch. We propose to add a new interface `SupportsAdmissionControl` and `ReadLimit`. A ReadLimit defines how much data should be read in the next micro-batch. `SupportsAdmissionControl` specifies that a source can rate limit its ingest into the system. The source can tell the system what the user specified as a read limit, and the system can enforce this limit within each micro-batch or impose it's own limit if the Trigger is Trigger.Once() for example. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30314) Add identifier and catalog information to DataSourceV2Relation
[ https://issues.apache.org/jira/browse/SPARK-30314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-30314. - Fix Version/s: 3.0.0 Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/26957] > Add identifier and catalog information to DataSourceV2Relation > -- > > Key: SPARK-30314 > URL: https://issues.apache.org/jira/browse/SPARK-30314 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuchen Huo >Assignee: Yuchen Huo >Priority: Major > Fix For: 3.0.0 > > > Add identifier and catalog information in DataSourceV2Relation so it would be > possible to do richer checks in *checkAnalysis* step. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30314) Add identifier and catalog information to DataSourceV2Relation
[ https://issues.apache.org/jira/browse/SPARK-30314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-30314: --- Assignee: Yuchen Huo > Add identifier and catalog information to DataSourceV2Relation > -- > > Key: SPARK-30314 > URL: https://issues.apache.org/jira/browse/SPARK-30314 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuchen Huo >Assignee: Yuchen Huo >Priority: Major > > Add identifier and catalog information in DataSourceV2Relation so it would be > possible to do richer checks in *checkAnalysis* step. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30612) can't resolve qualified column name with v2 tables
[ https://issues.apache.org/jira/browse/SPARK-30612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022678#comment-17022678 ] Burak Yavuz commented on SPARK-30612: - I prefer SubqueryAlias. We need to support all degrees of the user provided identifier I believe: SELECT testcat.ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl SELECT ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl SELECT ns2.tbl.foo FROM testcat.ns1.ns2.tbl SELECT tbl.foo FROM testcat.ns1.ns2.tbl should all work. However I'm not sure if SELECT spark_catalog.default.tbl.foo FROM tbl should work. Are my assumptions correct? > can't resolve qualified column name with v2 tables > -- > > Key: SPARK-30612 > URL: https://issues.apache.org/jira/browse/SPARK-30612 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > When running queries with qualified columns like `SELECT t.a FROM t`, it > fails to resolve for v2 tables. > v1 table is fine as we always wrap the v1 relation with a `SubqueryAlias`. We > should do the same for v2 tables. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30615) normalize the column name in AlterTable
[ https://issues.apache.org/jira/browse/SPARK-30615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022577#comment-17022577 ] Burak Yavuz commented on SPARK-30615: - I actually had a PR in progress on this. Let me push that > normalize the column name in AlterTable > --- > > Key: SPARK-30615 > URL: https://issues.apache.org/jira/browse/SPARK-30615 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Because of case insensitive resolution, the column name in AlterTable may > match the table schema but not exactly the same. To ease DS v2 > implementations, Spark should normalize the column name before passing them > to v2 catalogs, so that users don't need to care about the case sensitive > config. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30612) can't resolve qualified column name with v2 tables
[ https://issues.apache.org/jira/browse/SPARK-30612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022343#comment-17022343 ] Burak Yavuz commented on SPARK-30612: - SPARK-30314 should help make this work easier > can't resolve qualified column name with v2 tables > -- > > Key: SPARK-30612 > URL: https://issues.apache.org/jira/browse/SPARK-30612 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > When running queries with qualified columns like `SELECT t.a FROM t`, it > fails to resolve for v2 tables. > v1 table is fine as we always wrap the v1 relation with a `SubqueryAlias`. We > should do the same for v2 tables. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29219) DataSourceV2: Support all SaveModes in DataFrameWriter.save
[ https://issues.apache.org/jira/browse/SPARK-29219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-29219. - Fix Version/s: 3.0.0 Resolution: Done Resolved by [https://github.com/apache/spark/pull/26913] > DataSourceV2: Support all SaveModes in DataFrameWriter.save > --- > > Key: SPARK-29219 > URL: https://issues.apache.org/jira/browse/SPARK-29219 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 3.0.0 > > > We currently don't support all save modes in DataFrameWriter.save as the > TableProvider interface allows for the reading/writing of data, but not for > the creation of tables. We created a catalog API to support the > creation/dropping/checking existence of tables, but DataFrameWriter.save > doesn't necessarily use a catalog for example, when writing to a path based > table. > For this case, we propose a new interface that will allow TableProviders to > extract an Indentifier and a Catalog from a bundle of > CaseInsensitiveStringOptions. This information can then be used to check the > existence of a table, and support all save modes. If a Catalog is not > defined, then the behavior is to use the spark_catalog (or configured session > catalog) to perform the check. > > The interface can look like: > {code:java} > trait CatalogOptions { > def extractCatalog(StringMap): String > def extractIdentifier(StringMap): Identifier > } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29219) DataSourceV2: Support all SaveModes in DataFrameWriter.save
[ https://issues.apache.org/jira/browse/SPARK-29219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-29219: --- Assignee: Burak Yavuz > DataSourceV2: Support all SaveModes in DataFrameWriter.save > --- > > Key: SPARK-29219 > URL: https://issues.apache.org/jira/browse/SPARK-29219 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > > We currently don't support all save modes in DataFrameWriter.save as the > TableProvider interface allows for the reading/writing of data, but not for > the creation of tables. We created a catalog API to support the > creation/dropping/checking existence of tables, but DataFrameWriter.save > doesn't necessarily use a catalog for example, when writing to a path based > table. > For this case, we propose a new interface that will allow TableProviders to > extract an Indentifier and a Catalog from a bundle of > CaseInsensitiveStringOptions. This information can then be used to check the > existence of a table, and support all save modes. If a Catalog is not > defined, then the behavior is to use the spark_catalog (or configured session > catalog) to perform the check. > > The interface can look like: > {code:java} > trait CatalogOptions { > def extractCatalog(StringMap): String > def extractIdentifier(StringMap): Identifier > } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30334) Add metadata around semi-structured columns to Spark
[ https://issues.apache.org/jira/browse/SPARK-30334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-30334: Description: Semi-structured data is used widely in the data industry for reporting events in a wide variety of formats. Click events in product analytics can be stored as json. Some application logs can be in the form of delimited key=value text. Some data may be in xml. The goal of this project is to be able to signal Spark that such a column exists. This will then enable Spark to "auto-parse" these columns on the fly. The proposal is to store this information as part of the column metadata, in the fields: - format: The format of the semi-structured column, e.g. json, xml, avro - options: Options for parsing these columns Then imagine having the following data: {code:java} ++---++ | ts | event |raw | ++---++ | 2019-10-12 | click | {"field":"value"} | ++---++ {code} SELECT raw.field FROM data will return "value" or the following data {code:java} ++---+--+ | ts | event | raw | ++---+--+ | 2019-10-12 | click | field1=v1|field2=v2 | ++---+--+ {code} SELECT raw.field1 FROM data will return v1. As a first step, we will introduce the function "as_json", which accomplishes this for JSON columns. was: Semi-structured data is used widely in the data industry for reporting events in a wide variety of formats. Click events in product analytics can be stored as json. Some application logs can be in the form of delimited key=value text. Some data may be in xml. The goal of this project is to be able to signal Spark that such a column exists. This will then enable Spark to "auto-parse" these columns on the fly. The proposal is to store this information as part of the column metadata, in the fields: - format: The format of the semi-structured column, e.g. json, xml, avro - options: Options for parsing these columns Then imagine having the following data: {code:java} ++---++ | ts | event |raw | ++---++ | 2019-10-12 | click | {"field":"value"} | ++---++ {code} SELECT raw.field FROM data will return "value" or the following data {code:java} ++---+--+ | ts | event | raw | ++---+--+ | 2019-10-12 | click | field1=v1|field2=v2 | ++---+--+ {code} SELECT raw.field1 FROM data will return v1. > Add metadata around semi-structured columns to Spark > > > Key: SPARK-30334 > URL: https://issues.apache.org/jira/browse/SPARK-30334 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.4 >Reporter: Burak Yavuz >Priority: Major > > Semi-structured data is used widely in the data industry for reporting events > in a wide variety of formats. Click events in product analytics can be stored > as json. Some application logs can be in the form of delimited key=value > text. Some data may be in xml. > The goal of this project is to be able to signal Spark that such a column > exists. This will then enable Spark to "auto-parse" these columns on the fly. > The proposal is to store this information as part of the column metadata, in > the fields: > - format: The format of the semi-structured column, e.g. json, xml, avro > - options: Options for parsing these columns > Then imagine having the following data: > {code:java} > ++---++ > | ts | event |raw | > ++---++ > | 2019-10-12 | click | {"field":"value"} | > ++---++ {code} > SELECT raw.field FROM data > will return "value" > or the following data > {code:java} > ++---+--+ > | ts | event | raw | > ++---+--+ > | 2019-10-12 | click | field1=v1|field2=v2 | > ++---+--+ {code} > SELECT raw.field1 FROM data > will return v1. > > As a first step, we will introduce the function "as_json", which accomplishes > this for JSON columns. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30334) Add metadata around semi-structured columns to Spark
Burak Yavuz created SPARK-30334: --- Summary: Add metadata around semi-structured columns to Spark Key: SPARK-30334 URL: https://issues.apache.org/jira/browse/SPARK-30334 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.4 Reporter: Burak Yavuz Semi-structured data is used widely in the data industry for reporting events in a wide variety of formats. Click events in product analytics can be stored as json. Some application logs can be in the form of delimited key=value text. Some data may be in xml. The goal of this project is to be able to signal Spark that such a column exists. This will then enable Spark to "auto-parse" these columns on the fly. The proposal is to store this information as part of the column metadata, in the fields: - format: The format of the semi-structured column, e.g. json, xml, avro - options: Options for parsing these columns Then imagine having the following data: {code:java} ++---++ | ts | event |raw | ++---++ | 2019-10-12 | click | {"field":"value"} | ++---++ {code} SELECT raw.field FROM data will return "value" or the following data {code:java} ++---+--+ | ts | event | raw | ++---+--+ | 2019-10-12 | click | field1=v1|field2=v2 | ++---+--+ {code} SELECT raw.field1 FROM data will return v1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30324) Simplify API for JSON access in DataFrames/SQL
Burak Yavuz created SPARK-30324: --- Summary: Simplify API for JSON access in DataFrames/SQL Key: SPARK-30324 URL: https://issues.apache.org/jira/browse/SPARK-30324 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.4 Reporter: Burak Yavuz get_json_object() is a UDF to parse JSON fields. It is verbose and hard to use, e.g. I wasn't expecting the path to a field to have to start with "$.". We can simplify all of this when a column is of StringType, and a nested field is requested. This API sugar will in the query planner be rewritten as get_json_object. This nested access can then be extended in the future to other semi-structured formats. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30143) StreamingQuery.stop() should not block indefinitely
[ https://issues.apache.org/jira/browse/SPARK-30143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-30143. - Fix Version/s: 3.0.0 Resolution: Done Resolved as part of [https://github.com/apache/spark/pull/26771] > StreamingQuery.stop() should not block indefinitely > --- > > Key: SPARK-30143 > URL: https://issues.apache.org/jira/browse/SPARK-30143 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.4 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 3.0.0 > > > The stop() method on a Streaming Query awaits the termination of the stream > execution thread. However, the stream execution thread may block forever > depending on the streaming source implementation (like in Kafka, which runs > UninterruptibleThreads). > This causes control flow applications to hang indefinitely as well. We'd like > to introduce a timeout to stop the execution thread, so that the control flow > thread can decide to do an action if a timeout is hit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30143) StreamingQuery.stop() should not block indefinitely
[ https://issues.apache.org/jira/browse/SPARK-30143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-30143: --- Assignee: Burak Yavuz > StreamingQuery.stop() should not block indefinitely > --- > > Key: SPARK-30143 > URL: https://issues.apache.org/jira/browse/SPARK-30143 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.4 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > > The stop() method on a Streaming Query awaits the termination of the stream > execution thread. However, the stream execution thread may block forever > depending on the streaming source implementation (like in Kafka, which runs > UninterruptibleThreads). > This causes control flow applications to hang indefinitely as well. We'd like > to introduce a timeout to stop the execution thread, so that the control flow > thread can decide to do an action if a timeout is hit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30241) Make the target table relations not children of DELETE/UPDATE/MERGE
[ https://issues.apache.org/jira/browse/SPARK-30241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-30241. - Resolution: Not A Problem > Make the target table relations not children of DELETE/UPDATE/MERGE > --- > > Key: SPARK-30241 > URL: https://issues.apache.org/jira/browse/SPARK-30241 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > > Having the targets of DELETE/UPDATE/MERGE as children can cause bad > interactions with other catalyst rules that may want to perform actions > assuming a child is a "read-only" relation. This is precisely why AlterTable > and DescribeTable have the targets as variables only and not children. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30241) Make the target table relations not children of DELETE/UPDATE/MERGE
Burak Yavuz created SPARK-30241: --- Summary: Make the target table relations not children of DELETE/UPDATE/MERGE Key: SPARK-30241 URL: https://issues.apache.org/jira/browse/SPARK-30241 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Burak Yavuz Assignee: Burak Yavuz Having the targets of DELETE/UPDATE/MERGE as children can cause bad interactions with other catalyst rules that may want to perform actions assuming a child is a "read-only" relation. This is precisely why AlterTable and DescribeTable have the targets as variables only and not children. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30143) StreamingQuery.stop() should not block indefinitely
Burak Yavuz created SPARK-30143: --- Summary: StreamingQuery.stop() should not block indefinitely Key: SPARK-30143 URL: https://issues.apache.org/jira/browse/SPARK-30143 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.4 Reporter: Burak Yavuz The stop() method on a Streaming Query awaits the termination of the stream execution thread. However, the stream execution thread may block forever depending on the streaming source implementation (like in Kafka, which runs UninterruptibleThreads). This causes control flow applications to hang indefinitely as well. We'd like to introduce a timeout to stop the execution thread, so that the control flow thread can decide to do an action if a timeout is hit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974436#comment-16974436 ] Burak Yavuz commented on SPARK-29900: - I definitely agree the behavior is very confusing here. (For example, you can saveAsTable into a table, while a temp table with the same name exists... Once you query the table, you get the temp table back). Can we post here the proposed behavior? > make relation lookup behavior consistent within Spark SQL > - > > Key: SPARK-29900 > URL: https://issues.apache.org/jira/browse/SPARK-29900 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Currently, Spark has 2 different relation resolution behaviors: > 1. try to look up temp view first, then try table/persistent view. > 2. try to look up table/persistent view. > The first behavior is used in SELECT, INSERT and a few commands that support > views, like DESC TABLE. > The second behavior is used in most commands. > It's confusing to have inconsistent relation resolution behaviors, and the > benefit is super small. It's only useful when there are temp view and table > with the same name, but users can easily use qualified table name to > disambiguate. > In postgres, the relation resolution behavior is consistent > {code} > cloud0fan=# create schema s1; > CREATE SCHEMA > cloud0fan=# SET search_path TO s1; > SET > cloud0fan=# create table s1.t (i int); > CREATE TABLE > cloud0fan=# insert into s1.t values (1); > INSERT 0 1 > # access table with qualified name > cloud0fan=# select * from s1.t; > i > --- > 1 > (1 row) > # access table with single name > cloud0fan=# select * from t; > i > --- > 1 > (1 rows) > # create a temp view with conflicting name > cloud0fan=# create temp view t as select 2 as i; > CREATE VIEW > # same as spark, temp view has higher proirity during resolution > cloud0fan=# select * from t; > i > --- > 2 > (1 row) > # DROP TABLE also resolves temp view first > cloud0fan=# drop table t; > ERROR: "t" is not a table > # DELETE also resolves temp view first > cloud0fan=# delete from t where i = 0; > ERROR: cannot delete from view "t" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29778) saveAsTable append mode is not passing writer options
Burak Yavuz created SPARK-29778: --- Summary: saveAsTable append mode is not passing writer options Key: SPARK-29778 URL: https://issues.apache.org/jira/browse/SPARK-29778 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Burak Yavuz There was an oversight where AppendData is not getting the WriterOptions in saveAsTable. [https://github.com/apache/spark/blob/782992c7ed652400e33bc4b1da04c8155b7b3866/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L530] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29568) Add flag to stop existing stream when new copy starts
Burak Yavuz created SPARK-29568: --- Summary: Add flag to stop existing stream when new copy starts Key: SPARK-29568 URL: https://issues.apache.org/jira/browse/SPARK-29568 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Burak Yavuz In multi-tenant environments where you have multiple SparkSessions, you can accidentally start multiple copies of the same stream (i.e. streams using the same checkpoint location). This will cause all new instantiations of the new stream to fail. However, sometimes you may want to turn off the old stream, as the old stream may have turned into a zombie (you no longer have access to the query handle or SparkSession). It would be nice to have a SQL flag that allows the stopping of the old stream for such zombie cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29352) Move active streaming query state to the SharedState
[ https://issues.apache.org/jira/browse/SPARK-29352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-29352. - Fix Version/s: 3.0.0 Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/26018] > Move active streaming query state to the SharedState > > > Key: SPARK-29352 > URL: https://issues.apache.org/jira/browse/SPARK-29352 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 3.0.0 > > > We have checks to prevent the restarting of the same stream on the same spark > session, but we can actually make that better in multi-tenant environments by > actually putting that state in the SharedState instead of SessionState. This > would allow a more comprehensive check for multi-tenant clusters. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29352) Move active streaming query state to the SharedState
[ https://issues.apache.org/jira/browse/SPARK-29352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-29352: --- Assignee: Burak Yavuz > Move active streaming query state to the SharedState > > > Key: SPARK-29352 > URL: https://issues.apache.org/jira/browse/SPARK-29352 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > > We have checks to prevent the restarting of the same stream on the same spark > session, but we can actually make that better in multi-tenant environments by > actually putting that state in the SharedState instead of SessionState. This > would allow a more comprehensive check for multi-tenant clusters. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29352) Move active streaming query state to the SharedState
Burak Yavuz created SPARK-29352: --- Summary: Move active streaming query state to the SharedState Key: SPARK-29352 URL: https://issues.apache.org/jira/browse/SPARK-29352 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.4, 3.0.0 Reporter: Burak Yavuz We have checks to prevent the restarting of the same stream on the same spark session, but we can actually make that better in multi-tenant environments by actually putting that state in the SharedState instead of SessionState. This would allow a more comprehensive check for multi-tenant clusters. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29219) DataSourceV2: Support all SaveModes in DataFrameWriter.save
[ https://issues.apache.org/jira/browse/SPARK-29219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-29219: Description: We currently don't support all save modes in DataFrameWriter.save as the TableProvider interface allows for the reading/writing of data, but not for the creation of tables. We created a catalog API to support the creation/dropping/checking existence of tables, but DataFrameWriter.save doesn't necessarily use a catalog for example, when writing to a path based table. For this case, we propose a new interface that will allow TableProviders to extract an Indentifier and a Catalog from a bundle of CaseInsensitiveStringOptions. This information can then be used to check the existence of a table, and support all save modes. If a Catalog is not defined, then the behavior is to use the spark_catalog (or configured session catalog) to perform the check. The interface can look like: {code:java} trait CatalogOptions { def extractCatalog(StringMap): String def extractIdentifier(StringMap): Identifier } {code} was: We currently don't support all save modes in DataFrameWriter.save as the TableProvider interface allows for the reading/writing of data, but not for the creation of tables. We created a catalog API to support the creation/dropping/checking existence of tables, but DataFrameWriter.save doesn't necessarily use a catalog for example, when writing to a path based table. For this case, we propose a new interface that will allow TableProviders to extract an Indentifier and a Catalog from a bundle of CaseInsensitiveStringOptions. This information can then be used to check the existence of a table, and support all save modes. If a Catalog is not defined, then the behavior is to use the spark_catalog (or configured session catalog) to perform the check. > DataSourceV2: Support all SaveModes in DataFrameWriter.save > --- > > Key: SPARK-29219 > URL: https://issues.apache.org/jira/browse/SPARK-29219 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Priority: Major > > We currently don't support all save modes in DataFrameWriter.save as the > TableProvider interface allows for the reading/writing of data, but not for > the creation of tables. We created a catalog API to support the > creation/dropping/checking existence of tables, but DataFrameWriter.save > doesn't necessarily use a catalog for example, when writing to a path based > table. > For this case, we propose a new interface that will allow TableProviders to > extract an Indentifier and a Catalog from a bundle of > CaseInsensitiveStringOptions. This information can then be used to check the > existence of a table, and support all save modes. If a Catalog is not > defined, then the behavior is to use the spark_catalog (or configured session > catalog) to perform the check. > > The interface can look like: > {code:java} > trait CatalogOptions { > def extractCatalog(StringMap): String > def extractIdentifier(StringMap): Identifier > } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29219) DataSourceV2: Support all SaveModes in DataFrameWriter.save
Burak Yavuz created SPARK-29219: --- Summary: DataSourceV2: Support all SaveModes in DataFrameWriter.save Key: SPARK-29219 URL: https://issues.apache.org/jira/browse/SPARK-29219 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Burak Yavuz We currently don't support all save modes in DataFrameWriter.save as the TableProvider interface allows for the reading/writing of data, but not for the creation of tables. We created a catalog API to support the creation/dropping/checking existence of tables, but DataFrameWriter.save doesn't necessarily use a catalog for example, when writing to a path based table. For this case, we propose a new interface that will allow TableProviders to extract an Indentifier and a Catalog from a bundle of CaseInsensitiveStringOptions. This information can then be used to check the existence of a table, and support all save modes. If a Catalog is not defined, then the behavior is to use the spark_catalog (or configured session catalog) to perform the check. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29197) Remove saveModeForDSV2 in DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-29197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935498#comment-16935498 ] Burak Yavuz commented on SPARK-29197: - Hi Ido, Thanks for your interest. I already have a PR on it right now. It's not going to be a straight forward task, and requires some context and future plans around DataSource V2 as well. > Remove saveModeForDSV2 in DataFrameWriter > - > > Key: SPARK-29197 > URL: https://issues.apache.org/jira/browse/SPARK-29197 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Priority: Blocker > > It is very confusing that the default save mode is different between the > internal implementation of a Data source. The reason that we had to have > saveModeForDSV2 was that there was no easy way to check the existence of a > Table in DataSource v2. Now, we have catalogs for that. Therefore we should > be able to remove the different save modes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29197) Remove saveModeForDSV2 in DataFrameWriter
Burak Yavuz created SPARK-29197: --- Summary: Remove saveModeForDSV2 in DataFrameWriter Key: SPARK-29197 URL: https://issues.apache.org/jira/browse/SPARK-29197 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Burak Yavuz It is very confusing that the default save mode is different between the internal implementation of a Data source. The reason that we had to have saveModeForDSV2 was that there was no easy way to check the existence of a Table in DataSource v2. Now, we have catalogs for that. Therefore we should be able to remove the different save modes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28612) DataSourceV2: Add new DataFrameWriter API for v2
[ https://issues.apache.org/jira/browse/SPARK-28612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-28612. - Fix Version/s: 3.0.0 Assignee: Ryan Blue Resolution: Done Remerged in [https://github.com/apache/spark/pull/25681] > DataSourceV2: Add new DataFrameWriter API for v2 > > > Key: SPARK-28612 > URL: https://issues.apache.org/jira/browse/SPARK-28612 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 3.0.0 > > > This tracks adding an API like the one proposed in SPARK-23521: > {code:lang=scala} > df.writeTo("catalog.db.table").append() // AppendData > df.writeTo("catalog.db.table").overwriteDynamic() // > OverwritePartiitonsDynamic > df.writeTo("catalog.db.table").overwrite($"date" === '2019-01-01') // > OverwriteByExpression > df.writeTo("catalog.db.table").partitionBy($"type", $"date").create() // CTAS > df.writeTo("catalog.db.table").replace() // RTAS > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29030) Simplify lookupV2Relation
[ https://issues.apache.org/jira/browse/SPARK-29030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-29030. - Fix Version/s: 3.0.0 Assignee: John Zhuge Resolution: Done Resolved by [https://github.com/apache/spark/pull/25735] > Simplify lookupV2Relation > - > > Key: SPARK-29030 > URL: https://issues.apache.org/jira/browse/SPARK-29030 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Assignee: John Zhuge >Priority: Minor > Fix For: 3.0.0 > > > Simplify the return type for {{lookupV2Relation}} which makes the 3 callers > more straightforward. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29127) Support partitioning for DataSource V2 tables in DataFrameWriter.save
Burak Yavuz created SPARK-29127: --- Summary: Support partitioning for DataSource V2 tables in DataFrameWriter.save Key: SPARK-29127 URL: https://issues.apache.org/jira/browse/SPARK-29127 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Burak Yavuz Currently, any data source that that upgrades to DataSource V2 loses the partition transform information when using DataFrameWriter.save. The main reason is the lack of an API for "creating" a table with partitioning and schema information for V2 tables without a catalog. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29062) Add V1_BATCH_WRITE to the TableCapabilityChecks in the Analyzer
Burak Yavuz created SPARK-29062: --- Summary: Add V1_BATCH_WRITE to the TableCapabilityChecks in the Analyzer Key: SPARK-29062 URL: https://issues.apache.org/jira/browse/SPARK-29062 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Burak Yavuz Currently the checks in the Analyzer require that V2 Tables have BATCH_WRITE defined for all tables that have V1 Write fallbacks. This is confusing as these tables may not have the V2 writer interface implemented yet. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28964) saveAsTable doesn't pass in the data source information for V2 nodes
Burak Yavuz created SPARK-28964: --- Summary: saveAsTable doesn't pass in the data source information for V2 nodes Key: SPARK-28964 URL: https://issues.apache.org/jira/browse/SPARK-28964 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Burak Yavuz We need to pass in the "provider" property in saveAsTable for V2 nodes in DataFrameWriter. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28628) Support namespaces in V2SessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-28628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-28628. - Fix Version/s: 3.0.0 Resolution: Fixed > Support namespaces in V2SessionCatalog > -- > > Key: SPARK-28628 > URL: https://issues.apache.org/jira/browse/SPARK-28628 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 3.0.0 > > > V2SessionCatalog should implement SupportsNamespaces. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28628) Support namespaces in V2SessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-28628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921705#comment-16921705 ] Burak Yavuz commented on SPARK-28628: - Resolved by [https://github.com/apache/spark/pull/25363] > Support namespaces in V2SessionCatalog > -- > > Key: SPARK-28628 > URL: https://issues.apache.org/jira/browse/SPARK-28628 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > > V2SessionCatalog should implement SupportsNamespaces. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28628) Support namespaces in V2SessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-28628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-28628: --- Assignee: Ryan Blue > Support namespaces in V2SessionCatalog > -- > > Key: SPARK-28628 > URL: https://issues.apache.org/jira/browse/SPARK-28628 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > > V2SessionCatalog should implement SupportsNamespaces. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28612) DataSourceV2: Add new DataFrameWriter API for v2
[ https://issues.apache.org/jira/browse/SPARK-28612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-28612. - Fix Version/s: 3.0.0 Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/25354] > DataSourceV2: Add new DataFrameWriter API for v2 > > > Key: SPARK-28612 > URL: https://issues.apache.org/jira/browse/SPARK-28612 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 3.0.0 > > > This tracks adding an API like the one proposed in SPARK-23521: > {code:lang=scala} > df.writeTo("catalog.db.table").append() // AppendData > df.writeTo("catalog.db.table").overwriteDynamic() // > OverwritePartiitonsDynamic > df.writeTo("catalog.db.table").overwrite($"date" === '2019-01-01') // > OverwriteByExpression > df.writeTo("catalog.db.table").partitionBy($"type", $"date").create() // CTAS > df.writeTo("catalog.db.table").replace() // RTAS > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28612) DataSourceV2: Add new DataFrameWriter API for v2
[ https://issues.apache.org/jira/browse/SPARK-28612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-28612: --- Assignee: Ryan Blue > DataSourceV2: Add new DataFrameWriter API for v2 > > > Key: SPARK-28612 > URL: https://issues.apache.org/jira/browse/SPARK-28612 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > > This tracks adding an API like the one proposed in SPARK-23521: > {code:lang=scala} > df.writeTo("catalog.db.table").append() // AppendData > df.writeTo("catalog.db.table").overwriteDynamic() // > OverwritePartiitonsDynamic > df.writeTo("catalog.db.table").overwrite($"date" === '2019-01-01') // > OverwriteByExpression > df.writeTo("catalog.db.table").partitionBy($"type", $"date").create() // CTAS > df.writeTo("catalog.db.table").replace() // RTAS > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28863) Add an AlreadyPlanned logical node that skips query planning
Burak Yavuz created SPARK-28863: --- Summary: Add an AlreadyPlanned logical node that skips query planning Key: SPARK-28863 URL: https://issues.apache.org/jira/browse/SPARK-28863 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Burak Yavuz With the DataSourceV2 write operations, we have a way to fallback to the V1 writer APIs using InsertableRelation. The gross part is that we're in physical land, but the InsertableRelation takes a logical plan, so we have to pass the logical plans to these physical nodes, and then potentially go through re-planning. A useful primitive could be specifying that a plan is ready for execution through a logical node AlreadyPlanned. This would wrap a physical plan, and then we can go straight to execution. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28635) create CatalogManager to track registered v2 catalogs
[ https://issues.apache.org/jira/browse/SPARK-28635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912621#comment-16912621 ] Burak Yavuz commented on SPARK-28635: - Also followed up by [https://github.com/apache/spark/pull/25521] > create CatalogManager to track registered v2 catalogs > - > > Key: SPARK-28635 > URL: https://issues.apache.org/jira/browse/SPARK-28635 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28565) DataSourceV2: DataFrameWriter.saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-28565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-28565: --- Assignee: Burak Yavuz (was: John Zhuge) > DataSourceV2: DataFrameWriter.saveAsTable > - > > Key: SPARK-28565 > URL: https://issues.apache.org/jira/browse/SPARK-28565 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Assignee: Burak Yavuz >Priority: Major > Fix For: 3.0.0 > > > Support multiple catalogs in the following use cases: > * DataFrameWriter.saveAsTable("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28565) DataSourceV2: DataFrameWriter.saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-28565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-28565. - Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/25330] > DataSourceV2: DataFrameWriter.saveAsTable > - > > Key: SPARK-28565 > URL: https://issues.apache.org/jira/browse/SPARK-28565 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Assignee: Burak Yavuz >Priority: Major > Fix For: 3.0.0 > > > Support multiple catalogs in the following use cases: > * DataFrameWriter.saveAsTable("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28668) Support the V2SessionCatalog with AlterTable commands
Burak Yavuz created SPARK-28668: --- Summary: Support the V2SessionCatalog with AlterTable commands Key: SPARK-28668 URL: https://issues.apache.org/jira/browse/SPARK-28668 Project: Spark Issue Type: Planned Work Components: SQL Affects Versions: 3.0.0 Reporter: Burak Yavuz We need to support the V2SessionCatalog with AlterTable commands so that V2 DataSources can leverage DDL through SQL ALTER TABLE commands. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28667) Support the V2SessionCatalog in insertInto
Burak Yavuz created SPARK-28667: --- Summary: Support the V2SessionCatalog in insertInto Key: SPARK-28667 URL: https://issues.apache.org/jira/browse/SPARK-28667 Project: Spark Issue Type: Planned Work Components: SQL Affects Versions: 3.0.0 Reporter: Burak Yavuz We need to support the V2SessionCatalog in the insert into SQL code paths as well as the DataFrameWriter code paths. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28666) Support the V2SessionCatalog in saveAsTable
Burak Yavuz created SPARK-28666: --- Summary: Support the V2SessionCatalog in saveAsTable Key: SPARK-28666 URL: https://issues.apache.org/jira/browse/SPARK-28666 Project: Spark Issue Type: Planned Work Components: SQL Affects Versions: 3.0.0 Reporter: Burak Yavuz We need to support the V2SessionCatalog in the old saveAsTable code paths so that V2 DataSources can leverage the old DataFrameWriter code path. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28331) Catalogs.load always throws CatalogNotFoundException on loading built-in catalogs
[ https://issues.apache.org/jira/browse/SPARK-28331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-28331. - Resolution: Fixed Fix Version/s: 3.0.0 Resolved with [https://github.com/apache/spark/pull/25348] > Catalogs.load always throws CatalogNotFoundException on loading built-in > catalogs > - > > Key: SPARK-28331 > URL: https://issues.apache.org/jira/browse/SPARK-28331 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > > In `Catalogs.load`, the `pluginClassName` in the following code > ``` > String pluginClassName = conf.getConfString("spark.sql.catalog." + name, > null); > ``` > is always null for built-in catalogs, e.g there is a SQLConf entry for > `spark.sql.catalog.session`. > This is because of https://github.com/apache/spark/pull/18852: > SQLConf.conf.getConfString(key, null) always returns null. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28331) Catalogs.load always throws CatalogNotFoundException on loading built-in catalogs
[ https://issues.apache.org/jira/browse/SPARK-28331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-28331: --- Assignee: Gengliang Wang > Catalogs.load always throws CatalogNotFoundException on loading built-in > catalogs > - > > Key: SPARK-28331 > URL: https://issues.apache.org/jira/browse/SPARK-28331 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > In `Catalogs.load`, the `pluginClassName` in the following code > ``` > String pluginClassName = conf.getConfString("spark.sql.catalog." + name, > null); > ``` > is always null for built-in catalogs, e.g there is a SQLConf entry for > `spark.sql.catalog.session`. > This is because of https://github.com/apache/spark/pull/18852: > SQLConf.conf.getConfString(key, null) always returns null. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27661) Add SupportsNamespaces interface for v2 catalogs
[ https://issues.apache.org/jira/browse/SPARK-27661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-27661. - Resolution: Fixed Fix Version/s: 3.0.0 Resolved by [https://github.com/apache/spark/pull/24560] > Add SupportsNamespaces interface for v2 catalogs > > > Key: SPARK-27661 > URL: https://issues.apache.org/jira/browse/SPARK-27661 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 3.0.0 > > > Some catalogs support namespace operations, like creating or dropping > namespaces. The v2 API should have a way to expose these operations to Spark. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27661) Add SupportsNamespaces interface for v2 catalogs
[ https://issues.apache.org/jira/browse/SPARK-27661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-27661: --- Assignee: Ryan Blue > Add SupportsNamespaces interface for v2 catalogs > > > Key: SPARK-27661 > URL: https://issues.apache.org/jira/browse/SPARK-27661 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > > Some catalogs support namespace operations, like creating or dropping > namespaces. The v2 API should have a way to expose these operations to Spark. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28572) Add simple analysis checks to the V2 Create Table code paths
Burak Yavuz created SPARK-28572: --- Summary: Add simple analysis checks to the V2 Create Table code paths Key: SPARK-28572 URL: https://issues.apache.org/jira/browse/SPARK-28572 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Burak Yavuz Currently, the V2 Create Table code paths don't have any checks around: # The existence of transforms in the table schema # Duplications of transforms # Case sensitivity checks around column names Having these rudimentary checks would simplify V2 Catalog development. Note that the goal of this JIRA is to not make V2 Create Table Hive Compatible. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27845) DataSourceV2: InsertTable
[ https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-27845. - Resolution: Done Fix Version/s: 3.0.0 > DataSourceV2: InsertTable > - > > Key: SPARK-27845 > URL: https://issues.apache.org/jira/browse/SPARK-27845 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Assignee: John Zhuge >Priority: Major > Fix For: 3.0.0 > > > Support multiple catalogs in the following use cases: > * INSERT INTO [TABLE] catalog.db.tbl > * INSERT OVERWRITE TABLE catalog.db.tbl -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27845) DataSourceV2: InsertTable
[ https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-27845: --- Assignee: John Zhuge > DataSourceV2: InsertTable > - > > Key: SPARK-27845 > URL: https://issues.apache.org/jira/browse/SPARK-27845 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Assignee: John Zhuge >Priority: Major > > Support multiple catalogs in the following use cases: > * INSERT INTO [TABLE] catalog.db.tbl > * INSERT OVERWRITE TABLE catalog.db.tbl -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27845) DataSourceV2: InsertTable
[ https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893162#comment-16893162 ] Burak Yavuz commented on SPARK-27845: - Resolved with [https://github.com/apache/spark/pull/24832] > DataSourceV2: InsertTable > - > > Key: SPARK-27845 > URL: https://issues.apache.org/jira/browse/SPARK-27845 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Assignee: John Zhuge >Priority: Major > > Support multiple catalogs in the following use cases: > * INSERT INTO [TABLE] catalog.db.tbl > * INSERT OVERWRITE TABLE catalog.db.tbl -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25472) Structured Streaming query.stop() doesn't always stop gracefully
[ https://issues.apache.org/jira/browse/SPARK-25472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622828#comment-16622828 ] Burak Yavuz commented on SPARK-25472: - Resolved by https://github.com/apache/spark/pull/22478 > Structured Streaming query.stop() doesn't always stop gracefully > > > Key: SPARK-25472 > URL: https://issues.apache.org/jira/browse/SPARK-25472 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 2.5.0 > > > We can have race conditions where the cancelling of Spark jobs will throw a > SparkException when stopping a streaming query. This SparkException specifies > that the job was cancelled. We can use this error message to swallow the > error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25472) Structured Streaming query.stop() doesn't always stop gracefully
[ https://issues.apache.org/jira/browse/SPARK-25472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-25472. - Resolution: Fixed Fix Version/s: 2.5.0 > Structured Streaming query.stop() doesn't always stop gracefully > > > Key: SPARK-25472 > URL: https://issues.apache.org/jira/browse/SPARK-25472 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 2.5.0 > > > We can have race conditions where the cancelling of Spark jobs will throw a > SparkException when stopping a streaming query. This SparkException specifies > that the job was cancelled. We can use this error message to swallow the > error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25472) Structured Streaming query.stop() doesn't always stop gracefully
[ https://issues.apache.org/jira/browse/SPARK-25472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-25472: --- Assignee: Burak Yavuz > Structured Streaming query.stop() doesn't always stop gracefully > > > Key: SPARK-25472 > URL: https://issues.apache.org/jira/browse/SPARK-25472 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > > We can have race conditions where the cancelling of Spark jobs will throw a > SparkException when stopping a streaming query. This SparkException specifies > that the job was cancelled. We can use this error message to swallow the > error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25472) Structured Streaming query.stop() doesn't always stop gracefully
Burak Yavuz created SPARK-25472: --- Summary: Structured Streaming query.stop() doesn't always stop gracefully Key: SPARK-25472 URL: https://issues.apache.org/jira/browse/SPARK-25472 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Burak Yavuz We can have race conditions where the cancelling of Spark jobs will throw a SparkException when stopping a streaming query. This SparkException specifies that the job was cancelled. We can use this error message to swallow the error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24525) Provide an option to limit MemorySink memory usage
[ https://issues.apache.org/jira/browse/SPARK-24525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-24525. - Resolution: Fixed Fix Version/s: 2.4.0 Resolved by [https://github.com/apache/spark/pull/21559] > Provide an option to limit MemorySink memory usage > -- > > Key: SPARK-24525 > URL: https://issues.apache.org/jira/browse/SPARK-24525 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mukul Murthy >Assignee: Mukul Murthy >Priority: Major > Fix For: 2.4.0 > > > MemorySink stores stream results in memory and is mostly used for testing and > displaying streams, but for large streams, this can OOM the driver. We should > add an option to limit the number of rows and the total size of a memory sink > and not add any new data once either limit is hit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24525) Provide an option to limit MemorySink memory usage
[ https://issues.apache.org/jira/browse/SPARK-24525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-24525: --- Assignee: Mukul Murthy > Provide an option to limit MemorySink memory usage > -- > > Key: SPARK-24525 > URL: https://issues.apache.org/jira/browse/SPARK-24525 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mukul Murthy >Assignee: Mukul Murthy >Priority: Major > > MemorySink stores stream results in memory and is mostly used for testing and > displaying streams, but for large streams, this can OOM the driver. We should > add an option to limit the number of rows and the total size of a memory sink > and not add any new data once either limit is hit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23515) JsonProtocol.sparkEventToJson can OOM when jsonifying an event
Burak Yavuz created SPARK-23515: --- Summary: JsonProtocol.sparkEventToJson can OOM when jsonifying an event Key: SPARK-23515 URL: https://issues.apache.org/jira/browse/SPARK-23515 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.1 Reporter: Burak Yavuz Assignee: Burak Yavuz {code} def sparkEventToJson(event: SparkListenerEvent) {code} has a fallback method which creates a JSON object by turning an unrecognized event to Json and then parsing it again. This method materializes the whole string to parse the json record, which is unnecessary and can cause OOMs as seen in the stacktrace below: {code:java} java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3664) at java.lang.String.(String.java:207) at java.lang.StringBuilder.toString(StringBuilder.java:407) at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:356) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.getText(ReaderBasedJsonParser.java:235) at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:20) at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:42) at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:35) at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3736) at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2726) at org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:20) at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:50) at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:103){code} We should just use the stream parsing to avoid such OOMs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23194) from_json in FAILFAST mode doesn't fail fast, instead it just returns nulls
[ https://issues.apache.org/jira/browse/SPARK-23194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336183#comment-16336183 ] Burak Yavuz commented on SPARK-23194: - cc [~cloud_fan] seems like you touched that code last in [https://github.com/apache/spark/commit/68d65fae71e475ad811a9716098aca03a2af9532.] Thoughts? > from_json in FAILFAST mode doesn't fail fast, instead it just returns nulls > --- > > Key: SPARK-23194 > URL: https://issues.apache.org/jira/browse/SPARK-23194 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Burak Yavuz >Priority: Major > > from_json accepts Json parsing options such as being PERMISSIVE to parsing > errors or failing fast. It seems from the code that even though the default > option is to fail fast, we catch that exception and return nulls. > > In order to not change behavior, we should remove that try-catch block and > change the default to permissive, but allow failfast mode to indeed fail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23194) from_json in FAILFAST mode doesn't fail fast, instead it just returns nulls
Burak Yavuz created SPARK-23194: --- Summary: from_json in FAILFAST mode doesn't fail fast, instead it just returns nulls Key: SPARK-23194 URL: https://issues.apache.org/jira/browse/SPARK-23194 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Burak Yavuz from_json accepts Json parsing options such as being PERMISSIVE to parsing errors or failing fast. It seems from the code that even though the default option is to fail fast, we catch that exception and return nulls. In order to not change behavior, we should remove that try-catch block and change the default to permissive, but allow failfast mode to indeed fail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable
[ https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334226#comment-16334226 ] Burak Yavuz commented on SPARK-23173: - In terms of usability, I prefer 1. In terms of the viewpoint of a data engineer, I would like 2 as well if that's not too hard. Basically, if I expect that my data doesn't have nulls, but is suddenly outputting them, I would rather have it fail initially (or get written out to the \_corrupt\_record column). In an ideal world, I should be able to either permit nullable fields (Option 1), or have the record be written out as corrupt. > from_json can produce nulls for fields which are marked as non-nullable > --- > > Key: SPARK-23173 > URL: https://issues.apache.org/jira/browse/SPARK-23173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Herman van Hovell >Priority: Major > > The {{from_json}} function uses a schema to convert a string into a Spark SQL > struct. This schema can contain non-nullable fields. The underlying > {{JsonToStructs}} expression does not check if a resulting struct respects > the nullability of the schema. This leads to very weird problems in consuming > expressions. In our case parquet writing would produce an illegal parquet > file. > There are roughly solutions here: > # Assume that each field in schema passed to {{from_json}} is nullable, and > ignore the nullability information set in the passed schema. > # Validate the object during runtime, and fail execution if the data is null > where we are not expecting this. > I currently am slightly in favor of option 1, since this is the more > performant option and a lot easier to do. > WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23135) Accumulators don't show up properly in the Stages page anymore
[ https://issues.apache.org/jira/browse/SPARK-23135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329293#comment-16329293 ] Burak Yavuz commented on SPARK-23135: - cc [~vanzin] > Accumulators don't show up properly in the Stages page anymore > -- > > Key: SPARK-23135 > URL: https://issues.apache.org/jira/browse/SPARK-23135 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 > Environment: > > >Reporter: Burak Yavuz >Priority: Blocker > Attachments: webUIAccumulatorRegression.png > > > Didn't do a lot of digging but may be caused by: > [https://github.com/apache/spark/commit/1c70da3bfbb4016e394de2c73eb0db7cdd9a6968#diff-0d37752c6ec3d902aeff701771b4e932] > > !webUIAccumulatorRegression.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23135) Accumulators don't show up properly in the Stages page anymore
[ https://issues.apache.org/jira/browse/SPARK-23135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-23135: Environment: was: Didn't do a lot of digging but may be caused by: [https://github.com/apache/spark/commit/1c70da3bfbb4016e394de2c73eb0db7cdd9a6968#diff-0d37752c6ec3d902aeff701771b4e932] > Accumulators don't show up properly in the Stages page anymore > -- > > Key: SPARK-23135 > URL: https://issues.apache.org/jira/browse/SPARK-23135 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 > Environment: > > >Reporter: Burak Yavuz >Priority: Blocker > Attachments: webUIAccumulatorRegression.png > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23135) Accumulators don't show up properly in the Stages page anymore
[ https://issues.apache.org/jira/browse/SPARK-23135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-23135: Attachment: webUIAccumulatorRegression.png > Accumulators don't show up properly in the Stages page anymore > -- > > Key: SPARK-23135 > URL: https://issues.apache.org/jira/browse/SPARK-23135 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 > Environment: > > >Reporter: Burak Yavuz >Priority: Blocker > Attachments: webUIAccumulatorRegression.png > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23135) Accumulators don't show up properly in the Stages page anymore
[ https://issues.apache.org/jira/browse/SPARK-23135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-23135: Description: Didn't do a lot of digging but may be caused by: [https://github.com/apache/spark/commit/1c70da3bfbb4016e394de2c73eb0db7cdd9a6968#diff-0d37752c6ec3d902aeff701771b4e932] !webUIAccumulatorRegression.png! > Accumulators don't show up properly in the Stages page anymore > -- > > Key: SPARK-23135 > URL: https://issues.apache.org/jira/browse/SPARK-23135 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 > Environment: > > >Reporter: Burak Yavuz >Priority: Blocker > Attachments: webUIAccumulatorRegression.png > > > Didn't do a lot of digging but may be caused by: > [https://github.com/apache/spark/commit/1c70da3bfbb4016e394de2c73eb0db7cdd9a6968#diff-0d37752c6ec3d902aeff701771b4e932] > > !webUIAccumulatorRegression.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23135) Accumulators don't show up properly in the Stages page anymore
Burak Yavuz created SPARK-23135: --- Summary: Accumulators don't show up properly in the Stages page anymore Key: SPARK-23135 URL: https://issues.apache.org/jira/browse/SPARK-23135 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.3.0 Environment: Didn't do a lot of digging but may be caused by: [https://github.com/apache/spark/commit/1c70da3bfbb4016e394de2c73eb0db7cdd9a6968#diff-0d37752c6ec3d902aeff701771b4e932] Reporter: Burak Yavuz -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23094) Json Readers choose wrong encoding when bad records are present and fail
Burak Yavuz created SPARK-23094: --- Summary: Json Readers choose wrong encoding when bad records are present and fail Key: SPARK-23094 URL: https://issues.apache.org/jira/browse/SPARK-23094 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.1 Reporter: Burak Yavuz The cases described in SPARK-16548 and SPARK-20549 handled the JsonParser code paths for expressions but not the readers. We should also cover reader code paths reading files with bad characters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23092) Migrate MemoryStream to DataSource V2
Burak Yavuz created SPARK-23092: --- Summary: Migrate MemoryStream to DataSource V2 Key: SPARK-23092 URL: https://issues.apache.org/jira/browse/SPARK-23092 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Burak Yavuz We should migrate the MemoryStream for Structured Streaming to DataSourceV2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp
[ https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-20168: --- Assignee: Yash Sharma > Enable kinesis to start stream from Initial position specified by a timestamp > - > > Key: SPARK-20168 > URL: https://issues.apache.org/jira/browse/SPARK-20168 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Yash Sharma >Assignee: Yash Sharma > Labels: kinesis, streaming > Fix For: 2.3.0 > > > Kinesis client can resume from a specified timestamp while creating a stream. > We should have option to pass a timestamp in config to allow kinesis to > resume from the given timestamp. > Have started initial work and will be posting a PR after I test the patch - > https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp
[ https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-20168: Fix Version/s: 2.3.0 > Enable kinesis to start stream from Initial position specified by a timestamp > - > > Key: SPARK-20168 > URL: https://issues.apache.org/jira/browse/SPARK-20168 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Yash Sharma > Labels: kinesis, streaming > Fix For: 2.3.0 > > > Kinesis client can resume from a specified timestamp while creating a stream. > We should have option to pass a timestamp in config to allow kinesis to > resume from the given timestamp. > Have started initial work and will be posting a PR after I test the patch - > https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp
[ https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-20168. - Resolution: Done Target Version/s: 2.3.0 Resolved with https://github.com/apache/spark/pull/18029 > Enable kinesis to start stream from Initial position specified by a timestamp > - > > Key: SPARK-20168 > URL: https://issues.apache.org/jira/browse/SPARK-20168 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Yash Sharma > Labels: kinesis, streaming > > Kinesis client can resume from a specified timestamp while creating a stream. > We should have option to pass a timestamp in config to allow kinesis to > resume from the given timestamp. > Have started initial work and will be posting a PR after I test the patch - > https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22238) EnsureStatefulOpPartitioning shouldn't ask for the child RDD before planning is completed
Burak Yavuz created SPARK-22238: --- Summary: EnsureStatefulOpPartitioning shouldn't ask for the child RDD before planning is completed Key: SPARK-22238 URL: https://issues.apache.org/jira/browse/SPARK-22238 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Burak Yavuz Assignee: Burak Yavuz In EnsureStatefulOpPartitioning, we check that the inputRDD to a SparkPlan has the expected partitioning for Streaming Stateful Operators. The problem is that we are not allowed to access this information during planning. The reason we added that check was because CoalesceExec could actually create RDDs with 0 partitions. We should fix it such that when CoalesceExec says that there is a SinglePartition, there is in fact an inputRDD of 1 partition instead of 0 partitions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21977) SinglePartition optimizations break certain Streaming Stateful Aggregation requirements
[ https://issues.apache.org/jira/browse/SPARK-21977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-21977. - Resolution: Fixed Fix Version/s: 2.3.0 > SinglePartition optimizations break certain Streaming Stateful Aggregation > requirements > --- > > Key: SPARK-21977 > URL: https://issues.apache.org/jira/browse/SPARK-21977 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Fix For: 2.3.0 > > > This is a bit hard to explain as there are several issues here, I'll try my > best. Here are the requirements: > 1. A StructuredStreaming Source that can generate empty RDDs with 0 > partitions > 2. A StructuredStreaming query that uses the above source, performs a > stateful aggregation (mapGroupsWithState, groupBy.count, ...), and coalesce's > by 1 > The crux of the problem is that when a dataset has a `coalesce(1)` call, it > receives a `SinglePartition` partitioning scheme. This scheme satisfies most > required distributions used for aggregations such as HashAggregateExec. This > causes a world of problems: > Symptom 1. If the input RDD has 0 partitions, the whole lineage will receive > 0 partitions, nothing will be executed, the state store will not create any > delta files. When this happens, the next trigger fails, because the > StateStore fails to load the delta file for the previous trigger > Symptom 2. Let's say that there was data. Then in this case, if you stop > your stream, and change `coalesce(1)` with `coalesce(2)`, then restart your > stream, your stream will fail, because `spark.sql.shuffle.partitions - 1` > number of StateStores will fail to find its delta files. > To fix the issues above, we must check that the partitioning of the child of > a `StatefulOperator` satisfies: > If the grouping expressions are empty: > a) AllTuple distribution > b) Single physical partition > If the grouping expressions are non empty: > a) Clustered distribution > b) spark.sql.shuffle.partition # of partitions > whether or not coalesce(1) exists in the plan, and whether or not the input > RDD for the trigger has any data. > Once you fix the above problem by adding an Exchange to the plan, you come > across the following bug: > If you call `coalesce(1).groupBy().count()` on a Streaming DataFrame, and if > you have a trigger with no data, `StateStoreRestoreExec` doesn't return the > prior state. However, for this specific aggregation, `HashAggregateExec` > after the restore returns a (0, 0) row, since we're performing a count, and > there is no data. Then this data gets stored in `StateStoreSaveExec` causing > the previous counts to be overwritten and lost. > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21977) SinglePartition optimizations break certain Streaming Stateful Aggregation requirements
[ https://issues.apache.org/jira/browse/SPARK-21977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-21977: Description: This is a bit hard to explain as there are several issues here, I'll try my best. Here are the requirements: 1. A StructuredStreaming Source that can generate empty RDDs with 0 partitions 2. A StructuredStreaming query that uses the above source, performs a stateful aggregation (mapGroupsWithState, groupBy.count, ...), and coalesce's by 1 The crux of the problem is that when a dataset has a `coalesce(1)` call, it receives a `SinglePartition` partitioning scheme. This scheme satisfies most required distributions used for aggregations such as HashAggregateExec. This causes a world of problems: Symptom 1. If the input RDD has 0 partitions, the whole lineage will receive 0 partitions, nothing will be executed, the state store will not create any delta files. When this happens, the next trigger fails, because the StateStore fails to load the delta file for the previous trigger Symptom 2. Let's say that there was data. Then in this case, if you stop your stream, and change `coalesce(1)` with `coalesce(2)`, then restart your stream, your stream will fail, because `spark.sql.shuffle.partitions - 1` number of StateStores will fail to find its delta files. To fix the issues above, we must check that the partitioning of the child of a `StatefulOperator` satisfies: If the grouping expressions are empty: a) AllTuple distribution b) Single physical partition If the grouping expressions are non empty: a) Clustered distribution b) spark.sql.shuffle.partition # of partitions whether or not coalesce(1) exists in the plan, and whether or not the input RDD for the trigger has any data. Once you fix the above problem by adding an Exchange to the plan, you come across the following bug: If you call `coalesce(1).groupBy().count()` on a Streaming DataFrame, and if you have a trigger with no data, `StateStoreRestoreExec` doesn't return the prior state. However, for this specific aggregation, `HashAggregateExec` after the restore returns a (0, 0) row, since we're performing a count, and there is no data. Then this data gets stored in `StateStoreSaveExec` causing the previous counts to be overwritten and lost. was:This is a bit hard to explain as there are several issues here > SinglePartition optimizations break certain Streaming Stateful Aggregation > requirements > --- > > Key: SPARK-21977 > URL: https://issues.apache.org/jira/browse/SPARK-21977 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > > This is a bit hard to explain as there are several issues here, I'll try my > best. Here are the requirements: > 1. A StructuredStreaming Source that can generate empty RDDs with 0 > partitions > 2. A StructuredStreaming query that uses the above source, performs a > stateful aggregation (mapGroupsWithState, groupBy.count, ...), and coalesce's > by 1 > The crux of the problem is that when a dataset has a `coalesce(1)` call, it > receives a `SinglePartition` partitioning scheme. This scheme satisfies most > required distributions used for aggregations such as HashAggregateExec. This > causes a world of problems: > Symptom 1. If the input RDD has 0 partitions, the whole lineage will receive > 0 partitions, nothing will be executed, the state store will not create any > delta files. When this happens, the next trigger fails, because the > StateStore fails to load the delta file for the previous trigger > Symptom 2. Let's say that there was data. Then in this case, if you stop > your stream, and change `coalesce(1)` with `coalesce(2)`, then restart your > stream, your stream will fail, because `spark.sql.shuffle.partitions - 1` > number of StateStores will fail to find its delta files. > To fix the issues above, we must check that the partitioning of the child of > a `StatefulOperator` satisfies: > If the grouping expressions are empty: > a) AllTuple distribution > b) Single physical partition > If the grouping expressions are non empty: > a) Clustered distribution > b) spark.sql.shuffle.partition # of partitions > whether or not coalesce(1) exists in the plan, and whether or not the input > RDD for the trigger has any data. > Once you fix the above problem by adding an Exchange to the plan, you come > across the following bug: > If you call `coalesce(1).groupBy().count()` on a Streaming DataFrame, and if > you have a trigger with no data, `StateStoreRestoreExec` doesn't return the > prior state. However, for this specific aggregation, `HashAggregateExec` > after
[jira] [Updated] (SPARK-21977) SinglePartition optimizations break certain Streaming Stateful Aggregation requirements
[ https://issues.apache.org/jira/browse/SPARK-21977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-21977: Summary: SinglePartition optimizations break certain Streaming Stateful Aggregation requirements (was: SinglePartition optimizations break certain StateStore requirements) > SinglePartition optimizations break certain Streaming Stateful Aggregation > requirements > --- > > Key: SPARK-21977 > URL: https://issues.apache.org/jira/browse/SPARK-21977 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > > This is a bit hard to explain as there are several issues here -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org