[jira] [Created] (SPARK-34075) Hidden directories are being listed for partition inference

2021-01-11 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-34075:
---

 Summary: Hidden directories are being listed for partition 
inference
 Key: SPARK-34075
 URL: https://issues.apache.org/jira/browse/SPARK-34075
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Burak Yavuz


Marking this as a blocker since it seems to be a regression. We are running 
Delta's tests against Spark 3.1 as part of QA here: 
[https://github.com/delta-io/delta/pull/579]

 

We have noticed that one of our tests regressed with:
{code:java}
java.lang.AssertionError: assertion failed: Conflicting directory structures 
detected. Suspicious paths:
[info]  
file:/private/var/folders/_2/xn1c9yr11_93wjdk2vkvmwm0gp/t/spark-18706bcc-23ea-4853-b8bc-c4cc2a5ed551
[info]  
file:/private/var/folders/_2/xn1c9yr11_93wjdk2vkvmwm0gp/t/spark-18706bcc-23ea-4853-b8bc-c4cc2a5ed551/_delta_log
[info] 
[info] If provided paths are partition directories, please set "basePath" in 
the options of the data source to specify the root directory of the table. If 
there are multiple root directories, please load them separately and then union 
them.
[info]   at scala.Predef$.assert(Predef.scala:223)
[info]   at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:172)
[info]   at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:104)
[info]   at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:158)
[info]   at 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:73)
[info]   at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:50)
[info]   at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:167)
[info]   at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:418)
[info]   at 
org.apache.spark.sql.execution.datasources.ResolveSQLOnFile$$anonfun$apply$1.applyOrElse(rules.scala:62)
[info]   at 
org.apache.spark.sql.execution.datasources.ResolveSQLOnFile$$anonfun$apply$1.applyOrElse(rules.scala:45)
[info]   at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
[info]   at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
[info]   at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
[info]   at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
[info]   at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
[info]   at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
[info]   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
[info]   at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:73)
[info]   at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:72)
[info]   at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
[info]   at 
org.apache.spark.sql.execution.datasources.ResolveSQLOnFile.apply(rules.scala:45)
[info]   at 
org.apache.spark.sql.execution.datasources.ResolveSQLOnFile.apply(rules.scala:40)
[info]   at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
[info]   at 
scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
[info]   at 
scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
[info]   at scala.collection.immutable.List.foldLeft(List.scala:89)
[info]   at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:213)
[info]   at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:205)
[info]   at scala.collection.immutable.List.foreach(List.scala:392)
[info]   at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:205)
[info]   at 
org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:195)
[info]   at 
org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:189) 
{code}
It seems like a hidden directory is not being filtered out, when it actually 
should.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: 

[jira] [Resolved] (SPARK-31255) DataSourceV2: Add metadata columns

2020-11-18 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-31255.
-
Fix Version/s: 3.1.0
 Assignee: Ryan Blue
   Resolution: Done

Resolved by https://github.com/apache/spark/pull/28027

> DataSourceV2: Add metadata columns
> --
>
> Key: SPARK-31255
> URL: https://issues.apache.org/jira/browse/SPARK-31255
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 3.1.0
>
>
> DSv2 should support reading additional metadata columns that are not in a 
> table's schema. This allows users to project metadata like Kafka's offset, 
> timestamp, and partition. It also allows other sources to expose metadata 
> like file and row position.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31640) Support SHOW PARTITIONS for DataSource V2 tables

2020-05-08 Thread Burak Yavuz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102746#comment-17102746
 ] 

Burak Yavuz commented on SPARK-31640:
-

Hi [~younggyuchun],

 

I'd take a look at how SHOW PARTITIONS works today:

 - 
[https://github.com/apache/spark/blob/36803031e850b08d689df90d15c75e1a1eeb28a8/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala#L1023]

which returns a list of string paths.

It would be great that with [DataSourceV2 
tables|[https://github.com/apache/spark/blob/36803031e850b08d689df90d15c75e1a1eeb28a8/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Table.java#L43]],
 we can return the list of partitions 
([https://github.com/apache/spark/blob/36803031e850b08d689df90d15c75e1a1eeb28a8/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Table.java#L60])
 where each Transform is a separate column. 

> Support SHOW PARTITIONS for DataSource V2 tables
> 
>
> Key: SPARK-31640
> URL: https://issues.apache.org/jira/browse/SPARK-31640
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Major
>
> SHOW PARTITIONS is supported for V1 Hive tables. We can also support it for 
> V2 tables, where they return the transforms and the values of those 
> transforms as separate columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31640) Support SHOW PARTITIONS for DataSource V2 tables

2020-05-04 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-31640:
---

 Summary: Support SHOW PARTITIONS for DataSource V2 tables
 Key: SPARK-31640
 URL: https://issues.apache.org/jira/browse/SPARK-31640
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


SHOW PARTITIONS is supported for V1 Hive tables. We can also support it for V2 
tables, where they return the transforms and the values of those transforms as 
separate columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31624) SHOW TBLPROPERTIES doesn't handle Session Catalog correctly

2020-05-01 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-31624:
---

Assignee: Burak Yavuz

> SHOW TBLPROPERTIES doesn't handle Session Catalog correctly
> ---
>
> Key: SPARK-31624
> URL: https://issues.apache.org/jira/browse/SPARK-31624
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
>
> SHOW TBLPROPERTIES doesn't handle DataSource V2 tables that use the session 
> catalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31624) SHOW TBLPROPERTIES doesn't handle Session Catalog correctly

2020-05-01 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-31624:
---

 Summary: SHOW TBLPROPERTIES doesn't handle Session Catalog 
correctly
 Key: SPARK-31624
 URL: https://issues.apache.org/jira/browse/SPARK-31624
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


SHOW TBLPROPERTIES doesn't handle DataSource V2 tables that use the session 
catalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29314) ProgressReporter.extractStateOperatorMetrics should not overwrite updated as 0 when it actually runs a batch even with no data

2020-04-08 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-29314.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/25987]

> ProgressReporter.extractStateOperatorMetrics should not overwrite updated as 
> 0 when it actually runs a batch even with no data
> --
>
> Key: SPARK-29314
> URL: https://issues.apache.org/jira/browse/SPARK-29314
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> SPARK-24156 brought the ability to run a batch without actual data to enable 
> fast state cleanup as well as emit evicted outputs without waiting actual 
> data to come.
> This breaks some assumption on 
> `ProgressReporter.extractStateOperatorMetrics`. See comment in source code:
> {code:java}
> // lastExecution could belong to one of the previous triggers if 
> `!hasNewData`.
> // Walking the plan again should be inexpensive.
> {code}
> and newNumRowsUpdated is replaced to 0 if hasNewData is false. It makes sense 
> if we copy progress from previous execution (which means no batch is run for 
> this time), but after SPARK-24156 the precondition is broken. 
> Spark should still replace the value of newNumRowsUpdated with 0 if there's 
> no batch being run and it needs to copy the old value from previous 
> execution, but it shouldn't touch the value if it runs a batch for no data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29314) ProgressReporter.extractStateOperatorMetrics should not overwrite updated as 0 when it actually runs a batch even with no data

2020-04-08 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-29314:
---

Assignee: Jungtaek Lim

> ProgressReporter.extractStateOperatorMetrics should not overwrite updated as 
> 0 when it actually runs a batch even with no data
> --
>
> Key: SPARK-29314
> URL: https://issues.apache.org/jira/browse/SPARK-29314
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> SPARK-24156 brought the ability to run a batch without actual data to enable 
> fast state cleanup as well as emit evicted outputs without waiting actual 
> data to come.
> This breaks some assumption on 
> `ProgressReporter.extractStateOperatorMetrics`. See comment in source code:
> {code:java}
> // lastExecution could belong to one of the previous triggers if 
> `!hasNewData`.
> // Walking the plan again should be inexpensive.
> {code}
> and newNumRowsUpdated is replaced to 0 if hasNewData is false. It makes sense 
> if we copy progress from previous execution (which means no batch is run for 
> this time), but after SPARK-24156 the precondition is broken. 
> Spark should still replace the value of newNumRowsUpdated with 0 if there's 
> no batch being run and it needs to copy the old value from previous 
> execution, but it shouldn't touch the value if it runs a batch for no data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31278) numOutputRows shows value from last micro batch when there is no new data

2020-04-07 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-31278.
-
Resolution: Fixed

> numOutputRows shows value from last micro batch when there is no new data
> -
>
> Key: SPARK-31278
> URL: https://issues.apache.org/jira/browse/SPARK-31278
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> In Structured Streaming, we provide progress updates every 10 seconds when a 
> stream doesn't have any new data upstream. When providing this progress 
> though, we zero out the input information but not the output information.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31278) numOutputRows shows value from last micro batch when there is no new data

2020-04-07 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-31278:
---

Assignee: Burak Yavuz

> numOutputRows shows value from last micro batch when there is no new data
> -
>
> Key: SPARK-31278
> URL: https://issues.apache.org/jira/browse/SPARK-31278
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> In Structured Streaming, we provide progress updates every 10 seconds when a 
> stream doesn't have any new data upstream. When providing this progress 
> though, we zero out the input information but not the output information.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31278) numOutputRows shows value from last micro batch when there is no new data

2020-04-07 Thread Burak Yavuz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077695#comment-17077695
 ] 

Burak Yavuz commented on SPARK-31278:
-

Resolved by [https://github.com/apache/spark/pull/28040]

> numOutputRows shows value from last micro batch when there is no new data
> -
>
> Key: SPARK-31278
> URL: https://issues.apache.org/jira/browse/SPARK-31278
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Major
>
> In Structured Streaming, we provide progress updates every 10 seconds when a 
> stream doesn't have any new data upstream. When providing this progress 
> though, we zero out the input information but not the output information.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31278) numOutputRows shows value from last micro batch when there is no new data

2020-04-07 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-31278:

Fix Version/s: 3.0.0

> numOutputRows shows value from last micro batch when there is no new data
> -
>
> Key: SPARK-31278
> URL: https://issues.apache.org/jira/browse/SPARK-31278
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> In Structured Streaming, we provide progress updates every 10 seconds when a 
> stream doesn't have any new data upstream. When providing this progress 
> though, we zero out the input information but not the output information.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31278) numOutputRows shows value from last micro batch when there is no new data

2020-03-26 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-31278:
---

 Summary: numOutputRows shows value from last micro batch when 
there is no new data
 Key: SPARK-31278
 URL: https://issues.apache.org/jira/browse/SPARK-31278
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Burak Yavuz


In Structured Streaming, we provide progress updates every 10 seconds when a 
stream doesn't have any new data upstream. When providing this progress though, 
we zero out the input information but not the output information.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31178) sql("INSERT INTO v2DataSource ...").collect() double inserts

2020-03-18 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-31178.
-
Fix Version/s: 3.0.0
 Assignee: Burak Yavuz
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/27941]

> sql("INSERT INTO v2DataSource ...").collect() double inserts
> 
>
> Key: SPARK-31178
> URL: https://issues.apache.org/jira/browse/SPARK-31178
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Blocker
> Fix For: 3.0.0
>
>
> The following unit test fails in DataSourceV2SQLSuite:
> {code:java}
> test("do not double insert on INSERT INTO collect()") {
>   import testImplicits._
>   val t1 = s"${catalogAndNamespace}tbl"
>   sql(s"CREATE TABLE $t1 (id bigint, data string) USING $v2Format")
>   val tmpView = "test_data"
>   val df = Seq((1L, "a"), (2L, "b"), (3L, "c")).toDF("id", "data")
>   df.createOrReplaceTempView(tmpView)
>   sql(s"INSERT INTO TABLE $t1 SELECT * FROM $tmpView").collect()
>   verifyTable(t1, df)
> } {code}
> The INSERT INTO is double inserting when ".collect()" is called. I think this 
> is because the V2 SparkPlans are not commands, and doExecute on a Spark plan 
> can be called multiple times causing data to be inserted multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31178) sql("INSERT INTO v2DataSource ...").collect() double inserts

2020-03-17 Thread Burak Yavuz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061292#comment-17061292
 ] 

Burak Yavuz commented on SPARK-31178:
-

cc [~wenchen] [~rdblue]

> sql("INSERT INTO v2DataSource ...").collect() double inserts
> 
>
> Key: SPARK-31178
> URL: https://issues.apache.org/jira/browse/SPARK-31178
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> The following unit test fails in DataSourceV2SQLSuite:
> {code:java}
> test("do not double insert on INSERT INTO collect()") {
>   import testImplicits._
>   val t1 = s"${catalogAndNamespace}tbl"
>   sql(s"CREATE TABLE $t1 (id bigint, data string) USING $v2Format")
>   val tmpView = "test_data"
>   val df = Seq((1L, "a"), (2L, "b"), (3L, "c")).toDF("id", "data")
>   df.createOrReplaceTempView(tmpView)
>   sql(s"INSERT INTO TABLE $t1 SELECT * FROM $tmpView").collect()
>   verifyTable(t1, df)
> } {code}
> The INSERT INTO is double inserting when ".collect()" is called. I think this 
> is because the V2 SparkPlans are not commands, and doExecute on a Spark plan 
> can be called multiple times causing data to be inserted multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31178) sql("INSERT INTO v2DataSource ...").collect() double inserts

2020-03-17 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-31178:
---

 Summary: sql("INSERT INTO v2DataSource ...").collect() double 
inserts
 Key: SPARK-31178
 URL: https://issues.apache.org/jira/browse/SPARK-31178
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


The following unit test fails in DataSourceV2SQLSuite:
{code:java}
test("do not double insert on INSERT INTO collect()") {
  import testImplicits._
  val t1 = s"${catalogAndNamespace}tbl"
  sql(s"CREATE TABLE $t1 (id bigint, data string) USING $v2Format")
  val tmpView = "test_data"
  val df = Seq((1L, "a"), (2L, "b"), (3L, "c")).toDF("id", "data")
  df.createOrReplaceTempView(tmpView)
  sql(s"INSERT INTO TABLE $t1 SELECT * FROM $tmpView").collect()

  verifyTable(t1, df)
} {code}
The INSERT INTO is double inserting when ".collect()" is called. I think this 
is because the V2 SparkPlans are not commands, and doExecute on a Spark plan 
can be called multiple times causing data to be inserted multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31061) Impossible to change the provider of a table in the HiveMetaStore

2020-03-05 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-31061:
---

 Summary: Impossible to change the provider of a table in the 
HiveMetaStore
 Key: SPARK-31061
 URL: https://issues.apache.org/jira/browse/SPARK-31061
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


Currently, it's impossible to alter the datasource of a table in the 
HiveMetaStore by using alterTable, as the HiveExternalCatalog doesn't change 
the provider table property during an alterTable command. This is required to 
support changing table formats when using commands like REPLACE TABLE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30924) Add additional validation into Merge Into

2020-02-22 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-30924:
---

 Summary: Add additional validation into Merge Into
 Key: SPARK-30924
 URL: https://issues.apache.org/jira/browse/SPARK-30924
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


Merge Into is currently missing additional validation around:

 1. The lack of any WHEN statements

 2. Single use of UPDATE/DELETE

 3. The first WHEN MATCHED statement needs to have a condition if there are two 
WHEN MATCHED statements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29908) Support partitioning for DataSource V2 tables in DataFrameWriter.save

2020-02-19 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-29908.
-
Resolution: Duplicate

> Support partitioning for DataSource V2 tables in DataFrameWriter.save
> -
>
> Key: SPARK-29908
> URL: https://issues.apache.org/jira/browse/SPARK-29908
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> Currently, any data source that that upgrades to DataSource V2 loses the 
> partition transform information when using DataFrameWriter.save. The main 
> reason is the lack of an API for "creating" a table with partitioning and 
> schema information for V2 tables without a catalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30814) Add Columns references should be able to resolve each other

2020-02-13 Thread Burak Yavuz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036418#comment-17036418
 ] 

Burak Yavuz commented on SPARK-30814:
-

cc [~cloud_fan] [~imback82], can we prioritize this over REPLACE COLUMNS if 
possible?

> Add Columns references should be able to resolve each other
> ---
>
> Key: SPARK-30814
> URL: https://issues.apache.org/jira/browse/SPARK-30814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Major
>
> In ResolveAlterTableChanges, we have checks that make sure that positional 
> arguments exist and are normalized around case sensitivity for ALTER TABLE 
> ADD COLUMNS. However, we missed the case, where a column in ADD COLUMNS can 
> depend on the position of a column that is just being added.
> For example for the schema:
> {code:java}
> root:
>   - a: string
>   - b: long
>  {code}
>  
> The following should work:
> {code:java}
> ALTER TABLE ... ADD COLUMNS (x int AFTER a, y int AFTER x) {code}
> Currently, the above statement will throw an error saying that AFTER x cannot 
> be resolved, because x doesn't exist yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30814) Add Columns references should be able to resolve each other

2020-02-13 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-30814:
---

 Summary: Add Columns references should be able to resolve each 
other
 Key: SPARK-30814
 URL: https://issues.apache.org/jira/browse/SPARK-30814
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


In ResolveAlterTableChanges, we have checks that make sure that positional 
arguments exist and are normalized around case sensitivity for ALTER TABLE ADD 
COLUMNS. However, we missed the case, where a column in ADD COLUMNS can depend 
on the position of a column that is just being added.

For example for the schema:
{code:java}
root:
  - a: string
  - b: long
 {code}
 

The following should work:
{code:java}
ALTER TABLE ... ADD COLUMNS (x int AFTER a, y int AFTER x) {code}
Currently, the above statement will throw an error saying that AFTER x cannot 
be resolved, because x doesn't exist yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30697) Handle database and namespace exceptions in catalog.isView

2020-02-02 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-30697:
---

Assignee: Burak Yavuz

> Handle database and namespace exceptions in catalog.isView
> --
>
> Key: SPARK-30697
> URL: https://issues.apache.org/jira/browse/SPARK-30697
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
>
> The non-existence of a database shouldn't throw a NoSuchDatabaseException 
> from catalog.isView



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30697) Handle database and namespace exceptions in catalog.isView

2020-02-02 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-30697.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Resolved by [#27423|https://github.com/apache/spark/pull/27423]

> Handle database and namespace exceptions in catalog.isView
> --
>
> Key: SPARK-30697
> URL: https://issues.apache.org/jira/browse/SPARK-30697
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> The non-existence of a database shouldn't throw a NoSuchDatabaseException 
> from catalog.isView



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30697) Handle database and namespace exceptions in catalog.isView

2020-01-31 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-30697:
---

 Summary: Handle database and namespace exceptions in catalog.isView
 Key: SPARK-30697
 URL: https://issues.apache.org/jira/browse/SPARK-30697
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


The non-existence of a database shouldn't throw a NoSuchDatabaseException from 
catalog.isView



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30669) Introduce AdmissionControl API to Structured Streaming

2020-01-30 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-30669.
-
Fix Version/s: 3.0.0
 Assignee: Burak Yavuz
   Resolution: Done

Resolved by [https://github.com/apache/spark/pull/27380]

> Introduce AdmissionControl API to Structured Streaming
> --
>
> Key: SPARK-30669
> URL: https://issues.apache.org/jira/browse/SPARK-30669
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> In Structured Streaming, we have the concept of Triggers. With a trigger like 
> Trigger.Once(), the semantics are to process all the data available to the 
> datasource in a single micro-batch. However, this semantic can be broken when 
> data source options such as `maxOffsetsPerTrigger` (in the Kafka source) rate 
> limit the amount of data read for that micro-batch.
> We propose to add a new interface `SupportsAdmissionControl` and `ReadLimit`. 
> A ReadLimit defines how much data should be read in the next micro-batch. 
> `SupportsAdmissionControl` specifies that a source can rate limit its ingest 
> into the system. The source can tell the system what the user specified as a 
> read limit, and the system can enforce this limit within each micro-batch or 
> impose it's own limit if the Trigger is Trigger.Once() for example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30669) Introduce AdmissionControl API to Structured Streaming

2020-01-28 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-30669:
---

 Summary: Introduce AdmissionControl API to Structured Streaming
 Key: SPARK-30669
 URL: https://issues.apache.org/jira/browse/SPARK-30669
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.4
Reporter: Burak Yavuz


In Structured Streaming, we have the concept of Triggers. With a trigger like 
Trigger.Once(), the semantics are to process all the data available to the 
datasource in a single micro-batch. However, this semantic can be broken when 
data source options such as `maxOffsetsPerTrigger` (in the Kafka source) rate 
limit the amount of data read for that micro-batch.

We propose to add a new interface `SupportsAdmissionControl` and `ReadLimit`. A 
ReadLimit defines how much data should be read in the next micro-batch. 
`SupportsAdmissionControl` specifies that a source can rate limit its ingest 
into the system. The source can tell the system what the user specified as a 
read limit, and the system can enforce this limit within each micro-batch or 
impose it's own limit if the Trigger is Trigger.Once() for example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30314) Add identifier and catalog information to DataSourceV2Relation

2020-01-26 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-30314.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/26957]

> Add identifier and catalog information to DataSourceV2Relation
> --
>
> Key: SPARK-30314
> URL: https://issues.apache.org/jira/browse/SPARK-30314
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuchen Huo
>Assignee: Yuchen Huo
>Priority: Major
> Fix For: 3.0.0
>
>
> Add identifier and catalog information in DataSourceV2Relation so it would be 
> possible to do richer checks in *checkAnalysis* step.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30314) Add identifier and catalog information to DataSourceV2Relation

2020-01-26 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-30314:
---

Assignee: Yuchen Huo

> Add identifier and catalog information to DataSourceV2Relation
> --
>
> Key: SPARK-30314
> URL: https://issues.apache.org/jira/browse/SPARK-30314
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuchen Huo
>Assignee: Yuchen Huo
>Priority: Major
>
> Add identifier and catalog information in DataSourceV2Relation so it would be 
> possible to do richer checks in *checkAnalysis* step.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30612) can't resolve qualified column name with v2 tables

2020-01-23 Thread Burak Yavuz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022678#comment-17022678
 ] 

Burak Yavuz commented on SPARK-30612:
-

I prefer SubqueryAlias. We need to support all degrees of the user provided
identifier I believe:

SELECT testcat.ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl

SELECT ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl

SELECT ns2.tbl.foo FROM testcat.ns1.ns2.tbl

SELECT tbl.foo FROM testcat.ns1.ns2.tbl

should all work.

However I'm not sure if

SELECT spark_catalog.default.tbl.foo FROM tbl

should work. Are my assumptions correct?




> can't resolve qualified column name with v2 tables
> --
>
> Key: SPARK-30612
> URL: https://issues.apache.org/jira/browse/SPARK-30612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> When running queries with qualified columns like `SELECT t.a FROM t`, it 
> fails to resolve for v2 tables.
> v1 table is fine as we always wrap the v1 relation with a `SubqueryAlias`. We 
> should do the same for v2 tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30615) normalize the column name in AlterTable

2020-01-23 Thread Burak Yavuz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022577#comment-17022577
 ] 

Burak Yavuz commented on SPARK-30615:
-

I actually had a PR in progress on this. Let me push that

> normalize the column name in AlterTable
> ---
>
> Key: SPARK-30615
> URL: https://issues.apache.org/jira/browse/SPARK-30615
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Because of case insensitive resolution, the column name in AlterTable may 
> match the table schema but not exactly the same. To ease DS v2 
> implementations, Spark should normalize the column name before passing them 
> to v2 catalogs, so that users don't need to care about the case sensitive 
> config.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30612) can't resolve qualified column name with v2 tables

2020-01-23 Thread Burak Yavuz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022343#comment-17022343
 ] 

Burak Yavuz commented on SPARK-30612:
-

SPARK-30314 should help make this work easier

> can't resolve qualified column name with v2 tables
> --
>
> Key: SPARK-30612
> URL: https://issues.apache.org/jira/browse/SPARK-30612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> When running queries with qualified columns like `SELECT t.a FROM t`, it 
> fails to resolve for v2 tables.
> v1 table is fine as we always wrap the v1 relation with a `SubqueryAlias`. We 
> should do the same for v2 tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29219) DataSourceV2: Support all SaveModes in DataFrameWriter.save

2020-01-09 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-29219.
-
Fix Version/s: 3.0.0
   Resolution: Done

Resolved by [https://github.com/apache/spark/pull/26913]

> DataSourceV2: Support all SaveModes in DataFrameWriter.save
> ---
>
> Key: SPARK-29219
> URL: https://issues.apache.org/jira/browse/SPARK-29219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> We currently don't support all save modes in DataFrameWriter.save as the 
> TableProvider interface allows for the reading/writing of data, but not for 
> the creation of tables. We created a catalog API to support the 
> creation/dropping/checking existence of tables, but DataFrameWriter.save 
> doesn't necessarily use a catalog for example, when writing to a path based 
> table.
> For this case, we propose a new interface that will allow TableProviders to 
> extract an Indentifier and a Catalog from a bundle of 
> CaseInsensitiveStringOptions. This information can then be used to check the 
> existence of a table, and support all save modes. If a Catalog is not 
> defined, then the behavior is to use the spark_catalog (or configured session 
> catalog) to perform the check.
>  
> The interface can look like:
> {code:java}
> trait CatalogOptions {
>   def extractCatalog(StringMap): String
>   def extractIdentifier(StringMap): Identifier
> } {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29219) DataSourceV2: Support all SaveModes in DataFrameWriter.save

2020-01-09 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-29219:
---

Assignee: Burak Yavuz

> DataSourceV2: Support all SaveModes in DataFrameWriter.save
> ---
>
> Key: SPARK-29219
> URL: https://issues.apache.org/jira/browse/SPARK-29219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
>
> We currently don't support all save modes in DataFrameWriter.save as the 
> TableProvider interface allows for the reading/writing of data, but not for 
> the creation of tables. We created a catalog API to support the 
> creation/dropping/checking existence of tables, but DataFrameWriter.save 
> doesn't necessarily use a catalog for example, when writing to a path based 
> table.
> For this case, we propose a new interface that will allow TableProviders to 
> extract an Indentifier and a Catalog from a bundle of 
> CaseInsensitiveStringOptions. This information can then be used to check the 
> existence of a table, and support all save modes. If a Catalog is not 
> defined, then the behavior is to use the spark_catalog (or configured session 
> catalog) to perform the check.
>  
> The interface can look like:
> {code:java}
> trait CatalogOptions {
>   def extractCatalog(StringMap): String
>   def extractIdentifier(StringMap): Identifier
> } {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30334) Add metadata around semi-structured columns to Spark

2019-12-23 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-30334:

Description: 
Semi-structured data is used widely in the data industry for reporting events 
in a wide variety of formats. Click events in product analytics can be stored 
as json. Some application logs can be in the form of delimited key=value text. 
Some data may be in xml.

The goal of this project is to be able to signal Spark that such a column 
exists. This will then enable Spark to "auto-parse" these columns on the fly. 
The proposal is to store this information as part of the column metadata, in 
the fields:

 - format: The format of the semi-structured column, e.g. json, xml, avro

 - options: Options for parsing these columns

Then imagine having the following data:
{code:java}
++---++
| ts | event |raw |
++---++
| 2019-10-12 | click | {"field":"value"}  |
++---++ {code}
SELECT raw.field FROM data

will return "value"

or the following data
{code:java}
++---+--+
| ts | event | raw  |
++---+--+
| 2019-10-12 | click | field1=v1|field2=v2  |
++---+--+ {code}
SELECT raw.field1 FROM data

will return v1.

 

As a first step, we will introduce the function "as_json", which accomplishes 
this for JSON columns.

  was:
Semi-structured data is used widely in the data industry for reporting events 
in a wide variety of formats. Click events in product analytics can be stored 
as json. Some application logs can be in the form of delimited key=value text. 
Some data may be in xml.

The goal of this project is to be able to signal Spark that such a column 
exists. This will then enable Spark to "auto-parse" these columns on the fly. 
The proposal is to store this information as part of the column metadata, in 
the fields:

 - format: The format of the semi-structured column, e.g. json, xml, avro

 - options: Options for parsing these columns

Then imagine having the following data:
{code:java}
++---++
| ts | event |raw |
++---++
| 2019-10-12 | click | {"field":"value"}  |
++---++ {code}
SELECT raw.field FROM data

will return "value"

or the following data
{code:java}
++---+--+
| ts | event | raw  |
++---+--+
| 2019-10-12 | click | field1=v1|field2=v2  |
++---+--+ {code}
SELECT raw.field1 FROM data

will return v1.


> Add metadata around semi-structured columns to Spark
> 
>
> Key: SPARK-30334
> URL: https://issues.apache.org/jira/browse/SPARK-30334
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Burak Yavuz
>Priority: Major
>
> Semi-structured data is used widely in the data industry for reporting events 
> in a wide variety of formats. Click events in product analytics can be stored 
> as json. Some application logs can be in the form of delimited key=value 
> text. Some data may be in xml.
> The goal of this project is to be able to signal Spark that such a column 
> exists. This will then enable Spark to "auto-parse" these columns on the fly. 
> The proposal is to store this information as part of the column metadata, in 
> the fields:
>  - format: The format of the semi-structured column, e.g. json, xml, avro
>  - options: Options for parsing these columns
> Then imagine having the following data:
> {code:java}
> ++---++
> | ts | event |raw |
> ++---++
> | 2019-10-12 | click | {"field":"value"}  |
> ++---++ {code}
> SELECT raw.field FROM data
> will return "value"
> or the following data
> {code:java}
> ++---+--+
> | ts | event | raw  |
> ++---+--+
> | 2019-10-12 | click | field1=v1|field2=v2  |
> ++---+--+ {code}
> SELECT raw.field1 FROM data
> will return v1.
>  
> As a first step, we will introduce the function "as_json", which accomplishes 
> this for JSON columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30334) Add metadata around semi-structured columns to Spark

2019-12-23 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-30334:
---

 Summary: Add metadata around semi-structured columns to Spark
 Key: SPARK-30334
 URL: https://issues.apache.org/jira/browse/SPARK-30334
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.4
Reporter: Burak Yavuz


Semi-structured data is used widely in the data industry for reporting events 
in a wide variety of formats. Click events in product analytics can be stored 
as json. Some application logs can be in the form of delimited key=value text. 
Some data may be in xml.

The goal of this project is to be able to signal Spark that such a column 
exists. This will then enable Spark to "auto-parse" these columns on the fly. 
The proposal is to store this information as part of the column metadata, in 
the fields:

 - format: The format of the semi-structured column, e.g. json, xml, avro

 - options: Options for parsing these columns

Then imagine having the following data:
{code:java}
++---++
| ts | event |raw |
++---++
| 2019-10-12 | click | {"field":"value"}  |
++---++ {code}
SELECT raw.field FROM data

will return "value"

or the following data
{code:java}
++---+--+
| ts | event | raw  |
++---+--+
| 2019-10-12 | click | field1=v1|field2=v2  |
++---+--+ {code}
SELECT raw.field1 FROM data

will return v1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30324) Simplify API for JSON access in DataFrames/SQL

2019-12-20 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-30324:
---

 Summary: Simplify API for JSON access in DataFrames/SQL
 Key: SPARK-30324
 URL: https://issues.apache.org/jira/browse/SPARK-30324
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.4
Reporter: Burak Yavuz


get_json_object() is a UDF to parse JSON fields. It is verbose and hard to use, 
e.g. I wasn't expecting the path to a field to have to start with "$.". 

We can simplify all of this when a column is of StringType, and a nested field 
is requested. This API sugar will in the query planner be rewritten as 
get_json_object.

This nested access can then be extended in the future to other semi-structured 
formats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30143) StreamingQuery.stop() should not block indefinitely

2019-12-13 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-30143.
-
Fix Version/s: 3.0.0
   Resolution: Done

Resolved as part of [https://github.com/apache/spark/pull/26771]

> StreamingQuery.stop() should not block indefinitely
> ---
>
> Key: SPARK-30143
> URL: https://issues.apache.org/jira/browse/SPARK-30143
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.4
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> The stop() method on a Streaming Query awaits the termination of the stream 
> execution thread. However, the stream execution thread may block forever 
> depending on the streaming source implementation (like in Kafka, which runs 
> UninterruptibleThreads).
> This causes control flow applications to hang indefinitely as well. We'd like 
> to introduce a timeout to stop the execution thread, so that the control flow 
> thread can decide to do an action if a timeout is hit. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30143) StreamingQuery.stop() should not block indefinitely

2019-12-13 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-30143:
---

Assignee: Burak Yavuz

> StreamingQuery.stop() should not block indefinitely
> ---
>
> Key: SPARK-30143
> URL: https://issues.apache.org/jira/browse/SPARK-30143
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.4
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
>
> The stop() method on a Streaming Query awaits the termination of the stream 
> execution thread. However, the stream execution thread may block forever 
> depending on the streaming source implementation (like in Kafka, which runs 
> UninterruptibleThreads).
> This causes control flow applications to hang indefinitely as well. We'd like 
> to introduce a timeout to stop the execution thread, so that the control flow 
> thread can decide to do an action if a timeout is hit. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30241) Make the target table relations not children of DELETE/UPDATE/MERGE

2019-12-12 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-30241.
-
Resolution: Not A Problem

> Make the target table relations not children of DELETE/UPDATE/MERGE
> ---
>
> Key: SPARK-30241
> URL: https://issues.apache.org/jira/browse/SPARK-30241
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
>
> Having the targets of DELETE/UPDATE/MERGE as children can cause bad 
> interactions with other catalyst rules that may want to perform actions 
> assuming a child is a "read-only" relation. This is precisely why AlterTable 
> and DescribeTable have the targets as variables only and not children.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30241) Make the target table relations not children of DELETE/UPDATE/MERGE

2019-12-12 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-30241:
---

 Summary: Make the target table relations not children of 
DELETE/UPDATE/MERGE
 Key: SPARK-30241
 URL: https://issues.apache.org/jira/browse/SPARK-30241
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz
Assignee: Burak Yavuz


Having the targets of DELETE/UPDATE/MERGE as children can cause bad 
interactions with other catalyst rules that may want to perform actions 
assuming a child is a "read-only" relation. This is precisely why AlterTable 
and DescribeTable have the targets as variables only and not children.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30143) StreamingQuery.stop() should not block indefinitely

2019-12-05 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-30143:
---

 Summary: StreamingQuery.stop() should not block indefinitely
 Key: SPARK-30143
 URL: https://issues.apache.org/jira/browse/SPARK-30143
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.4
Reporter: Burak Yavuz


The stop() method on a Streaming Query awaits the termination of the stream 
execution thread. However, the stream execution thread may block forever 
depending on the streaming source implementation (like in Kafka, which runs 
UninterruptibleThreads).

This causes control flow applications to hang indefinitely as well. We'd like 
to introduce a timeout to stop the execution thread, so that the control flow 
thread can decide to do an action if a timeout is hit. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL

2019-11-14 Thread Burak Yavuz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974436#comment-16974436
 ] 

Burak Yavuz commented on SPARK-29900:
-

I definitely agree the behavior is very confusing here. (For example, you can 
saveAsTable into a table, while a temp table with the same name exists... Once 
you query the table, you get the temp table back). Can we post here the 
proposed behavior?

> make relation lookup behavior consistent within Spark SQL
> -
>
> Key: SPARK-29900
> URL: https://issues.apache.org/jira/browse/SPARK-29900
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently, Spark has 2 different relation resolution behaviors:
> 1. try to look up temp view first, then try table/persistent view.
> 2. try to look up table/persistent view.
> The first behavior is used in SELECT, INSERT and a few commands that support 
> views, like DESC TABLE.
> The second behavior is used in most commands.
> It's confusing to have inconsistent relation resolution behaviors, and the 
> benefit is super small. It's only useful when there are temp view and table 
> with the same name, but users can easily use qualified table name to 
> disambiguate.
> In postgres, the relation resolution behavior is consistent
> {code}
> cloud0fan=# create schema s1;
> CREATE SCHEMA
> cloud0fan=# SET search_path TO s1;
> SET
> cloud0fan=# create table s1.t (i int);
> CREATE TABLE
> cloud0fan=# insert into s1.t values (1);
> INSERT 0 1
> # access table with qualified name
> cloud0fan=# select * from s1.t;
>  i 
> ---
>  1
> (1 row)
> # access table with single name
> cloud0fan=# select * from t;
>  i 
> ---
>  1
> (1 rows)
> # create a temp view with conflicting name
> cloud0fan=# create temp view t as select 2 as i;
> CREATE VIEW
> # same as spark, temp view has higher proirity during resolution
> cloud0fan=# select * from t;
>  i 
> ---
>  2
> (1 row)
> # DROP TABLE also resolves temp view first
> cloud0fan=# drop table t;
> ERROR:  "t" is not a table
> # DELETE also resolves temp view first
> cloud0fan=# delete from t where i = 0;
> ERROR:  cannot delete from view "t"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29778) saveAsTable append mode is not passing writer options

2019-11-06 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-29778:
---

 Summary: saveAsTable append mode is not passing writer options
 Key: SPARK-29778
 URL: https://issues.apache.org/jira/browse/SPARK-29778
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


There was an oversight where AppendData is not getting the WriterOptions in 
saveAsTable. 
[https://github.com/apache/spark/blob/782992c7ed652400e33bc4b1da04c8155b7b3866/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L530]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29568) Add flag to stop existing stream when new copy starts

2019-10-23 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-29568:
---

 Summary: Add flag to stop existing stream when new copy starts
 Key: SPARK-29568
 URL: https://issues.apache.org/jira/browse/SPARK-29568
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Burak Yavuz


In multi-tenant environments where you have multiple SparkSessions, you can 
accidentally start multiple copies of the same stream (i.e. streams using the 
same checkpoint location). This will cause all new instantiations of the new 
stream to fail. However, sometimes you may want to turn off the old stream, as 
the old stream may have turned into a zombie (you no longer have access to the 
query handle or SparkSession).

It would be nice to have a SQL flag that allows the stopping of the old stream 
for such zombie cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29352) Move active streaming query state to the SharedState

2019-10-23 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-29352.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/26018]

> Move active streaming query state to the SharedState
> 
>
> Key: SPARK-29352
> URL: https://issues.apache.org/jira/browse/SPARK-29352
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> We have checks to prevent the restarting of the same stream on the same spark 
> session, but we can actually make that better in multi-tenant environments by 
> actually putting that state in the SharedState instead of SessionState. This 
> would allow a more comprehensive check for multi-tenant clusters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29352) Move active streaming query state to the SharedState

2019-10-09 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-29352:
---

Assignee: Burak Yavuz

> Move active streaming query state to the SharedState
> 
>
> Key: SPARK-29352
> URL: https://issues.apache.org/jira/browse/SPARK-29352
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
>
> We have checks to prevent the restarting of the same stream on the same spark 
> session, but we can actually make that better in multi-tenant environments by 
> actually putting that state in the SharedState instead of SessionState. This 
> would allow a more comprehensive check for multi-tenant clusters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29352) Move active streaming query state to the SharedState

2019-10-03 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-29352:
---

 Summary: Move active streaming query state to the SharedState
 Key: SPARK-29352
 URL: https://issues.apache.org/jira/browse/SPARK-29352
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.4, 3.0.0
Reporter: Burak Yavuz


We have checks to prevent the restarting of the same stream on the same spark 
session, but we can actually make that better in multi-tenant environments by 
actually putting that state in the SharedState instead of SessionState. This 
would allow a more comprehensive check for multi-tenant clusters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29219) DataSourceV2: Support all SaveModes in DataFrameWriter.save

2019-09-23 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-29219:

Description: 
We currently don't support all save modes in DataFrameWriter.save as the 
TableProvider interface allows for the reading/writing of data, but not for the 
creation of tables. We created a catalog API to support the 
creation/dropping/checking existence of tables, but DataFrameWriter.save 
doesn't necessarily use a catalog for example, when writing to a path based 
table.

For this case, we propose a new interface that will allow TableProviders to 
extract an Indentifier and a Catalog from a bundle of 
CaseInsensitiveStringOptions. This information can then be used to check the 
existence of a table, and support all save modes. If a Catalog is not defined, 
then the behavior is to use the spark_catalog (or configured session catalog) 
to perform the check.

 

The interface can look like:
{code:java}
trait CatalogOptions {
  def extractCatalog(StringMap): String
  def extractIdentifier(StringMap): Identifier
} {code}

  was:
We currently don't support all save modes in DataFrameWriter.save as the 
TableProvider interface allows for the reading/writing of data, but not for the 
creation of tables. We created a catalog API to support the 
creation/dropping/checking existence of tables, but DataFrameWriter.save 
doesn't necessarily use a catalog for example, when writing to a path based 
table.

For this case, we propose a new interface that will allow TableProviders to 
extract an Indentifier and a Catalog from a bundle of 
CaseInsensitiveStringOptions. This information can then be used to check the 
existence of a table, and support all save modes. If a Catalog is not defined, 
then the behavior is to use the spark_catalog (or configured session catalog) 
to perform the check.


> DataSourceV2: Support all SaveModes in DataFrameWriter.save
> ---
>
> Key: SPARK-29219
> URL: https://issues.apache.org/jira/browse/SPARK-29219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Major
>
> We currently don't support all save modes in DataFrameWriter.save as the 
> TableProvider interface allows for the reading/writing of data, but not for 
> the creation of tables. We created a catalog API to support the 
> creation/dropping/checking existence of tables, but DataFrameWriter.save 
> doesn't necessarily use a catalog for example, when writing to a path based 
> table.
> For this case, we propose a new interface that will allow TableProviders to 
> extract an Indentifier and a Catalog from a bundle of 
> CaseInsensitiveStringOptions. This information can then be used to check the 
> existence of a table, and support all save modes. If a Catalog is not 
> defined, then the behavior is to use the spark_catalog (or configured session 
> catalog) to perform the check.
>  
> The interface can look like:
> {code:java}
> trait CatalogOptions {
>   def extractCatalog(StringMap): String
>   def extractIdentifier(StringMap): Identifier
> } {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29219) DataSourceV2: Support all SaveModes in DataFrameWriter.save

2019-09-23 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-29219:
---

 Summary: DataSourceV2: Support all SaveModes in 
DataFrameWriter.save
 Key: SPARK-29219
 URL: https://issues.apache.org/jira/browse/SPARK-29219
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


We currently don't support all save modes in DataFrameWriter.save as the 
TableProvider interface allows for the reading/writing of data, but not for the 
creation of tables. We created a catalog API to support the 
creation/dropping/checking existence of tables, but DataFrameWriter.save 
doesn't necessarily use a catalog for example, when writing to a path based 
table.

For this case, we propose a new interface that will allow TableProviders to 
extract an Indentifier and a Catalog from a bundle of 
CaseInsensitiveStringOptions. This information can then be used to check the 
existence of a table, and support all save modes. If a Catalog is not defined, 
then the behavior is to use the spark_catalog (or configured session catalog) 
to perform the check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29197) Remove saveModeForDSV2 in DataFrameWriter

2019-09-22 Thread Burak Yavuz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935498#comment-16935498
 ] 

Burak Yavuz commented on SPARK-29197:
-

Hi Ido,

Thanks for your interest. I already have a PR on it right now. It's not going 
to be a straight forward task, and requires some context and future plans 
around DataSource V2 as well.

> Remove saveModeForDSV2 in DataFrameWriter
> -
>
> Key: SPARK-29197
> URL: https://issues.apache.org/jira/browse/SPARK-29197
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> It is very confusing that the default save mode is different between the 
> internal implementation of a Data source. The reason that we had to have 
> saveModeForDSV2 was that there was no easy way to check the existence of a 
> Table in DataSource v2. Now, we have catalogs for that. Therefore we should 
> be able to remove the different save modes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29197) Remove saveModeForDSV2 in DataFrameWriter

2019-09-20 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-29197:
---

 Summary: Remove saveModeForDSV2 in DataFrameWriter
 Key: SPARK-29197
 URL: https://issues.apache.org/jira/browse/SPARK-29197
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


It is very confusing that the default save mode is different between the 
internal implementation of a Data source. The reason that we had to have 
saveModeForDSV2 was that there was no easy way to check the existence of a 
Table in DataSource v2. Now, we have catalogs for that. Therefore we should be 
able to remove the different save modes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28612) DataSourceV2: Add new DataFrameWriter API for v2

2019-09-19 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-28612.
-
Fix Version/s: 3.0.0
 Assignee: Ryan Blue
   Resolution: Done

Remerged in [https://github.com/apache/spark/pull/25681]

> DataSourceV2: Add new DataFrameWriter API for v2
> 
>
> Key: SPARK-28612
> URL: https://issues.apache.org/jira/browse/SPARK-28612
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 3.0.0
>
>
> This tracks adding an API like the one proposed in SPARK-23521:
> {code:lang=scala}
> df.writeTo("catalog.db.table").append() // AppendData
> df.writeTo("catalog.db.table").overwriteDynamic() // 
> OverwritePartiitonsDynamic
> df.writeTo("catalog.db.table").overwrite($"date" === '2019-01-01') // 
> OverwriteByExpression
> df.writeTo("catalog.db.table").partitionBy($"type", $"date").create() // CTAS
> df.writeTo("catalog.db.table").replace() // RTAS
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29030) Simplify lookupV2Relation

2019-09-18 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-29030.
-
Fix Version/s: 3.0.0
 Assignee: John Zhuge
   Resolution: Done

Resolved by [https://github.com/apache/spark/pull/25735]

> Simplify lookupV2Relation
> -
>
> Key: SPARK-29030
> URL: https://issues.apache.org/jira/browse/SPARK-29030
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Minor
> Fix For: 3.0.0
>
>
> Simplify the return type for {{lookupV2Relation}} which makes the 3 callers 
> more straightforward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29127) Support partitioning for DataSource V2 tables in DataFrameWriter.save

2019-09-17 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-29127:
---

 Summary: Support partitioning for DataSource V2 tables in 
DataFrameWriter.save
 Key: SPARK-29127
 URL: https://issues.apache.org/jira/browse/SPARK-29127
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


Currently, any data source that that upgrades to DataSource V2 loses the 
partition transform information when using DataFrameWriter.save. The main 
reason is the lack of an API for "creating" a table with partitioning and 
schema information for V2 tables without a catalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29062) Add V1_BATCH_WRITE to the TableCapabilityChecks in the Analyzer

2019-09-11 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-29062:
---

 Summary: Add V1_BATCH_WRITE to the TableCapabilityChecks in the 
Analyzer
 Key: SPARK-29062
 URL: https://issues.apache.org/jira/browse/SPARK-29062
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


Currently the checks in the Analyzer require that V2 Tables have BATCH_WRITE 
defined for all tables that have V1 Write fallbacks. This is confusing as these 
tables may not have the V2 writer interface implemented yet.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28964) saveAsTable doesn't pass in the data source information for V2 nodes

2019-09-03 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-28964:
---

 Summary: saveAsTable doesn't pass in the data source information 
for V2 nodes
 Key: SPARK-28964
 URL: https://issues.apache.org/jira/browse/SPARK-28964
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


We need to pass in the "provider" property in saveAsTable for V2 nodes in 
DataFrameWriter.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28628) Support namespaces in V2SessionCatalog

2019-09-03 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-28628.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> Support namespaces in V2SessionCatalog
> --
>
> Key: SPARK-28628
> URL: https://issues.apache.org/jira/browse/SPARK-28628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 3.0.0
>
>
> V2SessionCatalog should implement SupportsNamespaces.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28628) Support namespaces in V2SessionCatalog

2019-09-03 Thread Burak Yavuz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921705#comment-16921705
 ] 

Burak Yavuz commented on SPARK-28628:
-

Resolved by [https://github.com/apache/spark/pull/25363]

> Support namespaces in V2SessionCatalog
> --
>
> Key: SPARK-28628
> URL: https://issues.apache.org/jira/browse/SPARK-28628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
>
> V2SessionCatalog should implement SupportsNamespaces.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28628) Support namespaces in V2SessionCatalog

2019-09-03 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-28628:
---

Assignee: Ryan Blue

> Support namespaces in V2SessionCatalog
> --
>
> Key: SPARK-28628
> URL: https://issues.apache.org/jira/browse/SPARK-28628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
>
> V2SessionCatalog should implement SupportsNamespaces.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28612) DataSourceV2: Add new DataFrameWriter API for v2

2019-08-31 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-28612.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/25354]

> DataSourceV2: Add new DataFrameWriter API for v2
> 
>
> Key: SPARK-28612
> URL: https://issues.apache.org/jira/browse/SPARK-28612
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 3.0.0
>
>
> This tracks adding an API like the one proposed in SPARK-23521:
> {code:lang=scala}
> df.writeTo("catalog.db.table").append() // AppendData
> df.writeTo("catalog.db.table").overwriteDynamic() // 
> OverwritePartiitonsDynamic
> df.writeTo("catalog.db.table").overwrite($"date" === '2019-01-01') // 
> OverwriteByExpression
> df.writeTo("catalog.db.table").partitionBy($"type", $"date").create() // CTAS
> df.writeTo("catalog.db.table").replace() // RTAS
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28612) DataSourceV2: Add new DataFrameWriter API for v2

2019-08-31 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-28612:
---

Assignee: Ryan Blue

> DataSourceV2: Add new DataFrameWriter API for v2
> 
>
> Key: SPARK-28612
> URL: https://issues.apache.org/jira/browse/SPARK-28612
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
>
> This tracks adding an API like the one proposed in SPARK-23521:
> {code:lang=scala}
> df.writeTo("catalog.db.table").append() // AppendData
> df.writeTo("catalog.db.table").overwriteDynamic() // 
> OverwritePartiitonsDynamic
> df.writeTo("catalog.db.table").overwrite($"date" === '2019-01-01') // 
> OverwriteByExpression
> df.writeTo("catalog.db.table").partitionBy($"type", $"date").create() // CTAS
> df.writeTo("catalog.db.table").replace() // RTAS
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28863) Add an AlreadyPlanned logical node that skips query planning

2019-08-23 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-28863:
---

 Summary: Add an AlreadyPlanned logical node that skips query 
planning
 Key: SPARK-28863
 URL: https://issues.apache.org/jira/browse/SPARK-28863
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


With the DataSourceV2 write operations, we have a way to fallback to the V1 
writer APIs using InsertableRelation.

The gross part is that we're in physical land, but the InsertableRelation takes 
a logical plan, so we have to pass the logical plans to these physical nodes, 
and then potentially go through re-planning.

A useful primitive could be specifying that a plan is ready for execution 
through a logical node AlreadyPlanned. This would wrap a physical plan, and 
then we can go straight to execution.

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28635) create CatalogManager to track registered v2 catalogs

2019-08-21 Thread Burak Yavuz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912621#comment-16912621
 ] 

Burak Yavuz commented on SPARK-28635:
-

Also followed up by [https://github.com/apache/spark/pull/25521]

> create CatalogManager to track registered v2 catalogs
> -
>
> Key: SPARK-28635
> URL: https://issues.apache.org/jira/browse/SPARK-28635
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28565) DataSourceV2: DataFrameWriter.saveAsTable

2019-08-08 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-28565:
---

Assignee: Burak Yavuz  (was: John Zhuge)

> DataSourceV2: DataFrameWriter.saveAsTable
> -
>
> Key: SPARK-28565
> URL: https://issues.apache.org/jira/browse/SPARK-28565
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> Support multiple catalogs in the following use cases:
>  * DataFrameWriter.saveAsTable("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28565) DataSourceV2: DataFrameWriter.saveAsTable

2019-08-08 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-28565.
-
Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/25330]

> DataSourceV2: DataFrameWriter.saveAsTable
> -
>
> Key: SPARK-28565
> URL: https://issues.apache.org/jira/browse/SPARK-28565
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> Support multiple catalogs in the following use cases:
>  * DataFrameWriter.saveAsTable("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28668) Support the V2SessionCatalog with AlterTable commands

2019-08-08 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-28668:
---

 Summary: Support the V2SessionCatalog with AlterTable commands
 Key: SPARK-28668
 URL: https://issues.apache.org/jira/browse/SPARK-28668
 Project: Spark
  Issue Type: Planned Work
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


We need to support the V2SessionCatalog with AlterTable commands so that V2 
DataSources can leverage DDL through SQL ALTER TABLE commands.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28667) Support the V2SessionCatalog in insertInto

2019-08-08 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-28667:
---

 Summary: Support the V2SessionCatalog in insertInto
 Key: SPARK-28667
 URL: https://issues.apache.org/jira/browse/SPARK-28667
 Project: Spark
  Issue Type: Planned Work
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


We need to support the V2SessionCatalog in the insert into SQL code paths as 
well as the DataFrameWriter code paths.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28666) Support the V2SessionCatalog in saveAsTable

2019-08-08 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-28666:
---

 Summary: Support the V2SessionCatalog in saveAsTable
 Key: SPARK-28666
 URL: https://issues.apache.org/jira/browse/SPARK-28666
 Project: Spark
  Issue Type: Planned Work
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


We need to support the V2SessionCatalog in the old saveAsTable code paths so 
that V2 DataSources can leverage the old DataFrameWriter code path.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28331) Catalogs.load always throws CatalogNotFoundException on loading built-in catalogs

2019-08-07 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-28331.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Resolved with [https://github.com/apache/spark/pull/25348]

> Catalogs.load always throws CatalogNotFoundException on loading built-in 
> catalogs
> -
>
> Key: SPARK-28331
> URL: https://issues.apache.org/jira/browse/SPARK-28331
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> In `Catalogs.load`, the `pluginClassName` in the following code 
> ```
> String pluginClassName = conf.getConfString("spark.sql.catalog." + name, 
> null);
> ```
> is always null for built-in catalogs, e.g there is a SQLConf entry for 
> `spark.sql.catalog.session`.
> This is because of https://github.com/apache/spark/pull/18852: 
> SQLConf.conf.getConfString(key, null)  always returns null.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28331) Catalogs.load always throws CatalogNotFoundException on loading built-in catalogs

2019-08-07 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-28331:
---

Assignee: Gengliang Wang

> Catalogs.load always throws CatalogNotFoundException on loading built-in 
> catalogs
> -
>
> Key: SPARK-28331
> URL: https://issues.apache.org/jira/browse/SPARK-28331
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> In `Catalogs.load`, the `pluginClassName` in the following code 
> ```
> String pluginClassName = conf.getConfString("spark.sql.catalog." + name, 
> null);
> ```
> is always null for built-in catalogs, e.g there is a SQLConf entry for 
> `spark.sql.catalog.session`.
> This is because of https://github.com/apache/spark/pull/18852: 
> SQLConf.conf.getConfString(key, null)  always returns null.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27661) Add SupportsNamespaces interface for v2 catalogs

2019-08-04 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-27661.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Resolved by [https://github.com/apache/spark/pull/24560]

> Add SupportsNamespaces interface for v2 catalogs
> 
>
> Key: SPARK-27661
> URL: https://issues.apache.org/jira/browse/SPARK-27661
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 3.0.0
>
>
> Some catalogs support namespace operations, like creating or dropping 
> namespaces. The v2 API should have a way to expose these operations to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27661) Add SupportsNamespaces interface for v2 catalogs

2019-08-04 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-27661:
---

Assignee: Ryan Blue

> Add SupportsNamespaces interface for v2 catalogs
> 
>
> Key: SPARK-27661
> URL: https://issues.apache.org/jira/browse/SPARK-27661
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
>
> Some catalogs support namespace operations, like creating or dropping 
> namespaces. The v2 API should have a way to expose these operations to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28572) Add simple analysis checks to the V2 Create Table code paths

2019-07-30 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-28572:
---

 Summary: Add simple analysis checks to the V2 Create Table code 
paths
 Key: SPARK-28572
 URL: https://issues.apache.org/jira/browse/SPARK-28572
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


Currently, the V2 Create Table code paths don't have any checks around:
 # The existence of transforms in the table schema
 # Duplications of transforms
 # Case sensitivity checks around column names

Having these rudimentary checks would simplify V2 Catalog development.

Note that the goal of this JIRA is to not make V2 Create Table Hive Compatible.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27845) DataSourceV2: InsertTable

2019-07-25 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-27845.
-
   Resolution: Done
Fix Version/s: 3.0.0

> DataSourceV2: InsertTable
> -
>
> Key: SPARK-27845
> URL: https://issues.apache.org/jira/browse/SPARK-27845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Major
> Fix For: 3.0.0
>
>
> Support multiple catalogs in the following use cases:
>  * INSERT INTO [TABLE] catalog.db.tbl
>  * INSERT OVERWRITE TABLE catalog.db.tbl



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27845) DataSourceV2: InsertTable

2019-07-25 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-27845:
---

Assignee: John Zhuge

> DataSourceV2: InsertTable
> -
>
> Key: SPARK-27845
> URL: https://issues.apache.org/jira/browse/SPARK-27845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Major
>
> Support multiple catalogs in the following use cases:
>  * INSERT INTO [TABLE] catalog.db.tbl
>  * INSERT OVERWRITE TABLE catalog.db.tbl



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27845) DataSourceV2: InsertTable

2019-07-25 Thread Burak Yavuz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893162#comment-16893162
 ] 

Burak Yavuz commented on SPARK-27845:
-

Resolved with [https://github.com/apache/spark/pull/24832]

> DataSourceV2: InsertTable
> -
>
> Key: SPARK-27845
> URL: https://issues.apache.org/jira/browse/SPARK-27845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Major
>
> Support multiple catalogs in the following use cases:
>  * INSERT INTO [TABLE] catalog.db.tbl
>  * INSERT OVERWRITE TABLE catalog.db.tbl



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25472) Structured Streaming query.stop() doesn't always stop gracefully

2018-09-20 Thread Burak Yavuz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622828#comment-16622828
 ] 

Burak Yavuz commented on SPARK-25472:
-

Resolved by https://github.com/apache/spark/pull/22478

> Structured Streaming query.stop() doesn't always stop gracefully
> 
>
> Key: SPARK-25472
> URL: https://issues.apache.org/jira/browse/SPARK-25472
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 2.5.0
>
>
> We can have race conditions where the cancelling of Spark jobs will throw a 
> SparkException when stopping a streaming query. This SparkException specifies 
> that the job was cancelled. We can use this error message to swallow the 
> error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25472) Structured Streaming query.stop() doesn't always stop gracefully

2018-09-20 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-25472.
-
   Resolution: Fixed
Fix Version/s: 2.5.0

> Structured Streaming query.stop() doesn't always stop gracefully
> 
>
> Key: SPARK-25472
> URL: https://issues.apache.org/jira/browse/SPARK-25472
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 2.5.0
>
>
> We can have race conditions where the cancelling of Spark jobs will throw a 
> SparkException when stopping a streaming query. This SparkException specifies 
> that the job was cancelled. We can use this error message to swallow the 
> error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25472) Structured Streaming query.stop() doesn't always stop gracefully

2018-09-19 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-25472:
---

Assignee: Burak Yavuz

> Structured Streaming query.stop() doesn't always stop gracefully
> 
>
> Key: SPARK-25472
> URL: https://issues.apache.org/jira/browse/SPARK-25472
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
>
> We can have race conditions where the cancelling of Spark jobs will throw a 
> SparkException when stopping a streaming query. This SparkException specifies 
> that the job was cancelled. We can use this error message to swallow the 
> error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25472) Structured Streaming query.stop() doesn't always stop gracefully

2018-09-19 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-25472:
---

 Summary: Structured Streaming query.stop() doesn't always stop 
gracefully
 Key: SPARK-25472
 URL: https://issues.apache.org/jira/browse/SPARK-25472
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Burak Yavuz


We can have race conditions where the cancelling of Spark jobs will throw a 
SparkException when stopping a streaming query. This SparkException specifies 
that the job was cancelled. We can use this error message to swallow the error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24525) Provide an option to limit MemorySink memory usage

2018-06-15 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-24525.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Resolved by [https://github.com/apache/spark/pull/21559]

> Provide an option to limit MemorySink memory usage
> --
>
> Key: SPARK-24525
> URL: https://issues.apache.org/jira/browse/SPARK-24525
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mukul Murthy
>Assignee: Mukul Murthy
>Priority: Major
> Fix For: 2.4.0
>
>
> MemorySink stores stream results in memory and is mostly used for testing and 
> displaying streams, but for large streams, this can OOM the driver. We should 
> add an option to limit the number of rows and the total size of a memory sink 
> and not add any new data once either limit is hit. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24525) Provide an option to limit MemorySink memory usage

2018-06-15 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-24525:
---

Assignee: Mukul Murthy

> Provide an option to limit MemorySink memory usage
> --
>
> Key: SPARK-24525
> URL: https://issues.apache.org/jira/browse/SPARK-24525
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mukul Murthy
>Assignee: Mukul Murthy
>Priority: Major
>
> MemorySink stores stream results in memory and is mostly used for testing and 
> displaying streams, but for large streams, this can OOM the driver. We should 
> add an option to limit the number of rows and the total size of a memory sink 
> and not add any new data once either limit is hit. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23515) JsonProtocol.sparkEventToJson can OOM when jsonifying an event

2018-02-25 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-23515:
---

 Summary: JsonProtocol.sparkEventToJson can OOM when jsonifying an 
event
 Key: SPARK-23515
 URL: https://issues.apache.org/jira/browse/SPARK-23515
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Burak Yavuz
Assignee: Burak Yavuz


{code}

def sparkEventToJson(event: SparkListenerEvent)

{code}

has a fallback method which creates a JSON object by turning an unrecognized 
event to Json and then parsing it again. This method materializes the whole 
string to parse the json record, which is unnecessary and can cause OOMs as 
seen in the stacktrace below:
{code:java}
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.(String.java:207)
at java.lang.StringBuilder.toString(StringBuilder.java:407)
at 
com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:356)
at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser.getText(ReaderBasedJsonParser.java:235)
at 
org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:20)
at 
org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:42)
at 
org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:35)
at 
com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3736)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2726)
at org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:20)
at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:50)
at 
org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:103){code}
 

We should just use the stream parsing to avoid such OOMs.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23194) from_json in FAILFAST mode doesn't fail fast, instead it just returns nulls

2018-01-23 Thread Burak Yavuz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336183#comment-16336183
 ] 

Burak Yavuz commented on SPARK-23194:
-

cc [~cloud_fan] seems like you touched that code last in 
[https://github.com/apache/spark/commit/68d65fae71e475ad811a9716098aca03a2af9532.]

 

Thoughts?

> from_json in FAILFAST mode doesn't fail fast, instead it just returns nulls
> ---
>
> Key: SPARK-23194
> URL: https://issues.apache.org/jira/browse/SPARK-23194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Burak Yavuz
>Priority: Major
>
> from_json accepts Json parsing options such as being PERMISSIVE to parsing 
> errors or failing fast. It seems from the code that even though the default 
> option is to fail fast, we catch that exception and return nulls.
>  
> In order to not change behavior, we should remove that try-catch block and 
> change the default to permissive, but allow failfast mode to indeed fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23194) from_json in FAILFAST mode doesn't fail fast, instead it just returns nulls

2018-01-23 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-23194:
---

 Summary: from_json in FAILFAST mode doesn't fail fast, instead it 
just returns nulls
 Key: SPARK-23194
 URL: https://issues.apache.org/jira/browse/SPARK-23194
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Burak Yavuz


from_json accepts Json parsing options such as being PERMISSIVE to parsing 
errors or failing fast. It seems from the code that even though the default 
option is to fail fast, we catch that exception and return nulls.

 

In order to not change behavior, we should remove that try-catch block and 
change the default to permissive, but allow failfast mode to indeed fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable

2018-01-22 Thread Burak Yavuz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334226#comment-16334226
 ] 

Burak Yavuz commented on SPARK-23173:
-

In terms of usability, I prefer 1. In terms of the viewpoint of a data 
engineer, I would like 2 as well if that's not too hard.

 

Basically, if I expect that my data doesn't have nulls, but is suddenly 
outputting them, I would rather have it fail initially (or get written out to 
the \_corrupt\_record column).

In an ideal world, I should be able to either permit nullable fields (Option 
1), or have the record be written out as corrupt.

> from_json can produce nulls for fields which are marked as non-nullable
> ---
>
> Key: SPARK-23173
> URL: https://issues.apache.org/jira/browse/SPARK-23173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Herman van Hovell
>Priority: Major
>
> The {{from_json}} function uses a schema to convert a string into a Spark SQL 
> struct. This schema can contain non-nullable fields. The underlying 
> {{JsonToStructs}} expression does not check if a resulting struct respects 
> the nullability of the schema. This leads to very weird problems in consuming 
> expressions. In our case parquet writing would produce an illegal parquet 
> file.
> There are roughly solutions here:
>  # Assume that each field in schema passed to {{from_json}} is nullable, and 
> ignore the nullability information set in the passed schema.
>  # Validate the object during runtime, and fail execution if the data is null 
> where we are not expecting this.
> I currently am slightly in favor of option 1, since this is the more 
> performant option and a lot easier to do.
> WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23135) Accumulators don't show up properly in the Stages page anymore

2018-01-17 Thread Burak Yavuz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329293#comment-16329293
 ] 

Burak Yavuz commented on SPARK-23135:
-

cc [~vanzin]

> Accumulators don't show up properly in the Stages page anymore
> --
>
> Key: SPARK-23135
> URL: https://issues.apache.org/jira/browse/SPARK-23135
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
> Environment:  
>  
>  
>Reporter: Burak Yavuz
>Priority: Blocker
> Attachments: webUIAccumulatorRegression.png
>
>
> Didn't do a lot of digging but may be caused by:
> [https://github.com/apache/spark/commit/1c70da3bfbb4016e394de2c73eb0db7cdd9a6968#diff-0d37752c6ec3d902aeff701771b4e932]
>  
> !webUIAccumulatorRegression.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23135) Accumulators don't show up properly in the Stages page anymore

2018-01-17 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-23135:

Environment: 
 

 

 

  was:
Didn't do a lot of digging but may be caused by:

[https://github.com/apache/spark/commit/1c70da3bfbb4016e394de2c73eb0db7cdd9a6968#diff-0d37752c6ec3d902aeff701771b4e932]

 

 


> Accumulators don't show up properly in the Stages page anymore
> --
>
> Key: SPARK-23135
> URL: https://issues.apache.org/jira/browse/SPARK-23135
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
> Environment:  
>  
>  
>Reporter: Burak Yavuz
>Priority: Blocker
> Attachments: webUIAccumulatorRegression.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23135) Accumulators don't show up properly in the Stages page anymore

2018-01-17 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-23135:

Attachment: webUIAccumulatorRegression.png

> Accumulators don't show up properly in the Stages page anymore
> --
>
> Key: SPARK-23135
> URL: https://issues.apache.org/jira/browse/SPARK-23135
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
> Environment:  
>  
>  
>Reporter: Burak Yavuz
>Priority: Blocker
> Attachments: webUIAccumulatorRegression.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23135) Accumulators don't show up properly in the Stages page anymore

2018-01-17 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-23135:

Description: 
Didn't do a lot of digging but may be caused by:

[https://github.com/apache/spark/commit/1c70da3bfbb4016e394de2c73eb0db7cdd9a6968#diff-0d37752c6ec3d902aeff701771b4e932]

 

!webUIAccumulatorRegression.png!

> Accumulators don't show up properly in the Stages page anymore
> --
>
> Key: SPARK-23135
> URL: https://issues.apache.org/jira/browse/SPARK-23135
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
> Environment:  
>  
>  
>Reporter: Burak Yavuz
>Priority: Blocker
> Attachments: webUIAccumulatorRegression.png
>
>
> Didn't do a lot of digging but may be caused by:
> [https://github.com/apache/spark/commit/1c70da3bfbb4016e394de2c73eb0db7cdd9a6968#diff-0d37752c6ec3d902aeff701771b4e932]
>  
> !webUIAccumulatorRegression.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23135) Accumulators don't show up properly in the Stages page anymore

2018-01-17 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-23135:
---

 Summary: Accumulators don't show up properly in the Stages page 
anymore
 Key: SPARK-23135
 URL: https://issues.apache.org/jira/browse/SPARK-23135
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.0
 Environment: Didn't do a lot of digging but may be caused by:

[https://github.com/apache/spark/commit/1c70da3bfbb4016e394de2c73eb0db7cdd9a6968#diff-0d37752c6ec3d902aeff701771b4e932]

 

 
Reporter: Burak Yavuz






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23094) Json Readers choose wrong encoding when bad records are present and fail

2018-01-16 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-23094:
---

 Summary: Json Readers choose wrong encoding when bad records are 
present and fail
 Key: SPARK-23094
 URL: https://issues.apache.org/jira/browse/SPARK-23094
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: Burak Yavuz


The cases described in SPARK-16548 and SPARK-20549 handled the JsonParser code 
paths for expressions but not the readers. We should also cover reader code 
paths reading files with bad characters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23092) Migrate MemoryStream to DataSource V2

2018-01-16 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-23092:
---

 Summary: Migrate MemoryStream to DataSource V2
 Key: SPARK-23092
 URL: https://issues.apache.org/jira/browse/SPARK-23092
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Burak Yavuz


We should migrate the MemoryStream for Structured Streaming to DataSourceV2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp

2017-12-25 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-20168:
---

Assignee: Yash Sharma

> Enable kinesis to start stream from Initial position specified by a timestamp
> -
>
> Key: SPARK-20168
> URL: https://issues.apache.org/jira/browse/SPARK-20168
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>Assignee: Yash Sharma
>  Labels: kinesis, streaming
> Fix For: 2.3.0
>
>
> Kinesis client can resume from a specified timestamp while creating a stream. 
> We should have option to pass a timestamp in config to allow kinesis to 
> resume from the given timestamp.
> Have started initial work and will be posting a PR after I test the patch -
> https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp

2017-12-25 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-20168:

Fix Version/s: 2.3.0

> Enable kinesis to start stream from Initial position specified by a timestamp
> -
>
> Key: SPARK-20168
> URL: https://issues.apache.org/jira/browse/SPARK-20168
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>  Labels: kinesis, streaming
> Fix For: 2.3.0
>
>
> Kinesis client can resume from a specified timestamp while creating a stream. 
> We should have option to pass a timestamp in config to allow kinesis to 
> resume from the given timestamp.
> Have started initial work and will be posting a PR after I test the patch -
> https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp

2017-12-25 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-20168.
-
  Resolution: Done
Target Version/s: 2.3.0

Resolved with
https://github.com/apache/spark/pull/18029

> Enable kinesis to start stream from Initial position specified by a timestamp
> -
>
> Key: SPARK-20168
> URL: https://issues.apache.org/jira/browse/SPARK-20168
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>  Labels: kinesis, streaming
>
> Kinesis client can resume from a specified timestamp while creating a stream. 
> We should have option to pass a timestamp in config to allow kinesis to 
> resume from the given timestamp.
> Have started initial work and will be posting a PR after I test the patch -
> https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22238) EnsureStatefulOpPartitioning shouldn't ask for the child RDD before planning is completed

2017-10-10 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-22238:
---

 Summary: EnsureStatefulOpPartitioning shouldn't ask for the child 
RDD before planning is completed
 Key: SPARK-22238
 URL: https://issues.apache.org/jira/browse/SPARK-22238
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Burak Yavuz
Assignee: Burak Yavuz


In EnsureStatefulOpPartitioning, we check that the inputRDD to a SparkPlan has 
the expected partitioning for Streaming Stateful Operators. The problem is that 
we are not allowed to access this information during planning.

The reason we added that check was because CoalesceExec could actually create 
RDDs with 0 partitions. We should fix it such that when CoalesceExec says that 
there is a SinglePartition, there is in fact an inputRDD of 1 partition instead 
of 0 partitions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21977) SinglePartition optimizations break certain Streaming Stateful Aggregation requirements

2017-09-20 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-21977.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> SinglePartition optimizations break certain Streaming Stateful Aggregation 
> requirements
> ---
>
> Key: SPARK-21977
> URL: https://issues.apache.org/jira/browse/SPARK-21977
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.3.0
>
>
> This is a bit hard to explain as there are several issues here, I'll try my 
> best. Here are the requirements:
>   1. A StructuredStreaming Source that can generate empty RDDs with 0 
> partitions
>   2. A StructuredStreaming query that uses the above source, performs a 
> stateful aggregation (mapGroupsWithState, groupBy.count, ...), and coalesce's 
> by 1
> The crux of the problem is that when a dataset has a `coalesce(1)` call, it 
> receives a `SinglePartition` partitioning scheme. This scheme satisfies most 
> required distributions used for aggregations such as HashAggregateExec. This 
> causes a world of problems:
>  Symptom 1. If the input RDD has 0 partitions, the whole lineage will receive 
> 0 partitions, nothing will be executed, the state store will not create any 
> delta files. When this happens, the next trigger fails, because the 
> StateStore fails to load the delta file for the previous trigger
>  Symptom 2. Let's say that there was data. Then in this case, if you stop 
> your stream, and change `coalesce(1)` with `coalesce(2)`, then restart your 
> stream, your stream will fail, because `spark.sql.shuffle.partitions - 1` 
> number of StateStores will fail to find its delta files.
> To fix the issues above, we must check that the partitioning of the child of 
> a `StatefulOperator` satisfies:
>   If the grouping expressions are empty:
> a) AllTuple distribution
> b) Single physical partition
>   If the grouping expressions are non empty:
> a) Clustered distribution
> b) spark.sql.shuffle.partition # of partitions
> whether or not coalesce(1) exists in the plan, and whether or not the input 
> RDD for the trigger has any data.
> Once you fix the above problem by adding an Exchange to the plan, you come 
> across the following bug:
> If you call `coalesce(1).groupBy().count()` on a Streaming DataFrame, and if 
> you have a trigger with no data, `StateStoreRestoreExec` doesn't return the 
> prior state. However, for this specific aggregation, `HashAggregateExec` 
> after the restore returns a (0, 0) row, since we're performing a count, and 
> there is no data. Then this data gets stored in `StateStoreSaveExec` causing 
> the previous counts to be overwritten and lost.
>   



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21977) SinglePartition optimizations break certain Streaming Stateful Aggregation requirements

2017-09-11 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-21977:

Description: 
This is a bit hard to explain as there are several issues here, I'll try my 
best. Here are the requirements:

  1. A StructuredStreaming Source that can generate empty RDDs with 0 partitions
  2. A StructuredStreaming query that uses the above source, performs a 
stateful aggregation (mapGroupsWithState, groupBy.count, ...), and coalesce's 
by 1

The crux of the problem is that when a dataset has a `coalesce(1)` call, it 
receives a `SinglePartition` partitioning scheme. This scheme satisfies most 
required distributions used for aggregations such as HashAggregateExec. This 
causes a world of problems:

 Symptom 1. If the input RDD has 0 partitions, the whole lineage will receive 0 
partitions, nothing will be executed, the state store will not create any delta 
files. When this happens, the next trigger fails, because the StateStore fails 
to load the delta file for the previous trigger

 Symptom 2. Let's say that there was data. Then in this case, if you stop your 
stream, and change `coalesce(1)` with `coalesce(2)`, then restart your stream, 
your stream will fail, because `spark.sql.shuffle.partitions - 1` number of 
StateStores will fail to find its delta files.

To fix the issues above, we must check that the partitioning of the child of a 
`StatefulOperator` satisfies:
  If the grouping expressions are empty:
a) AllTuple distribution
b) Single physical partition
  If the grouping expressions are non empty:
a) Clustered distribution
b) spark.sql.shuffle.partition # of partitions
whether or not coalesce(1) exists in the plan, and whether or not the input RDD 
for the trigger has any data.

Once you fix the above problem by adding an Exchange to the plan, you come 
across the following bug:

If you call `coalesce(1).groupBy().count()` on a Streaming DataFrame, and if 
you have a trigger with no data, `StateStoreRestoreExec` doesn't return the 
prior state. However, for this specific aggregation, `HashAggregateExec` after 
the restore returns a (0, 0) row, since we're performing a count, and there is 
no data. Then this data gets stored in `StateStoreSaveExec` causing the 
previous counts to be overwritten and lost.
  

  was:This is a bit hard to explain as there are several issues here


> SinglePartition optimizations break certain Streaming Stateful Aggregation 
> requirements
> ---
>
> Key: SPARK-21977
> URL: https://issues.apache.org/jira/browse/SPARK-21977
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>
> This is a bit hard to explain as there are several issues here, I'll try my 
> best. Here are the requirements:
>   1. A StructuredStreaming Source that can generate empty RDDs with 0 
> partitions
>   2. A StructuredStreaming query that uses the above source, performs a 
> stateful aggregation (mapGroupsWithState, groupBy.count, ...), and coalesce's 
> by 1
> The crux of the problem is that when a dataset has a `coalesce(1)` call, it 
> receives a `SinglePartition` partitioning scheme. This scheme satisfies most 
> required distributions used for aggregations such as HashAggregateExec. This 
> causes a world of problems:
>  Symptom 1. If the input RDD has 0 partitions, the whole lineage will receive 
> 0 partitions, nothing will be executed, the state store will not create any 
> delta files. When this happens, the next trigger fails, because the 
> StateStore fails to load the delta file for the previous trigger
>  Symptom 2. Let's say that there was data. Then in this case, if you stop 
> your stream, and change `coalesce(1)` with `coalesce(2)`, then restart your 
> stream, your stream will fail, because `spark.sql.shuffle.partitions - 1` 
> number of StateStores will fail to find its delta files.
> To fix the issues above, we must check that the partitioning of the child of 
> a `StatefulOperator` satisfies:
>   If the grouping expressions are empty:
> a) AllTuple distribution
> b) Single physical partition
>   If the grouping expressions are non empty:
> a) Clustered distribution
> b) spark.sql.shuffle.partition # of partitions
> whether or not coalesce(1) exists in the plan, and whether or not the input 
> RDD for the trigger has any data.
> Once you fix the above problem by adding an Exchange to the plan, you come 
> across the following bug:
> If you call `coalesce(1).groupBy().count()` on a Streaming DataFrame, and if 
> you have a trigger with no data, `StateStoreRestoreExec` doesn't return the 
> prior state. However, for this specific aggregation, `HashAggregateExec` 
> after 

[jira] [Updated] (SPARK-21977) SinglePartition optimizations break certain Streaming Stateful Aggregation requirements

2017-09-11 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-21977:

Summary: SinglePartition optimizations break certain Streaming Stateful 
Aggregation requirements  (was: SinglePartition optimizations break certain 
StateStore requirements)

> SinglePartition optimizations break certain Streaming Stateful Aggregation 
> requirements
> ---
>
> Key: SPARK-21977
> URL: https://issues.apache.org/jira/browse/SPARK-21977
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>
> This is a bit hard to explain as there are several issues here



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   >