[jira] [Commented] (SPARK-37621) ClassCastException when trying to persist the result of a join between two Iceberg tables

2021-12-13 Thread Ryan Blue (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458516#comment-17458516
 ] 

Ryan Blue commented on SPARK-37621:
---

[~hyukjin.kwon], this affects any source that doesn't always produce 
`UnsafeRow`. The problem is that certain parts of Spark assume that `UnsafeRow` 
will be passed even though the required interface is `InternalRow`. Rather than 
fixing that assumption, the community chose to ensure that there is always a 
projection added so that the conversion to unsafe happens. But if that 
projection is removed by other rules or is not added, then operators that 
assume `UnsafeRow` can fail.

The long-term fix is the same as always: eventually, Spark should use the 
declared type. A simpler fix is to find out why the projection is missing and 
update that. But then we'll see this problem come back later.

> ClassCastException when trying to persist the result of a join between two 
> Iceberg tables
> -
>
> Key: SPARK-37621
> URL: https://issues.apache.org/jira/browse/SPARK-37621
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.2
>Reporter: Ciprian Gerea
>Priority: Major
>
> I am gettin an error when I try to persist the results on a Join operation. 
> Note that both tables to be joined and the output table are Iceberg tables.
> SQL code to repro. 
> String sqlJoin = String.format(
> "SELECT * from " +
> "((select %s from %s.%s where %s ) l " +
> "join (select %s from %s.%s where %s ) r " +
> "using (%s))",
> );
> spark.sql(sqlJoin).writeTo("ciptest.ttt").option("write-format", 
> "parquet").createOrReplace();
> My exception stack is:
> {{Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow}}
> {{at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$1.writeValue(UnsafeRowSerializer.scala:64)}}
> {{at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:249)}}
> {{at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:158)}}
> {{at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)}}
> {{at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)}}
> {{at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)}}
> {{at org.apache.spark.scheduler.Task.run(Task.scala:131)}}
> {{at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)}}
> {{at ….}}
> Explain on the Sql statement gets the following plan:
> {{== Physical Plan ==}}
> {{Project [ ... ]}}
> {{+- SortMergeJoin […], Inner}}
> {{  :- Sort […], false, 0}}
> {{  : +- Exchange hashpartitioning(…, 10), ENSURE_REQUIREMENTS, [id=#38]}}
> {{  :   +- Filter (…)}}
> {{  :+- BatchScan[... ] left [filters=…]}}
> {{  +- *(2) Sort […], false, 0}}
> {{   +- Exchange hashpartitioning(…, 10), ENSURE_REQUIREMENTS, [id=#47]}}
> {{ +- *(1) Filter (…)}}
> {{  +- BatchScan[…] right [filters=…] }}
> {{Note that several variations of this fail. Besides the repro code listed 
> above I have tried doing CTAS and trying to write the result into parquet 
> files without making a table out of it.}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33779) DataSource V2: API to request distribution and ordering on write

2020-12-14 Thread Ryan Blue (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-33779.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Merged PR #30706. Thanks [~aokolnychyi]!

> DataSource V2: API to request distribution and ordering on write
> 
>
> Key: SPARK-33779
> URL: https://issues.apache.org/jira/browse/SPARK-33779
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Anton Okolnychyi
>Priority: Major
> Fix For: 3.2.0
>
>
> We need to have proper APIs for requesting a specific distribution and 
> ordering on writes for data sources that implement the V2 interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32168) DSv2 SQL overwrite incorrectly uses static plan with hidden partitions

2020-07-03 Thread Ryan Blue (Jira)
Ryan Blue created SPARK-32168:
-

 Summary: DSv2 SQL overwrite incorrectly uses static plan with 
hidden partitions
 Key: SPARK-32168
 URL: https://issues.apache.org/jira/browse/SPARK-32168
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue


The v2 analyzer rule {{ResolveInsertInto}} tries to detect when a static 
overwrite and a dynamic overwrite would produce the same result and will choose 
to use static overwrite in that case. It will only use a dynamic overwrite if 
there is a partition column without a static value and the SQL mode is set to 
dynamic.

{code:lang=scala}
val dynamicPartitionOverwrite = partCols.size > staticPartitions.size &&
  conf.partitionOverwriteMode == PartitionOverwriteMode.DYNAMIC
{code}

The problem is that {{partCols}} are the names of only partitions that are in 
the column list (identity partitions) and does not include hidden partitions, 
like {{days(ts)}}. As a result, this doesn't detect hidden partitions and use 
dynamic overwrite. Static overwrite is used instead; when a table has only 
hidden partitions, the static filter drops all table data.

This is a correctness bug because Spark will overwrite more data than just the 
set of partitions being written to in dynamic mode. The impact is limited 
because this rule is only used for SQL queries (not plans from 
DataFrameWriters) and only affects tables with hidden partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32037) Rename blacklisting feature to avoid language with racist connotation

2020-06-19 Thread Ryan Blue (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140825#comment-17140825
 ] 

Ryan Blue commented on SPARK-32037:
---

What about "healthy" and "unhealthy"? That's basically what we are trying to 
keep track of -- whether a node is healthy enough to run tasks, or if it should 
not be used for some period of time.

I think "trusted" and "untrusted" may also work, but "healthy" is a bit closer 
to what we want.

> Rename blacklisting feature to avoid language with racist connotation
> -
>
> Key: SPARK-32037
> URL: https://issues.apache.org/jira/browse/SPARK-32037
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Priority: Minor
>
> As per [discussion on the Spark dev 
> list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E],
>  it will be beneficial to remove references to problematic language that can 
> alienate potential community members. One such reference is "blacklist". 
> While it seems to me that there is some valid debate as to whether these 
> terms have racist origins, the cultural connotations are inescapable in 
> today's world.
> I've created a separate task, SPARK-32036, to remove references outside of 
> this feature. Given the large surface area of this feature and the 
> public-facing UI / configs / etc., more care will need to be taken here.
> I'd like to start by opening up debate on what the best replacement name 
> would be. Reject-/deny-/ignore-/block-list are common replacements for 
> "blacklist", but I'm not sure that any of them work well for this situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31255) DataSourceV2: Add metadata columns

2020-03-25 Thread Ryan Blue (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-31255:
--
Issue Type: New Feature  (was: Bug)

> DataSourceV2: Add metadata columns
> --
>
> Key: SPARK-31255
> URL: https://issues.apache.org/jira/browse/SPARK-31255
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>
> DSv2 should support reading additional metadata columns that are not in a 
> table's schema. This allows users to project metadata like Kafka's offset, 
> timestamp, and partition. It also allows other sources to expose metadata 
> like file and row position.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31255) DataSourceV2: Add metadata columns

2020-03-25 Thread Ryan Blue (Jira)
Ryan Blue created SPARK-31255:
-

 Summary: DataSourceV2: Add metadata columns
 Key: SPARK-31255
 URL: https://issues.apache.org/jira/browse/SPARK-31255
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue


DSv2 should support reading additional metadata columns that are not in a 
table's schema. This allows users to project metadata like Kafka's offset, 
timestamp, and partition. It also allows other sources to expose metadata like 
file and row position.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29558) ResolveTables and ResolveRelations should be order-insensitive

2019-11-21 Thread Ryan Blue (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979455#comment-16979455
 ] 

Ryan Blue commented on SPARK-29558:
---

Thanks for fixing this, [~cloud_fan]!

> ResolveTables and ResolveRelations should be order-insensitive
> --
>
> Key: SPARK-29558
> URL: https://issues.apache.org/jira/browse/SPARK-29558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29558) ResolveTables and ResolveRelations should be order-insensitive

2019-11-21 Thread Ryan Blue (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-29558.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> ResolveTables and ResolveRelations should be order-insensitive
> --
>
> Key: SPARK-29558
> URL: https://issues.apache.org/jira/browse/SPARK-29558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29966) Add version method in TableCatalog to avoid load table twice

2019-11-21 Thread Ryan Blue (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979419#comment-16979419
 ] 

Ryan Blue commented on SPARK-29966:
---

As I said on the PR, I'm -1 on changing a public extension API to avoid a 
temporary performance regression that should be handled in the implementation.

> Add version method in TableCatalog to avoid load table twice
> 
>
> Key: SPARK-29966
> URL: https://issues.apache.org/jira/browse/SPARK-29966
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Priority: Minor
>
> Now resolve logic plan will load table twice which are in ResolveTables and 
> ResolveRelations. The ResolveRelations is old code path, and ResolveTables is 
> v2 code path, and the reason why load table twice is that ResolveTables will 
> load table and rollback v1 table to ResolveRelations code path.
> The same scene also exists in ResolveSessionCatalog.
> It affect that execute command will cost double time than spark 2.4.
> Here is the idea that add a table version method in TableCatalog, and rules 
> should always get table version firstly without load table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL

2019-11-14 Thread Ryan Blue (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974456#comment-16974456
 ] 

Ryan Blue commented on SPARK-29900:
---

To be clear, we think this is going to be a breaking change, right?

> make relation lookup behavior consistent within Spark SQL
> -
>
> Key: SPARK-29900
> URL: https://issues.apache.org/jira/browse/SPARK-29900
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently, Spark has 2 different relation resolution behaviors:
> 1. try to look up temp view first, then try table/persistent view.
> 2. try to look up table/persistent view.
> The first behavior is used in SELECT, INSERT and a few commands that support 
> views, like DESC TABLE.
> The second behavior is used in most commands.
> It's confusing to have inconsistent relation resolution behaviors, and the 
> benefit is super small. It's only useful when there are temp view and table 
> with the same name, but users can easily use qualified table name to 
> disambiguate.
> In postgres, the relation resolution behavior is consistent
> {code}
> cloud0fan=# create schema s1;
> CREATE SCHEMA
> cloud0fan=# SET search_path TO s1;
> SET
> cloud0fan=# create table s1.t (i int);
> CREATE TABLE
> cloud0fan=# insert into s1.t values (1);
> INSERT 0 1
> # access table with qualified name
> cloud0fan=# select * from s1.t;
>  i 
> ---
>  1
> (1 row)
> # access table with single name
> cloud0fan=# select * from t;
>  i 
> ---
>  1
> (1 rows)
> # create a temp view with conflicting name
> cloud0fan=# create temp view t as select 2 as i;
> CREATE VIEW
> # same as spark, temp view has higher proirity during resolution
> cloud0fan=# select * from t;
>  i 
> ---
>  2
> (1 row)
> # DROP TABLE also resolves temp view first
> cloud0fan=# drop table t;
> ERROR:  "t" is not a table
> # DELETE also resolves temp view first
> cloud0fan=# delete from t where i = 0;
> ERROR:  cannot delete from view "t"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29789) should not parse the bucket column name again when creating v2 tables

2019-11-12 Thread Ryan Blue (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-29789.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> should not parse the bucket column name again when creating v2 tables
> -
>
> Key: SPARK-29789
> URL: https://issues.apache.org/jira/browse/SPARK-29789
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29277) DataSourceV2: Add early filter and projection pushdown

2019-10-30 Thread Ryan Blue (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-29277.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Fixed by #25955.

> DataSourceV2: Add early filter and projection pushdown
> --
>
> Key: SPARK-29277
> URL: https://issues.apache.org/jira/browse/SPARK-29277
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
> Fix For: 3.0.0
>
>
> Spark uses optimizer rules that need stats before conversion to physical 
> plan. DataSourceV2 should support early pushdown for those rules.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29592) ALTER TABLE (set partition location) should look up catalog/table like v2 commands

2019-10-29 Thread Ryan Blue (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962527#comment-16962527
 ] 

Ryan Blue commented on SPARK-29592:
---

There is not currently a way to alter the partition spec for a table, so I 
don't think we need to worry about this for now.

> ALTER TABLE (set partition location) should look up catalog/table like v2 
> commands
> --
>
> Key: SPARK-29592
> URL: https://issues.apache.org/jira/browse/SPARK-29592
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Major
>
> ALTER TABLE (set partition location) should look up catalog/table like v2 
> commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29277) DataSourceV2: Add early filter and projection pushdown

2019-09-27 Thread Ryan Blue (Jira)
Ryan Blue created SPARK-29277:
-

 Summary: DataSourceV2: Add early filter and projection pushdown
 Key: SPARK-29277
 URL: https://issues.apache.org/jira/browse/SPARK-29277
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue


Spark uses optimizer rules that need stats before conversion to physical plan. 
DataSourceV2 should support early pushdown for those rules.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29249) DataFrameWriterV2 should not allow setting table properties for existing tables

2019-09-25 Thread Ryan Blue (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-29249:
--
Description: tableProperty should return CreateTableWriter, not 
DataFrameWriterV2.

> DataFrameWriterV2 should not allow setting table properties for existing 
> tables
> ---
>
> Key: SPARK-29249
> URL: https://issues.apache.org/jira/browse/SPARK-29249
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>
> tableProperty should return CreateTableWriter, not DataFrameWriterV2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29249) DataFrameWriterV2 should not allow setting table properties for existing tables

2019-09-25 Thread Ryan Blue (Jira)
Ryan Blue created SPARK-29249:
-

 Summary: DataFrameWriterV2 should not allow setting table 
properties for existing tables
 Key: SPARK-29249
 URL: https://issues.apache.org/jira/browse/SPARK-29249
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29157) DataSourceV2: Add DataFrameWriterV2 to Python API

2019-09-18 Thread Ryan Blue (Jira)
Ryan Blue created SPARK-29157:
-

 Summary: DataSourceV2: Add DataFrameWriterV2 to Python API
 Key: SPARK-29157
 URL: https://issues.apache.org/jira/browse/SPARK-29157
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue


After SPARK-28612 is committed, we need to add the corresponding PySpark API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29014) DataSourceV2: Clean up current, default, and session catalog uses

2019-09-10 Thread Ryan Blue (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926889#comment-16926889
 ] 

Ryan Blue commented on SPARK-29014:
---

[~cloud_fan], why does this require a major refactor?

It would be best to keep the implementation of this as small as possible and 
not tie it to other work.

> DataSourceV2: Clean up current, default, and session catalog uses
> -
>
> Key: SPARK-29014
> URL: https://issues.apache.org/jira/browse/SPARK-29014
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Blocker
>
> Catalog tracking in DSv2 has evolved since the initial changes went in. We 
> need to make sure that handling is consistent across plans using the latest 
> rules:
>  * The _current_ catalog should be used when no catalog is specified
>  * The _default_ catalog is the catalog _current_ is initialized to
>  * If the _default_ catalog is not set, then it is the built-in Spark session 
> catalog, which will be called `spark_catalog` (This is the v2 session catalog)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28970) implement USE CATALOG/NAMESPACE for Data Source V2

2019-09-06 Thread Ryan Blue (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924681#comment-16924681
 ] 

Ryan Blue commented on SPARK-28970:
---

I think we should, yes.

> implement USE CATALOG/NAMESPACE for Data Source V2
> --
>
> Key: SPARK-28970
> URL: https://issues.apache.org/jira/browse/SPARK-28970
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently Spark has a `USE abc` command to switch the current database.
> We should have something similar for Data Source V2, to switch the current 
> catalog and/or current namespace.
> We can introduce 2 new command: `USE CATALOG abc` and `USE NAMESPACE abc`



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29014) DataSourceV2: Clean up current, default, and session catalog uses

2019-09-06 Thread Ryan Blue (Jira)
Ryan Blue created SPARK-29014:
-

 Summary: DataSourceV2: Clean up current, default, and session 
catalog uses
 Key: SPARK-29014
 URL: https://issues.apache.org/jira/browse/SPARK-29014
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue


Catalog tracking in DSv2 has evolved since the initial changes went in. We need 
to make sure that handling is consistent across plans using the latest rules:
 * The _current_ catalog should be used when no catalog is specified
 * The _default_ catalog is the catalog _current_ is initialized to
 * If the _default_ catalog is not set, then it is the built-in Spark session 
catalog, which will be called `spark_catalog` (This is the v2 session catalog)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28979) DataSourceV2: Rename UnresolvedTable

2019-09-04 Thread Ryan Blue (Jira)
Ryan Blue created SPARK-28979:
-

 Summary: DataSourceV2: Rename UnresolvedTable
 Key: SPARK-28979
 URL: https://issues.apache.org/jira/browse/SPARK-28979
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue


CatalogTableAsV2 was renamed to UnresolvedTable in SPARK-28666. This name is 
incorrect because the table is not unresolved. Instead, it is a v1 table that 
doesn't expose any v2 capabilities. The name should not include "Unresolved".



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28899) merge the testing in-memory v2 catalogs from catalyst and core

2019-08-29 Thread Ryan Blue (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-28899.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> merge the testing in-memory v2 catalogs from catalyst and core
> --
>
> Key: SPARK-28899
> URL: https://issues.apache.org/jira/browse/SPARK-28899
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28878) DataSourceV2 should not insert extra projection for columnar batches

2019-08-26 Thread Ryan Blue (Jira)
Ryan Blue created SPARK-28878:
-

 Summary: DataSourceV2 should not insert extra projection for 
columnar batches
 Key: SPARK-28878
 URL: https://issues.apache.org/jira/browse/SPARK-28878
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue


SPARK-23325 added an extra physical projection when reading from a DSv2 source 
because some Spark operators assume that InternalRow instances are actually 
UnsafeRow. The projection ensures that InternalRow is converted to UnsafeRow. 
This isn't needed for the columnar batch read path because this is already done 
when converting from columnar operators to row-based operators in 
InputRDDCodegen.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28846) Set OMP_NUM_THREADS to executor cores for python

2019-08-22 Thread Ryan Blue (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-28846.
---
Resolution: Duplicate

> Set OMP_NUM_THREADS to executor cores for python
> 
>
> Key: SPARK-28846
> URL: https://issues.apache.org/jira/browse/SPARK-28846
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28843) Set OMP_NUM_THREADS to executor cores reduce Python memory consumption

2019-08-21 Thread Ryan Blue (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-28843:
--
Description: 
While testing hardware with more cores, we found that the amount of memory 
required by PySpark applications increased and tracked the problem to importing 
numpy. The numpy issue is [https://github.com/numpy/numpy/issues/10455]

NumPy uses OpenMP that starts a thread pool with the number of cores on the 
machine (and does not respect cgroups). When we set this lower we see a 
significant reduction in memory consumption.

This parallelism setting should be set to the number of cores allocated to the 
executor, not the number of cores available.

  was:
While testing hardware with more cores, we found that the amount of memory 
required by PySpark applications increased and tracked the problem to importing 
numpy. The numpy issue is [https://github.com/numpy/numpy/issues/10455]

NumPy uses OpenMP that starts a thread pool with the number of cores on the 
machine (and does not respect cgroups). When we set this lower we see a 
reduction in memory consumption.

This parallelism setting should be set to the number of cores allocated to the 
executor, not the number of cores available.


> Set OMP_NUM_THREADS to executor cores reduce Python memory consumption
> --
>
> Key: SPARK-28843
> URL: https://issues.apache.org/jira/browse/SPARK-28843
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Ryan Blue
>Priority: Major
>
> While testing hardware with more cores, we found that the amount of memory 
> required by PySpark applications increased and tracked the problem to 
> importing numpy. The numpy issue is 
> [https://github.com/numpy/numpy/issues/10455]
> NumPy uses OpenMP that starts a thread pool with the number of cores on the 
> machine (and does not respect cgroups). When we set this lower we see a 
> significant reduction in memory consumption.
> This parallelism setting should be set to the number of cores allocated to 
> the executor, not the number of cores available.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28843) Set OMP_NUM_THREADS to executor cores reduce Python memory consumption

2019-08-21 Thread Ryan Blue (Jira)
Ryan Blue created SPARK-28843:
-

 Summary: Set OMP_NUM_THREADS to executor cores reduce Python 
memory consumption
 Key: SPARK-28843
 URL: https://issues.apache.org/jira/browse/SPARK-28843
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.3, 2.3.3, 3.0.0
Reporter: Ryan Blue


While testing hardware with more cores, we found that the amount of memory 
required by PySpark applications increased and tracked the problem to importing 
numpy. The numpy issue is [https://github.com/numpy/numpy/issues/10455]

NumPy uses OpenMP that starts a thread pool with the number of cores on the 
machine (and does not respect cgroups). When we set this lower we see a 
reduction in memory consumption.

This parallelism setting should be set to the number of cores allocated to the 
executor, not the number of cores available.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28628) Support namespaces in V2SessionCatalog

2019-08-05 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-28628:
-

 Summary: Support namespaces in V2SessionCatalog
 Key: SPARK-28628
 URL: https://issues.apache.org/jira/browse/SPARK-28628
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue


V2SessionCatalog should implement SupportsNamespaces.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28612) DataSourceV2: Add new DataFrameWriter API for v2

2019-08-03 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-28612:
-

 Summary: DataSourceV2: Add new DataFrameWriter API for v2
 Key: SPARK-28612
 URL: https://issues.apache.org/jira/browse/SPARK-28612
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue


This tracks adding an API like the one proposed in SPARK-23521:

{code:lang=scala}
df.writeTo("catalog.db.table").append() // AppendData
df.writeTo("catalog.db.table").overwriteDynamic() // OverwritePartiitonsDynamic
df.writeTo("catalog.db.table").overwrite($"date" === '2019-01-01') // 
OverwriteByExpression
df.writeTo("catalog.db.table").partitionBy($"type", $"date").create() // CTAS
df.writeTo("catalog.db.table").replace() // RTAS
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23204) DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter

2019-08-03 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-23204.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

I'm closing this because it is implemented by SPARK-28178 and SPARK-28565.

> DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter
> 
>
> Key: SPARK-23204
> URL: https://issues.apache.org/jira/browse/SPARK-23204
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
> Fix For: 3.0.0
>
>
> DataSourceV2 is currently only configured with a path, passed in options as 
> {{path}}. For many data sources, like JDBC, a table name is more appropriate. 
> I propose testing the "location" passed to load(String) and save(String) to 
> see if it is a path and if not, parsing it as a table name and passing 
> "database" and "table" options to readers and writers.
> This also creates a way to pass the table identifier when using DataSourceV2 
> tables from SQL. For example, {{SELECT * FROM db.table}} creates an 
> {{UnresolvedRelation(db,table)}} that could be resolved using the default 
> source, passing the db and table name using the same options. Similarly, we 
> can add a table property for the datasource implementation to metastore 
> tables and add a rule to convert them to DataSourceV2 relations.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25280) Add support for USING syntax for DataSourceV2

2019-08-03 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899532#comment-16899532
 ] 

Ryan Blue commented on SPARK-25280:
---

[~hyukjin.kwon], is there anything left to do for this? I think that most of 
the functionality has been added at this point.

> Add support for USING syntax for DataSourceV2
> -
>
> Key: SPARK-25280
> URL: https://issues.apache.org/jira/browse/SPARK-25280
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> class SourcesTest extends SparkFunSuite {
>   val spark = SparkSession.builder().master("local").getOrCreate()
>   test("Test CREATE TABLE ... USING - v1") {
> spark.read.format(classOf[SimpleDataSourceV1].getCanonicalName).load()
>   }
>   test("Test DataFrameReader - v1") {
> spark.sql(s"CREATE TABLE tableA USING 
> ${classOf[SimpleDataSourceV1].getCanonicalName}")
>   }
>   test("Test CREATE TABLE ... USING - v2") {
> spark.read.format(classOf[SimpleDataSourceV2].getCanonicalName).load()
>   }
>   test("Test DataFrameReader - v2") {
> spark.sql(s"CREATE TABLE tableB USING 
> ${classOf[SimpleDataSourceV2].getCanonicalName}")
>   }
> }
> {code}
> {code}
> org.apache.spark.sql.sources.v2.SimpleDataSourceV2 is not a valid Spark SQL 
> Data Source.;
> org.apache.spark.sql.AnalysisException: 
> org.apache.spark.sql.sources.v2.SimpleDataSourceV2 is not a valid Spark SQL 
> Data Source.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385)
>   at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:78)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$54.apply(Dataset.scala:3296)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3295)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641)
>   at 
> org.apache.spark.sql.sources.v2.SourcesTest$$anonfun$4.apply(DataSourceV2Suite.scala:45)
>   at 
> org.apache.spark.sql.sources.v2.SourcesTest$$anonfun$4.apply(DataSourceV2Suite.scala:45)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at 

[jira] [Commented] (SPARK-14543) SQL/Hive insertInto has unexpected results

2019-07-18 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888222#comment-16888222
 ] 

Ryan Blue commented on SPARK-14543:
---

{{byName}} was never added to Apache Spark. The change was rejected, so it is 
only available in Netflix's Spark branch. I resolved this with "later" because 
we are including by-name resolution in the DSv2 work. The replacement for 
{{DataFrameWriter}} will default to name-based resolution.

> SQL/Hive insertInto has unexpected results
> --
>
> Key: SPARK-14543
> URL: https://issues.apache.org/jira/browse/SPARK-14543
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
>
> *Updated description*
> There should be an option to match input data to output columns by name. The 
> API allows operations on tables, which hide the column resolution problem. 
> It's easy to copy from one table to another without listing the columns, and 
> in the API it is common to work with columns by name rather than by position. 
> I think the API should add a way to match columns by name, which is closer to 
> what users expect. I propose adding something like this:
> {code}
> CREATE TABLE src (id: bigint, count: int, total: bigint)
> CREATE TABLE dst (id: bigint, total: bigint, count: int)
> sqlContext.table("src").write.byName.insertInto("dst")
> {code}
> *Original description*
> The Hive write path adds a pre-insertion cast (projection) to reconcile 
> incoming data columns with the outgoing table schema. Columns are matched by 
> position and casts are inserted to reconcile the two column schemas.
> When columns aren't correctly aligned, this causes unexpected results. I ran 
> into this by not using a correct {{partitionBy}} call (addressed by 
> SPARK-14459), which caused an error message that an int could not be cast to 
> an array. However, if the columns are vaguely compatible, for example string 
> and float, then no error or warning is produced and data is written to the 
> wrong columns using unexpected casts (string -> bigint -> float).
> A real-world use case that will hit this is when a table definition changes 
> by adding a column in the middle of a table. Spark SQL statements that copied 
> from that table to a destination table will then map the columns differently 
> but insert casts that mask the problem. The last column's data will be 
> dropped without a reliable warning for the user.
> This highlights a few problems:
> * Too many or too few incoming data columns should cause an AnalysisException 
> to be thrown
> * Only "safe" casts should be inserted automatically, like int -> long, using 
> UpCast
> * Pre-insertion casts currently ignore extra columns by using zip
> * The pre-insertion cast logic differs between Hive's MetastoreRelation and 
> LogicalRelation
> Also, I think there should be an option to match input data to output columns 
> by name. The API allows operations on tables, which hide the column 
> resolution problem. It's easy to copy from one table to another without 
> listing the columns, and in the API it is common to work with columns by name 
> rather than by position. I think the API should add a way to match columns by 
> name, which is closer to what users expect. I propose adding something like 
> this:
> {code}
> CREATE TABLE src (id: bigint, count: int, total: bigint)
> CREATE TABLE dst (id: bigint, total: bigint, count: int)
> sqlContext.table("src").write.byName.insertInto("dst")
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28376) Support to write sorted parquet files in each row group

2019-07-15 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885428#comment-16885428
 ] 

Ryan Blue commented on SPARK-28376:
---

I don't think this is a regression. The linked issue was to automatically add 
repartitioning to the SQL plan to avoid too many files, even with a local sort. 
I think that this is no longer needed because we plan to do it in DSv2.

> Support to write sorted parquet files in each row group
> ---
>
> Key: SPARK-28376
> URL: https://issues.apache.org/jira/browse/SPARK-28376
> Project: Spark
>  Issue Type: New Feature
>  Components: Input/Output, Spark Core
>Affects Versions: 2.4.3
>Reporter: t oo
>Priority: Major
>
> this is for the ability to writeee parquet with sorteed values in each 
> rowgroup
>  
> see 
> [https://stackoverflow.com/questions/52159938/cant-write-ordered-data-to-parquet-in-spark]
> [https://www.slideshare.net/RyanBlue3/parquet-performance-tuning-the-missing-guide]
>  (slidee 26-27)
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28374) DataSourceV2: Add method to support INSERT ... IF NOT EXISTS

2019-07-12 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-28374:
-

 Summary: DataSourceV2: Add method to support INSERT ... IF NOT 
EXISTS
 Key: SPARK-28374
 URL: https://issues.apache.org/jira/browse/SPARK-28374
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue


This is a follow-up to [PR #24832 
(comment)|[https://github.com/apache/spark/pull/24832/files#r298257179]]. The 
SQL parser supports INSERT ... IF NOT EXISTS to validate that an insert did not 
write into existing partitions. This requires the addition of a support trait 
for the write builder, so should be done as a follow-up.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28319) DataSourceV2: Support SHOW TABLES

2019-07-09 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-28319:
-

 Summary: DataSourceV2: Support SHOW TABLES
 Key: SPARK-28319
 URL: https://issues.apache.org/jira/browse/SPARK-28319
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue


SHOW TABLES needs to support v2 catalogs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28219) Data source v2 user guide

2019-07-02 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877172#comment-16877172
 ] 

Ryan Blue commented on SPARK-28219:
---

I'm closing this as a duplicate. Please use SPARK-27708.

If you want to note specific docs to write, please add them to that issue.

> Data source v2 user guide
> -
>
> Key: SPARK-28219
> URL: https://issues.apache.org/jira/browse/SPARK-28219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28219) Data source v2 user guide

2019-07-02 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-28219.
---
Resolution: Duplicate

> Data source v2 user guide
> -
>
> Key: SPARK-28219
> URL: https://issues.apache.org/jira/browse/SPARK-28219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28192) Data Source - State - Write side

2019-06-28 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875046#comment-16875046
 ] 

Ryan Blue commented on SPARK-28192:
---

It sounds like what you want is for a source to be able to communicate the 
required clustering and sort order for a write, is that correct?

I opened an issue for this a while ago, but it probably won't be on the roadmap 
for Spark 3.0: SPARK-23889. We can do that sooner if you're interested in it!

> Data Source - State - Write side
> 
>
> Key: SPARK-28192
> URL: https://issues.apache.org/jira/browse/SPARK-28192
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue tracks the efforts on addressing batch write on state data source.
> It could include "state repartition" if it doesn't require huge effort for 
> new DSv2, but it can be also move out to separate issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28139) DataSourceV2: Add AlterTable v2 implementation

2019-06-21 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-28139:
-

 Summary: DataSourceV2: Add AlterTable v2 implementation
 Key: SPARK-28139
 URL: https://issues.apache.org/jira/browse/SPARK-28139
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue


SPARK-27857 updated the parser for v2 ALTER TABLE statements. This tracks 
implementing those using a v2 catalog.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27857) DataSourceV2: Support ALTER TABLE statements in catalyst SQL parser

2019-06-21 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-27857:
--
Summary: DataSourceV2: Support ALTER TABLE statements in catalyst SQL 
parser  (was: DataSourceV2: Support ALTER TABLE statements)

> DataSourceV2: Support ALTER TABLE statements in catalyst SQL parser
> ---
>
> Key: SPARK-27857
> URL: https://issues.apache.org/jira/browse/SPARK-27857
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 3.0.0
>
>
> ALTER TABLE statements should be supported for v2 tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27965) Add extractors for logical transforms

2019-06-05 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27965:
-

 Summary: Add extractors for logical transforms
 Key: SPARK-27965
 URL: https://issues.apache.org/jira/browse/SPARK-27965
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue


Extractors can be used to make any Transform class appear like a case class to 
Spark internals.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27964) Create CatalogV2Util

2019-06-05 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27964:
-

 Summary: Create CatalogV2Util
 Key: SPARK-27964
 URL: https://issues.apache.org/jira/browse/SPARK-27964
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue


Need to move utility functions from test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27919) DataSourceV2: Add v2 session catalog

2019-06-05 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-27919:
--
Affects Version/s: (was: 2.4.3)
   3.0.0

> DataSourceV2: Add v2 session catalog
> 
>
> Key: SPARK-27919
> URL: https://issues.apache.org/jira/browse/SPARK-27919
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>
> When no default catalog is set, the session catalog (v1) is responsible for 
> table identifiers with no catalog part. When CTAS creates a table with a v2 
> provider, a v2 catalog is required and the default catalog is used. But this 
> may cause Spark to create a table in a catalog that it cannot use to look up 
> the table.
> In this case, a v2 catalog that delegates to the session catalog should be 
> used instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27960) DataSourceV2 ORC implementation doesn't handle schemas correctly

2019-06-05 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27960:
-

 Summary: DataSourceV2 ORC implementation doesn't handle schemas 
correctly
 Key: SPARK-27960
 URL: https://issues.apache.org/jira/browse/SPARK-27960
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
Reporter: Ryan Blue


While testing SPARK-27919 
(#[24768|https://github.com/apache/spark/pull/24768]), I tried to use the v2 
ORC implementation to validate a v2 catalog that delegates to the session 
catalog. The ORC implementation fails the following test case because it cannot 
infer a schema (there is no data) but it should be using the schema used to 
create the table.

 Test case:
{code}
test("CreateTable: test ORC source") {
  spark.conf.set("spark.sql.catalog.session", classOf[V2SessionCatalog].getName)

  spark.sql(s"CREATE TABLE table_name (id bigint, data string) USING $orc2")

  val testCatalog = spark.catalog("session").asTableCatalog
  val table = testCatalog.loadTable(Identifier.of(Array(), "table_name"))

  assert(table.name == "orc ") // <-- should this be table_name?
  assert(table.partitioning.isEmpty)
  assert(table.properties == Map(
"provider" -> orc2,
"database" -> "default",
"table" -> "table_name").asJava)
  assert(table.schema == new StructType().add("id", LongType).add("data", 
StringType)) // <-- fail

  val rdd = 
spark.sparkContext.parallelize(table.asInstanceOf[InMemoryTable].rows)
  checkAnswer(spark.internalCreateDataFrame(rdd, table.schema), Seq.empty)
}
{code}

Error:
{code}
Unable to infer schema for ORC. It must be specified manually.;
org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must 
be specified manually.;
at 
org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$5(FileTable.scala:61)
at scala.Option.getOrElse(Option.scala:138)
at 
org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:61)
at 
org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:54)
at 
org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:67)
at 
org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:65)
at 
org.apache.spark.sql.sources.v2.DataSourceV2SQLSuite.$anonfun$new$5(DataSourceV2SQLSuite.scala:82)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27960) DataSourceV2 ORC implementation doesn't handle schemas correctly

2019-06-05 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856955#comment-16856955
 ] 

Ryan Blue commented on SPARK-27960:
---

[~Gengliang.Wang], FYI

> DataSourceV2 ORC implementation doesn't handle schemas correctly
> 
>
> Key: SPARK-27960
> URL: https://issues.apache.org/jira/browse/SPARK-27960
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Priority: Major
>
> While testing SPARK-27919 
> (#[24768|https://github.com/apache/spark/pull/24768]), I tried to use the v2 
> ORC implementation to validate a v2 catalog that delegates to the session 
> catalog. The ORC implementation fails the following test case because it 
> cannot infer a schema (there is no data) but it should be using the schema 
> used to create the table.
>  Test case:
> {code}
> test("CreateTable: test ORC source") {
>   spark.conf.set("spark.sql.catalog.session", 
> classOf[V2SessionCatalog].getName)
>   spark.sql(s"CREATE TABLE table_name (id bigint, data string) USING $orc2")
>   val testCatalog = spark.catalog("session").asTableCatalog
>   val table = testCatalog.loadTable(Identifier.of(Array(), "table_name"))
>   assert(table.name == "orc ") // <-- should this be table_name?
>   assert(table.partitioning.isEmpty)
>   assert(table.properties == Map(
> "provider" -> orc2,
> "database" -> "default",
> "table" -> "table_name").asJava)
>   assert(table.schema == new StructType().add("id", LongType).add("data", 
> StringType)) // <-- fail
>   val rdd = 
> spark.sparkContext.parallelize(table.asInstanceOf[InMemoryTable].rows)
>   checkAnswer(spark.internalCreateDataFrame(rdd, table.schema), Seq.empty)
> }
> {code}
> Error:
> {code}
> Unable to infer schema for ORC. It must be specified manually.;
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It 
> must be specified manually.;
>   at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$5(FileTable.scala:61)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:61)
>   at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:54)
>   at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:67)
>   at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:65)
>   at 
> org.apache.spark.sql.sources.v2.DataSourceV2SQLSuite.$anonfun$new$5(DataSourceV2SQLSuite.scala:82)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27919) DataSourceV2: Add v2 session catalog

2019-06-02 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27919:
-

 Summary: DataSourceV2: Add v2 session catalog
 Key: SPARK-27919
 URL: https://issues.apache.org/jira/browse/SPARK-27919
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
Reporter: Ryan Blue


When no default catalog is set, the session catalog (v1) is responsible for 
table identifiers with no catalog part. When CTAS creates a table with a v2 
provider, a v2 catalog is required and the default catalog is used. But this 
may cause Spark to create a table in a catalog that it cannot use to look up 
the table.

In this case, a v2 catalog that delegates to the session catalog should be used 
instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27909) Fix CTE substitution dependence on ResolveRelations throwing AnalysisException

2019-05-31 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27909:
-

 Summary: Fix CTE substitution dependence on ResolveRelations 
throwing AnalysisException
 Key: SPARK-27909
 URL: https://issues.apache.org/jira/browse/SPARK-27909
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
Reporter: Ryan Blue


CTE substitution currently works by running all analyzer rules on plans after 
each substitution. It does this to fix a recursive CTE case, but this design 
requires the ResolveRelations rule to throw an AnalysisException when it cannot 
resolve a table or else the CTE substitution will run again and may possibly 
recurse infinitely.

Table resolution should be possible across multiple independent rules. To 
accomplish this, the current ResolveRelations rule detects cases where other 
rules (like ResolveDataSource) will resolve a TableIdentifier and returns the 
UnresolvedRelation unmodified only in those cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27857) DataSourceV2: Support ALTER TABLE statements

2019-05-27 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27857:
-

 Summary: DataSourceV2: Support ALTER TABLE statements
 Key: SPARK-27857
 URL: https://issues.apache.org/jira/browse/SPARK-27857
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.3
Reporter: Ryan Blue


ALTER TABLE statements should be supported for v2 tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27784) Alias ID reuse can break correctness when substituting foldable expressions

2019-05-20 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844404#comment-16844404
 ] 

Ryan Blue commented on SPARK-27784:
---

[~cloud_fan], I don't see this happening in master because an additional 
Project is added somewhere. Any idea what adds it?
{code:java}
== Parsed Logical Plan ==
Union
:- Project [id#43, data#42]
: +- Join Inner, (id#43 = id#40)
: :- Project [id#43, coalesce(data#44, _) AS data#42]
: : +- SubqueryAlias `default`.`t1`
: : +- Relation[id#43,data#44] parquet
: +- Project [value#37 AS id#40]
: +- LocalRelation [value#37]
+- Project [id#49, coalesce(cast(data#51 as string), _) AS data#42]
+- Project [id#49, null AS data#51]
+- SubqueryAlias `default`.`t2`
+- Relation[id#49] parquet

== Analyzed Logical Plan ==
id: int, data: string
Union
:- Project [id#43, data#42] 
<--- same ID
: +- Join Inner, (id#43 = id#40)
: :- Project [id#43, coalesce(data#44, _) AS data#42]
: : +- SubqueryAlias `default`.`t1`
: : +- Relation[id#43,data#44] parquet
: +- Project [value#37 AS id#40]
: +- LocalRelation [value#37]
+- Project [id#49 AS id#81, data#42 AS data#82]
+- Project [id#49, coalesce(cast(data#51 as string), _) AS data#42] 
<--- same ID
+- Project [id#49, null AS data#51]
+- SubqueryAlias `default`.`t2`
+- Relation[id#49] parquet

== Optimized Logical Plan ==
Union
:- Project [id#43, data#42]
: +- Join Inner, (id#43 = id#40)
: :- Project [id#43, coalesce(data#44, _) AS data#42]
: : +- Filter isnotnull(id#43)
: : +- Relation[id#43,data#44] parquet
: +- Project [value#37 AS id#40]
: +- LocalRelation [value#37]
+- Project [id#49, _ AS data#82]
+- Relation[id#49] parquet

== Physical Plan ==
Union
:- *(2) Project [id#43, data#42]
: +- *(2) BroadcastHashJoin [id#43], [id#40], Inner, BuildRight
: :- *(2) Project [id#43, coalesce(data#44, _) AS data#42]
: : +- *(2) Filter isnotnull(id#43)
: : +- *(2) FileScan parquet default.t1[id#43,data#44] Batched: true, 
DataFilters: [isnotnull(id#43)], Format: Parquet, Location: 
InMemoryFileIndex[file:/home/blue/workspace/spark/common/kvstore/spark-warehouse/org.apache.spark...,
 PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
struct
: +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
false] as bigint)))
: +- *(1) Project [value#37 AS id#40]
: +- LocalTableScan [value#37]
+- *(3) Project [id#49, _ AS data#82] <- reused ID eliminated 
by collapsing projections
+- *(3) FileScan parquet default.t2[id#49] Batched: true, DataFilters: [], 
Format: Parquet, Location: 
InMemoryFileIndex[file:/home/blue/workspace/spark/common/kvstore/spark-warehouse/org.apache.spark...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct
{code}

> Alias ID reuse can break correctness when substituting foldable expressions
> ---
>
> Key: SPARK-27784
> URL: https://issues.apache.org/jira/browse/SPARK-27784
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.3.2
>Reporter: Ryan Blue
>Priority: Major
>  Labels: correctness
>
> This is a correctness bug when reusing a set of project expressions in the 
> DataFrame API.
> Use case: a user was migrating a table to a new version with an additional 
> column ("data" in the repro case). To migrate the user unions the old table 
> ("t2") with the new table ("t1"), and applies a common set of projections to 
> ensure the union doesn't hit an issue with ordering (SPARK-22335). In some 
> cases, this produces an incorrect query plan:
> {code:java}
> Seq((4, "a"), (5, "b"), (6, "c")).toDF("id", "data").write.saveAsTable("t1")
> Seq(1, 2, 3).toDF("id").write.saveAsTable("t2")
> val dim = Seq(2, 3, 4).toDF("id")
> val outputCols = Seq($"id", coalesce($"data", lit("_")).as("data"))
> val t1 = spark.table("t1").select(outputCols:_*)
> val t2 = spark.table("t2").withColumn("data", lit(null)).select(outputCols:_*)
> t1.join(dim, t1("id") === dim("id")).select(t1("id"), 
> t1("data")).union(t2).explain(true){code}
> {code:java}
> == Physical Plan ==
> Union
> :- *Project [id#330, _ AS data#237] < THE CONSTANT IS 
> FROM THE OTHER SIDE OF THE UNION
> : +- *BroadcastHashJoin [id#330], [id#234], Inner, BuildRight
> : :- *Project [id#330]
> : :  +- *Filter isnotnull(id#330)
> : : +- *FileScan parquet t1[id#330] Batched: true, Format: Parquet, 
> Location: CatalogFileIndex[s3://.../t1], PartitionFilters: [], PushedFilters: 
> [IsNotNull(id)], ReadSchema: struct
> : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
> false] as bigint)))
> :+- LocalTableScan [id#234]
> +- *Project [id#340, _ AS data#237]
>+- *FileScan parquet t2[id#340] Batched: true, Format: 

[jira] [Updated] (SPARK-27784) Alias ID reuse can break correctness when substituting foldable expressions

2019-05-20 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-27784:
--
Description: 
This is a correctness bug when reusing a set of project expressions in the 
DataFrame API.

Use case: a user was migrating a table to a new version with an additional 
column ("data" in the repro case). To migrate the user unions the old table 
("t2") with the new table ("t1"), and applies a common set of projections to 
ensure the union doesn't hit an issue with ordering (SPARK-22335). In some 
cases, this produces an incorrect query plan:
{code:java}
Seq((4, "a"), (5, "b"), (6, "c")).toDF("id", "data").write.saveAsTable("t1")
Seq(1, 2, 3).toDF("id").write.saveAsTable("t2")

val dim = Seq(2, 3, 4).toDF("id")
val outputCols = Seq($"id", coalesce($"data", lit("_")).as("data"))

val t1 = spark.table("t1").select(outputCols:_*)
val t2 = spark.table("t2").withColumn("data", lit(null)).select(outputCols:_*)

t1.join(dim, t1("id") === dim("id")).select(t1("id"), 
t1("data")).union(t2).explain(true){code}
{code:java}
== Physical Plan ==
Union
:- *Project [id#330, _ AS data#237] < THE CONSTANT IS 
FROM THE OTHER SIDE OF THE UNION
: +- *BroadcastHashJoin [id#330], [id#234], Inner, BuildRight
: :- *Project [id#330]
: :  +- *Filter isnotnull(id#330)
: : +- *FileScan parquet t1[id#330] Batched: true, Format: Parquet, 
Location: CatalogFileIndex[s3://.../t1], PartitionFilters: [], PushedFilters: 
[IsNotNull(id)], ReadSchema: struct
: +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
false] as bigint)))
:+- LocalTableScan [id#234]
+- *Project [id#340, _ AS data#237]
   +- *FileScan parquet t2[id#340] Batched: true, Format: Parquet, Location: 
CatalogFileIndex[s3://.../t2], PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct{code}
The problem happens because "outputCols" has an alias. The ID for that alias is 
created when the projection Seq is created, so it is reused in both sides of 
the union.

When FoldablePropagation runs, it identifies that "data" in the t2 side of the 
union is a foldable expression and replaces all references to it, including the 
references in the t1 side of the union.

The join to a dimension table is necessary to reproduce the problem because it 
requires a Projection on top of the join that uses an AttributeReference for 
data#237. Otherwise, the projections are collapsed and the projection includes 
an Alias that does not get rewritten by FoldablePropagation.

> Alias ID reuse can break correctness when substituting foldable expressions
> ---
>
> Key: SPARK-27784
> URL: https://issues.apache.org/jira/browse/SPARK-27784
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.3.2
>Reporter: Ryan Blue
>Priority: Major
>  Labels: correctness
>
> This is a correctness bug when reusing a set of project expressions in the 
> DataFrame API.
> Use case: a user was migrating a table to a new version with an additional 
> column ("data" in the repro case). To migrate the user unions the old table 
> ("t2") with the new table ("t1"), and applies a common set of projections to 
> ensure the union doesn't hit an issue with ordering (SPARK-22335). In some 
> cases, this produces an incorrect query plan:
> {code:java}
> Seq((4, "a"), (5, "b"), (6, "c")).toDF("id", "data").write.saveAsTable("t1")
> Seq(1, 2, 3).toDF("id").write.saveAsTable("t2")
> val dim = Seq(2, 3, 4).toDF("id")
> val outputCols = Seq($"id", coalesce($"data", lit("_")).as("data"))
> val t1 = spark.table("t1").select(outputCols:_*)
> val t2 = spark.table("t2").withColumn("data", lit(null)).select(outputCols:_*)
> t1.join(dim, t1("id") === dim("id")).select(t1("id"), 
> t1("data")).union(t2).explain(true){code}
> {code:java}
> == Physical Plan ==
> Union
> :- *Project [id#330, _ AS data#237] < THE CONSTANT IS 
> FROM THE OTHER SIDE OF THE UNION
> : +- *BroadcastHashJoin [id#330], [id#234], Inner, BuildRight
> : :- *Project [id#330]
> : :  +- *Filter isnotnull(id#330)
> : : +- *FileScan parquet t1[id#330] Batched: true, Format: Parquet, 
> Location: CatalogFileIndex[s3://.../t1], PartitionFilters: [], PushedFilters: 
> [IsNotNull(id)], ReadSchema: struct
> : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
> false] as bigint)))
> :+- LocalTableScan [id#234]
> +- *Project [id#340, _ AS data#237]
>+- *FileScan parquet t2[id#340] Batched: true, Format: Parquet, Location: 
> CatalogFileIndex[s3://.../t2], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct{code}
> The problem happens because "outputCols" has an alias. The ID for that alias 
> is created when the projection Seq is created, so it is reused in 

[jira] [Created] (SPARK-27784) Alias ID reuse can break correctness when substituting foldable expressions

2019-05-20 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27784:
-

 Summary: Alias ID reuse can break correctness when substituting 
foldable expressions
 Key: SPARK-27784
 URL: https://issues.apache.org/jira/browse/SPARK-27784
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2, 2.1.1
Reporter: Ryan Blue






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27732) DataSourceV2: Add CreateTable logical operation

2019-05-15 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27732:
-

 Summary: DataSourceV2: Add CreateTable logical operation
 Key: SPARK-27732
 URL: https://issues.apache.org/jira/browse/SPARK-27732
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.3
Reporter: Ryan Blue






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27724) Add RTAS logical operation

2019-05-15 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27724:
-

 Summary: Add RTAS logical operation
 Key: SPARK-27724
 URL: https://issues.apache.org/jira/browse/SPARK-27724
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.3
Reporter: Ryan Blue






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27724) DataSourceV2: Add RTAS logical operation

2019-05-15 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-27724:
--
Summary: DataSourceV2: Add RTAS logical operation  (was: Add RTAS logical 
operation)

> DataSourceV2: Add RTAS logical operation
> 
>
> Key: SPARK-27724
> URL: https://issues.apache.org/jira/browse/SPARK-27724
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24923) DataSourceV2: Add CTAS logical operation

2019-05-15 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-24923:
--
Summary: DataSourceV2: Add CTAS logical operation  (was: DataSourceV2: Add 
CTAS and RTAS logical operations)

> DataSourceV2: Add CTAS logical operation
> 
>
> Key: SPARK-24923
> URL: https://issues.apache.org/jira/browse/SPARK-24923
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 3.0.0
>
>
> When SPARK-24252 and SPARK-24251 are in, next plans to implement from the 
> SPIP are CTAS and RTAS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27708) Add documentation for v2 data sources

2019-05-14 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27708:
-

 Summary: Add documentation for v2 data sources
 Key: SPARK-27708
 URL: https://issues.apache.org/jira/browse/SPARK-27708
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.3
Reporter: Ryan Blue


Before the 3.0 release, the new v2 data sources should be documented. This 
includes:
 * How to plug in catalog implementations
 * Catalog plugin configuration
 * Multi-part identifier behavior
 * Partition transforms
 * Table properties that are used to pass table info (e.g. "provider")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27693) DataSourceV2: Add default catalog property

2019-05-13 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27693:
-

 Summary: DataSourceV2: Add default catalog property
 Key: SPARK-27693
 URL: https://issues.apache.org/jira/browse/SPARK-27693
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.3
Reporter: Ryan Blue


Add a default catalog property for DataSourceV2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27661) Add SupportsNamespaces interface for v2 catalogs

2019-05-08 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27661:
-

 Summary: Add SupportsNamespaces interface for v2 catalogs
 Key: SPARK-27661
 URL: https://issues.apache.org/jira/browse/SPARK-27661
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.3
Reporter: Ryan Blue


Some catalogs support namespace operations, like creating or dropping 
namespaces. The v2 API should have a way to expose these operations to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27658) Catalog API to load functions

2019-05-08 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27658:
-

 Summary: Catalog API to load functions
 Key: SPARK-27658
 URL: https://issues.apache.org/jira/browse/SPARK-27658
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.3
Reporter: Ryan Blue


SPARK-24252 added an API that catalog plugins can implement to expose table 
operations. Catalogs should also be able to provide function implementations to 
Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23098) Migrate Kafka batch source to v2

2019-05-08 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835700#comment-16835700
 ] 

Ryan Blue commented on SPARK-23098:
---

I don't think there's a DSv2-related obstacle to implementing this.

> Migrate Kafka batch source to v2
> 
>
> Key: SPARK-23098
> URL: https://issues.apache.org/jira/browse/SPARK-23098
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27471) Reorganize public v2 catalog API

2019-04-19 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822271#comment-16822271
 ] 

Ryan Blue commented on SPARK-27471:
---

Thanks [~hyukjin.kwon]. I meant to set the target version, not the fix version. 
I've updated that.

> Reorganize public v2 catalog API
> 
>
> Key: SPARK-27471
> URL: https://issues.apache.org/jira/browse/SPARK-27471
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Ryan Blue
>Priority: Blocker
>
> In the review for SPARK-27181, Reynold suggested some package moves. We've 
> decided (at the v2 community sync) not to delay by having this discussion now 
> because we want to get the new catalog API in so we can work on more logical 
> plans in parallel. But we do need to make sure we have a sane package scheme 
> for the next release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27471) Reorganize public v2 catalog API

2019-04-19 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-27471:
--
Target Version/s: 3.0.0

> Reorganize public v2 catalog API
> 
>
> Key: SPARK-27471
> URL: https://issues.apache.org/jira/browse/SPARK-27471
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Ryan Blue
>Priority: Blocker
>
> In the review for SPARK-27181, Reynold suggested some package moves. We've 
> decided (at the v2 community sync) not to delay by having this discussion now 
> because we want to get the new catalog API in so we can work on more logical 
> plans in parallel. But we do need to make sure we have a sane package scheme 
> for the next release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27471) Reorganize public v2 catalog API

2019-04-15 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27471:
-

 Summary: Reorganize public v2 catalog API
 Key: SPARK-27471
 URL: https://issues.apache.org/jira/browse/SPARK-27471
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.1
Reporter: Ryan Blue
 Fix For: 3.0.0


In the review for SPARK-27181, Reynold suggested some package moves. We've 
decided (at the v2 community sync) not to delay by having this discussion now 
because we want to get the new catalog API in so we can work on more logical 
plans in parallel. But we do need to make sure we have a sane package scheme 
for the next release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27386) Improve partition transform parsing

2019-04-04 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27386:
-

 Summary: Improve partition transform parsing
 Key: SPARK-27386
 URL: https://issues.apache.org/jira/browse/SPARK-27386
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue


SPARK-27181 adds support to the SQL parser for transformation functions in the 
{{PARTITION BY}} clause. The rules to match this are specific to transforms and 
can match only literals or qualified names (field references). This should be 
improved to match a broader set of expressions so that Spark can produce better 
error messages than an expected symbol list.

For example, {{PARTITION BY (2 + 3)}} should produce "invalid transformation 
expression: 2 + 3" instead of "expecting qualified name".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25006) Add optional catalog to TableIdentifier

2019-03-29 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-25006.
---
Resolution: Won't Fix

Closing this because SPARK-26946 replaces it.

> Add optional catalog to TableIdentifier
> ---
>
> Key: SPARK-25006
> URL: https://issues.apache.org/jira/browse/SPARK-25006
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> For multi-catalog support, Spark table identifiers need to identify the 
> catalog for a table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27181) Add public expression and transform API for DSv2 partitioning

2019-03-16 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27181:
-

 Summary: Add public expression and transform API for DSv2 
partitioning
 Key: SPARK-27181
 URL: https://issues.apache.org/jira/browse/SPARK-27181
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Ryan Blue






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26778) Implement file source V2 partitioning

2019-03-14 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16792843#comment-16792843
 ] 

Ryan Blue commented on SPARK-26778:
---

[~Gengliang.Wang], can you clarify what this issue is tracking?

> Implement file source V2 partitioning 
> --
>
> Key: SPARK-26778
> URL: https://issues.apache.org/jira/browse/SPARK-26778
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27108) Add parsed CreateTable plans to Catalyst

2019-03-08 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27108:
-

 Summary: Add parsed CreateTable plans to Catalyst
 Key: SPARK-27108
 URL: https://issues.apache.org/jira/browse/SPARK-27108
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.1
Reporter: Ryan Blue


The abstract Catalyst SQL AST builder cannot currently parse {{CREATE TABLE}} 
commands. Creates are handled only by {{SparkSqlParser}} because the logical 
plans are defined in the v1 datasource package 
(org.apache.spark.sql.execution.datasources).

The {{SparkSqlParser}} mixes parsing with logic that is specific to v1, like 
converting {{IF NOT EXISTS}} into a {{SaveMode}}. This makes it difficult (and 
error-prone) to produce v2 plans because it requires converting the AST to v1 
and the converting v1 to v2.

Instead, the catalyst parser should create plans that represent exactly what 
was parsed, after validation like ensuring no duplicate clauses. Then those 
plans should be converted to v1 or v2 plans in the analyzer. This structure 
will avoid errors caused by multiple layers of translation and keeps v1 and v2 
plans separate to ensure that v1 has no behavior changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27067) SPIP: Catalog API for table metadata

2019-03-05 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-27067.
---
Resolution: Fixed

I'm resolving this issue because the vote to adopt the proposal passed.

I've added links to the google doc proposal (now view-only) and vote thread, 
and uploaded a copy of the proposal as a PDF.

> SPIP: Catalog API for table metadata
> 
>
> Key: SPARK-27067
> URL: https://issues.apache.org/jira/browse/SPARK-27067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Spark API for Table Metadata.pdf
>
>
> Goal: Define a catalog API to create, alter, load, and drop tables



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27067) SPIP: Catalog API for table metadata

2019-03-05 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-27067:
--
Attachment: SPIP_ Spark API for Table Metadata.pdf

> SPIP: Catalog API for table metadata
> 
>
> Key: SPARK-27067
> URL: https://issues.apache.org/jira/browse/SPARK-27067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Spark API for Table Metadata.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27066) SPIP: Identifiers for multi-catalog support

2019-03-05 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-27066:
--
Description: 
Goals:
 * Propose semantics for identifiers and a listing API to support multiple 
catalogs
 ** Support any namespace scheme used by an external catalog
 ** Avoid traversing namespaces via multiple listing calls from Spark
 * Outline migration from the current behavior to Spark with multiple catalogs

> SPIP: Identifiers for multi-catalog support
> ---
>
> Key: SPARK-27066
> URL: https://issues.apache.org/jira/browse/SPARK-27066
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Identifiers for multi-catalog Spark.pdf
>
>
> Goals:
>  * Propose semantics for identifiers and a listing API to support multiple 
> catalogs
>  ** Support any namespace scheme used by an external catalog
>  ** Avoid traversing namespaces via multiple listing calls from Spark
>  * Outline migration from the current behavior to Spark with multiple catalogs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27066) SPIP: Identifiers for multi-catalog support

2019-03-05 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27066:
-

 Summary: SPIP: Identifiers for multi-catalog support
 Key: SPARK-27066
 URL: https://issues.apache.org/jira/browse/SPARK-27066
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27067) SPIP: Catalog API for table metadata

2019-03-05 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-27067:
--
Description: Goal: Define a catalog API to create, alter, load, and drop 
tables

> SPIP: Catalog API for table metadata
> 
>
> Key: SPARK-27067
> URL: https://issues.apache.org/jira/browse/SPARK-27067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Spark API for Table Metadata.pdf
>
>
> Goal: Define a catalog API to create, alter, load, and drop tables



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27067) SPIP: Catalog API for table metadata

2019-03-05 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27067:
-

 Summary: SPIP: Catalog API for table metadata
 Key: SPARK-27067
 URL: https://issues.apache.org/jira/browse/SPARK-27067
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27066) SPIP: Identifiers for multi-catalog support

2019-03-05 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-27066.
---
Resolution: Fixed

I'm resolving this issue because the vote to adopt the proposal passed.

I've added links to the google doc proposal (now view-only) and vote thread, 
and uploaded a copy of the proposal as a PDF.

> SPIP: Identifiers for multi-catalog support
> ---
>
> Key: SPARK-27066
> URL: https://issues.apache.org/jira/browse/SPARK-27066
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Identifiers for multi-catalog Spark.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23521) SPIP: Standardize SQL logical plans with DataSourceV2

2019-03-05 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784736#comment-16784736
 ] 

Ryan Blue commented on SPARK-23521:
---

I've turned off commenting on the google doc to preserve its state, with the 
existing comments. I'm also adding a PDF of the final proposal to this issue.

> SPIP: Standardize SQL logical plans with DataSourceV2
> -
>
> Key: SPARK-23521
> URL: https://issues.apache.org/jira/browse/SPARK-23521
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Standardize logical plans.pdf
>
>
> Executive Summary: This SPIP is based on [discussion about the DataSourceV2 
> implementation|https://lists.apache.org/thread.html/55676ec1f5039d3deaf347d391cf82fe8574b8fa4eeab70110ed5b2b@%3Cdev.spark.apache.org%3E]
>  on the dev list. The proposal is to standardize the logical plans used for 
> write operations to make the planner more maintainable and to make Spark's 
> write behavior predictable and reliable. It proposes the following principles:
>  # Use well-defined logical plan nodes for all high-level operations: insert, 
> create, CTAS, overwrite table, etc.
>  # Use planner rules that match on these high-level nodes, so that it isn’t 
> necessary to create rules to match each eventual code path individually.
>  # Clearly define Spark’s behavior for these logical plan nodes. Physical 
> nodes should implement that behavior so that all code paths eventually make 
> the same guarantees.
>  # Specialize implementation when creating a physical plan, not logical 
> plans. This will avoid behavior drift and ensure planner code is shared 
> across physical implementations.
> The SPIP doc presents a small but complete set of those high-level logical 
> operations, most of which are already defined in SQL or implemented by some 
> write path in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27066) SPIP: Identifiers for multi-catalog support

2019-03-05 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-27066:
--
Attachment: SPIP_ Identifiers for multi-catalog Spark.pdf

> SPIP: Identifiers for multi-catalog support
> ---
>
> Key: SPARK-27066
> URL: https://issues.apache.org/jira/browse/SPARK-27066
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Identifiers for multi-catalog Spark.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23521) SPIP: Standardize SQL logical plans with DataSourceV2

2019-03-05 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-23521:
--
Attachment: SPIP_ Standardize logical plans.pdf

> SPIP: Standardize SQL logical plans with DataSourceV2
> -
>
> Key: SPARK-23521
> URL: https://issues.apache.org/jira/browse/SPARK-23521
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Standardize logical plans.pdf
>
>
> Executive Summary: This SPIP is based on [discussion about the DataSourceV2 
> implementation|https://lists.apache.org/thread.html/55676ec1f5039d3deaf347d391cf82fe8574b8fa4eeab70110ed5b2b@%3Cdev.spark.apache.org%3E]
>  on the dev list. The proposal is to standardize the logical plans used for 
> write operations to make the planner more maintainable and to make Spark's 
> write behavior predictable and reliable. It proposes the following principles:
>  # Use well-defined logical plan nodes for all high-level operations: insert, 
> create, CTAS, overwrite table, etc.
>  # Use planner rules that match on these high-level nodes, so that it isn’t 
> necessary to create rules to match each eventual code path individually.
>  # Clearly define Spark’s behavior for these logical plan nodes. Physical 
> nodes should implement that behavior so that all code paths eventually make 
> the same guarantees.
>  # Specialize implementation when creating a physical plan, not logical 
> plans. This will avoid behavior drift and ensure planner code is shared 
> across physical implementations.
> The SPIP doc presents a small but complete set of those high-level logical 
> operations, most of which are already defined in SQL or implemented by some 
> write path in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26874) With PARQUET-1414, Spark can erroneously write empty pages

2019-02-13 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-26874:
--
Summary: With PARQUET-1414, Spark can erroneously write empty pages  (was: 
When we upgrade Parquet to 1.11+, Spark can erroneously write empty pages)

> With PARQUET-1414, Spark can erroneously write empty pages
> --
>
> Key: SPARK-26874
> URL: https://issues.apache.org/jira/browse/SPARK-26874
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>
> This issue will only come up when Spark upgrades its Parquet dependency to 
> the latest. This issue is being filed to proactively fix the bug before we 
> upgrade - it's not something that would easily be found in the current unit 
> tests and can be missed until the community scale tests in an e.g. RC phase.
> Parquet introduced a new feature to limit the number of rows written to a 
> page in a column chunk - see PARQUET-1414. Previously, Parquet would only 
> flush pages to the column store after the page writer had filled its buffer 
> with a certain amount of bytes. The idea of the Parquet patch was to make 
> page writers flush to the column store upon the writer being given a certain 
> number of rows - the default value is 2.
> The patch makes the Spark Parquet Data Source erroneously write empty pages 
> to column chunks, making the Parquet file ultimately unreadable with 
> exceptions like these:
>  
> {code:java}
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
> at 0 in block -1 in file 
> file:/private/var/folders/7f/rrg37sj15r33cwlxbhg7fyg5080xxg/T/spark-2c1b1e5d-c132-42fe-98e1-b523b9baa176/parquet-data/part-2-b8b4bb3c-b49f-440b-b96c-53fcd19786ad-c000.snappy.parquet
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
>  at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
>  at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
>  ... 18 more
> Caused by: java.lang.IllegalArgumentException: Reading past RLE/BitPacking 
> stream.
>  at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:53)
>  at 
> org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:80)
>  at 
> org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:62)
>  at 
> org.apache.parquet.column.values.rle.RunLengthBitPackingHybridValuesReader.readInteger(RunLengthBitPackingHybridValuesReader.java:53)
>  at 
> org.apache.parquet.column.impl.ColumnReaderBase$ValuesReaderIntIterator.nextInt(ColumnReaderBase.java:733)
>  at 
> org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:567)
>  at 
> org.apache.parquet.column.impl.ColumnReaderBase.consume(ColumnReaderBase.java:705)
>  at 
> org.apache.parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:30)
>  at 
> org.apache.parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:47)
>  at 
> org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:84)
>  at 
> org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:271)
>  at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147)
>  at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109)
>  at 
> org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:165)
>  at 
> org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
>  ... 22 more
> {code}
> What's happening here is that the reader is being given a page with no 
> values, which Parquet can never handle.
> The root cause is due to the way Spark treats empty (null) records in 
> optional fields. Concretely, in {{ParquetWriteSupport#write}}, we always 
> indicate to the recordConsumer that we are starting a message 
> ({{recordConsumer#startMessage}}). If Spark is given a null field in the row, 
> it still indicates to the record consumer after having ignored the row that 
> the message is finished ({{recordConsumer#endMessage}}). The ending of the 
> message causes all column 

[jira] [Commented] (SPARK-26874) When we upgrade Parquet to 1.11+, Spark can erroneously write empty pages

2019-02-13 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16767798#comment-16767798
 ] 

Ryan Blue commented on SPARK-26874:
---

To be clear, Parquet has not released any 1.11.x versions so this is a problem 
with master, not a Parquet release.

> When we upgrade Parquet to 1.11+, Spark can erroneously write empty pages
> -
>
> Key: SPARK-26874
> URL: https://issues.apache.org/jira/browse/SPARK-26874
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>
> This issue will only come up when Spark upgrades its Parquet dependency to 
> the latest. This issue is being filed to proactively fix the bug before we 
> upgrade - it's not something that would easily be found in the current unit 
> tests and can be missed until the community scale tests in an e.g. RC phase.
> Parquet introduced a new feature to limit the number of rows written to a 
> page in a column chunk - see PARQUET-1414. Previously, Parquet would only 
> flush pages to the column store after the page writer had filled its buffer 
> with a certain amount of bytes. The idea of the Parquet patch was to make 
> page writers flush to the column store upon the writer being given a certain 
> number of rows - the default value is 2.
> The patch makes the Spark Parquet Data Source erroneously write empty pages 
> to column chunks, making the Parquet file ultimately unreadable with 
> exceptions like these:
>  
> {code:java}
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
> at 0 in block -1 in file 
> file:/private/var/folders/7f/rrg37sj15r33cwlxbhg7fyg5080xxg/T/spark-2c1b1e5d-c132-42fe-98e1-b523b9baa176/parquet-data/part-2-b8b4bb3c-b49f-440b-b96c-53fcd19786ad-c000.snappy.parquet
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
>  at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
>  at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
>  ... 18 more
> Caused by: java.lang.IllegalArgumentException: Reading past RLE/BitPacking 
> stream.
>  at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:53)
>  at 
> org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:80)
>  at 
> org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:62)
>  at 
> org.apache.parquet.column.values.rle.RunLengthBitPackingHybridValuesReader.readInteger(RunLengthBitPackingHybridValuesReader.java:53)
>  at 
> org.apache.parquet.column.impl.ColumnReaderBase$ValuesReaderIntIterator.nextInt(ColumnReaderBase.java:733)
>  at 
> org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:567)
>  at 
> org.apache.parquet.column.impl.ColumnReaderBase.consume(ColumnReaderBase.java:705)
>  at 
> org.apache.parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:30)
>  at 
> org.apache.parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:47)
>  at 
> org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:84)
>  at 
> org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:271)
>  at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147)
>  at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109)
>  at 
> org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:165)
>  at 
> org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
>  ... 22 more
> {code}
> What's happening here is that the reader is being given a page with no 
> values, which Parquet can never handle.
> The root cause is due to the way Spark treats empty (null) records in 
> optional fields. Concretely, in {{ParquetWriteSupport#write}}, we always 
> indicate to the recordConsumer that we are starting a message 
> ({{recordConsumer#startMessage}}). If Spark is given a null field in the row, 
> it still indicates to the record consumer after having ignored the row that 
> the message is finished ({{recordConsumer#endMessage}}). The ending of the 

[jira] [Created] (SPARK-26873) FileFormatWriter creates inconsistent MR job IDs

2019-02-13 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-26873:
-

 Summary: FileFormatWriter creates inconsistent MR job IDs
 Key: SPARK-26873
 URL: https://issues.apache.org/jira/browse/SPARK-26873
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 2.3.2, 2.2.3
Reporter: Ryan Blue


FileFormatWriter uses the current time to create a Job ID that is used when 
calling Hadoop committers. This ID is used to produce task and task attempt IDs 
used in commits.

The problem is that Spark [generates this Job 
ID|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L209]
 in {{executeTask}} for every task:
{code:lang=scala}
  /** Writes data out in a single Spark task. */
  private def executeTask(
  description: WriteJobDescription,
  sparkStageId: Int,
  sparkPartitionId: Int,
  sparkAttemptNumber: Int,
  committer: FileCommitProtocol,
  iterator: Iterator[InternalRow]): WriteTaskResult = {

val jobId = SparkHadoopWriterUtils.createJobID(new Date, sparkStageId)
val taskId = new TaskID(jobId, TaskType.MAP, sparkPartitionId)
val taskAttemptId = new TaskAttemptID(taskId, sparkAttemptNumber)

...
{code}

Because this is called in each task, the Job ID used is not consistent across 
tasks, which violates the contract expected by Hadoop committers.

If a committer expects identical task IDs across attempts for correctness, this 
breaks correctness. For example, a Hadoop committer should be able to rename an 
output file to a path based on the task ID to ensure that only one copy is 
committed.

We hit this issue when preemption caused a task to die just after the commit 
operation. The commit coordinator authorized a second task commit because the 
first did not complete due to preemption.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26811) Add DataSourceV2 capabilities to check support for batch append, overwrite, truncate during analysis.

2019-02-01 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-26811:
-

 Summary: Add DataSourceV2 capabilities to check support for batch 
append, overwrite, truncate during analysis.
 Key: SPARK-26811
 URL: https://issues.apache.org/jira/browse/SPARK-26811
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Ryan Blue






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-30 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756654#comment-16756654
 ] 

Ryan Blue commented on SPARK-26677:
---

Thanks, sorry about the mistake.

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Blocker
>  Labels: correctness
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-30 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-26677:
--
Fix Version/s: 2.4.1

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.1
>
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-25 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16752606#comment-16752606
 ] 

Ryan Blue commented on SPARK-26677:
---

To clarify [~dongjoon]'s comment: All recent versions of Parquet are affected 
by this {{not(eqNullSafe(...)}} bug. Only Parquet 1.10.0 is affected by 
PARQUET-1309.

This filter bug has been present since Parquet introduced dictionary filtering.

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Blocker
>  Labels: correctness
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26682) Task attempt ID collision causes lost data

2019-01-21 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-26682:
-

 Summary: Task attempt ID collision causes lost data
 Key: SPARK-26682
 URL: https://issues.apache.org/jira/browse/SPARK-26682
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0, 2.3.2, 2.1.3
Reporter: Ryan Blue


We recently tracked missing data to a collision in the fake Hadoop task attempt 
ID created when using Hadoop OutputCommitters. This is similar to SPARK-24589.

A stage had one task fail to get one shard from a shuffle, causing a 
FetchFailedException and Spark resubmitted the stage. Because only one task was 
affected, the original stage attempt continued running tasks that had been 
resubmitted. Another task ran two attempts concurrently on the same executor, 
but had the same attempt number because they were from different stage 
attempts. Because the attempt number was the same, the task used the same temp 
locations. That caused one attempt to fail because a file path already existed, 
and that attempt then removed the shared temp location and deleted the other 
task's data. When the second attempt succeeded, it committed partial data.

The problem was that both attempts had the same partition and attempt numbers, 
despite being run in different stages, and that was used to create a Hadoop 
task attempt ID on which the temp location was based. The fix is to use Spark's 
global task attempt ID, which is a counter, instead of attempt number because 
attempt number is reused in stage attempts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26681) Support Ammonite scopes in OuterScopes

2019-01-21 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-26681:
-

 Summary: Support Ammonite scopes in OuterScopes
 Key: SPARK-26681
 URL: https://issues.apache.org/jira/browse/SPARK-26681
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Ryan Blue


When the Ammonite REPL is used with Spark, users have to call 
{{OuterScopes.addOuterScope}} 
[manually|https://github.com/alexarchambault/ammonite-spark#troubleshooting] in 
order to get a working Dataset. A similar problem is already solved for the 
Spark REPL, which recognizes class names and returns the correct scope. Spark 
should support Ammonite scopes also.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26679) Deconflict spark.executor.pyspark.memory and spark.python.worker.memory

2019-01-21 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-26679:
-

 Summary: Deconflict spark.executor.pyspark.memory and 
spark.python.worker.memory
 Key: SPARK-26679
 URL: https://issues.apache.org/jira/browse/SPARK-26679
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Ryan Blue


In 2.4.0, spark.executor.pyspark.memory was added to limit the total memory 
space of a python worker. There is another RDD setting, 
spark.python.worker.memory that controls when Spark decides to spill data to 
disk. These are currently similar, but not related to one another.

PySpark should probably use spark.executor.pyspark.memory to limit or default 
the setting of spark.python.worker.memory because the latter property controls 
spilling and should be lower than the total memory limit. Renaming 
spark.python.worker.memory would also help clarity because it sounds like it 
should control the limit, but is more like the JVM setting 
spark.memory.fraction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23398) DataSourceV2 should provide a way to get a source's schema.

2019-01-18 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-23398.
---
Resolution: Fixed

SPARK-25528 adds a Table interface that can report its schema.

> DataSourceV2 should provide a way to get a source's schema.
> ---
>
> Key: SPARK-23398
> URL: https://issues.apache.org/jira/browse/SPARK-23398
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> To validate writes with DataSourceV2, the planner needs to get a source's 
> schema. The current API has no direct way to get that schema. SPARK-23321 
> instantiates a reader to get the schema, but sources are not required to 
> implement {{ReadSupport}} or {{ReadSupportWithSchema}}. V2 should either add 
> a method to get the schema of a source, or require sources implement 
> {{ReadSupport}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26666) DataSourceV2: Add overwrite and dynamic overwrite.

2019-01-18 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-2:
-

 Summary: DataSourceV2: Add overwrite and dynamic overwrite.
 Key: SPARK-2
 URL: https://issues.apache.org/jira/browse/SPARK-2
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Ryan Blue


DataSourceV2 will support overwrite operations using the following two 
interfaces (from the write side design doc):

{code:lang=java}
public interface SupportsOverwrite extends WriteBuilder {
  WriteBuilder overwrite(Filter[] filters)
}

public interface SupportsDynamicOverwrite extends WriteBuilder {
  WriteBuilder overwriteDynamicPartitions()
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23321) DataSourceV2 should apply some validation when writing.

2019-01-18 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-23321.
---
Resolution: Fixed

Done for Append plans. Will be included in new logical plans as they are added.

> DataSourceV2 should apply some validation when writing.
> ---
>
> Key: SPARK-23321
> URL: https://issues.apache.org/jira/browse/SPARK-23321
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> DataSourceV2 writes are not validated. These writes should be validated using 
> the standard preprocess rules that are used for Hive and DataSource tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-07 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678728#comment-16678728
 ] 

Ryan Blue commented on SPARK-25966:
---

[~andrioni], were there any failed tasks or executors in the job that wrote 
this file? It looks to me like a problem in closing the file or with an 
executor dying before finishing a file. If that happened and the data wasn't 
cleaned up, then it could lead to this problem.

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 
> in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
> Reached the end of stream with 996 bytes left to read
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
>   at 
> 

[jira] [Commented] (SPARK-25531) new write APIs for data source v2

2018-09-26 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629418#comment-16629418
 ] 

Ryan Blue commented on SPARK-25531:
---

[~cloud_fan], what was the intent for this umbrella issue? You described it as 
progress of "Standardize SQL logical plans" but the current description is "new 
write APIs" instead. Also, these issues were already tracked under the umbrella 
SPARK-22386 to improve DSv2, which covers the new logical plans and other 
support issues like adding interfaces for required clustering and sorting 
(SPARK-23889).

Is your intent to close the other issue because it is too old?

> new write APIs for data source v2
> -
>
> Key: SPARK-25531
> URL: https://issues.apache.org/jira/browse/SPARK-25531
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The current data source write API heavily depend on {{SaveMode}}, which 
> doesn't have a clear semantic, especially when writing to tables.
> We should design a new set of write API without {{SaveMode}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23521) SPIP: Standardize SQL logical plans with DataSourceV2

2018-09-25 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-23521.
---
Resolution: Fixed

Marking this as "FIxed" because the vote passed.

> SPIP: Standardize SQL logical plans with DataSourceV2
> -
>
> Key: SPARK-23521
> URL: https://issues.apache.org/jira/browse/SPARK-23521
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: SPIP
>
> Executive Summary: This SPIP is based on [discussion about the DataSourceV2 
> implementation|https://lists.apache.org/thread.html/55676ec1f5039d3deaf347d391cf82fe8574b8fa4eeab70110ed5b2b@%3Cdev.spark.apache.org%3E]
>  on the dev list. The proposal is to standardize the logical plans used for 
> write operations to make the planner more maintainable and to make Spark's 
> write behavior predictable and reliable. It proposes the following principles:
>  # Use well-defined logical plan nodes for all high-level operations: insert, 
> create, CTAS, overwrite table, etc.
>  # Use planner rules that match on these high-level nodes, so that it isn’t 
> necessary to create rules to match each eventual code path individually.
>  # Clearly define Spark’s behavior for these logical plan nodes. Physical 
> nodes should implement that behavior so that all code paths eventually make 
> the same guarantees.
>  # Specialize implementation when creating a physical plan, not logical 
> plans. This will avoid behavior drift and ensure planner code is shared 
> across physical implementations.
> The SPIP doc presents a small but complete set of those high-level logical 
> operations, most of which are already defined in SQL or implemented by some 
> write path in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15420) Repartition and sort before Parquet writes

2018-09-19 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-15420.
---
Resolution: Won't Fix

> Repartition and sort before Parquet writes
> --
>
> Key: SPARK-15420
> URL: https://issues.apache.org/jira/browse/SPARK-15420
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Ryan Blue
>Priority: Major
>
> Parquet requires buffering data in memory before writing a group of rows 
> organized by column. This causes significant memory pressure when writing 
> partitioned output because each open file must buffer rows.
> Currently, Spark will sort data and spill if necessary in the 
> {{WriterContainer}} to avoid keeping many files open at once. But, this isn't 
> a full solution for a few reasons:
> * The final sort is always performed, even if incoming data is already sorted 
> correctly. For example, a global sort will cause two sorts to happen, even if 
> the global sort correctly prepares the data.
> * To prevent a large number of output small output files, users must manually 
> add a repartition step. That step is also ignored by the sort within the 
> writer.
> * Hive does not currently support {{DataFrameWriter#sortBy}}
> The sort in {{WriterContainer}} makes sense to prevent problems, but should 
> detect if the incoming data is already sorted. The {{DataFrameWriter}} should 
> also expose the ability to repartition data before the write stage, and the 
> query planner should expose an option to automatically insert repartition 
> operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15420) Repartition and sort before Parquet writes

2018-09-19 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-15420:
--
Target Version/s:   (was: 2.4.0)

> Repartition and sort before Parquet writes
> --
>
> Key: SPARK-15420
> URL: https://issues.apache.org/jira/browse/SPARK-15420
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Ryan Blue
>Priority: Major
>
> Parquet requires buffering data in memory before writing a group of rows 
> organized by column. This causes significant memory pressure when writing 
> partitioned output because each open file must buffer rows.
> Currently, Spark will sort data and spill if necessary in the 
> {{WriterContainer}} to avoid keeping many files open at once. But, this isn't 
> a full solution for a few reasons:
> * The final sort is always performed, even if incoming data is already sorted 
> correctly. For example, a global sort will cause two sorts to happen, even if 
> the global sort correctly prepares the data.
> * To prevent a large number of output small output files, users must manually 
> add a repartition step. That step is also ignored by the sort within the 
> writer.
> * Hive does not currently support {{DataFrameWriter#sortBy}}
> The sort in {{WriterContainer}} makes sense to prevent problems, but should 
> detect if the incoming data is already sorted. The {{DataFrameWriter}} should 
> also expose the ability to repartition data before the write stage, and the 
> query planner should expose an option to automatically insert repartition 
> operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25213) DataSourceV2 doesn't seem to produce unsafe rows

2018-08-23 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590531#comment-16590531
 ] 

Ryan Blue commented on SPARK-25213:
---

Sorry, I just realized the point is that the filter could have a python UDF in 
it. In that case, we need to add the project before the filter runs. I'll take 
a look at it.

> DataSourceV2 doesn't seem to produce unsafe rows 
> -
>
> Key: SPARK-25213
> URL: https://issues.apache.org/jira/browse/SPARK-25213
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Priority: Major
>
> Reproduce (Need to compile test-classes):
> bin/pyspark --driver-class-path sql/core/target/scala-2.11/test-classes
> {code:java}
> datasource_v2_df = spark.read \
> .format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2") 
> \
> .load()
> result = datasource_v2_df.withColumn('x', udf(lambda x: x, 
> 'int')(datasource_v2_df['i']))
> result.show()
> {code}
> The above code fails with:
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:126)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
> {code}
> Seems like Data Source V2 doesn't produce unsafeRows here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25213) DataSourceV2 doesn't seem to produce unsafe rows

2018-08-23 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590406#comment-16590406
 ] 

Ryan Blue commented on SPARK-25213:
---

[~cloud_fan], that PR ensures that there is a Project node on top of the v2 
scan to ensure the rows are converted to unsafe. We should be able to take a 
look at the physical plan to see whether it is there. If not, then we should 
find out why it isn't there. If it is there, we should find out why it isn't 
producing unsafe rows.

> DataSourceV2 doesn't seem to produce unsafe rows 
> -
>
> Key: SPARK-25213
> URL: https://issues.apache.org/jira/browse/SPARK-25213
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Priority: Major
>
> Reproduce (Need to compile test-classes):
> bin/pyspark --driver-class-path sql/core/target/scala-2.11/test-classes
> {code:java}
> datasource_v2_df = spark.read \
> .format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2") 
> \
> .load()
> result = datasource_v2_df.withColumn('x', udf(lambda x: x, 
> 'int')(datasource_v2_df['i']))
> result.show()
> {code}
> The above code fails with:
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:126)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
> {code}
> Seems like Data Source V2 doesn't produce unsafeRows here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25188) Add WriteConfig

2018-08-22 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589088#comment-16589088
 ] 

Ryan Blue commented on SPARK-25188:
---

One update to that proposal: {{BatchOverwriteSupport}} should be split into two 
interfaces: one for dynamic overwrite and one for overwrite using filter 
expressions. That supports the two use cases separately, since some sources 
won't support dynamic partition overwrite.

> Add WriteConfig
> ---
>
> Key: SPARK-25188
> URL: https://issues.apache.org/jira/browse/SPARK-25188
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The current write API is not flexible enough to implement more complex write 
> operations like `replaceWhere`. We can follow the read API and add a 
> `WriteConfig` to make it more flexible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25188) Add WriteConfig

2018-08-22 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589086#comment-16589086
 ] 

Ryan Blue commented on SPARK-25188:
---

Here's the original proposal for adding a write config:

The read side has a {{ScanConfig}}, but the write side doesn't have an 
equivalent object that tracks a particular write. I think if we introduce one, 
the API would be more similar between the read and write sides, and we would 
have a better API for overwrite operations. I propose adding a {{WriteConfig}} 
object and passing it like this:

{code:lang=java}
interface BatchWriteSupport {
  WriteConfig newWriteConfig(writeOptions: Map[String, String])

  DataWriterFactory createWriterFactory(WriteConfig)

  void commit(WriteConfig, WriterCommitMessage[])
}
{code}

That allows us to pass options for the write that affect how the WriterFactory 
operates. For example, in Iceberg I could request using Orc as the underlying 
format instead of Parquet. (I also suggested an addition like this for the read 
side.)

The other benefit of adding {{WriteConfig}} is that it provides a clean way of 
adding the ReplaceData operations. The ones I'm currently working on are 
ReplaceDynamicPartitions and ReplaceData. The first one removes any data in 
partitions that are being written to, and the second one replaces data based on 
a filter: e.g. {{df.writeTo(t).overwrite($"day" == "2018-08-15")}}. The 
information about replacement could be carried by {{WriteConfig}} to {{commit}} 
and would be created with a support interface:

{code:lang=java}
interface BatchOverwriteSupport extends BatchWriteSupport {
  WriteConfig newOverwrite(writeOptions, filters: Filter[])

  WriteConfig newDynamicOverwrite(writeOptions)
}
{code}

This is much cleaner than staging a delete and then running a write to complete 
the operation. All of the information about what to overwrite is just passed to 
the commit operation that can handle it at once. This is much better for 
dynamic partition replacement because the partitions to be replaced aren't even 
known by Spark before the write.

Last, this adds a place for write life-cycle operations that matches the 
ScanConfig read life-cycle. This could be used to perform operations like 
getting a write lock on a Hive table if someone wanted to support Hive's 
locking mechanism in the future.

> Add WriteConfig
> ---
>
> Key: SPARK-25188
> URL: https://issues.apache.org/jira/browse/SPARK-25188
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The current write API is not flexible enough to implement more complex write 
> operations like `replaceWhere`. We can follow the read API and add a 
> `WriteConfig` to make it more flexible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25190) Better operator pushdown API

2018-08-22 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589070#comment-16589070
 ] 

Ryan Blue commented on SPARK-25190:
---

The main problem I have with the current pushdown API is that Spark gets 
information back from pushdown before all pushdown is finished. I like the idea 
to have an immutable ScanConfig that is the result of pushdown operations, so 
it is clear that pushdown calls are finished before getting a reader factory or 
asking for statistics.

Unlike those methods that accept a ScanConfig, using {{SupportsPushDownXYZ}} 
for pushdown means that Spark is getting the results of pushdown operations 
back before all pushdown is complete. That means that the same confusion over 
pushdown order still exists, although the problem is fixed for some operations. 
I think that all feedback from pushdown should come from the ScanConfig.

> Better operator pushdown API
> 
>
> Key: SPARK-25190
> URL: https://issues.apache.org/jira/browse/SPARK-25190
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The current operator pushdown API is a little hard to use. It defines several 
> {{SupportsPushdownXYZ}} interfaces and ask the implementation to be mutable 
> to store the pushdown result. We should design a builder like API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   >