[jira] [Commented] (SPARK-37621) ClassCastException when trying to persist the result of a join between two Iceberg tables
[ https://issues.apache.org/jira/browse/SPARK-37621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458516#comment-17458516 ] Ryan Blue commented on SPARK-37621: --- [~hyukjin.kwon], this affects any source that doesn't always produce `UnsafeRow`. The problem is that certain parts of Spark assume that `UnsafeRow` will be passed even though the required interface is `InternalRow`. Rather than fixing that assumption, the community chose to ensure that there is always a projection added so that the conversion to unsafe happens. But if that projection is removed by other rules or is not added, then operators that assume `UnsafeRow` can fail. The long-term fix is the same as always: eventually, Spark should use the declared type. A simpler fix is to find out why the projection is missing and update that. But then we'll see this problem come back later. > ClassCastException when trying to persist the result of a join between two > Iceberg tables > - > > Key: SPARK-37621 > URL: https://issues.apache.org/jira/browse/SPARK-37621 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.1.2 >Reporter: Ciprian Gerea >Priority: Major > > I am gettin an error when I try to persist the results on a Join operation. > Note that both tables to be joined and the output table are Iceberg tables. > SQL code to repro. > String sqlJoin = String.format( > "SELECT * from " + > "((select %s from %s.%s where %s ) l " + > "join (select %s from %s.%s where %s ) r " + > "using (%s))", > ); > spark.sql(sqlJoin).writeTo("ciptest.ttt").option("write-format", > "parquet").createOrReplace(); > My exception stack is: > {{Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow}} > {{at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$1.writeValue(UnsafeRowSerializer.scala:64)}} > {{at > org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:249)}} > {{at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:158)}} > {{at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)}} > {{at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)}} > {{at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)}} > {{at org.apache.spark.scheduler.Task.run(Task.scala:131)}} > {{at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)}} > {{at ….}} > Explain on the Sql statement gets the following plan: > {{== Physical Plan ==}} > {{Project [ ... ]}} > {{+- SortMergeJoin […], Inner}} > {{ :- Sort […], false, 0}} > {{ : +- Exchange hashpartitioning(…, 10), ENSURE_REQUIREMENTS, [id=#38]}} > {{ : +- Filter (…)}} > {{ :+- BatchScan[... ] left [filters=…]}} > {{ +- *(2) Sort […], false, 0}} > {{ +- Exchange hashpartitioning(…, 10), ENSURE_REQUIREMENTS, [id=#47]}} > {{ +- *(1) Filter (…)}} > {{ +- BatchScan[…] right [filters=…] }} > {{Note that several variations of this fail. Besides the repro code listed > above I have tried doing CTAS and trying to write the result into parquet > files without making a table out of it.}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33779) DataSource V2: API to request distribution and ordering on write
[ https://issues.apache.org/jira/browse/SPARK-33779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-33779. --- Fix Version/s: 3.2.0 Resolution: Fixed Merged PR #30706. Thanks [~aokolnychyi]! > DataSource V2: API to request distribution and ordering on write > > > Key: SPARK-33779 > URL: https://issues.apache.org/jira/browse/SPARK-33779 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Anton Okolnychyi >Priority: Major > Fix For: 3.2.0 > > > We need to have proper APIs for requesting a specific distribution and > ordering on writes for data sources that implement the V2 interface. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32168) DSv2 SQL overwrite incorrectly uses static plan with hidden partitions
Ryan Blue created SPARK-32168: - Summary: DSv2 SQL overwrite incorrectly uses static plan with hidden partitions Key: SPARK-32168 URL: https://issues.apache.org/jira/browse/SPARK-32168 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue The v2 analyzer rule {{ResolveInsertInto}} tries to detect when a static overwrite and a dynamic overwrite would produce the same result and will choose to use static overwrite in that case. It will only use a dynamic overwrite if there is a partition column without a static value and the SQL mode is set to dynamic. {code:lang=scala} val dynamicPartitionOverwrite = partCols.size > staticPartitions.size && conf.partitionOverwriteMode == PartitionOverwriteMode.DYNAMIC {code} The problem is that {{partCols}} are the names of only partitions that are in the column list (identity partitions) and does not include hidden partitions, like {{days(ts)}}. As a result, this doesn't detect hidden partitions and use dynamic overwrite. Static overwrite is used instead; when a table has only hidden partitions, the static filter drops all table data. This is a correctness bug because Spark will overwrite more data than just the set of partitions being written to in dynamic mode. The impact is limited because this rule is only used for SQL queries (not plans from DataFrameWriters) and only affects tables with hidden partitions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32037) Rename blacklisting feature to avoid language with racist connotation
[ https://issues.apache.org/jira/browse/SPARK-32037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140825#comment-17140825 ] Ryan Blue commented on SPARK-32037: --- What about "healthy" and "unhealthy"? That's basically what we are trying to keep track of -- whether a node is healthy enough to run tasks, or if it should not be used for some period of time. I think "trusted" and "untrusted" may also work, but "healthy" is a bit closer to what we want. > Rename blacklisting feature to avoid language with racist connotation > - > > Key: SPARK-32037 > URL: https://issues.apache.org/jira/browse/SPARK-32037 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Erik Krogen >Priority: Minor > > As per [discussion on the Spark dev > list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E], > it will be beneficial to remove references to problematic language that can > alienate potential community members. One such reference is "blacklist". > While it seems to me that there is some valid debate as to whether these > terms have racist origins, the cultural connotations are inescapable in > today's world. > I've created a separate task, SPARK-32036, to remove references outside of > this feature. Given the large surface area of this feature and the > public-facing UI / configs / etc., more care will need to be taken here. > I'd like to start by opening up debate on what the best replacement name > would be. Reject-/deny-/ignore-/block-list are common replacements for > "blacklist", but I'm not sure that any of them work well for this situation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31255) DataSourceV2: Add metadata columns
[ https://issues.apache.org/jira/browse/SPARK-31255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-31255: -- Issue Type: New Feature (was: Bug) > DataSourceV2: Add metadata columns > -- > > Key: SPARK-31255 > URL: https://issues.apache.org/jira/browse/SPARK-31255 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > > DSv2 should support reading additional metadata columns that are not in a > table's schema. This allows users to project metadata like Kafka's offset, > timestamp, and partition. It also allows other sources to expose metadata > like file and row position. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31255) DataSourceV2: Add metadata columns
Ryan Blue created SPARK-31255: - Summary: DataSourceV2: Add metadata columns Key: SPARK-31255 URL: https://issues.apache.org/jira/browse/SPARK-31255 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue DSv2 should support reading additional metadata columns that are not in a table's schema. This allows users to project metadata like Kafka's offset, timestamp, and partition. It also allows other sources to expose metadata like file and row position. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29558) ResolveTables and ResolveRelations should be order-insensitive
[ https://issues.apache.org/jira/browse/SPARK-29558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979455#comment-16979455 ] Ryan Blue commented on SPARK-29558: --- Thanks for fixing this, [~cloud_fan]! > ResolveTables and ResolveRelations should be order-insensitive > -- > > Key: SPARK-29558 > URL: https://issues.apache.org/jira/browse/SPARK-29558 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29558) ResolveTables and ResolveRelations should be order-insensitive
[ https://issues.apache.org/jira/browse/SPARK-29558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-29558. --- Fix Version/s: 3.0.0 Resolution: Fixed > ResolveTables and ResolveRelations should be order-insensitive > -- > > Key: SPARK-29558 > URL: https://issues.apache.org/jira/browse/SPARK-29558 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29966) Add version method in TableCatalog to avoid load table twice
[ https://issues.apache.org/jira/browse/SPARK-29966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979419#comment-16979419 ] Ryan Blue commented on SPARK-29966: --- As I said on the PR, I'm -1 on changing a public extension API to avoid a temporary performance regression that should be handled in the implementation. > Add version method in TableCatalog to avoid load table twice > > > Key: SPARK-29966 > URL: https://issues.apache.org/jira/browse/SPARK-29966 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: ulysses you >Priority: Minor > > Now resolve logic plan will load table twice which are in ResolveTables and > ResolveRelations. The ResolveRelations is old code path, and ResolveTables is > v2 code path, and the reason why load table twice is that ResolveTables will > load table and rollback v1 table to ResolveRelations code path. > The same scene also exists in ResolveSessionCatalog. > It affect that execute command will cost double time than spark 2.4. > Here is the idea that add a table version method in TableCatalog, and rules > should always get table version firstly without load table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974456#comment-16974456 ] Ryan Blue commented on SPARK-29900: --- To be clear, we think this is going to be a breaking change, right? > make relation lookup behavior consistent within Spark SQL > - > > Key: SPARK-29900 > URL: https://issues.apache.org/jira/browse/SPARK-29900 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Currently, Spark has 2 different relation resolution behaviors: > 1. try to look up temp view first, then try table/persistent view. > 2. try to look up table/persistent view. > The first behavior is used in SELECT, INSERT and a few commands that support > views, like DESC TABLE. > The second behavior is used in most commands. > It's confusing to have inconsistent relation resolution behaviors, and the > benefit is super small. It's only useful when there are temp view and table > with the same name, but users can easily use qualified table name to > disambiguate. > In postgres, the relation resolution behavior is consistent > {code} > cloud0fan=# create schema s1; > CREATE SCHEMA > cloud0fan=# SET search_path TO s1; > SET > cloud0fan=# create table s1.t (i int); > CREATE TABLE > cloud0fan=# insert into s1.t values (1); > INSERT 0 1 > # access table with qualified name > cloud0fan=# select * from s1.t; > i > --- > 1 > (1 row) > # access table with single name > cloud0fan=# select * from t; > i > --- > 1 > (1 rows) > # create a temp view with conflicting name > cloud0fan=# create temp view t as select 2 as i; > CREATE VIEW > # same as spark, temp view has higher proirity during resolution > cloud0fan=# select * from t; > i > --- > 2 > (1 row) > # DROP TABLE also resolves temp view first > cloud0fan=# drop table t; > ERROR: "t" is not a table > # DELETE also resolves temp view first > cloud0fan=# delete from t where i = 0; > ERROR: cannot delete from view "t" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29789) should not parse the bucket column name again when creating v2 tables
[ https://issues.apache.org/jira/browse/SPARK-29789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-29789. --- Fix Version/s: 3.0.0 Resolution: Fixed > should not parse the bucket column name again when creating v2 tables > - > > Key: SPARK-29789 > URL: https://issues.apache.org/jira/browse/SPARK-29789 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29277) DataSourceV2: Add early filter and projection pushdown
[ https://issues.apache.org/jira/browse/SPARK-29277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-29277. --- Fix Version/s: 3.0.0 Resolution: Fixed Fixed by #25955. > DataSourceV2: Add early filter and projection pushdown > -- > > Key: SPARK-29277 > URL: https://issues.apache.org/jira/browse/SPARK-29277 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > Fix For: 3.0.0 > > > Spark uses optimizer rules that need stats before conversion to physical > plan. DataSourceV2 should support early pushdown for those rules. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29592) ALTER TABLE (set partition location) should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962527#comment-16962527 ] Ryan Blue commented on SPARK-29592: --- There is not currently a way to alter the partition spec for a table, so I don't think we need to worry about this for now. > ALTER TABLE (set partition location) should look up catalog/table like v2 > commands > -- > > Key: SPARK-29592 > URL: https://issues.apache.org/jira/browse/SPARK-29592 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Priority: Major > > ALTER TABLE (set partition location) should look up catalog/table like v2 > commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29277) DataSourceV2: Add early filter and projection pushdown
Ryan Blue created SPARK-29277: - Summary: DataSourceV2: Add early filter and projection pushdown Key: SPARK-29277 URL: https://issues.apache.org/jira/browse/SPARK-29277 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue Spark uses optimizer rules that need stats before conversion to physical plan. DataSourceV2 should support early pushdown for those rules. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29249) DataFrameWriterV2 should not allow setting table properties for existing tables
[ https://issues.apache.org/jira/browse/SPARK-29249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-29249: -- Description: tableProperty should return CreateTableWriter, not DataFrameWriterV2. > DataFrameWriterV2 should not allow setting table properties for existing > tables > --- > > Key: SPARK-29249 > URL: https://issues.apache.org/jira/browse/SPARK-29249 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > > tableProperty should return CreateTableWriter, not DataFrameWriterV2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29249) DataFrameWriterV2 should not allow setting table properties for existing tables
Ryan Blue created SPARK-29249: - Summary: DataFrameWriterV2 should not allow setting table properties for existing tables Key: SPARK-29249 URL: https://issues.apache.org/jira/browse/SPARK-29249 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29157) DataSourceV2: Add DataFrameWriterV2 to Python API
Ryan Blue created SPARK-29157: - Summary: DataSourceV2: Add DataFrameWriterV2 to Python API Key: SPARK-29157 URL: https://issues.apache.org/jira/browse/SPARK-29157 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 3.0.0 Reporter: Ryan Blue After SPARK-28612 is committed, we need to add the corresponding PySpark API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29014) DataSourceV2: Clean up current, default, and session catalog uses
[ https://issues.apache.org/jira/browse/SPARK-29014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926889#comment-16926889 ] Ryan Blue commented on SPARK-29014: --- [~cloud_fan], why does this require a major refactor? It would be best to keep the implementation of this as small as possible and not tie it to other work. > DataSourceV2: Clean up current, default, and session catalog uses > - > > Key: SPARK-29014 > URL: https://issues.apache.org/jira/browse/SPARK-29014 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Blocker > > Catalog tracking in DSv2 has evolved since the initial changes went in. We > need to make sure that handling is consistent across plans using the latest > rules: > * The _current_ catalog should be used when no catalog is specified > * The _default_ catalog is the catalog _current_ is initialized to > * If the _default_ catalog is not set, then it is the built-in Spark session > catalog, which will be called `spark_catalog` (This is the v2 session catalog) -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28970) implement USE CATALOG/NAMESPACE for Data Source V2
[ https://issues.apache.org/jira/browse/SPARK-28970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924681#comment-16924681 ] Ryan Blue commented on SPARK-28970: --- I think we should, yes. > implement USE CATALOG/NAMESPACE for Data Source V2 > -- > > Key: SPARK-28970 > URL: https://issues.apache.org/jira/browse/SPARK-28970 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Currently Spark has a `USE abc` command to switch the current database. > We should have something similar for Data Source V2, to switch the current > catalog and/or current namespace. > We can introduce 2 new command: `USE CATALOG abc` and `USE NAMESPACE abc` -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29014) DataSourceV2: Clean up current, default, and session catalog uses
Ryan Blue created SPARK-29014: - Summary: DataSourceV2: Clean up current, default, and session catalog uses Key: SPARK-29014 URL: https://issues.apache.org/jira/browse/SPARK-29014 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue Catalog tracking in DSv2 has evolved since the initial changes went in. We need to make sure that handling is consistent across plans using the latest rules: * The _current_ catalog should be used when no catalog is specified * The _default_ catalog is the catalog _current_ is initialized to * If the _default_ catalog is not set, then it is the built-in Spark session catalog, which will be called `spark_catalog` (This is the v2 session catalog) -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28979) DataSourceV2: Rename UnresolvedTable
Ryan Blue created SPARK-28979: - Summary: DataSourceV2: Rename UnresolvedTable Key: SPARK-28979 URL: https://issues.apache.org/jira/browse/SPARK-28979 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue CatalogTableAsV2 was renamed to UnresolvedTable in SPARK-28666. This name is incorrect because the table is not unresolved. Instead, it is a v1 table that doesn't expose any v2 capabilities. The name should not include "Unresolved". -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28899) merge the testing in-memory v2 catalogs from catalyst and core
[ https://issues.apache.org/jira/browse/SPARK-28899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-28899. --- Fix Version/s: 3.0.0 Resolution: Fixed > merge the testing in-memory v2 catalogs from catalyst and core > -- > > Key: SPARK-28899 > URL: https://issues.apache.org/jira/browse/SPARK-28899 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28878) DataSourceV2 should not insert extra projection for columnar batches
Ryan Blue created SPARK-28878: - Summary: DataSourceV2 should not insert extra projection for columnar batches Key: SPARK-28878 URL: https://issues.apache.org/jira/browse/SPARK-28878 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue SPARK-23325 added an extra physical projection when reading from a DSv2 source because some Spark operators assume that InternalRow instances are actually UnsafeRow. The projection ensures that InternalRow is converted to UnsafeRow. This isn't needed for the columnar batch read path because this is already done when converting from columnar operators to row-based operators in InputRDDCodegen. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28846) Set OMP_NUM_THREADS to executor cores for python
[ https://issues.apache.org/jira/browse/SPARK-28846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-28846. --- Resolution: Duplicate > Set OMP_NUM_THREADS to executor cores for python > > > Key: SPARK-28846 > URL: https://issues.apache.org/jira/browse/SPARK-28846 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28843) Set OMP_NUM_THREADS to executor cores reduce Python memory consumption
[ https://issues.apache.org/jira/browse/SPARK-28843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-28843: -- Description: While testing hardware with more cores, we found that the amount of memory required by PySpark applications increased and tracked the problem to importing numpy. The numpy issue is [https://github.com/numpy/numpy/issues/10455] NumPy uses OpenMP that starts a thread pool with the number of cores on the machine (and does not respect cgroups). When we set this lower we see a significant reduction in memory consumption. This parallelism setting should be set to the number of cores allocated to the executor, not the number of cores available. was: While testing hardware with more cores, we found that the amount of memory required by PySpark applications increased and tracked the problem to importing numpy. The numpy issue is [https://github.com/numpy/numpy/issues/10455] NumPy uses OpenMP that starts a thread pool with the number of cores on the machine (and does not respect cgroups). When we set this lower we see a reduction in memory consumption. This parallelism setting should be set to the number of cores allocated to the executor, not the number of cores available. > Set OMP_NUM_THREADS to executor cores reduce Python memory consumption > -- > > Key: SPARK-28843 > URL: https://issues.apache.org/jira/browse/SPARK-28843 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.3, 3.0.0, 2.4.3 >Reporter: Ryan Blue >Priority: Major > > While testing hardware with more cores, we found that the amount of memory > required by PySpark applications increased and tracked the problem to > importing numpy. The numpy issue is > [https://github.com/numpy/numpy/issues/10455] > NumPy uses OpenMP that starts a thread pool with the number of cores on the > machine (and does not respect cgroups). When we set this lower we see a > significant reduction in memory consumption. > This parallelism setting should be set to the number of cores allocated to > the executor, not the number of cores available. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28843) Set OMP_NUM_THREADS to executor cores reduce Python memory consumption
Ryan Blue created SPARK-28843: - Summary: Set OMP_NUM_THREADS to executor cores reduce Python memory consumption Key: SPARK-28843 URL: https://issues.apache.org/jira/browse/SPARK-28843 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.4.3, 2.3.3, 3.0.0 Reporter: Ryan Blue While testing hardware with more cores, we found that the amount of memory required by PySpark applications increased and tracked the problem to importing numpy. The numpy issue is [https://github.com/numpy/numpy/issues/10455] NumPy uses OpenMP that starts a thread pool with the number of cores on the machine (and does not respect cgroups). When we set this lower we see a reduction in memory consumption. This parallelism setting should be set to the number of cores allocated to the executor, not the number of cores available. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28628) Support namespaces in V2SessionCatalog
Ryan Blue created SPARK-28628: - Summary: Support namespaces in V2SessionCatalog Key: SPARK-28628 URL: https://issues.apache.org/jira/browse/SPARK-28628 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue V2SessionCatalog should implement SupportsNamespaces. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28612) DataSourceV2: Add new DataFrameWriter API for v2
Ryan Blue created SPARK-28612: - Summary: DataSourceV2: Add new DataFrameWriter API for v2 Key: SPARK-28612 URL: https://issues.apache.org/jira/browse/SPARK-28612 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue This tracks adding an API like the one proposed in SPARK-23521: {code:lang=scala} df.writeTo("catalog.db.table").append() // AppendData df.writeTo("catalog.db.table").overwriteDynamic() // OverwritePartiitonsDynamic df.writeTo("catalog.db.table").overwrite($"date" === '2019-01-01') // OverwriteByExpression df.writeTo("catalog.db.table").partitionBy($"type", $"date").create() // CTAS df.writeTo("catalog.db.table").replace() // RTAS {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23204) DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-23204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-23204. --- Resolution: Fixed Fix Version/s: 3.0.0 I'm closing this because it is implemented by SPARK-28178 and SPARK-28565. > DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter > > > Key: SPARK-23204 > URL: https://issues.apache.org/jira/browse/SPARK-23204 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > Fix For: 3.0.0 > > > DataSourceV2 is currently only configured with a path, passed in options as > {{path}}. For many data sources, like JDBC, a table name is more appropriate. > I propose testing the "location" passed to load(String) and save(String) to > see if it is a path and if not, parsing it as a table name and passing > "database" and "table" options to readers and writers. > This also creates a way to pass the table identifier when using DataSourceV2 > tables from SQL. For example, {{SELECT * FROM db.table}} creates an > {{UnresolvedRelation(db,table)}} that could be resolved using the default > source, passing the db and table name using the same options. Similarly, we > can add a table property for the datasource implementation to metastore > tables and add a rule to convert them to DataSourceV2 relations. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25280) Add support for USING syntax for DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-25280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899532#comment-16899532 ] Ryan Blue commented on SPARK-25280: --- [~hyukjin.kwon], is there anything left to do for this? I think that most of the functionality has been added at this point. > Add support for USING syntax for DataSourceV2 > - > > Key: SPARK-25280 > URL: https://issues.apache.org/jira/browse/SPARK-25280 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > class SourcesTest extends SparkFunSuite { > val spark = SparkSession.builder().master("local").getOrCreate() > test("Test CREATE TABLE ... USING - v1") { > spark.read.format(classOf[SimpleDataSourceV1].getCanonicalName).load() > } > test("Test DataFrameReader - v1") { > spark.sql(s"CREATE TABLE tableA USING > ${classOf[SimpleDataSourceV1].getCanonicalName}") > } > test("Test CREATE TABLE ... USING - v2") { > spark.read.format(classOf[SimpleDataSourceV2].getCanonicalName).load() > } > test("Test DataFrameReader - v2") { > spark.sql(s"CREATE TABLE tableB USING > ${classOf[SimpleDataSourceV2].getCanonicalName}") > } > } > {code} > {code} > org.apache.spark.sql.sources.v2.SimpleDataSourceV2 is not a valid Spark SQL > Data Source.; > org.apache.spark.sql.AnalysisException: > org.apache.spark.sql.sources.v2.SimpleDataSourceV2 is not a valid Spark SQL > Data Source.; > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385) > at > org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:78) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$54.apply(Dataset.scala:3296) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3295) > at org.apache.spark.sql.Dataset.(Dataset.scala:190) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641) > at > org.apache.spark.sql.sources.v2.SourcesTest$$anonfun$4.apply(DataSourceV2Suite.scala:45) > at > org.apache.spark.sql.sources.v2.SourcesTest$$anonfun$4.apply(DataSourceV2Suite.scala:45) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196) > at org.scalatest.FunSuite.withFixture(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at
[jira] [Commented] (SPARK-14543) SQL/Hive insertInto has unexpected results
[ https://issues.apache.org/jira/browse/SPARK-14543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888222#comment-16888222 ] Ryan Blue commented on SPARK-14543: --- {{byName}} was never added to Apache Spark. The change was rejected, so it is only available in Netflix's Spark branch. I resolved this with "later" because we are including by-name resolution in the DSv2 work. The replacement for {{DataFrameWriter}} will default to name-based resolution. > SQL/Hive insertInto has unexpected results > -- > > Key: SPARK-14543 > URL: https://issues.apache.org/jira/browse/SPARK-14543 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > > *Updated description* > There should be an option to match input data to output columns by name. The > API allows operations on tables, which hide the column resolution problem. > It's easy to copy from one table to another without listing the columns, and > in the API it is common to work with columns by name rather than by position. > I think the API should add a way to match columns by name, which is closer to > what users expect. I propose adding something like this: > {code} > CREATE TABLE src (id: bigint, count: int, total: bigint) > CREATE TABLE dst (id: bigint, total: bigint, count: int) > sqlContext.table("src").write.byName.insertInto("dst") > {code} > *Original description* > The Hive write path adds a pre-insertion cast (projection) to reconcile > incoming data columns with the outgoing table schema. Columns are matched by > position and casts are inserted to reconcile the two column schemas. > When columns aren't correctly aligned, this causes unexpected results. I ran > into this by not using a correct {{partitionBy}} call (addressed by > SPARK-14459), which caused an error message that an int could not be cast to > an array. However, if the columns are vaguely compatible, for example string > and float, then no error or warning is produced and data is written to the > wrong columns using unexpected casts (string -> bigint -> float). > A real-world use case that will hit this is when a table definition changes > by adding a column in the middle of a table. Spark SQL statements that copied > from that table to a destination table will then map the columns differently > but insert casts that mask the problem. The last column's data will be > dropped without a reliable warning for the user. > This highlights a few problems: > * Too many or too few incoming data columns should cause an AnalysisException > to be thrown > * Only "safe" casts should be inserted automatically, like int -> long, using > UpCast > * Pre-insertion casts currently ignore extra columns by using zip > * The pre-insertion cast logic differs between Hive's MetastoreRelation and > LogicalRelation > Also, I think there should be an option to match input data to output columns > by name. The API allows operations on tables, which hide the column > resolution problem. It's easy to copy from one table to another without > listing the columns, and in the API it is common to work with columns by name > rather than by position. I think the API should add a way to match columns by > name, which is closer to what users expect. I propose adding something like > this: > {code} > CREATE TABLE src (id: bigint, count: int, total: bigint) > CREATE TABLE dst (id: bigint, total: bigint, count: int) > sqlContext.table("src").write.byName.insertInto("dst") > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28376) Support to write sorted parquet files in each row group
[ https://issues.apache.org/jira/browse/SPARK-28376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885428#comment-16885428 ] Ryan Blue commented on SPARK-28376: --- I don't think this is a regression. The linked issue was to automatically add repartitioning to the SQL plan to avoid too many files, even with a local sort. I think that this is no longer needed because we plan to do it in DSv2. > Support to write sorted parquet files in each row group > --- > > Key: SPARK-28376 > URL: https://issues.apache.org/jira/browse/SPARK-28376 > Project: Spark > Issue Type: New Feature > Components: Input/Output, Spark Core >Affects Versions: 2.4.3 >Reporter: t oo >Priority: Major > > this is for the ability to writeee parquet with sorteed values in each > rowgroup > > see > [https://stackoverflow.com/questions/52159938/cant-write-ordered-data-to-parquet-in-spark] > [https://www.slideshare.net/RyanBlue3/parquet-performance-tuning-the-missing-guide] > (slidee 26-27) > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28374) DataSourceV2: Add method to support INSERT ... IF NOT EXISTS
Ryan Blue created SPARK-28374: - Summary: DataSourceV2: Add method to support INSERT ... IF NOT EXISTS Key: SPARK-28374 URL: https://issues.apache.org/jira/browse/SPARK-28374 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue This is a follow-up to [PR #24832 (comment)|[https://github.com/apache/spark/pull/24832/files#r298257179]]. The SQL parser supports INSERT ... IF NOT EXISTS to validate that an insert did not write into existing partitions. This requires the addition of a support trait for the write builder, so should be done as a follow-up. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28319) DataSourceV2: Support SHOW TABLES
Ryan Blue created SPARK-28319: - Summary: DataSourceV2: Support SHOW TABLES Key: SPARK-28319 URL: https://issues.apache.org/jira/browse/SPARK-28319 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue SHOW TABLES needs to support v2 catalogs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28219) Data source v2 user guide
[ https://issues.apache.org/jira/browse/SPARK-28219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877172#comment-16877172 ] Ryan Blue commented on SPARK-28219: --- I'm closing this as a duplicate. Please use SPARK-27708. If you want to note specific docs to write, please add them to that issue. > Data source v2 user guide > - > > Key: SPARK-28219 > URL: https://issues.apache.org/jira/browse/SPARK-28219 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Blocker > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28219) Data source v2 user guide
[ https://issues.apache.org/jira/browse/SPARK-28219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-28219. --- Resolution: Duplicate > Data source v2 user guide > - > > Key: SPARK-28219 > URL: https://issues.apache.org/jira/browse/SPARK-28219 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Blocker > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28192) Data Source - State - Write side
[ https://issues.apache.org/jira/browse/SPARK-28192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875046#comment-16875046 ] Ryan Blue commented on SPARK-28192: --- It sounds like what you want is for a source to be able to communicate the required clustering and sort order for a write, is that correct? I opened an issue for this a while ago, but it probably won't be on the roadmap for Spark 3.0: SPARK-23889. We can do that sooner if you're interested in it! > Data Source - State - Write side > > > Key: SPARK-28192 > URL: https://issues.apache.org/jira/browse/SPARK-28192 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > This issue tracks the efforts on addressing batch write on state data source. > It could include "state repartition" if it doesn't require huge effort for > new DSv2, but it can be also move out to separate issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28139) DataSourceV2: Add AlterTable v2 implementation
Ryan Blue created SPARK-28139: - Summary: DataSourceV2: Add AlterTable v2 implementation Key: SPARK-28139 URL: https://issues.apache.org/jira/browse/SPARK-28139 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue SPARK-27857 updated the parser for v2 ALTER TABLE statements. This tracks implementing those using a v2 catalog. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27857) DataSourceV2: Support ALTER TABLE statements in catalyst SQL parser
[ https://issues.apache.org/jira/browse/SPARK-27857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-27857: -- Summary: DataSourceV2: Support ALTER TABLE statements in catalyst SQL parser (was: DataSourceV2: Support ALTER TABLE statements) > DataSourceV2: Support ALTER TABLE statements in catalyst SQL parser > --- > > Key: SPARK-27857 > URL: https://issues.apache.org/jira/browse/SPARK-27857 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.3 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 3.0.0 > > > ALTER TABLE statements should be supported for v2 tables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27965) Add extractors for logical transforms
Ryan Blue created SPARK-27965: - Summary: Add extractors for logical transforms Key: SPARK-27965 URL: https://issues.apache.org/jira/browse/SPARK-27965 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue Extractors can be used to make any Transform class appear like a case class to Spark internals. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27964) Create CatalogV2Util
Ryan Blue created SPARK-27964: - Summary: Create CatalogV2Util Key: SPARK-27964 URL: https://issues.apache.org/jira/browse/SPARK-27964 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue Need to move utility functions from test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27919) DataSourceV2: Add v2 session catalog
[ https://issues.apache.org/jira/browse/SPARK-27919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-27919: -- Affects Version/s: (was: 2.4.3) 3.0.0 > DataSourceV2: Add v2 session catalog > > > Key: SPARK-27919 > URL: https://issues.apache.org/jira/browse/SPARK-27919 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > > When no default catalog is set, the session catalog (v1) is responsible for > table identifiers with no catalog part. When CTAS creates a table with a v2 > provider, a v2 catalog is required and the default catalog is used. But this > may cause Spark to create a table in a catalog that it cannot use to look up > the table. > In this case, a v2 catalog that delegates to the session catalog should be > used instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27960) DataSourceV2 ORC implementation doesn't handle schemas correctly
Ryan Blue created SPARK-27960: - Summary: DataSourceV2 ORC implementation doesn't handle schemas correctly Key: SPARK-27960 URL: https://issues.apache.org/jira/browse/SPARK-27960 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3 Reporter: Ryan Blue While testing SPARK-27919 (#[24768|https://github.com/apache/spark/pull/24768]), I tried to use the v2 ORC implementation to validate a v2 catalog that delegates to the session catalog. The ORC implementation fails the following test case because it cannot infer a schema (there is no data) but it should be using the schema used to create the table. Test case: {code} test("CreateTable: test ORC source") { spark.conf.set("spark.sql.catalog.session", classOf[V2SessionCatalog].getName) spark.sql(s"CREATE TABLE table_name (id bigint, data string) USING $orc2") val testCatalog = spark.catalog("session").asTableCatalog val table = testCatalog.loadTable(Identifier.of(Array(), "table_name")) assert(table.name == "orc ") // <-- should this be table_name? assert(table.partitioning.isEmpty) assert(table.properties == Map( "provider" -> orc2, "database" -> "default", "table" -> "table_name").asJava) assert(table.schema == new StructType().add("id", LongType).add("data", StringType)) // <-- fail val rdd = spark.sparkContext.parallelize(table.asInstanceOf[InMemoryTable].rows) checkAnswer(spark.internalCreateDataFrame(rdd, table.schema), Seq.empty) } {code} Error: {code} Unable to infer schema for ORC. It must be specified manually.; org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must be specified manually.; at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$5(FileTable.scala:61) at scala.Option.getOrElse(Option.scala:138) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:61) at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:54) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:67) at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:65) at org.apache.spark.sql.sources.v2.DataSourceV2SQLSuite.$anonfun$new$5(DataSourceV2SQLSuite.scala:82) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27960) DataSourceV2 ORC implementation doesn't handle schemas correctly
[ https://issues.apache.org/jira/browse/SPARK-27960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856955#comment-16856955 ] Ryan Blue commented on SPARK-27960: --- [~Gengliang.Wang], FYI > DataSourceV2 ORC implementation doesn't handle schemas correctly > > > Key: SPARK-27960 > URL: https://issues.apache.org/jira/browse/SPARK-27960 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Ryan Blue >Priority: Major > > While testing SPARK-27919 > (#[24768|https://github.com/apache/spark/pull/24768]), I tried to use the v2 > ORC implementation to validate a v2 catalog that delegates to the session > catalog. The ORC implementation fails the following test case because it > cannot infer a schema (there is no data) but it should be using the schema > used to create the table. > Test case: > {code} > test("CreateTable: test ORC source") { > spark.conf.set("spark.sql.catalog.session", > classOf[V2SessionCatalog].getName) > spark.sql(s"CREATE TABLE table_name (id bigint, data string) USING $orc2") > val testCatalog = spark.catalog("session").asTableCatalog > val table = testCatalog.loadTable(Identifier.of(Array(), "table_name")) > assert(table.name == "orc ") // <-- should this be table_name? > assert(table.partitioning.isEmpty) > assert(table.properties == Map( > "provider" -> orc2, > "database" -> "default", > "table" -> "table_name").asJava) > assert(table.schema == new StructType().add("id", LongType).add("data", > StringType)) // <-- fail > val rdd = > spark.sparkContext.parallelize(table.asInstanceOf[InMemoryTable].rows) > checkAnswer(spark.internalCreateDataFrame(rdd, table.schema), Seq.empty) > } > {code} > Error: > {code} > Unable to infer schema for ORC. It must be specified manually.; > org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It > must be specified manually.; > at > org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$5(FileTable.scala:61) > at scala.Option.getOrElse(Option.scala:138) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:61) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:54) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:67) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:65) > at > org.apache.spark.sql.sources.v2.DataSourceV2SQLSuite.$anonfun$new$5(DataSourceV2SQLSuite.scala:82) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27919) DataSourceV2: Add v2 session catalog
Ryan Blue created SPARK-27919: - Summary: DataSourceV2: Add v2 session catalog Key: SPARK-27919 URL: https://issues.apache.org/jira/browse/SPARK-27919 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3 Reporter: Ryan Blue When no default catalog is set, the session catalog (v1) is responsible for table identifiers with no catalog part. When CTAS creates a table with a v2 provider, a v2 catalog is required and the default catalog is used. But this may cause Spark to create a table in a catalog that it cannot use to look up the table. In this case, a v2 catalog that delegates to the session catalog should be used instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27909) Fix CTE substitution dependence on ResolveRelations throwing AnalysisException
Ryan Blue created SPARK-27909: - Summary: Fix CTE substitution dependence on ResolveRelations throwing AnalysisException Key: SPARK-27909 URL: https://issues.apache.org/jira/browse/SPARK-27909 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3 Reporter: Ryan Blue CTE substitution currently works by running all analyzer rules on plans after each substitution. It does this to fix a recursive CTE case, but this design requires the ResolveRelations rule to throw an AnalysisException when it cannot resolve a table or else the CTE substitution will run again and may possibly recurse infinitely. Table resolution should be possible across multiple independent rules. To accomplish this, the current ResolveRelations rule detects cases where other rules (like ResolveDataSource) will resolve a TableIdentifier and returns the UnresolvedRelation unmodified only in those cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27857) DataSourceV2: Support ALTER TABLE statements
Ryan Blue created SPARK-27857: - Summary: DataSourceV2: Support ALTER TABLE statements Key: SPARK-27857 URL: https://issues.apache.org/jira/browse/SPARK-27857 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.3 Reporter: Ryan Blue ALTER TABLE statements should be supported for v2 tables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27784) Alias ID reuse can break correctness when substituting foldable expressions
[ https://issues.apache.org/jira/browse/SPARK-27784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844404#comment-16844404 ] Ryan Blue commented on SPARK-27784: --- [~cloud_fan], I don't see this happening in master because an additional Project is added somewhere. Any idea what adds it? {code:java} == Parsed Logical Plan == Union :- Project [id#43, data#42] : +- Join Inner, (id#43 = id#40) : :- Project [id#43, coalesce(data#44, _) AS data#42] : : +- SubqueryAlias `default`.`t1` : : +- Relation[id#43,data#44] parquet : +- Project [value#37 AS id#40] : +- LocalRelation [value#37] +- Project [id#49, coalesce(cast(data#51 as string), _) AS data#42] +- Project [id#49, null AS data#51] +- SubqueryAlias `default`.`t2` +- Relation[id#49] parquet == Analyzed Logical Plan == id: int, data: string Union :- Project [id#43, data#42] <--- same ID : +- Join Inner, (id#43 = id#40) : :- Project [id#43, coalesce(data#44, _) AS data#42] : : +- SubqueryAlias `default`.`t1` : : +- Relation[id#43,data#44] parquet : +- Project [value#37 AS id#40] : +- LocalRelation [value#37] +- Project [id#49 AS id#81, data#42 AS data#82] +- Project [id#49, coalesce(cast(data#51 as string), _) AS data#42] <--- same ID +- Project [id#49, null AS data#51] +- SubqueryAlias `default`.`t2` +- Relation[id#49] parquet == Optimized Logical Plan == Union :- Project [id#43, data#42] : +- Join Inner, (id#43 = id#40) : :- Project [id#43, coalesce(data#44, _) AS data#42] : : +- Filter isnotnull(id#43) : : +- Relation[id#43,data#44] parquet : +- Project [value#37 AS id#40] : +- LocalRelation [value#37] +- Project [id#49, _ AS data#82] +- Relation[id#49] parquet == Physical Plan == Union :- *(2) Project [id#43, data#42] : +- *(2) BroadcastHashJoin [id#43], [id#40], Inner, BuildRight : :- *(2) Project [id#43, coalesce(data#44, _) AS data#42] : : +- *(2) Filter isnotnull(id#43) : : +- *(2) FileScan parquet default.t1[id#43,data#44] Batched: true, DataFilters: [isnotnull(id#43)], Format: Parquet, Location: InMemoryFileIndex[file:/home/blue/workspace/spark/common/kvstore/spark-warehouse/org.apache.spark..., PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) : +- *(1) Project [value#37 AS id#40] : +- LocalTableScan [value#37] +- *(3) Project [id#49, _ AS data#82] <- reused ID eliminated by collapsing projections +- *(3) FileScan parquet default.t2[id#49] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/home/blue/workspace/spark/common/kvstore/spark-warehouse/org.apache.spark..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct {code} > Alias ID reuse can break correctness when substituting foldable expressions > --- > > Key: SPARK-27784 > URL: https://issues.apache.org/jira/browse/SPARK-27784 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.3.2 >Reporter: Ryan Blue >Priority: Major > Labels: correctness > > This is a correctness bug when reusing a set of project expressions in the > DataFrame API. > Use case: a user was migrating a table to a new version with an additional > column ("data" in the repro case). To migrate the user unions the old table > ("t2") with the new table ("t1"), and applies a common set of projections to > ensure the union doesn't hit an issue with ordering (SPARK-22335). In some > cases, this produces an incorrect query plan: > {code:java} > Seq((4, "a"), (5, "b"), (6, "c")).toDF("id", "data").write.saveAsTable("t1") > Seq(1, 2, 3).toDF("id").write.saveAsTable("t2") > val dim = Seq(2, 3, 4).toDF("id") > val outputCols = Seq($"id", coalesce($"data", lit("_")).as("data")) > val t1 = spark.table("t1").select(outputCols:_*) > val t2 = spark.table("t2").withColumn("data", lit(null)).select(outputCols:_*) > t1.join(dim, t1("id") === dim("id")).select(t1("id"), > t1("data")).union(t2).explain(true){code} > {code:java} > == Physical Plan == > Union > :- *Project [id#330, _ AS data#237] < THE CONSTANT IS > FROM THE OTHER SIDE OF THE UNION > : +- *BroadcastHashJoin [id#330], [id#234], Inner, BuildRight > : :- *Project [id#330] > : : +- *Filter isnotnull(id#330) > : : +- *FileScan parquet t1[id#330] Batched: true, Format: Parquet, > Location: CatalogFileIndex[s3://.../t1], PartitionFilters: [], PushedFilters: > [IsNotNull(id)], ReadSchema: struct > : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, > false] as bigint))) > :+- LocalTableScan [id#234] > +- *Project [id#340, _ AS data#237] >+- *FileScan parquet t2[id#340] Batched: true, Format:
[jira] [Updated] (SPARK-27784) Alias ID reuse can break correctness when substituting foldable expressions
[ https://issues.apache.org/jira/browse/SPARK-27784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-27784: -- Description: This is a correctness bug when reusing a set of project expressions in the DataFrame API. Use case: a user was migrating a table to a new version with an additional column ("data" in the repro case). To migrate the user unions the old table ("t2") with the new table ("t1"), and applies a common set of projections to ensure the union doesn't hit an issue with ordering (SPARK-22335). In some cases, this produces an incorrect query plan: {code:java} Seq((4, "a"), (5, "b"), (6, "c")).toDF("id", "data").write.saveAsTable("t1") Seq(1, 2, 3).toDF("id").write.saveAsTable("t2") val dim = Seq(2, 3, 4).toDF("id") val outputCols = Seq($"id", coalesce($"data", lit("_")).as("data")) val t1 = spark.table("t1").select(outputCols:_*) val t2 = spark.table("t2").withColumn("data", lit(null)).select(outputCols:_*) t1.join(dim, t1("id") === dim("id")).select(t1("id"), t1("data")).union(t2).explain(true){code} {code:java} == Physical Plan == Union :- *Project [id#330, _ AS data#237] < THE CONSTANT IS FROM THE OTHER SIDE OF THE UNION : +- *BroadcastHashJoin [id#330], [id#234], Inner, BuildRight : :- *Project [id#330] : : +- *Filter isnotnull(id#330) : : +- *FileScan parquet t1[id#330] Batched: true, Format: Parquet, Location: CatalogFileIndex[s3://.../t1], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) :+- LocalTableScan [id#234] +- *Project [id#340, _ AS data#237] +- *FileScan parquet t2[id#340] Batched: true, Format: Parquet, Location: CatalogFileIndex[s3://.../t2], PartitionFilters: [], PushedFilters: [], ReadSchema: struct{code} The problem happens because "outputCols" has an alias. The ID for that alias is created when the projection Seq is created, so it is reused in both sides of the union. When FoldablePropagation runs, it identifies that "data" in the t2 side of the union is a foldable expression and replaces all references to it, including the references in the t1 side of the union. The join to a dimension table is necessary to reproduce the problem because it requires a Projection on top of the join that uses an AttributeReference for data#237. Otherwise, the projections are collapsed and the projection includes an Alias that does not get rewritten by FoldablePropagation. > Alias ID reuse can break correctness when substituting foldable expressions > --- > > Key: SPARK-27784 > URL: https://issues.apache.org/jira/browse/SPARK-27784 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.3.2 >Reporter: Ryan Blue >Priority: Major > Labels: correctness > > This is a correctness bug when reusing a set of project expressions in the > DataFrame API. > Use case: a user was migrating a table to a new version with an additional > column ("data" in the repro case). To migrate the user unions the old table > ("t2") with the new table ("t1"), and applies a common set of projections to > ensure the union doesn't hit an issue with ordering (SPARK-22335). In some > cases, this produces an incorrect query plan: > {code:java} > Seq((4, "a"), (5, "b"), (6, "c")).toDF("id", "data").write.saveAsTable("t1") > Seq(1, 2, 3).toDF("id").write.saveAsTable("t2") > val dim = Seq(2, 3, 4).toDF("id") > val outputCols = Seq($"id", coalesce($"data", lit("_")).as("data")) > val t1 = spark.table("t1").select(outputCols:_*) > val t2 = spark.table("t2").withColumn("data", lit(null)).select(outputCols:_*) > t1.join(dim, t1("id") === dim("id")).select(t1("id"), > t1("data")).union(t2).explain(true){code} > {code:java} > == Physical Plan == > Union > :- *Project [id#330, _ AS data#237] < THE CONSTANT IS > FROM THE OTHER SIDE OF THE UNION > : +- *BroadcastHashJoin [id#330], [id#234], Inner, BuildRight > : :- *Project [id#330] > : : +- *Filter isnotnull(id#330) > : : +- *FileScan parquet t1[id#330] Batched: true, Format: Parquet, > Location: CatalogFileIndex[s3://.../t1], PartitionFilters: [], PushedFilters: > [IsNotNull(id)], ReadSchema: struct > : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, > false] as bigint))) > :+- LocalTableScan [id#234] > +- *Project [id#340, _ AS data#237] >+- *FileScan parquet t2[id#340] Batched: true, Format: Parquet, Location: > CatalogFileIndex[s3://.../t2], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct{code} > The problem happens because "outputCols" has an alias. The ID for that alias > is created when the projection Seq is created, so it is reused in
[jira] [Created] (SPARK-27784) Alias ID reuse can break correctness when substituting foldable expressions
Ryan Blue created SPARK-27784: - Summary: Alias ID reuse can break correctness when substituting foldable expressions Key: SPARK-27784 URL: https://issues.apache.org/jira/browse/SPARK-27784 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.2, 2.1.1 Reporter: Ryan Blue -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27732) DataSourceV2: Add CreateTable logical operation
Ryan Blue created SPARK-27732: - Summary: DataSourceV2: Add CreateTable logical operation Key: SPARK-27732 URL: https://issues.apache.org/jira/browse/SPARK-27732 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.3 Reporter: Ryan Blue -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27724) Add RTAS logical operation
Ryan Blue created SPARK-27724: - Summary: Add RTAS logical operation Key: SPARK-27724 URL: https://issues.apache.org/jira/browse/SPARK-27724 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.3 Reporter: Ryan Blue -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27724) DataSourceV2: Add RTAS logical operation
[ https://issues.apache.org/jira/browse/SPARK-27724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-27724: -- Summary: DataSourceV2: Add RTAS logical operation (was: Add RTAS logical operation) > DataSourceV2: Add RTAS logical operation > > > Key: SPARK-27724 > URL: https://issues.apache.org/jira/browse/SPARK-27724 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.3 >Reporter: Ryan Blue >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24923) DataSourceV2: Add CTAS logical operation
[ https://issues.apache.org/jira/browse/SPARK-24923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-24923: -- Summary: DataSourceV2: Add CTAS logical operation (was: DataSourceV2: Add CTAS and RTAS logical operations) > DataSourceV2: Add CTAS logical operation > > > Key: SPARK-24923 > URL: https://issues.apache.org/jira/browse/SPARK-24923 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 3.0.0 > > > When SPARK-24252 and SPARK-24251 are in, next plans to implement from the > SPIP are CTAS and RTAS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27708) Add documentation for v2 data sources
Ryan Blue created SPARK-27708: - Summary: Add documentation for v2 data sources Key: SPARK-27708 URL: https://issues.apache.org/jira/browse/SPARK-27708 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.3 Reporter: Ryan Blue Before the 3.0 release, the new v2 data sources should be documented. This includes: * How to plug in catalog implementations * Catalog plugin configuration * Multi-part identifier behavior * Partition transforms * Table properties that are used to pass table info (e.g. "provider") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27693) DataSourceV2: Add default catalog property
Ryan Blue created SPARK-27693: - Summary: DataSourceV2: Add default catalog property Key: SPARK-27693 URL: https://issues.apache.org/jira/browse/SPARK-27693 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.3 Reporter: Ryan Blue Add a default catalog property for DataSourceV2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27661) Add SupportsNamespaces interface for v2 catalogs
Ryan Blue created SPARK-27661: - Summary: Add SupportsNamespaces interface for v2 catalogs Key: SPARK-27661 URL: https://issues.apache.org/jira/browse/SPARK-27661 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.3 Reporter: Ryan Blue Some catalogs support namespace operations, like creating or dropping namespaces. The v2 API should have a way to expose these operations to Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27658) Catalog API to load functions
Ryan Blue created SPARK-27658: - Summary: Catalog API to load functions Key: SPARK-27658 URL: https://issues.apache.org/jira/browse/SPARK-27658 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.3 Reporter: Ryan Blue SPARK-24252 added an API that catalog plugins can implement to expose table operations. Catalogs should also be able to provide function implementations to Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23098) Migrate Kafka batch source to v2
[ https://issues.apache.org/jira/browse/SPARK-23098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835700#comment-16835700 ] Ryan Blue commented on SPARK-23098: --- I don't think there's a DSv2-related obstacle to implementing this. > Migrate Kafka batch source to v2 > > > Key: SPARK-23098 > URL: https://issues.apache.org/jira/browse/SPARK-23098 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27471) Reorganize public v2 catalog API
[ https://issues.apache.org/jira/browse/SPARK-27471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822271#comment-16822271 ] Ryan Blue commented on SPARK-27471: --- Thanks [~hyukjin.kwon]. I meant to set the target version, not the fix version. I've updated that. > Reorganize public v2 catalog API > > > Key: SPARK-27471 > URL: https://issues.apache.org/jira/browse/SPARK-27471 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.1 >Reporter: Ryan Blue >Priority: Blocker > > In the review for SPARK-27181, Reynold suggested some package moves. We've > decided (at the v2 community sync) not to delay by having this discussion now > because we want to get the new catalog API in so we can work on more logical > plans in parallel. But we do need to make sure we have a sane package scheme > for the next release. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27471) Reorganize public v2 catalog API
[ https://issues.apache.org/jira/browse/SPARK-27471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-27471: -- Target Version/s: 3.0.0 > Reorganize public v2 catalog API > > > Key: SPARK-27471 > URL: https://issues.apache.org/jira/browse/SPARK-27471 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.1 >Reporter: Ryan Blue >Priority: Blocker > > In the review for SPARK-27181, Reynold suggested some package moves. We've > decided (at the v2 community sync) not to delay by having this discussion now > because we want to get the new catalog API in so we can work on more logical > plans in parallel. But we do need to make sure we have a sane package scheme > for the next release. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27471) Reorganize public v2 catalog API
Ryan Blue created SPARK-27471: - Summary: Reorganize public v2 catalog API Key: SPARK-27471 URL: https://issues.apache.org/jira/browse/SPARK-27471 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.1 Reporter: Ryan Blue Fix For: 3.0.0 In the review for SPARK-27181, Reynold suggested some package moves. We've decided (at the v2 community sync) not to delay by having this discussion now because we want to get the new catalog API in so we can work on more logical plans in parallel. But we do need to make sure we have a sane package scheme for the next release. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27386) Improve partition transform parsing
Ryan Blue created SPARK-27386: - Summary: Improve partition transform parsing Key: SPARK-27386 URL: https://issues.apache.org/jira/browse/SPARK-27386 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue SPARK-27181 adds support to the SQL parser for transformation functions in the {{PARTITION BY}} clause. The rules to match this are specific to transforms and can match only literals or qualified names (field references). This should be improved to match a broader set of expressions so that Spark can produce better error messages than an expected symbol list. For example, {{PARTITION BY (2 + 3)}} should produce "invalid transformation expression: 2 + 3" instead of "expecting qualified name". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25006) Add optional catalog to TableIdentifier
[ https://issues.apache.org/jira/browse/SPARK-25006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-25006. --- Resolution: Won't Fix Closing this because SPARK-26946 replaces it. > Add optional catalog to TableIdentifier > --- > > Key: SPARK-25006 > URL: https://issues.apache.org/jira/browse/SPARK-25006 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > For multi-catalog support, Spark table identifiers need to identify the > catalog for a table. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27181) Add public expression and transform API for DSv2 partitioning
Ryan Blue created SPARK-27181: - Summary: Add public expression and transform API for DSv2 partitioning Key: SPARK-27181 URL: https://issues.apache.org/jira/browse/SPARK-27181 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Ryan Blue -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26778) Implement file source V2 partitioning
[ https://issues.apache.org/jira/browse/SPARK-26778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16792843#comment-16792843 ] Ryan Blue commented on SPARK-26778: --- [~Gengliang.Wang], can you clarify what this issue is tracking? > Implement file source V2 partitioning > -- > > Key: SPARK-26778 > URL: https://issues.apache.org/jira/browse/SPARK-26778 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27108) Add parsed CreateTable plans to Catalyst
Ryan Blue created SPARK-27108: - Summary: Add parsed CreateTable plans to Catalyst Key: SPARK-27108 URL: https://issues.apache.org/jira/browse/SPARK-27108 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.1 Reporter: Ryan Blue The abstract Catalyst SQL AST builder cannot currently parse {{CREATE TABLE}} commands. Creates are handled only by {{SparkSqlParser}} because the logical plans are defined in the v1 datasource package (org.apache.spark.sql.execution.datasources). The {{SparkSqlParser}} mixes parsing with logic that is specific to v1, like converting {{IF NOT EXISTS}} into a {{SaveMode}}. This makes it difficult (and error-prone) to produce v2 plans because it requires converting the AST to v1 and the converting v1 to v2. Instead, the catalyst parser should create plans that represent exactly what was parsed, after validation like ensuring no duplicate clauses. Then those plans should be converted to v1 or v2 plans in the analyzer. This structure will avoid errors caused by multiple layers of translation and keeps v1 and v2 plans separate to ensure that v1 has no behavior changes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27067) SPIP: Catalog API for table metadata
[ https://issues.apache.org/jira/browse/SPARK-27067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-27067. --- Resolution: Fixed I'm resolving this issue because the vote to adopt the proposal passed. I've added links to the google doc proposal (now view-only) and vote thread, and uploaded a copy of the proposal as a PDF. > SPIP: Catalog API for table metadata > > > Key: SPARK-27067 > URL: https://issues.apache.org/jira/browse/SPARK-27067 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Spark API for Table Metadata.pdf > > > Goal: Define a catalog API to create, alter, load, and drop tables -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27067) SPIP: Catalog API for table metadata
[ https://issues.apache.org/jira/browse/SPARK-27067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-27067: -- Attachment: SPIP_ Spark API for Table Metadata.pdf > SPIP: Catalog API for table metadata > > > Key: SPARK-27067 > URL: https://issues.apache.org/jira/browse/SPARK-27067 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Spark API for Table Metadata.pdf > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27066) SPIP: Identifiers for multi-catalog support
[ https://issues.apache.org/jira/browse/SPARK-27066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-27066: -- Description: Goals: * Propose semantics for identifiers and a listing API to support multiple catalogs ** Support any namespace scheme used by an external catalog ** Avoid traversing namespaces via multiple listing calls from Spark * Outline migration from the current behavior to Spark with multiple catalogs > SPIP: Identifiers for multi-catalog support > --- > > Key: SPARK-27066 > URL: https://issues.apache.org/jira/browse/SPARK-27066 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Identifiers for multi-catalog Spark.pdf > > > Goals: > * Propose semantics for identifiers and a listing API to support multiple > catalogs > ** Support any namespace scheme used by an external catalog > ** Avoid traversing namespaces via multiple listing calls from Spark > * Outline migration from the current behavior to Spark with multiple catalogs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27066) SPIP: Identifiers for multi-catalog support
Ryan Blue created SPARK-27066: - Summary: SPIP: Identifiers for multi-catalog support Key: SPARK-27066 URL: https://issues.apache.org/jira/browse/SPARK-27066 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27067) SPIP: Catalog API for table metadata
[ https://issues.apache.org/jira/browse/SPARK-27067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-27067: -- Description: Goal: Define a catalog API to create, alter, load, and drop tables > SPIP: Catalog API for table metadata > > > Key: SPARK-27067 > URL: https://issues.apache.org/jira/browse/SPARK-27067 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Spark API for Table Metadata.pdf > > > Goal: Define a catalog API to create, alter, load, and drop tables -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27067) SPIP: Catalog API for table metadata
Ryan Blue created SPARK-27067: - Summary: SPIP: Catalog API for table metadata Key: SPARK-27067 URL: https://issues.apache.org/jira/browse/SPARK-27067 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Ryan Blue -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27066) SPIP: Identifiers for multi-catalog support
[ https://issues.apache.org/jira/browse/SPARK-27066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-27066. --- Resolution: Fixed I'm resolving this issue because the vote to adopt the proposal passed. I've added links to the google doc proposal (now view-only) and vote thread, and uploaded a copy of the proposal as a PDF. > SPIP: Identifiers for multi-catalog support > --- > > Key: SPARK-27066 > URL: https://issues.apache.org/jira/browse/SPARK-27066 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Identifiers for multi-catalog Spark.pdf > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23521) SPIP: Standardize SQL logical plans with DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-23521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784736#comment-16784736 ] Ryan Blue commented on SPARK-23521: --- I've turned off commenting on the google doc to preserve its state, with the existing comments. I'm also adding a PDF of the final proposal to this issue. > SPIP: Standardize SQL logical plans with DataSourceV2 > - > > Key: SPARK-23521 > URL: https://issues.apache.org/jira/browse/SPARK-23521 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Standardize logical plans.pdf > > > Executive Summary: This SPIP is based on [discussion about the DataSourceV2 > implementation|https://lists.apache.org/thread.html/55676ec1f5039d3deaf347d391cf82fe8574b8fa4eeab70110ed5b2b@%3Cdev.spark.apache.org%3E] > on the dev list. The proposal is to standardize the logical plans used for > write operations to make the planner more maintainable and to make Spark's > write behavior predictable and reliable. It proposes the following principles: > # Use well-defined logical plan nodes for all high-level operations: insert, > create, CTAS, overwrite table, etc. > # Use planner rules that match on these high-level nodes, so that it isn’t > necessary to create rules to match each eventual code path individually. > # Clearly define Spark’s behavior for these logical plan nodes. Physical > nodes should implement that behavior so that all code paths eventually make > the same guarantees. > # Specialize implementation when creating a physical plan, not logical > plans. This will avoid behavior drift and ensure planner code is shared > across physical implementations. > The SPIP doc presents a small but complete set of those high-level logical > operations, most of which are already defined in SQL or implemented by some > write path in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27066) SPIP: Identifiers for multi-catalog support
[ https://issues.apache.org/jira/browse/SPARK-27066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-27066: -- Attachment: SPIP_ Identifiers for multi-catalog Spark.pdf > SPIP: Identifiers for multi-catalog support > --- > > Key: SPARK-27066 > URL: https://issues.apache.org/jira/browse/SPARK-27066 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Identifiers for multi-catalog Spark.pdf > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23521) SPIP: Standardize SQL logical plans with DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-23521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-23521: -- Attachment: SPIP_ Standardize logical plans.pdf > SPIP: Standardize SQL logical plans with DataSourceV2 > - > > Key: SPARK-23521 > URL: https://issues.apache.org/jira/browse/SPARK-23521 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Standardize logical plans.pdf > > > Executive Summary: This SPIP is based on [discussion about the DataSourceV2 > implementation|https://lists.apache.org/thread.html/55676ec1f5039d3deaf347d391cf82fe8574b8fa4eeab70110ed5b2b@%3Cdev.spark.apache.org%3E] > on the dev list. The proposal is to standardize the logical plans used for > write operations to make the planner more maintainable and to make Spark's > write behavior predictable and reliable. It proposes the following principles: > # Use well-defined logical plan nodes for all high-level operations: insert, > create, CTAS, overwrite table, etc. > # Use planner rules that match on these high-level nodes, so that it isn’t > necessary to create rules to match each eventual code path individually. > # Clearly define Spark’s behavior for these logical plan nodes. Physical > nodes should implement that behavior so that all code paths eventually make > the same guarantees. > # Specialize implementation when creating a physical plan, not logical > plans. This will avoid behavior drift and ensure planner code is shared > across physical implementations. > The SPIP doc presents a small but complete set of those high-level logical > operations, most of which are already defined in SQL or implemented by some > write path in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26874) With PARQUET-1414, Spark can erroneously write empty pages
[ https://issues.apache.org/jira/browse/SPARK-26874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-26874: -- Summary: With PARQUET-1414, Spark can erroneously write empty pages (was: When we upgrade Parquet to 1.11+, Spark can erroneously write empty pages) > With PARQUET-1414, Spark can erroneously write empty pages > -- > > Key: SPARK-26874 > URL: https://issues.apache.org/jira/browse/SPARK-26874 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Major > > This issue will only come up when Spark upgrades its Parquet dependency to > the latest. This issue is being filed to proactively fix the bug before we > upgrade - it's not something that would easily be found in the current unit > tests and can be missed until the community scale tests in an e.g. RC phase. > Parquet introduced a new feature to limit the number of rows written to a > page in a column chunk - see PARQUET-1414. Previously, Parquet would only > flush pages to the column store after the page writer had filled its buffer > with a certain amount of bytes. The idea of the Parquet patch was to make > page writers flush to the column store upon the writer being given a certain > number of rows - the default value is 2. > The patch makes the Spark Parquet Data Source erroneously write empty pages > to column chunks, making the Parquet file ultimately unreadable with > exceptions like these: > > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > file:/private/var/folders/7f/rrg37sj15r33cwlxbhg7fyg5080xxg/T/spark-2c1b1e5d-c132-42fe-98e1-b523b9baa176/parquet-data/part-2-b8b4bb3c-b49f-440b-b96c-53fcd19786ad-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181) > ... 18 more > Caused by: java.lang.IllegalArgumentException: Reading past RLE/BitPacking > stream. > at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:53) > at > org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:80) > at > org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:62) > at > org.apache.parquet.column.values.rle.RunLengthBitPackingHybridValuesReader.readInteger(RunLengthBitPackingHybridValuesReader.java:53) > at > org.apache.parquet.column.impl.ColumnReaderBase$ValuesReaderIntIterator.nextInt(ColumnReaderBase.java:733) > at > org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:567) > at > org.apache.parquet.column.impl.ColumnReaderBase.consume(ColumnReaderBase.java:705) > at > org.apache.parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:30) > at > org.apache.parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:47) > at > org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:84) > at > org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:271) > at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147) > at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109) > at > org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:165) > at > org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222) > ... 22 more > {code} > What's happening here is that the reader is being given a page with no > values, which Parquet can never handle. > The root cause is due to the way Spark treats empty (null) records in > optional fields. Concretely, in {{ParquetWriteSupport#write}}, we always > indicate to the recordConsumer that we are starting a message > ({{recordConsumer#startMessage}}). If Spark is given a null field in the row, > it still indicates to the record consumer after having ignored the row that > the message is finished ({{recordConsumer#endMessage}}). The ending of the > message causes all column
[jira] [Commented] (SPARK-26874) When we upgrade Parquet to 1.11+, Spark can erroneously write empty pages
[ https://issues.apache.org/jira/browse/SPARK-26874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16767798#comment-16767798 ] Ryan Blue commented on SPARK-26874: --- To be clear, Parquet has not released any 1.11.x versions so this is a problem with master, not a Parquet release. > When we upgrade Parquet to 1.11+, Spark can erroneously write empty pages > - > > Key: SPARK-26874 > URL: https://issues.apache.org/jira/browse/SPARK-26874 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Major > > This issue will only come up when Spark upgrades its Parquet dependency to > the latest. This issue is being filed to proactively fix the bug before we > upgrade - it's not something that would easily be found in the current unit > tests and can be missed until the community scale tests in an e.g. RC phase. > Parquet introduced a new feature to limit the number of rows written to a > page in a column chunk - see PARQUET-1414. Previously, Parquet would only > flush pages to the column store after the page writer had filled its buffer > with a certain amount of bytes. The idea of the Parquet patch was to make > page writers flush to the column store upon the writer being given a certain > number of rows - the default value is 2. > The patch makes the Spark Parquet Data Source erroneously write empty pages > to column chunks, making the Parquet file ultimately unreadable with > exceptions like these: > > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > file:/private/var/folders/7f/rrg37sj15r33cwlxbhg7fyg5080xxg/T/spark-2c1b1e5d-c132-42fe-98e1-b523b9baa176/parquet-data/part-2-b8b4bb3c-b49f-440b-b96c-53fcd19786ad-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181) > ... 18 more > Caused by: java.lang.IllegalArgumentException: Reading past RLE/BitPacking > stream. > at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:53) > at > org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:80) > at > org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:62) > at > org.apache.parquet.column.values.rle.RunLengthBitPackingHybridValuesReader.readInteger(RunLengthBitPackingHybridValuesReader.java:53) > at > org.apache.parquet.column.impl.ColumnReaderBase$ValuesReaderIntIterator.nextInt(ColumnReaderBase.java:733) > at > org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:567) > at > org.apache.parquet.column.impl.ColumnReaderBase.consume(ColumnReaderBase.java:705) > at > org.apache.parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:30) > at > org.apache.parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:47) > at > org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:84) > at > org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:271) > at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147) > at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109) > at > org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:165) > at > org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222) > ... 22 more > {code} > What's happening here is that the reader is being given a page with no > values, which Parquet can never handle. > The root cause is due to the way Spark treats empty (null) records in > optional fields. Concretely, in {{ParquetWriteSupport#write}}, we always > indicate to the recordConsumer that we are starting a message > ({{recordConsumer#startMessage}}). If Spark is given a null field in the row, > it still indicates to the record consumer after having ignored the row that > the message is finished ({{recordConsumer#endMessage}}). The ending of the
[jira] [Created] (SPARK-26873) FileFormatWriter creates inconsistent MR job IDs
Ryan Blue created SPARK-26873: - Summary: FileFormatWriter creates inconsistent MR job IDs Key: SPARK-26873 URL: https://issues.apache.org/jira/browse/SPARK-26873 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 2.3.2, 2.2.3 Reporter: Ryan Blue FileFormatWriter uses the current time to create a Job ID that is used when calling Hadoop committers. This ID is used to produce task and task attempt IDs used in commits. The problem is that Spark [generates this Job ID|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L209] in {{executeTask}} for every task: {code:lang=scala} /** Writes data out in a single Spark task. */ private def executeTask( description: WriteJobDescription, sparkStageId: Int, sparkPartitionId: Int, sparkAttemptNumber: Int, committer: FileCommitProtocol, iterator: Iterator[InternalRow]): WriteTaskResult = { val jobId = SparkHadoopWriterUtils.createJobID(new Date, sparkStageId) val taskId = new TaskID(jobId, TaskType.MAP, sparkPartitionId) val taskAttemptId = new TaskAttemptID(taskId, sparkAttemptNumber) ... {code} Because this is called in each task, the Job ID used is not consistent across tasks, which violates the contract expected by Hadoop committers. If a committer expects identical task IDs across attempts for correctness, this breaks correctness. For example, a Hadoop committer should be able to rename an output file to a path based on the task ID to ensure that only one copy is committed. We hit this issue when preemption caused a task to die just after the commit operation. The commit coordinator authorized a second task commit because the first did not complete due to preemption. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26811) Add DataSourceV2 capabilities to check support for batch append, overwrite, truncate during analysis.
Ryan Blue created SPARK-26811: - Summary: Add DataSourceV2 capabilities to check support for batch append, overwrite, truncate during analysis. Key: SPARK-26811 URL: https://issues.apache.org/jira/browse/SPARK-26811 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Ryan Blue -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file
[ https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756654#comment-16756654 ] Ryan Blue commented on SPARK-26677: --- Thanks, sorry about the mistake. > Incorrect results of not(eqNullSafe) when data read from Parquet file > -- > > Key: SPARK-26677 > URL: https://issues.apache.org/jira/browse/SPARK-26677 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu > 18.04). >Reporter: Michal Kapalka >Priority: Blocker > Labels: correctness > > Example code (spark-shell from Spark 2.4.0): > {code:java} > scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > +-+ > {code} > Running the same with Spark 2.2.0 or 2.3.2 gives the correct result: > {code:java} > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} > Also, with a different input sequence and Spark 2.4.0 we get the correct > result: > {code:java} > scala> Seq("A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file
[ https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-26677: -- Fix Version/s: 2.4.1 > Incorrect results of not(eqNullSafe) when data read from Parquet file > -- > > Key: SPARK-26677 > URL: https://issues.apache.org/jira/browse/SPARK-26677 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu > 18.04). >Reporter: Michal Kapalka >Priority: Blocker > Labels: correctness > Fix For: 2.4.1 > > > Example code (spark-shell from Spark 2.4.0): > {code:java} > scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > +-+ > {code} > Running the same with Spark 2.2.0 or 2.3.2 gives the correct result: > {code:java} > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} > Also, with a different input sequence and Spark 2.4.0 we get the correct > result: > {code:java} > scala> Seq("A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file
[ https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16752606#comment-16752606 ] Ryan Blue commented on SPARK-26677: --- To clarify [~dongjoon]'s comment: All recent versions of Parquet are affected by this {{not(eqNullSafe(...)}} bug. Only Parquet 1.10.0 is affected by PARQUET-1309. This filter bug has been present since Parquet introduced dictionary filtering. > Incorrect results of not(eqNullSafe) when data read from Parquet file > -- > > Key: SPARK-26677 > URL: https://issues.apache.org/jira/browse/SPARK-26677 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu > 18.04). >Reporter: Michal Kapalka >Priority: Blocker > Labels: correctness > > Example code (spark-shell from Spark 2.4.0): > {code:java} > scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > +-+ > {code} > Running the same with Spark 2.2.0 or 2.3.2 gives the correct result: > {code:java} > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} > Also, with a different input sequence and Spark 2.4.0 we get the correct > result: > {code:java} > scala> Seq("A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26682) Task attempt ID collision causes lost data
Ryan Blue created SPARK-26682: - Summary: Task attempt ID collision causes lost data Key: SPARK-26682 URL: https://issues.apache.org/jira/browse/SPARK-26682 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0, 2.3.2, 2.1.3 Reporter: Ryan Blue We recently tracked missing data to a collision in the fake Hadoop task attempt ID created when using Hadoop OutputCommitters. This is similar to SPARK-24589. A stage had one task fail to get one shard from a shuffle, causing a FetchFailedException and Spark resubmitted the stage. Because only one task was affected, the original stage attempt continued running tasks that had been resubmitted. Another task ran two attempts concurrently on the same executor, but had the same attempt number because they were from different stage attempts. Because the attempt number was the same, the task used the same temp locations. That caused one attempt to fail because a file path already existed, and that attempt then removed the shared temp location and deleted the other task's data. When the second attempt succeeded, it committed partial data. The problem was that both attempts had the same partition and attempt numbers, despite being run in different stages, and that was used to create a Hadoop task attempt ID on which the temp location was based. The fix is to use Spark's global task attempt ID, which is a counter, instead of attempt number because attempt number is reused in stage attempts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26681) Support Ammonite scopes in OuterScopes
Ryan Blue created SPARK-26681: - Summary: Support Ammonite scopes in OuterScopes Key: SPARK-26681 URL: https://issues.apache.org/jira/browse/SPARK-26681 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Ryan Blue When the Ammonite REPL is used with Spark, users have to call {{OuterScopes.addOuterScope}} [manually|https://github.com/alexarchambault/ammonite-spark#troubleshooting] in order to get a working Dataset. A similar problem is already solved for the Spark REPL, which recognizes class names and returns the correct scope. Spark should support Ammonite scopes also. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26679) Deconflict spark.executor.pyspark.memory and spark.python.worker.memory
Ryan Blue created SPARK-26679: - Summary: Deconflict spark.executor.pyspark.memory and spark.python.worker.memory Key: SPARK-26679 URL: https://issues.apache.org/jira/browse/SPARK-26679 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.4.0 Reporter: Ryan Blue In 2.4.0, spark.executor.pyspark.memory was added to limit the total memory space of a python worker. There is another RDD setting, spark.python.worker.memory that controls when Spark decides to spill data to disk. These are currently similar, but not related to one another. PySpark should probably use spark.executor.pyspark.memory to limit or default the setting of spark.python.worker.memory because the latter property controls spilling and should be lower than the total memory limit. Renaming spark.python.worker.memory would also help clarity because it sounds like it should control the limit, but is more like the JVM setting spark.memory.fraction. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23398) DataSourceV2 should provide a way to get a source's schema.
[ https://issues.apache.org/jira/browse/SPARK-23398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-23398. --- Resolution: Fixed SPARK-25528 adds a Table interface that can report its schema. > DataSourceV2 should provide a way to get a source's schema. > --- > > Key: SPARK-23398 > URL: https://issues.apache.org/jira/browse/SPARK-23398 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > To validate writes with DataSourceV2, the planner needs to get a source's > schema. The current API has no direct way to get that schema. SPARK-23321 > instantiates a reader to get the schema, but sources are not required to > implement {{ReadSupport}} or {{ReadSupportWithSchema}}. V2 should either add > a method to get the schema of a source, or require sources implement > {{ReadSupport}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26666) DataSourceV2: Add overwrite and dynamic overwrite.
Ryan Blue created SPARK-2: - Summary: DataSourceV2: Add overwrite and dynamic overwrite. Key: SPARK-2 URL: https://issues.apache.org/jira/browse/SPARK-2 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0 Reporter: Ryan Blue DataSourceV2 will support overwrite operations using the following two interfaces (from the write side design doc): {code:lang=java} public interface SupportsOverwrite extends WriteBuilder { WriteBuilder overwrite(Filter[] filters) } public interface SupportsDynamicOverwrite extends WriteBuilder { WriteBuilder overwriteDynamicPartitions() } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23321) DataSourceV2 should apply some validation when writing.
[ https://issues.apache.org/jira/browse/SPARK-23321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-23321. --- Resolution: Fixed Done for Append plans. Will be included in new logical plans as they are added. > DataSourceV2 should apply some validation when writing. > --- > > Key: SPARK-23321 > URL: https://issues.apache.org/jira/browse/SPARK-23321 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > DataSourceV2 writes are not validated. These writes should be validated using > the standard preprocess rules that are used for Hive and DataSource tables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets
[ https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678728#comment-16678728 ] Ryan Blue commented on SPARK-25966: --- [~andrioni], were there any failed tasks or executors in the job that wrote this file? It looks to me like a problem in closing the file or with an executor dying before finishing a file. If that happened and the data wasn't cleaned up, then it could lead to this problem. > "EOF Reached the end of stream with bytes left to read" while reading/writing > to Parquets > - > > Key: SPARK-25966 > URL: https://issues.apache.org/jira/browse/SPARK-25966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on > top of a Mesos cluster. Both input and output Parquet files are on S3. >Reporter: Alessandro Andrioni >Priority: Major > > I was persistently getting the following exception while trying to run one > Spark job we have using Spark 2.4.0. It went away after I regenerated from > scratch all the input Parquet files (generated by another Spark job also > using Spark 2.4.0). > Is there a chance that Spark is writing (quite rarely) corrupted Parquet > files? > {code:java} > org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557) > (...) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 > in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: > Reached the end of stream with 996 bytes left to read > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301) > at >
[jira] [Commented] (SPARK-25531) new write APIs for data source v2
[ https://issues.apache.org/jira/browse/SPARK-25531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629418#comment-16629418 ] Ryan Blue commented on SPARK-25531: --- [~cloud_fan], what was the intent for this umbrella issue? You described it as progress of "Standardize SQL logical plans" but the current description is "new write APIs" instead. Also, these issues were already tracked under the umbrella SPARK-22386 to improve DSv2, which covers the new logical plans and other support issues like adding interfaces for required clustering and sorting (SPARK-23889). Is your intent to close the other issue because it is too old? > new write APIs for data source v2 > - > > Key: SPARK-25531 > URL: https://issues.apache.org/jira/browse/SPARK-25531 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: Wenchen Fan >Priority: Major > > The current data source write API heavily depend on {{SaveMode}}, which > doesn't have a clear semantic, especially when writing to tables. > We should design a new set of write API without {{SaveMode}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23521) SPIP: Standardize SQL logical plans with DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-23521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-23521. --- Resolution: Fixed Marking this as "FIxed" because the vote passed. > SPIP: Standardize SQL logical plans with DataSourceV2 > - > > Key: SPARK-23521 > URL: https://issues.apache.org/jira/browse/SPARK-23521 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > > Executive Summary: This SPIP is based on [discussion about the DataSourceV2 > implementation|https://lists.apache.org/thread.html/55676ec1f5039d3deaf347d391cf82fe8574b8fa4eeab70110ed5b2b@%3Cdev.spark.apache.org%3E] > on the dev list. The proposal is to standardize the logical plans used for > write operations to make the planner more maintainable and to make Spark's > write behavior predictable and reliable. It proposes the following principles: > # Use well-defined logical plan nodes for all high-level operations: insert, > create, CTAS, overwrite table, etc. > # Use planner rules that match on these high-level nodes, so that it isn’t > necessary to create rules to match each eventual code path individually. > # Clearly define Spark’s behavior for these logical plan nodes. Physical > nodes should implement that behavior so that all code paths eventually make > the same guarantees. > # Specialize implementation when creating a physical plan, not logical > plans. This will avoid behavior drift and ensure planner code is shared > across physical implementations. > The SPIP doc presents a small but complete set of those high-level logical > operations, most of which are already defined in SQL or implemented by some > write path in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15420) Repartition and sort before Parquet writes
[ https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-15420. --- Resolution: Won't Fix > Repartition and sort before Parquet writes > -- > > Key: SPARK-15420 > URL: https://issues.apache.org/jira/browse/SPARK-15420 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Ryan Blue >Priority: Major > > Parquet requires buffering data in memory before writing a group of rows > organized by column. This causes significant memory pressure when writing > partitioned output because each open file must buffer rows. > Currently, Spark will sort data and spill if necessary in the > {{WriterContainer}} to avoid keeping many files open at once. But, this isn't > a full solution for a few reasons: > * The final sort is always performed, even if incoming data is already sorted > correctly. For example, a global sort will cause two sorts to happen, even if > the global sort correctly prepares the data. > * To prevent a large number of output small output files, users must manually > add a repartition step. That step is also ignored by the sort within the > writer. > * Hive does not currently support {{DataFrameWriter#sortBy}} > The sort in {{WriterContainer}} makes sense to prevent problems, but should > detect if the incoming data is already sorted. The {{DataFrameWriter}} should > also expose the ability to repartition data before the write stage, and the > query planner should expose an option to automatically insert repartition > operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15420) Repartition and sort before Parquet writes
[ https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated SPARK-15420: -- Target Version/s: (was: 2.4.0) > Repartition and sort before Parquet writes > -- > > Key: SPARK-15420 > URL: https://issues.apache.org/jira/browse/SPARK-15420 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Ryan Blue >Priority: Major > > Parquet requires buffering data in memory before writing a group of rows > organized by column. This causes significant memory pressure when writing > partitioned output because each open file must buffer rows. > Currently, Spark will sort data and spill if necessary in the > {{WriterContainer}} to avoid keeping many files open at once. But, this isn't > a full solution for a few reasons: > * The final sort is always performed, even if incoming data is already sorted > correctly. For example, a global sort will cause two sorts to happen, even if > the global sort correctly prepares the data. > * To prevent a large number of output small output files, users must manually > add a repartition step. That step is also ignored by the sort within the > writer. > * Hive does not currently support {{DataFrameWriter#sortBy}} > The sort in {{WriterContainer}} makes sense to prevent problems, but should > detect if the incoming data is already sorted. The {{DataFrameWriter}} should > also expose the ability to repartition data before the write stage, and the > query planner should expose an option to automatically insert repartition > operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25213) DataSourceV2 doesn't seem to produce unsafe rows
[ https://issues.apache.org/jira/browse/SPARK-25213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590531#comment-16590531 ] Ryan Blue commented on SPARK-25213: --- Sorry, I just realized the point is that the filter could have a python UDF in it. In that case, we need to add the project before the filter runs. I'll take a look at it. > DataSourceV2 doesn't seem to produce unsafe rows > - > > Key: SPARK-25213 > URL: https://issues.apache.org/jira/browse/SPARK-25213 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Li Jin >Priority: Major > > Reproduce (Need to compile test-classes): > bin/pyspark --driver-class-path sql/core/target/scala-2.11/test-classes > {code:java} > datasource_v2_df = spark.read \ > .format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2") > \ > .load() > result = datasource_v2_df.withColumn('x', udf(lambda x: x, > 'int')(datasource_v2_df['i'])) > result.show() > {code} > The above code fails with: > {code:java} > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > at > org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:127) > at > org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:126) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > {code} > Seems like Data Source V2 doesn't produce unsafeRows here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25213) DataSourceV2 doesn't seem to produce unsafe rows
[ https://issues.apache.org/jira/browse/SPARK-25213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590406#comment-16590406 ] Ryan Blue commented on SPARK-25213: --- [~cloud_fan], that PR ensures that there is a Project node on top of the v2 scan to ensure the rows are converted to unsafe. We should be able to take a look at the physical plan to see whether it is there. If not, then we should find out why it isn't there. If it is there, we should find out why it isn't producing unsafe rows. > DataSourceV2 doesn't seem to produce unsafe rows > - > > Key: SPARK-25213 > URL: https://issues.apache.org/jira/browse/SPARK-25213 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Li Jin >Priority: Major > > Reproduce (Need to compile test-classes): > bin/pyspark --driver-class-path sql/core/target/scala-2.11/test-classes > {code:java} > datasource_v2_df = spark.read \ > .format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2") > \ > .load() > result = datasource_v2_df.withColumn('x', udf(lambda x: x, > 'int')(datasource_v2_df['i'])) > result.show() > {code} > The above code fails with: > {code:java} > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > at > org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:127) > at > org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:126) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > {code} > Seems like Data Source V2 doesn't produce unsafeRows here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25188) Add WriteConfig
[ https://issues.apache.org/jira/browse/SPARK-25188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589088#comment-16589088 ] Ryan Blue commented on SPARK-25188: --- One update to that proposal: {{BatchOverwriteSupport}} should be split into two interfaces: one for dynamic overwrite and one for overwrite using filter expressions. That supports the two use cases separately, since some sources won't support dynamic partition overwrite. > Add WriteConfig > --- > > Key: SPARK-25188 > URL: https://issues.apache.org/jira/browse/SPARK-25188 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > The current write API is not flexible enough to implement more complex write > operations like `replaceWhere`. We can follow the read API and add a > `WriteConfig` to make it more flexible. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25188) Add WriteConfig
[ https://issues.apache.org/jira/browse/SPARK-25188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589086#comment-16589086 ] Ryan Blue commented on SPARK-25188: --- Here's the original proposal for adding a write config: The read side has a {{ScanConfig}}, but the write side doesn't have an equivalent object that tracks a particular write. I think if we introduce one, the API would be more similar between the read and write sides, and we would have a better API for overwrite operations. I propose adding a {{WriteConfig}} object and passing it like this: {code:lang=java} interface BatchWriteSupport { WriteConfig newWriteConfig(writeOptions: Map[String, String]) DataWriterFactory createWriterFactory(WriteConfig) void commit(WriteConfig, WriterCommitMessage[]) } {code} That allows us to pass options for the write that affect how the WriterFactory operates. For example, in Iceberg I could request using Orc as the underlying format instead of Parquet. (I also suggested an addition like this for the read side.) The other benefit of adding {{WriteConfig}} is that it provides a clean way of adding the ReplaceData operations. The ones I'm currently working on are ReplaceDynamicPartitions and ReplaceData. The first one removes any data in partitions that are being written to, and the second one replaces data based on a filter: e.g. {{df.writeTo(t).overwrite($"day" == "2018-08-15")}}. The information about replacement could be carried by {{WriteConfig}} to {{commit}} and would be created with a support interface: {code:lang=java} interface BatchOverwriteSupport extends BatchWriteSupport { WriteConfig newOverwrite(writeOptions, filters: Filter[]) WriteConfig newDynamicOverwrite(writeOptions) } {code} This is much cleaner than staging a delete and then running a write to complete the operation. All of the information about what to overwrite is just passed to the commit operation that can handle it at once. This is much better for dynamic partition replacement because the partitions to be replaced aren't even known by Spark before the write. Last, this adds a place for write life-cycle operations that matches the ScanConfig read life-cycle. This could be used to perform operations like getting a write lock on a Hive table if someone wanted to support Hive's locking mechanism in the future. > Add WriteConfig > --- > > Key: SPARK-25188 > URL: https://issues.apache.org/jira/browse/SPARK-25188 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > The current write API is not flexible enough to implement more complex write > operations like `replaceWhere`. We can follow the read API and add a > `WriteConfig` to make it more flexible. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25190) Better operator pushdown API
[ https://issues.apache.org/jira/browse/SPARK-25190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589070#comment-16589070 ] Ryan Blue commented on SPARK-25190: --- The main problem I have with the current pushdown API is that Spark gets information back from pushdown before all pushdown is finished. I like the idea to have an immutable ScanConfig that is the result of pushdown operations, so it is clear that pushdown calls are finished before getting a reader factory or asking for statistics. Unlike those methods that accept a ScanConfig, using {{SupportsPushDownXYZ}} for pushdown means that Spark is getting the results of pushdown operations back before all pushdown is complete. That means that the same confusion over pushdown order still exists, although the problem is fixed for some operations. I think that all feedback from pushdown should come from the ScanConfig. > Better operator pushdown API > > > Key: SPARK-25190 > URL: https://issues.apache.org/jira/browse/SPARK-25190 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > The current operator pushdown API is a little hard to use. It defines several > {{SupportsPushdownXYZ}} interfaces and ask the implementation to be mutable > to store the pushdown result. We should design a builder like API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org