[GitHub] spark issue #21308: [SPARK-24253][SQL] Add DeleteSupport mix-in for DataSour...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21308 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21308: [SPARK-24253][SQL] Add DeleteSupport mix-in for DataSour...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21308 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5236/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21308: [SPARK-24253][SQL] Add DeleteSupport mix-in for DataSour...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21308 @tigerquoll, I'm talking about the DataSourceV2 API in general. I'm not sure if I think there is value in exposing partitions, but I'd be happy to hear why you think they are valuable and think through how it would fix with the existing API. I think that partitions that aren't hidden make tables much harder for users to work with, which is why Iceberg hides partitioning and automatically translates from row filters to partition filters. For Kudu, maybe it is different. Could you write up the use case with a bit more context about what empty partitions are used for, and send it to the dev list? If we think that the v2 API should expose a partition concept, then that would definitely include a way to add or drop partitions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21308: [SPARK-24253][SQL] Add DeleteSupport mix-in for DataSour...
Github user tigerquoll commented on the issue: https://github.com/apache/spark/pull/21308 @rdblue when you say "you don't think the API proposed here needs to support a first-class partition concept", are you referring to the "DeleteSupport" Interface, or to DataSourceV2 in general? If you are referring to DeleteSupport, then do you have the same objections to a separate "DropPartition"/"AddPartition" interface? If you mean that you don't think DataSourceV2 requires supporting partitions as a first-class concept, then how are users of spark supposed to perform operations like 1. adding, 2. altering, 3. removing, and 4. listing partitions on those data-sources that are represented by particular instances of DatasourceV2? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21308: [SPARK-24253][SQL] Add DeleteSupport mix-in for DataSour...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21308 @tigerquoll, I'm not debating whether we should or shouldn't expose partitions here. In general, I'm undecided. I don't think that the API proposed here needs to support a first-class partition concept for tables, largely because partitions aren't currently exposed in the v2 API. The issue you linked to, SPARK-22389, exposes Spark's view of partitioning -- as in `repartition(col)` -- which is to say data rows are grouped together. That's not the same thing as partitions in a data source that can exist independent of data rows. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21308: [SPARK-24253][SQL] Add DeleteSupport mix-in for DataSour...
Github user tigerquoll commented on the issue: https://github.com/apache/spark/pull/21308 @rdblue I think our debate is whether we should expose an API to represent direct operations on partitions in the new datasource api. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21308: [SPARK-24253][SQL] Add DeleteSupport mix-in for DataSour...
Github user tigerquoll commented on the issue: https://github.com/apache/spark/pull/21308 @rdblue Actually: https://issues.apache.org/jira/browse/SPARK-22389. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21308: [SPARK-24253][SQL] Add DeleteSupport mix-in for DataSour...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21308 @tigerquoll, there is currently no support to expose partitions through the v2 API. That would be a different operation. If you wanted to implement partition operations through this API, then you would need to follow the guarantees specified here: if you need to delete by partition, then the expression must match records at partition boundaries and reject the delete operation otherwise. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21308: [SPARK-24253][SQL] Add DeleteSupport mix-in for DataSour...
Github user tigerquoll commented on the issue: https://github.com/apache/spark/pull/21308 @rdblue what about those data sources that support record deletion and partition dropping as two semantically different operations - Kudu and Hbase being two examples. All systems that support partitions have a different api for dealing with partition level ops. Even file based table storage systems support the different levels of manipulation. (look at the sql DDL that impala supports for parquet partition for an example - they use a filter, but the command is âthis partition op applies to the partiton that is defined by this filterâ, not âapply this op to all records that match this filter)â The difference is subtle, but it is an important one, and every system that supports partitions enforces that difference for a reason. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21308: [SPARK-24253][SQL] Add DeleteSupport mix-in for DataSour...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21308 @tigerquoll, what we come up with needs to work across a variety of data sources, including those like JDBC that can delete at a lower granularity than partition. For Hive tables, the partition columns are exposed directly, so users would supply a predicate that matches partition columns. A Hive table source would also be free to reject delete requests -- by throwing the documented exception -- that would require rewriting data. These avoid the case that you're talking about because the predicate must match entire partitions, the source can reject predicates on non-partition columns, or could reject predicates that can't be cleanly deleted with a metadata operation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21308: [SPARK-24253][SQL] Add DeleteSupport mix-in for DataSour...
Github user tigerquoll commented on the issue: https://github.com/apache/spark/pull/21308 I am assuming this API was intended to support the "drop partition" use-case. I'm arguing that adding and deleting partitions deal with a concept that is a slightly higher concept than just a bunch of records that match a filter. Backing up this fact is the concept that partitions are defined independently of any records they may or may not contain - You can add an empty partition and the underlying state of the system will change. Also - as an end user I would be very upset if I meant to drop a partition, but because of a transcription error accidentally started a delete process with a filter that didn't directly match a partition definition that takes a million times as long to execute. Partitions are an implementation optimisation that has leaked into higher level APIs because they are an extremely useful and performant implementation optimisation. I am wondering if we should represent them in this API as something slightly more higher level then just a filter definition. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org