[GitHub] spark issue #21308: [SPARK-24253][SQL] Add DeleteSupport mix-in for DataSour...

2018-11-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21308
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21308: [SPARK-24253][SQL] Add DeleteSupport mix-in for DataSour...

2018-11-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21308
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5236/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21308: [SPARK-24253][SQL] Add DeleteSupport mix-in for DataSour...

2018-09-10 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21308
  
@tigerquoll, I'm talking about the DataSourceV2 API in general. I'm not 
sure if I think there is value in exposing partitions, but I'd be happy to hear 
why you think they are valuable and think through how it would fix with the 
existing API.

I think that partitions that aren't hidden make tables much harder for 
users to work with, which is why Iceberg hides partitioning and automatically 
translates from row filters to partition filters. For Kudu, maybe it is 
different. Could you write up the use case with a bit more context about what 
empty partitions are used for, and send it to the dev list?

If we think that the v2 API should expose a partition concept, then that 
would definitely include a way to add or drop partitions.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21308: [SPARK-24253][SQL] Add DeleteSupport mix-in for DataSour...

2018-09-10 Thread tigerquoll
Github user tigerquoll commented on the issue:

https://github.com/apache/spark/pull/21308
  
@rdblue when you say "you don't think the API proposed here needs to 
support a first-class partition concept", are you referring to the 
"DeleteSupport" Interface, or to DataSourceV2 in general?
If you are referring to DeleteSupport, then do you have the same objections 
to a separate "DropPartition"/"AddPartition" interface?  
If you mean that you don't think DataSourceV2 requires supporting 
partitions as a first-class concept, then how are users of spark supposed to 
perform operations like 
1. adding,  
2. altering,
3. removing, and
4. listing
partitions on those data-sources that are represented by particular 
instances of DatasourceV2?  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21308: [SPARK-24253][SQL] Add DeleteSupport mix-in for DataSour...

2018-09-07 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21308
  
@tigerquoll, I'm not debating whether we should or shouldn't expose 
partitions here. In general, I'm undecided. I don't think that the API proposed 
here needs to support a first-class partition concept for tables, largely 
because partitions aren't currently exposed in the v2 API.

The issue you linked to, SPARK-22389, exposes Spark's view of partitioning 
-- as in `repartition(col)` -- which is to say data rows are grouped together. 
That's not the same thing as partitions in a data source that can exist 
independent of data rows.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21308: [SPARK-24253][SQL] Add DeleteSupport mix-in for DataSour...

2018-09-06 Thread tigerquoll
Github user tigerquoll commented on the issue:

https://github.com/apache/spark/pull/21308
  
@rdblue I think our debate is whether we should expose an API to represent 
direct operations on partitions in the new datasource api. 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21308: [SPARK-24253][SQL] Add DeleteSupport mix-in for DataSour...

2018-09-06 Thread tigerquoll
Github user tigerquoll commented on the issue:

https://github.com/apache/spark/pull/21308
  
@rdblue Actually:  https://issues.apache.org/jira/browse/SPARK-22389.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21308: [SPARK-24253][SQL] Add DeleteSupport mix-in for DataSour...

2018-09-06 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21308
  
@tigerquoll, there is currently no support to expose partitions through the 
v2 API. That would be a different operation. If you wanted to implement 
partition operations through this API, then you would need to follow the 
guarantees specified here: if you need to delete by partition, then the 
expression must match records at partition boundaries and reject the delete 
operation otherwise.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21308: [SPARK-24253][SQL] Add DeleteSupport mix-in for DataSour...

2018-09-06 Thread tigerquoll
Github user tigerquoll commented on the issue:

https://github.com/apache/spark/pull/21308
  
@rdblue what about those data sources that support record deletion and 
partition dropping as two semantically different operations - Kudu and Hbase 
being two examples. 

All systems that support partitions have a different api for dealing with 
partition level ops. Even file based table storage systems support the 
different levels of manipulation. (look at the sql DDL that impala supports for 
parquet partition for an example - they use a filter, but the command is 
“this partition op applies to the partiton that is defined by this filter”, 
not “apply this op to all records that match this filter)”

The difference is subtle, but it is an important one, and every system that 
supports partitions enforces that difference for a reason.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21308: [SPARK-24253][SQL] Add DeleteSupport mix-in for DataSour...

2018-09-04 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21308
  
@tigerquoll, what we come up with needs to work across a variety of data 
sources, including those like JDBC that can delete at a lower granularity than 
partition.

For Hive tables, the partition columns are exposed directly, so users would 
supply a predicate that matches partition columns. A Hive table source would 
also be free to reject delete requests -- by throwing the documented exception 
-- that would require rewriting data. These avoid the case that you're talking 
about because the predicate must match entire partitions, the source can reject 
predicates on non-partition columns, or could reject predicates that can't be 
cleanly deleted with a metadata operation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21308: [SPARK-24253][SQL] Add DeleteSupport mix-in for DataSour...

2018-09-04 Thread tigerquoll
Github user tigerquoll commented on the issue:

https://github.com/apache/spark/pull/21308
  
I am assuming this API was intended to support the "drop partition" 
use-case.  I'm arguing that adding and deleting partitions deal with a concept 
that is a slightly higher concept than just a bunch of records that match a 
filter.  Backing up this fact is the concept that partitions are defined 
independently of any records they may or may not contain - You can add an empty 
partition and the underlying state of the system will change.

Also - as an end user I would be very upset if I meant to drop a partition, 
but because of a transcription error accidentally started a delete process with 
a filter that didn't directly match a partition definition that takes a million 
times as long to execute.  

Partitions are an implementation optimisation that has leaked into higher 
level APIs because they are an extremely useful and performant implementation 
optimisation.  I am wondering if we should represent them in this API as 
something slightly more higher level then just a filter definition.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org