Re: [discuss] Data Source V2 write path

2017-10-02 Thread Ryan Blue
As far as changes to the public API go, I’d prefer deprecating the API that mixes data and metadata operations. But I don’t think that requires that we go with your proposal #1, where the current write API can’t use data source v2 writers. I think we can separate the metadata operations for Hadoop

Re: [discuss] Data Source V2 write path

2017-10-01 Thread Wenchen Fan
The main entries of data source inside Spark is the SQL API and `DataFrameReader/Writer`. For SQL API, I think the semantic is well defined, the data and metadata operations are separated. E.g., INSERT INTO means write data into an existing table, CREATE TABLE means only create the metadata. But

Re: [discuss] Data Source V2 write path

2017-09-29 Thread Ryan Blue
> Spark doesn't know how to create a table in external systems like Cassandra, and that's why it's currently done inside the data source writer. This isn't a valid argument for doing this task in the writer for v2. If we want to fix the problems with v1, we shouldn't continue to mix write

Re: [discuss] Data Source V2 write path

2017-09-28 Thread Wenchen Fan
> When this CTAS logical node is turned into a physical plan, the relation gets turned into a `DataSourceV2` instance and then Spark gets a writer and configures it with the proposed API. The main point of this is to pass the logical relation (with all of the user's options) through to the data

Re: [discuss] Data Source V2 write path

2017-09-27 Thread Russell Spitzer
On an unrelated note, is there any appetite for making the write path also include an option to return elements that were not able to be processed for some reason. Usage might be like saveAndIgnoreFailures() : Dataset So that if some records cannot be parsed by the datasource for writing, or

Re: [discuss] Data Source V2 write path

2017-09-27 Thread Ryan Blue
Comments inline. I've written up what I'm proposing with a bit more detail. On Tue, Sep 26, 2017 at 11:17 AM, Wenchen Fan wrote: > I'm trying to give a summary: > > Ideally data source API should only deal with data, not metadata. But one > key problem is, Spark still need

Re: [discuss] Data Source V2 write path

2017-09-26 Thread Wenchen Fan
I'm trying to give a summary: Ideally data source API should only deal with data, not metadata. But one key problem is, Spark still need to support data sources without metastore, e.g. file format data sources. For this kind of data sources, users have to pass the metadata information like

Re: [discuss] Data Source V2 write path

2017-09-25 Thread Wenchen Fan
> I think it is a bad idea to let this problem leak into the new storage API. Well, I think using data source options is a good compromise for this. We can't avoid this problem until catalog federation is done, and this may not happen within Spark 2.3, but we definitely need data source write API

Re: [discuss] Data Source V2 write path

2017-09-25 Thread Ryan Blue
I think it is a bad idea to let this problem leak into the new storage API. By not setting the expectation that metadata for a table will exist, this will needlessly complicate writers just to support the existing problematic design. Why can't we use an in-memory catalog to store the configuration

Re: [discuss] Data Source V2 write path

2017-09-25 Thread Wenchen Fan
Catalog federation is to publish the Spark catalog API(kind of a data source API for metadata), so that Spark is able to read/write metadata from external systems. (SPARK-15777) Currently Spark can only read/write Hive metastore, which means for other systems like Cassandra, we can only

Re: [discuss] Data Source V2 write path

2017-09-25 Thread Ryan Blue
However, without catalog federation, Spark doesn’t have an API to ask an external system(like Cassandra) to create a table. Currently it’s all done by data source write API. Data source implementations are responsible to create or insert a table according to the save mode. What’s catalog

Re: [discuss] Data Source V2 write path

2017-09-25 Thread Wenchen Fan
We still need to support low-level data sources like pure parquet files, which do not have a metastore. BTW I think we should leave the metadata management to the catalog API after catalog federation. Data source API should only care about data. On Mon, Sep 25, 2017 at 11:14 AM, Reynold Xin

Re: [discuss] Data Source V2 write path

2017-09-24 Thread Reynold Xin
Can there be an explicit create function? On Sun, Sep 24, 2017 at 7:17 PM, Wenchen Fan wrote: > I agree it would be a clean approach if data source is only responsible to > write into an already-configured table. However, without catalog > federation, Spark doesn't have an

Re: [discuss] Data Source V2 write path

2017-09-24 Thread Wenchen Fan
I agree it would be a clean approach if data source is only responsible to write into an already-configured table. However, without catalog federation, Spark doesn't have an API to ask an external system(like Cassandra) to create a table. Currently it's all done by data source write API. Data

Re: [discuss] Data Source V2 write path

2017-09-21 Thread Ryan Blue
> input data requirement Clustering and sorting within partitions are a good start. We can always add more later when they are needed. The primary use case I'm thinking of for this is partitioning and bucketing. If I'm implementing a partitioned table format, I need to tell Spark to cluster by

Re: [discuss] Data Source V2 write path

2017-09-21 Thread Reynold Xin
Ah yes I agree. I was just saying it should be options (rather than specific constructs). Having them at creation time makes a lot of sense. Although one tricky thing is what if they need to change, but we can probably just special case that. On Thu, Sep 21, 2017 at 6:28 PM Ryan Blue

Re: [discuss] Data Source V2 write path

2017-09-21 Thread Ryan Blue
I’d just pass them [partitioning/bucketing] as options, until there are clear (and strong) use cases to do them otherwise. I don’t think it makes sense to pass partitioning and bucketing information *into* this API. The writer should already know the table structure and should pass relevant

Re: [discuss] Data Source V2 write path

2017-09-20 Thread Reynold Xin
On Wed, Sep 20, 2017 at 3:10 AM, Wenchen Fan wrote: > Hi all, > > I want to have some discussion about Data Source V2 write path before > starting a voting. > > The Data Source V1 write path asks implementations to write a DataFrame > directly, which is painful: > 1.

[discuss] Data Source V2 write path

2017-09-20 Thread Wenchen Fan
Hi all, I want to have some discussion about Data Source V2 write path before starting a voting. The Data Source V1 write path asks implementations to write a DataFrame directly, which is painful: 1. Exposing upper-level API like DataFrame to Data Source API is not good for maintenance. 2. Data