[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-07-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21306
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-07-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21306
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92623/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-07-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21306
  
**[Test build #92623 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92623/testReport)**
 for PR 21306 at commit 
[`023995d`](https://github.com/apache/spark/commit/023995d15b4293fac1530da6bd966b6ab6823980).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-07-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21306
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92621/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-07-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21306
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-07-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21306
  
**[Test build #92621 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92621/testReport)**
 for PR 21306 at commit 
[`42ed4a4`](https://github.com/apache/spark/commit/42ed4a4a138e5c5f681755d871fd9d9030a4619a).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-07-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21306
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-07-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21306
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/686/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-07-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21306
  
**[Test build #92623 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92623/testReport)**
 for PR 21306 at commit 
[`023995d`](https://github.com/apache/spark/commit/023995d15b4293fac1530da6bd966b6ab6823980).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-07-04 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21306
  
@cloud-fan, I've updated this to address your comments. Thanks for the 
reviews!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-07-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21306
  
**[Test build #92621 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92621/testReport)**
 for PR 21306 at commit 
[`42ed4a4`](https://github.com/apache/spark/commit/42ed4a4a138e5c5f681755d871fd9d9030a4619a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-07-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21306
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/684/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-07-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21306
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-07-03 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21306
  
@cloud-fan, thanks for the thorough feedback!

> What catalog operations we want to forward to the data source catalog? 
Currently it's create/drop/alter table, I think it's good enough for now.

This PR introduces create, drop, and alter. We can always add more later. 
These are the ones that we need to implement DataSourceV2 operations and DDL 
support.

> Spark provides an API so that end-users can do it directly. e.g. 
`spark.catalog("iceberge").createTable(...)`, or SQL API `CREATE TABLE 
iceberge.db1.tbl1 . . .`

These two are the easiest and least intrusive way to start because the data 
source catalog interaction is explicitly tied to a catalog. It also matches the 
behavior used by other systems for multiple catalogs. I think this is what we 
should start with and then tackle ideas like your second point.

> When creating/dropping/altering Spark tables, also forward it to the data 
source catalog. . .

For this and a couple other questions, I don't think we need to decide 
right now. This PR is about getting the interface for other sources in Spark. 
We don't necessarily need to know all of the ways that users will call it or 
interact with it, like how `DESC TABLE` will work.

To your question here, I'm not sure whether the `CREATE TABLE ... USING 
source` syntax should use the default catalog or defer to the catalog for 
`source` or forward to both, but that doesn't need to block adding this API 
because I think we can decide it later. In addition, we should probably discuss 
this on the dev list to make sure we get the behavior right.

> How to lookup the table metadata from data source catalog?

The SPIP proposes two catalog interfaces that return `Table`. One that uses 
table identifiers and one that uses paths. Data sources can implement support 
for both or just one. This PR includes just the support for table identifiers. 
We would add a similar API for path-based tables in another PR.

> How to define table metadata? Maybe we can forward `DESC TABLE` . . .

That sounds like a reasonable idea to me. Like the behavior of `USING`, I 
don't think this is something that we have to decide right now. We can add 
support later as we implement table DDL. Maybe `Table` should return a DF that 
is its `DESCRIBE` output.

> How does the table metadata involve in data reading/writing?

This is another example of something we don't need to decide yet. We have a 
couple different options for the behavior and will want to think them through 
and discuss them on the dev list. But I don't think that the behavior 
necessarily needs to be decided before we add this API to sources.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-06-28 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21306
  
There are several things we need to discuss here:

- What catalog operations we want to forward to the data source catalog? 
Currently it's create/drop/alter table, I think it's good enough for now.
- How does Spark forward these catalog operations? IMO there are 2 ways.
  - Spark provides an API so that end-users can do it directly. e.g. 
`spark.catalog("iceberge").createTable(...)`, or SQL API `CREATE TABLE 
iceberge.db1.tbl1 ...`.
  - When creating/dropping/altering Spark tables, also forward it to the 
data source catalog. For example, users create a table in Spark via `CREATE 
TABLE t(...) USING iceberg`, which creates an table entry in the Hive 
metastore, as well as a iceberg meta file. When dropping this table, Spark 
should notify iceberg to remove the meta file. It's arguable that we need this 
feature or not, if users are willing to always add the catalog prefix, they can 
just write`CREATE TABLE iceberge.db1.tbl1 ...` and `SELECT ... FROM 
iceberge.db1.tbl1`, and totoally by-pass the Spark catalog.
- How to lookup the table metadata from data source catalog? I think 
database name + table name is a common way(e.g. `iceberge.db1.tbl1`), but we 
should also consider other ways like path (e.g. `` delta.`/a/path` ``). Maybe 
we can treat path as a table name without database, and leave the data source 
to interprete it.
- How to define table metadata? It seems that Spark only need to know the 
table schema for analysis. Maybe we can forward `DESC TABLE` to data source so 
that Spark doesn't need to standardize the table metadata.
- How does the table metadata involve in data reading/writing? When reading 
data without catalog, e.g. `spark.read.format("my_data_source").option("table", 
"my_table").load()`, the data source need to get the metadata of the given 
table. When reading data with catalog, e.g. 
`spark.table("my_data_source.my_table")`, the data source also need to get the 
metadata of the given table, but need to implement it in a different 
API(`CatalogSupport`). It's ok to say that data source implementation is 
responsible to eliminate code duplication themselves.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-06-26 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21306
  
@cloud-fan, what needs to change to get this in? I'd like to start making 
more PRs based on these changes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21306
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90532/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21306
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-05-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21306
  
**[Test build #90532 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90532/testReport)**
 for PR 21306 at commit 
[`7130d13`](https://github.com/apache/spark/commit/7130d13de27c99480189cdce3b7f00749a801e9c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21306
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3159/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21306
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-05-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21306
  
**[Test build #90532 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90532/testReport)**
 for PR 21306 at commit 
[`7130d13`](https://github.com/apache/spark/commit/7130d13de27c99480189cdce3b7f00749a801e9c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21306
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3158/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21306
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-05-11 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21306
  
@henryr, @cloud-fan, @marmbrus, here's a first pass at adding a catalog 
mix-in to the v2 API. Please have a look and leave comments on what you'd like 
to change.

One thing that I don't think we need right away is the `alterTable` 
operation. We could easily remove that and add it later. For CTAS and other 
operations, we do need `loadTable`, `createTable`, and `dropTable` soon.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21306
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90531/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-05-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21306
  
**[Test build #90531 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90531/testReport)**
 for PR 21306 at commit 
[`34f91c5`](https://github.com/apache/spark/commit/34f91c58ca23b81a7ee6f9270f30001e8885733e).
 * This patch **fails RAT tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  final class AddColumn implements TableChange `
  * `  final class RenameColumn implements TableChange `
  * `  final class UpdateColumn implements TableChange `
  * `  final class DeleteColumn implements TableChange `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-05-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21306
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-05-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21306
  
**[Test build #90531 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90531/testReport)**
 for PR 21306 at commit 
[`34f91c5`](https://github.com/apache/spark/commit/34f91c58ca23b81a7ee6f9270f30001e8885733e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org