Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Wenchen Fan
Not all the DDL commands support v2 catalog APIs (e.g. CREATE TABLE LIKE), so it's possible that some commands still go through the v1 session catalog although you configured a custom v2 session catalog. Can you create JIRA tickets if you hit any DDL commands that don't support v2 catalog? We

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Wenchen Fan
If you just want to save typing the catalog name when writing table names, you can set your custom catalog as the default catalog (See SQLConf.DEFAULT_CATALOG). SQLConf.V2_SESSION_CATALOG_IMPLEMENTATION is used to extend the v1 session catalog, not replace it. On Wed, Oct 7, 2020 at 5:36 PM

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Wenchen Fan
Ah, this is by design. V1 tables should still go through the v1 session catalog. I think we can remove this restriction when we are confident about the new v2 DDL commands that work with v2 catalog APIs. On Wed, Oct 7, 2020 at 5:00 PM Jungtaek Lim wrote: > My case is DROP TABLE and DROP TABLE

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Wenchen Fan
> As someone who's had the job of porting different SQL dialects to Spark, I'm also very much in favor of keeping EXTERNAL Just to be clear: no one is proposing to remove EXTERNAL. The 2 options we are discussing are: 1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists with

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Jungtaek Lim
My case is DROP TABLE and DROP TABLE supports both v1 and v2 (as it simply works when I use custom catalog without replacing the default catalog). It just fails on v2 when the "default catalog" is replaced (say I replace 'spark_catalog'), because TempViewOrV1Table is providing value even with v2

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Jungtaek Lim
If it's by design and not prepared, then IMHO replacing the default session catalog is better to be restricted until things are sorted out, as it gives pretty much confusion and has known bugs. Actually there's another bug/limitation on default session catalog on the length of identifier, so

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Ryan Blue
I disagree that this is “by design”. An operation like DROP TABLE should use a v2 drop plan if the table is v2. If a v2 table is loaded or created using a v2 catalog it should also be dropped that way. Otherwise, the v2 catalog is not notified when the table is dropped and can’t perform other

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Holden Karau
On Wed, Oct 7, 2020 at 9:57 AM Wenchen Fan wrote: > I don't think Hive compatibility itself is a "use case". > Ok let's add on top of this: I have some hive queries that I want to run on Spark. I believe that makes it a use case. > The Nessie example you

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Ryan Blue
Wenchen, why are you ignoring Hive as a “reasonable use case”? The keyword came from Hive and we all agree that a Hive catalog with Hive behavior can’t be implemented if Spark chooses to couple this with LOCATION. Why is this use case not a justification? Also, the option to keep behavior the

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Wenchen Fan
I don't think Hive compatibility itself is a "use case". The Nessie example you mentioned is a reasonable use case to me: some frameworks/applications want to create external tables without user-specified location, so that they can manage the table directory

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Ryan Blue
how about LOCATION without EXTERNAL? Currently Spark treats it as an external table. I think there is some confusion about what Spark has to handle. Regardless of what Spark allows as DDL, these tables can exist in a Hive MetaStore that Spark connects to, and the general expectation is that Spark

Re: [FYI] Removing `spark-3.1.0-bin-hadoop2.7-hive1.2.tgz` from Apache Spark 3.1 distribution

2020-10-07 Thread Koert Kuipers
i am a little confused about this. i assumed spark would no longer make a distribution with hive 1.x, but the hive-1.2 profile remains. yet i see the hive-1.2 profile has been removed from pom.xml? On Wed, Sep 23, 2020 at 6:58 PM Dongjoon Hyun wrote: > Hi, All. > > Since Apache Spark 3.0.0,

[Spark Docs]: Question about building docs including docs of internal packages

2020-10-07 Thread Omer Ozarslan
Hello, I'm trying to build docs with univoc such that it includes internal components of spark on master branch. I looked at latest docs [1][2], but I couldn't figure out instructions for doing so. Specifically, I'm trying to build docs for `org.apache.spark.sql.execution` package to access data

Re: [Spark Docs]: Question about building docs including docs of internal packages

2020-10-07 Thread Omer Ozarslan
ps. I meant unidoc, not univoc On Wed, Oct 7, 2020 at 2:15 PM Omer Ozarslan wrote: > > Hello, > > I'm trying to build docs with univoc such that it includes internal > components of spark on master branch. I looked at latest docs [1][2], > but I couldn't figure out instructions for doing so.

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Ryan Blue
I don’t think Spark ever claims to be 100% Hive compatible. By accepting the EXTERNAL keyword in some circumstances, Spark is providing compatibility with Hive DDL. Yes, there are places where it breaks. The question is whether we should deliberately break what a Hive catalog could implement,

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Wenchen Fan
> I have some hive queries that I want to run on Spark. Spark is not compatible with Hive in many places. Decoupling EXTERNAL and LOCATION can't help you too much here. If you do have this use case, we need a much wider discussion about how to achieve it. For this particular topic, we need

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Jungtaek Lim
> If you just want to save typing the catalog name when writing table names, you can set your custom catalog as the default catalog (See SQLConf.DEFAULT_CATALOG). SQLConf.V2_SESSION_CATALOG_IMPLEMENTATION is used to extend the v1 session catalog, not replace it. I'm sorry, but I don't get this.

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-07 Thread Dongjoon Hyun
Thank you so much for your feedback, Koert. Yes, SPARK-20202 was created in April 2017 and targeted for 3.1.0 since Nov 2019. However, I believe Apache Spark 3.1.0 (Hadoop 3.2/Hive 2.3 distribution) will work with old Hadoop 2.x clusters if you isolated the classpath via SPARK-31960.

Re: [Spark Docs]: Question about building docs including docs of internal packages

2020-10-07 Thread Omer Ozarslan
Hello, Thanks. It worked well for me. I ended up with the small patch below. https://gist.github.com/ozars/2b01c9647bc34f16ab3c18eef3579147 Best, Omer On Wed, Oct 7, 2020 at 6:09 PM gemelen wrote: > > Hello, > > by default, some packages (that are treated as internal) are excluded from >

Re: [Spark Docs]: Question about building docs including docs of internal packages

2020-10-07 Thread gemelen
Hello, by default, some packages (that are treated as internal) are excluded from documentation generation task. To generate Javadoc/Scaladoc for classes from them, you would need to comment relevant line in build definition file. For example. package `org/apache/spark/sql/execution` is mentioned

Re: [FYI] Removing `spark-3.1.0-bin-hadoop2.7-hive1.2.tgz` from Apache Spark 3.1 distribution

2020-10-07 Thread Dongjoon Hyun
There is a new thread here. https://lists.apache.org/thread.html/ra2418b75ac276861a598e7ec943750e2b038c2f8ba49f41db57e5ae9%40%3Cdev.spark.apache.org%3E Could you share your use case of Hive 1.2 here, Koert? Bests, Dongjoon. On Wed, Oct 7, 2020 at 1:04 PM Koert Kuipers wrote: > i am a little

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-07 Thread Koert Kuipers
it seems to me with SPARK-20202 we are no longer planning to support hadoop2 + hive 1.2. is that correct? so basically spark 3.1 will no longer run on say CDH 5.x or HDP2.x with hive? my use case is building spark 3.1 and launching on these existing clusters that are not managed by me. e.g. i do

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Ryan Blue
I don’t think Spark ever claims to be 100% Hive compatible. I just found some relevant documentation on this, where Databricks claims that “Apache Spark SQL in Databricks is designed to be