Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Jungtaek Lim
Thanks Nicholas for the side comment; you'll need to interpret "CREATE TABLE USING HIVE FORMAT" as CREATE TABLE using "HIVE FORMAT", but yes it may add the confusion. Ryan, thanks for the detailed analysis and proposal. That's what I would like to see in discussion thread. I'm open to solutions

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Nicholas Chammas
Side comment: The current docs for CREATE TABLE add to the confusion by describing the Hive-compatible command as "CREATE TABLE USING HIVE FORMAT", but neither

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Ryan Blue
Jungtaek, it sounds like you consider the two rules to be separate syntaxes with their own consistency rules. For example, if I am using the Hive syntax rule, then the PARTITIONED BY clause adds new (partition) columns and requires types for those columns; if I’m using the Spark syntax rule with

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Jungtaek Lim
I'm trying to understand the reason you have been suggesting to keep the real thing unchanged but change doc instead. Could you please elaborate why? End users would blame us when they hit the case their query doesn't work as intended (1) and found the fact it's undocumented (2) and hard to

Re: Spark 2.4.x and 3.x datasourcev2 api documentation & references

2020-03-18 Thread MadDoxX
Thanks for the link ! i was able to code my first datasources in microbatch and continuous by using the v2 api... i was wondering if there are some equivalent to SparkListener for microbatching & continuous stream (where one could use the spark.extra.listener) ~ monitoring :) -- Sent from:

Re: Scala vs PySpark Inconsistency: SQLContext/SparkSession access from DataFrame/DataSet

2020-03-18 Thread Maciej Szymkiewicz
Hi Ben, Please note that `_sc` is not a SQLContext. It is a SparkContext, which is used primarily for internal calls. SQLContext is exposed through `sql_ctx` (https://github.com/apache/spark/blob/8bfaa62f2fcc942dd99a63b20366167277bce2a1/python/pyspark/sql/dataframe.py#L80) On 3/17/20 5:53 PM,

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Wenchen Fan
The fact that we have 2 CREATE TABLE syntax is already confusing many users. Shall we only document the native syntax? Then users don't need to worry about which rule their query fits and they don't need to spend a lot of time understanding the subtle difference between these 2 syntaxes. On Wed,

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Jungtaek Lim
A bit correction: the example I provided for vice versa is not really a correct case for vice versa. It's actually same case (intended to use rule 2 which is not default) but different result. On Wed, Mar 18, 2020 at 7:22 PM Jungtaek Lim wrote: > My concern is that although we simply think

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Jungtaek Lim
My concern is that although we simply think about the change to mark "USING provider" to be optional in rule 1, but in reality the change is most likely swapping the default rule for CREATE TABLE, which was "rule 2", and now "rule 1" (it would be the happy case of migration doc if the swap happens

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Wenchen Fan
Document-wise, yes, it's confusing as a simple CREATE TABLE fits both native and Hive syntax. I'm fine with some changes to make it less confusing, as long as the user-facing behavior doesn't change. For example, define "ROW FORMAT" or "STORED AS" as mandatory only if the legacy config is false.

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Jungtaek Lim
Thanks for sharing your view. I agree with you it's good for Spark to promote Spark's own CREATE TABLE syntax. The thing is, we still leave Hive CREATE TABLE syntax unchanged - it's being said as "convenience" but I'm not sure I can agree with. I'll quote my comments in SPARK-31136 here again to

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Wenchen Fan
I think the general guideline is to promote Spark's own CREATE TABLE syntax instead of the Hive one. Previously these two rules are mutually exclusive because the native syntax requires the USING clause while the Hive syntax makes ROW FORMAT or STORED AS clause optional. It's a good move to make

Re: Spark 2.4.x and 3.x datasourcev2 api documentation & references

2020-03-18 Thread Wenchen Fan
For now you can take a look at `DataSourceV2Suite`, which contains both Java/Scala implementations. There is also an ongoing PR to implement catalog APIs for JDBC: https://github.com/apache/spark/pull/27345 We are still working on the user guide. On Mon, Mar 16, 2020 at 4:59 AM MadDoxX wrote: