Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-09 Thread Ryan Blue
String instead, then the UDF wouldn’t work. What then? Does Spark detect that the wrong type was used? It would need to or else it would be difficult for a UDF developer to tell what is wrong. And this is a runtime issue so it is caught late. -- Ryan Blue

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-10 Thread Ryan Blue
n >>>>> the community for a long period of time. I especially appreciate how the >>>>> design is focused on a minimal useful component, with future optimizations >>>>> considered from a point of view of making sure it's flexible, but actual >&

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-15 Thread Ryan Blue
gt;> >>> I find both of the proposed UDF APIs to be sufficiently user-friendly >>> >>> and >>> >>> extensible. I generally think Wenchen's proposal is easier for a >>> user to >>> >>> work with in the common case, but has greater

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-16 Thread Ryan Blue
> > This proposal looks very interesting. Would future goals for this > functionality include both support for aggregation functions, as well > as support for processing ColumnBatch-es (instead of Row/InternalRow)? > > Thanks > Andrew > > On Mon, Feb 15, 2021 at 12:44 PM Ryan Bl

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-17 Thread Ryan Blue
conclude this thread >> and have at least one implementation in the `master` branch this month >> (February). >> If you need more time (one month or longer), why don't we have Ryan's >> suggestion in the `master` branch first and benchmark with your PR later >>

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-18 Thread Ryan Blue
individual-parameters version or the row-parameter version? > > To move forward, how about we implement the function loading and binding > first? Then we can have PRs for both the individual-parameters (I can take > it) and row-parameter approaches, if we still can't reach a consensus at

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-19 Thread Ryan Blue
gt; merge two Arrays (of generic types) to a Map. >> >> Also, to address Wenchen's InternalRow question, can we create a number >> of Function classes, each corresponding to a number of input parameter >> length (e.g., ScalarFunction1, ScalarFunction2, etc)? >> >&

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-01 Thread Ryan Blue
safety guarantees only if you need just one set of types for each number of arguments and are using the non-codegen path. Since varargs is one of the primary reasons to use this API, then I don’t think that it is a good idea to use Object[] instead of InternalRow. -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-03 Thread Ryan Blue
only be Object and cause boxing issues. > > I agree that Object[] is worse than InternalRow. But I can't think of > real use cases that will force the individual-parameters approach to use > Object instead of concrete types. > > > On Tue, Mar 2, 2021 at 3:36 AM Ryan

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-03 Thread Ryan Blue
throw new UnsupportedOperationException(); > + } > > By providing the default implementation, it will not *forcing users to > implement it* technically. > And, we can provide a document about our expected usage properly. > What do you think? > > Bests, > Dongjoon. >

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-04 Thread Ryan Blue
the "magical methods", then we can have a single >>> ScalarFunction interface which has the row-parameter API (with a >>> default implementation to fail) and documents to describe the "magical >>> methods" (which can be done later). >>> >&

[VOTE] SPIP: Add FunctionCatalog

2021-03-08 Thread Ryan Blue
.82w8qxfl2uwl Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll do a final update of the PR and we can merge the API. [ ] +1: Accept the proposal as an official SPIP [ ] +0 [ ] -1: I don’t think this is a good idea because … -- Ryan Blue

Re: [VOTE] SPIP: Add FunctionCatalog

2021-03-15 Thread Ryan Blue
t; > On Tue, Mar 9, 2021 at 9:27 AM huaxin gao < >> >>>> >> >>>> > huaxin.gao11@ >> >>>> >> >>>> > > wrote: >> >>>> > >> >>>> >> +1 (non-binding) >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> -- >> >>>> Sent from: >> http://apache-spark-developers-list.1001551.n3.nabble.com/ >> >>>> >> >>>> - >> >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >>>> >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> > > -- > --- > Takeshi Yamamuro > -- Ryan Blue Software Engineer Netflix

[RESULT] [VOTE] SPIP: Add FunctionCatalog

2021-03-15 Thread Ryan Blue
This SPIP is adopted with the following +1 votes and no -1 or +0 votes: Holden Karau* John Zhuge Chao Sun Dongjoon Hyun* Russell Spitzer DB Tsai* Wenchen Fan* Kent Yao Huaxin Gao Liang-Chi Hsieh Jungtaek Lim Hyukjin Kwon* Gengliang Wang kordex Takeshi Yamamuro Ryan Blue * = binding On Mon, Mar

Re: [VOTE] SPIP: Catalog API for view metadata

2021-05-24 Thread Ryan Blue
> > Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll > update the PR for review. > > [ ] +1: Accept the proposal as an official SPIP > [ ] +0 > [ ] -1: I don’t think this is a good idea because … > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-24 Thread Ryan Blue
tribution properties > reported by data sources and eliminate shuffle whenever possible. > > > > Design doc: > https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE > (includes a POC link at the end) > > > > We'd like to start a discussion on the doc and any feedback is welcome! > > > > Thanks, > > Chao > -- Ryan Blue

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-26 Thread Ryan Blue
;>>> >>>>> +1 for this SPIP. >>>>> >>>>> On Sun, Oct 24, 2021 at 9:59 AM huaxin gao >>>>> wrote: >>>>> >>>>>> +1. Thanks for lifting the current restrictions on bucket join and >>>>>

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-27 Thread Ryan Blue
pache.org/jira/browse/SPARK-19256> has >>> details). >>> >>> >>> >>> 1. Would aggregate work automatically after the SPIP? >>> >>> >>> >>> Another major benefit for having bucketed table, is to avoid shuffle >>>

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-27 Thread Ryan Blue
hash function. Or we can > clearly define the bucket hash function of the builtin `BucketTransform` in > the doc. > > On Thu, Oct 28, 2021 at 12:25 AM Ryan Blue wrote: > >> Two v2 sources may return different bucket IDs for the same value, and >> this breaks the phase 1 s

Re: [VOTE] SPIP: Storage Partitioned Join for Data Source V2

2021-10-29 Thread Ryan Blue
t; > > > >> > > > [ ] +1: Accept the proposal as an official SPIP >> > > > [ ] +0 >> > > > [ ] -1: I don’t think this is a good idea because … >> > > > >> > > > >> - >> > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> > > > >> > > > >> > > >> > > - >> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> > > >> > > >> > >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- Ryan Blue Tabular

Re: [VOTE] SPIP: Row-level operations in Data Source V2

2021-11-14 Thread Ryan Blue
rg/thread/kd8qohrk5h3qx8d6y4lhrm67vnn8p6bv> >>>>> > >>>>> > - JIRA: SPARK-35801 < >>>>> https://issues.apache.org/jira/browse/SPARK-35801> >>>>> > - PR for handling DELETE statements: >>>>> > <https://github.com/apache/spark/pull/33008> >>>>> > >>>>> > - Design doc >>>>> > < >>>>> https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60/ >>>>> > >>>>> > >>>>> > Please vote on the SPIP for the next 72 hours: >>>>> > >>>>> > [ ] +1: Accept the proposal as an official SPIP >>>>> > [ ] +0 >>>>> > [ ] -1: I don’t think this is a good idea because … >>>>> > >>>>> > - >>>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>> > >>>>> >>>>> >>>>> - >>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>> >>>>> -- Ryan Blue Tabular

Re: Supports Dynamic Table Options for Spark SQL

2021-11-15 Thread Ryan Blue
t; >> Hi dev, >> >> We are discussing Support Dynamic Table Options for Spark SQL ( >> https://github.com/apache/spark/pull/34072). It is currently not sure if >> the syntax makes sense, and would like to know if there is other feedback >> or opinion on this. >> >> I would appreciate any feedback on this. >> >> Thanks. >> > -- Ryan Blue Tabular

Re: Supports Dynamic Table Options for Spark SQL

2021-11-15 Thread Ryan Blue
iable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Mon, 15 Nov 2021 at 17:02, Russell Spitzer > wrote: > >> I think since we probably will end up using this same syntax on write, >> this makes a lot of sense. Unless there

Re: Supports Dynamic Table Options for Spark SQL

2021-11-16 Thread Ryan Blue
? we can extract options from runtime session >> configurations e.g., SessionConfigSupport. >> >> On Tue, 16 Nov 2021 at 04:30, Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >>> Side note about time travel: There is a PR >>> <https:

Re: [VOTE][SPIP] Support Customized Kubernetes Schedulers Proposal

2022-01-12 Thread Ryan Blue
Previous discussion in dev mailing list: [DISCUSSION] SPIP: Support >>> Volcano/Alternative Schedulers Proposal >>> > - Design doc: [SPIP] Spark-36057 Support Customized Kubernetes >>> Schedulers Proposal >>> > - JIRA: SPARK-36057 >>> > >>

Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-03 Thread Ryan Blue
to add a ViewCatalog interface that can be used to load, > create, alter, and drop views in DataSourceV2. > > Please vote on the SPIP until Feb. 9th (Wednesday). > > [ ] +1: Accept the proposal as an official SPIP > [ ] +0 > [ ] -1: I don’t think this is a good idea because … > > Thanks! > -- Ryan Blue Tabular

Re: Data Contracts

2023-06-12 Thread Ryan Blue
schema metadata that are > enforced in the implementation of a FileFormatDataWriter? > > Just throwing it out there and wondering what other people think. It's an > area that interests me as it seems that over half my problems at the day > job are because of dodgy data. > > Regards, > > Phillip > > -- Ryan Blue Tabular

Re: Query hints visible to DSV2 connectors?

2023-08-03 Thread Ryan Blue
hint system [ > https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-hints.html > or sql("select 1").hint("foo").show()] aren't visible from the > TableCatalog/Table/ScanBuilder. > > I guess I could set a config parameter but I'd rather do this on a > per-query basis. Any tips? > > Thanks! > > -0xe1a > -- Ryan Blue Tabular

Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-09 Thread Ryan Blue
netes operator, making it a part of the Apache Flink project ( >>> https://github.com/apache/flink-kubernetes-operator). This move has >>> gained wide industry adoption and contributions from the community. In a >>> mere year, the Flink operator has garnered more than 600 stars and has >>> attracted contributions from over 80 contributors. This showcases the level >>> of community interest and collaborative momentum that can be achieved in >>> similar scenarios. >>> More details can be found at SPIP doc : Spark Kubernetes Operator >>> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE >>> >>> Thanks, >>> -- >>> *Zhou JIANG* >>> >>> >>> -- Ryan Blue Tabular

Re: Manually reading parquet files.

2019-03-21 Thread Ryan Blue
tate.newHadoopConfWithOptions(relation.options)) > ) > > *import *scala.collection.JavaConverters._ > > *val *rows = readFile(pFile).flatMap(_ *match *{ > *case *r: InternalRow => *Seq*(r) > > // This doesn't work. vector mode is doing something screwy > *case *b: ColumnarBatch => b.rowIterator().asScala > }).toList > > *println*(rows) > //List([0,1,5b,24,66647361]) > //??this is wrong I think > > > > Has anyone attempted something similar? > > > > Cheers Andrew > > > -- Ryan Blue Software Engineer Netflix

Re: Closing a SparkSession stops the SparkContext

2019-04-02 Thread Ryan Blue
RM9LrT3Yp2mf1BcbBf1o_m3bcNZdOjznrogBLzUzgeE&e= > . > > > > > > > > I’m trying to understand why this is the intended behavior – anyone > have any knowledge of why this is the case? > > > > > > > > Thanks, > > > > Vinoo > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: Closing a SparkSession stops the SparkContext

2019-04-03 Thread Ryan Blue
t; > it happens to be in the SparkContext but is state only needed by one > > SparkSession and that there isn't any way to clean up now, that's a > > compelling reason to change the API. Is that the situation? The only > > downside is making the user sepa

DataSourceV2 sync 3 April 2019

2019-04-04 Thread Ryan Blue
Ryan Blue John Zhuge Russel Spitzer Gengliang Wang Yuanjian Li Matt Cheah Yifei Huang Felix Cheung Dilip Biswal Wenchen Fan -- Ryan Blue Software Engineer Netflix

Re: Dataset schema incompatibility bug when reading column partitioned data

2019-04-11 Thread Ryan Blue
> That is, when reading column partitioned Parquet files the explicitly >> specified schema is not adhered to, instead the partitioning columns are >> appended the end of the column list. This is a quite severe issue as some >> operations, such as union, fails if columns are in

Re: Thoughts on dataframe cogroup?

2019-04-15 Thread Ryan Blue
wrote: >>>>>>>> >>>>>>>>> Hello, >>>>>>>>> I fail to see how an equi-join on the key columns is different >>>>>>>>> than the cogroup you propose. >>>>>>>>> >>&

Re: Spark 2.4.2

2019-04-16 Thread Ryan Blue
'd like to start the process shortly. > > Michael > -- Ryan Blue Software Engineer Netflix

Re: Spark 2.4.2

2019-04-16 Thread Ryan Blue
ed > by this behavior. Do you have a different proposal about how this should > be handled? > > On Tue, Apr 16, 2019 at 4:23 PM Ryan Blue wrote: > >> Is this a bug fix? It looks like a new feature to me. >> >> On Tue, Apr 16, 2019 at 4:13 PM Michael Armbrust &

DataSourceV2 sync, 17 April 2019

2019-04-19 Thread Ryan Blue
*: - TableCatalog PR #24246: https://github.com/apache/spark/pull/24246 - Remove SaveMode PR #24233: https://github.com/apache/spark/pull/24233 - Streaming capabilities PR #24129: https://github.com/apache/spark/pull/24129 *Attendees*: Ryan Blue John Zhuge Matt Cheah Yifei Huang Bruce Robbins Jamison

Re: DataSourceV2 sync, 17 April 2019

2019-04-29 Thread Ryan Blue
t; Jean Georges Perrin > j...@jgp.net > > > > On Apr 19, 2019, at 10:10, Ryan Blue wrote: > > Here are my notes from the last DSv2 sync. As always: > >- If you’d like to attend the sync, send me an email and I’ll add you >to the invite. Everyone is welcome. &

Re: Bucketing and catalyst

2019-05-02 Thread Ryan Blue
ng is Catalyst? I’ve been trying to piece together how > Catalyst knows that it can remove a sort and shuffle given that both tables > are bucketed and sorted the same way. Is there any classes in particular I > should look at? > > > > Cheers Andrew > -- Ryan Blue Software Engineer Netflix

DataSourceV2 community sync notes - 1 May 2019

2019-05-06 Thread Ryan Blue
Here are my notes for the latest DSv2 community sync. As usual, if you have comments or corrections, please reply. If you’d like to be invited to the next sync, email me directly. Everyone is welcome to attend. *Attendees*: Ryan Blue John Zhuge Andrew Long Bruce Robbins Dilip Biswal Gengliang

Re: DataSourceV2Reader Q

2019-05-21 Thread Ryan Blue
> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:89) > at > org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:41) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:541) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:763) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:463) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:209)] > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

DataSourceV2 sync notes - 15 May 2019

2019-05-29 Thread Ryan Blue
Sorry these notes are so late, I didn’t get to the write up until now. As usual, if anyone has corrections or comments, please reply. *Attendees*: John Zhuge Ryan Blue Andrew Long Wenchen Fan Gengliang Wang Russell Spitzer Yuanjian Li Yifei Huang Matt Cheah Amardeep Singh Dhilon Zhilmil Dhion

DataSourceV2 sync notes - 29 May 2019

2019-05-30 Thread Ryan Blue
Here are my notes from last night’s sync. I had to leave early, so there may be more discussion. Others can fill in the details for those topics. *Attendees*: John Zhuge Ryan Blue Yifei Huang Matt Cheah Yuanjian Li Russell Spitzer Kevin Yu *Topics*: - Atomic extensions for the TableCatalog

DataSourceV2 sync notes - 12 June 2019

2019-06-14 Thread Ryan Blue
Here are the latest DSv2 sync notes. Please reply with updates or corrections. *Attendees*: Ryan Blue Michael Armbrust Gengliang Wang Matt Cheah John Zhuge *Topics*: Wenchen’s reorganization proposal Problems with TableProvider - property map isn’t sufficient New PRs: - ReplaceTable

Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

2019-06-17 Thread Ryan Blue
i everyone, >>>>>> >>>>>> >>>>>> >>>>>> I would like to call a vote for the SPIP for SPARK-25299 >>>>>> <https://issues.apache.org/jira/browse/SPARK-25299>, which proposes >>>>>> to introduce a pluggable storage API for temporary shuffle data. >>>>>> >>>>>> >>>>>> >>>>>> You may find the SPIP document here >>>>>> <https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit> >>>>>> . >>>>>> >>>>>> >>>>>> >>>>>> The discussion thread for the SPIP was conducted here >>>>>> <https://lists.apache.org/thread.html/2fe82b6b86daadb1d2edaef66a2d1c4dd2f45449656098ee38c50079@%3Cdev.spark.apache.org%3E> >>>>>> . >>>>>> >>>>>> >>>>>> >>>>>> Please vote on whether or not this proposal is agreeable to you. >>>>>> >>>>>> >>>>>> >>>>>> Thanks! >>>>>> >>>>>> >>>>>> >>>>>> -Matt Cheah >>>>>> >>>>> -- Ryan Blue Software Engineer Netflix

Re: DSv1 removal

2019-06-20 Thread Ryan Blue
DSv1 sources has > to be removed (in case of DSv2 replacement is implemented). After some > digging I've found DSv1 sources which are already removed but in some cases > v1 and v2 still exists in parallel. > > Can somebody please tell me what's the overall plan in this area? > > BR, > G > > -- Ryan Blue Software Engineer Netflix

Re: Timeline for Spark 3.0

2019-06-28 Thread Ryan Blue
> > > Is there a timeline for spark 3.0 in terms of the first RC and final > release? > > > > > > > > Cheers Andrew > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

DSv2 sync notes - 26 June 2019

2019-06-28 Thread Ryan Blue
Here are my notes from this week’s sync. *Attendees*: Ryan Blue John Zhuge Dale Richardson Gabor Somogyi Matt Cheah Yifei Huang Xin Ren Jose Torres Gengliang Wang Kevin Yu *Topics*: - Metadata columns or function push-down for Kafka v2 source - Open PRs - REPLACE TABLE

Re: JDBC connector for DataSourceV2

2019-07-12 Thread Ryan Blue
s in Master, > but can't find a JDBC implementation or related JIRA. > > DatasourceV2 APIs to me look in good shape to attempt a JDBC connector for > READ/WRITE path. > > Thanks & Regards, > Shiv > -- Ryan Blue Software Engineer Netflix

Re: JDBC connector for DataSourceV2

2019-07-12 Thread Ryan Blue
Sounds great! Ping me on the review, I think this will be really valuable. On Fri, Jul 12, 2019 at 6:51 PM Xianyin Xin wrote: > If there’s nobody working on that, I’d like to contribute. > > > > Loop in @Gengliang Wang. > > > > Xianyin > > > > *F

DataSourceV2 sync notes - 10 July 2019

2019-07-19 Thread Ryan Blue
Here are my notes from the last sync. If you’d like to be added to the invite or have topics, please let me know. *Attendees*: Ryan Blue Matt Cheah Yifei Huang Jose Torres Burak Yavuz Gengliang Wang Michael Artz Russel Spitzer *Topics*: - Existing PRs - V2 session catalog: https

Re: DataSourceV2 sync notes - 10 July 2019

2019-07-23 Thread Ryan Blue
may hurt data source v2 performance a lot and we'd better > fix it sooner rather than later. > > > On Sat, Jul 20, 2019 at 8:20 AM Ryan Blue > wrote: > >> Here are my notes from the last sync. If you’d like to be added to the >> invite or have topics, please let

Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-26 Thread Ryan Blue
;> following ANSI SQL is a better idea. >> For more information, please read the Discuss: Follow ANSI SQL on table >> insertion >> <https://docs.google.com/document/d/1b9nnWWbKVDRp7lpzhQS1buv1_lDzWIZY2ApFs5rBcGI/edit?usp=sharing> >> Please let me know if you have any thoughts on this. >> >> Regards, >> Gengliang >> > -- Ryan Blue Software Engineer Netflix

Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-31 Thread Ryan Blue
arranted to do so. >> >> >> >> -Matt Cheah >> >> >> >> *From: *Reynold Xin >> *Date: *Wednesday, July 31, 2019 at 9:58 AM >> *To: *Matt Cheah >> *Cc: *Russell Spitzer , Takeshi Yamamuro < >> linguin@gmail.com>, Ge

Re: DataSourceV2 : Transactional Write support

2019-08-03 Thread Ryan Blue
gt; Thanks in advance for your help. >> >> Regards, >> Shiv >> > > > -- > Name : Jungtaek Lim > Blog : http://medium.com/@heartsavior > Twitter : http://twitter.com/heartsavior > LinkedIn : http://www.linkedin.com/in/heartsavior > -- Ryan Blue Software Engineer Netflix

Re: [Discuss] Follow ANSI SQL on table insertion

2019-08-05 Thread Ryan Blue
id results. My intuition is yes, because >> different users have different levels of tolerance for different kinds of >> errors. I’d expect these sorts of configurations to be set up at an >> infrastructure level, e.g. to maintain consistent standards throughout a >> who

DataSourceV2 sync notes - 24 July 2019

2019-08-06 Thread Ryan Blue
Here are my notes from the last DSv2 sync. Sorry it's a bit late! *Attendees*: Ryan Blue John Zhuge Raynmond McCollum Terry Kim Gengliang Wang Jose Torres Wenchen Fan Priyanka Gomatam Matt Cheah Russel Spitzer Burak Yavuz *Topics*: - Check in on blockers - Remove Sav

Re: [DISCUSS] ViewCatalog interface for DSv2

2019-08-13 Thread Ryan Blue
olumns >- owner >- createTime >- softwareVersion >- options (map) > > ViewColumn interface: > >- name >- type > > > Thanks, > John Zhuge > -- Ryan Blue Software Engineer Netflix

DSv2 sync notes - 21 August 2019

2019-08-30 Thread Ryan Blue
Sorry these notes were delayed. Here’s what we talked about in the last DSv2 sync. *Attendees*: Ryan Blue John Zhuge Burak Yavuz Gengliang Wang Terry Kim Wenchen Fan Xin Ren Srabasti Banerjee Priyanka Gomatam *Topics*: - Follow up on renaming append to insert in v2 API - Changes to

DSv2 sync - 4 September 2019

2019-09-06 Thread Ryan Blue
Here are my notes from the latest sync. Feel free to reply with clarifications if I’ve missed anything. *Attendees*: Ryan Blue John Zhuge Russell Spitzer Matt Cheah Gengliang Wang Priyanka Gomatam Holden Karau *Topics*: - DataFrameWriterV2 insert vs append (recap) - ANSI and strict modes

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-06 Thread Ryan Blue
Dataset >> encoder. As far as I know, no maintainstream DBMS is using this policy by >> default. >> >> Currently, the V1 data source uses "Legacy" policy by default, while V2 uses >> "Strict". This proposal is to use "ANSI" policy by default for both V1 and >> V2 in Spark 3.0. >> >> There was also a DISCUSS thread "Follow ANSI SQL on table insertion" in the >> dev mailing list. >> >> This vote is open until next Thurs (Sept. 12nd). >> >> [ ] +1: Accept the proposal >> [ ] +0 >> [ ] -1: I don't think this is a good idea because ... >> >> Thank you! >> >> Gengliang >> >> > -- Ryan Blue Software Engineer Netflix

Re: [VOTE] [SPARK-27495] SPIP: Support Stage level resource configuration and scheduling

2019-09-11 Thread Ryan Blue
[ ] +0 > [ ] -1: I don't think this is a good idea because ... > > I'll start with my +1 > > Thanks, > Tom > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: Thoughts on Spark 3 release, or a preview release

2019-09-13 Thread Ryan Blue
gt;> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more >>> operators >>> >>>>>>> SPARK-24941 Add RDDBarrier.coalesce() function >>> >>>>>>> SPARK-25017 Add test suite for ContextBarrierState >>> >>>>>>> SPARK-25083 remove the type erasure hack in data source scan >>> >>>>>>> SPARK-25383 Image data source supports sample pushdown >>> >>>>>>> SPARK-27272 Enable blacklisting of node/executor on fetch >>> failures by default >>> >>>>>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a >>> major >>> >>>>>>> efficiency problem >>> >>>>>>> SPARK-25128 multiple simultaneous job submissions against k8s >>> backend >>> >>>>>>> cause driver pods to hang >>> >>>>>>> SPARK-26731 remove EOLed spark jobs from jenkins >>> >>>>>>> SPARK-26664 Make DecimalType's minimum adjusted scale >>> configurable >>> >>>>>>> SPARK-21559 Remove Mesos fine-grained mode >>> >>>>>>> SPARK-24942 Improve cluster resource management with jobs >>> containing >>> >>>>>>> barrier stage >>> >>>>>>> SPARK-25914 Separate projection from grouping and aggregate in >>> logical Aggregate >>> >>>>>>> SPARK-26022 PySpark Comparison with Pandas >>> >>>>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL >>> standard >>> >>>>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics >>> >>>>>>> SPARK-26425 Add more constraint checks in file streaming source >>> to >>> >>>>>>> avoid checkpoint corruption >>> >>>>>>> SPARK-25843 Redesign rangeBetween API >>> >>>>>>> SPARK-25841 Redesign window function rangeBetween API >>> >>>>>>> SPARK-25752 Add trait to easily whitelist logical operators that >>> >>>>>>> produce named output from CleanupAliases >>> >>>>>>> SPARK-23210 Introduce the concept of default value to schema >>> >>>>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and >>> window aggregate >>> >>>>>>> SPARK-25531 new write APIs for data source v2 >>> >>>>>>> SPARK-25547 Pluggable jdbc connection factory >>> >>>>>>> SPARK-20845 Support specification of column names in INSERT INTO >>> >>>>>>> SPARK-24417 Build and Run Spark on JDK11 >>> >>>>>>> SPARK-24724 Discuss necessary info and access in barrier mode + >>> Kubernetes >>> >>>>>>> SPARK-24725 Discuss necessary info and access in barrier mode + >>> Mesos >>> >>>>>>> SPARK-25074 Implement maxNumConcurrentTasks() in >>> >>>>>>> MesosFineGrainedSchedulerBackend >>> >>>>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 >>> >>>>>>> SPARK-25186 Stabilize Data Source V2 API >>> >>>>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for >>> barrier >>> >>>>>>> execution mode >>> >>>>>>> SPARK-25390 data source V2 API refactoring >>> >>>>>>> SPARK-7768 Make user-defined type (UDT) API public >>> >>>>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based >>> Partition Spec >>> >>>>>>> SPARK-15691 Refactor and improve Hive support >>> >>>>>>> SPARK-15694 Implement ScriptTransformation in sql/core >>> >>>>>>> SPARK-16217 Support SELECT INTO statement >>> >>>>>>> SPARK-16452 basic INFORMATION_SCHEMA support >>> >>>>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working >>> >>>>>>> SPARK-18245 Improving support for bucketed table >>> >>>>>>> SPARK-19842 Informational Referential Integrity Constraints >>> Support in Spark >>> >>>>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in >>> nested >>> >>>>>>> list of structures >>> >>>>>>> SPARK-22632 Fix the behavior of timestamp values for R's >>> DataFrame to >>> >>>>>>> respect session timezone >>> >>>>>>> SPARK-22386 Data Source V2 improvements >>> >>>>>>> SPARK-24723 Discuss necessary info and access in barrier mode + >>> YARN >>> >>>>>>> >>> >>>>>>> >>> - >>> >>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>>>>>> >>> >>>>>>> >>> >>>> >>> >>>> >>> >>>> -- >>> >>>> Name : Jungtaek Lim >>> >>>> Blog : http://medium.com/@heartsavior >>> >>>> Twitter : http://twitter.com/heartsavior >>> >>>> LinkedIn : http://www.linkedin.com/in/heartsavior >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> John Zhuge >>> >> >>> >> >>> >> >>> >> -- >>> >> Twitter: https://twitter.com/holdenkarau >>> >> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 >>> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> > >>> > >>> >>> - >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>> -- Ryan Blue Software Engineer Netflix

DSv2 sync notes - 28 September 2019

2019-09-20 Thread Ryan Blue
Here are my notes from this week’s DSv2 sync. *Attendees*: Ryan Blue Holden Karau Russell Spitzer Terry Kim Wenchen Fan Shiv Prashant Sood Joseph Torres Gengliang Wang Matt Cheah Burak Yavuz *Topics*: - Driver-side Hadoop conf - SHOW DATABASES/NAMESPACES behavior - Review outstanding

[DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue
patibility, to keep the scope of the release small. The purpose is to assist people moving to 3.0 and not distract from the 3.0 release. Would a Spark 2.5 release help anyone else? Are there any concerns about this plan? rb -- Ryan Blue Software Engineer Netflix

Re: Spark 3.0 preview release on-going features discussion

2019-09-20 Thread Ryan Blue
SPARK-20568) >> > >> > Here, I am proposing to cut the branch on October 15th. If the features >> are targeting to 3.0 preview release, please prioritize the work and finish >> it before the date. Note, Oct. 15th is not the code freeze of Spark 3.0. >> That means, the community will still work on the features for the upcoming >> Spark 3.0 release, even if they are not included in the preview release. >> The goal of preview release is to collect more feedback from the community >> regarding the new 3.0 features/behavior changes. >> > >> > Thanks! >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue
stable. > > > > On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue > wrote: > >> Hi everyone, >> >> In the DSv2 sync this week, we talked about a possible Spark 2.5 release >> based on the latest Spark 2.4, but with DSv2 and Java 11 support added. >> &g

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue
lease, I > think. I'm still not convinced there is a burning need to use Java 11 > but stay on 2.4, after 3.0 is out, and at least the wheels are in > motion there. Java 8 is still free and being updated. > > On Fri, Sep 20, 2019 at 12:48 PM Ryan Blue > wrote: > > &

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue
temporarily. You can't just make a bunch internal APIs > tightly coupled with other internal pieces public and stable and call it a > day, just because it happen to satisfy some use cases temporarily assuming > the rest of Spark doesn't change. > > > > On Fri, Sep 20, 2

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue
this way, you might as well argue we should make the > entire catalyst package public to be pragmatic and not allow any changes. > > > > > On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue wrote: > >> When you created the PR to make InternalRow public >> >> This i

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Ryan Blue
licy.html ). > > > We just won’t add any breaking changes before 3.1. > > Bests, > Dongjoon. > > > On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue > wrote: > >> I don’t think we need to gate a 3.0 release on making a more stable >> version of InternalRow &g

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Ryan Blue
rate to Spark > 3.0 if they are prepared to migrate to new DSv2. > > On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun > wrote: > >> Do you mean you want to have a breaking API change between 3.0 and 3.1? >> I believe we follow Semantic Versioning ( >> https://spark.apach

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Ryan Blue
. On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin wrote: > How would you not make incompatible changes in 3.x? As discussed the > InternalRow API is not stable and needs to change. > > On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue wrote: > >> > Making downstream to diverge thei

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Ryan Blue
ine ... as > suggested by others in the thread, DSv2 would be one of the main reasons > people upgrade to 3.0. What's so special about DSv2 that we are doing this? > Why not abandoning 3.0 entirely and backport all the features to 2.x? > > > > On Sat, Sep 21, 2019 at

Re: [DISCUSS] Spark 2.5 release

2019-09-23 Thread Ryan Blue
to try the DSv2 API and build DSv2 data >>>> sources, can we recommend the 3.0-preview release for this? That would get >>>> people shifting to 3.0 faster, which is probably better overall compared to >>>> maintaining two major versions. There’s not that much

Re: [DISCUSS] Spark 2.5 release

2019-09-24 Thread Ryan Blue
t; >> I would personally love to see us provide a gentle migration path to >> Spark 3 especially if much of the work is already going to happen anyways. >> >> Maybe giving it a different name (eg something like >> Spark-2-to-3-transitional) would make it more clear about i

Re: [DISCUSS] Spark 2.5 release

2019-09-24 Thread Ryan Blue
view has > advantage here (assuming we provide maven artifacts as well as official > announcement), as it can give us expectation that there're bunch of changes > given it's a new major version. It also provides bunch of time to try > adopting it before the version is officially

[DISCUSS] Out of order optimizer rules?

2019-09-28 Thread Ryan Blue
addressed. rb -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Ryan Blue
ite sure. Seems to me it's better > to run it before join reorder. > > On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue > wrote: > >> Hi everyone, >> >> I have been working on a PR that moves filter and projection pushdown >> into the optimizer for DSv2, instead of

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Ryan Blue
ues.apache.org/jira/browse/HIVE-9152). Henry R's description >> was also correct. >> >> >> >> >> >> On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue >> wrote: >> >>> Where can I find a design doc for dynamic partition pruning tha

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-10 Thread Ryan Blue
rules are originally for Dataset >>>> encoder. As far as I know, no mainstream DBMS is using this policy by >>>> default. >>>> >>>> Currently, the V1 data source uses "Legacy" policy by default, while V2 >>>> uses "Strict". This proposal is to use "ANSI" policy by default for both V1 >>>> and V2 in Spark 3.0. >>>> >>>> This vote is open until Friday (Oct. 11). >>>> >>>> [ ] +1: Accept the proposal >>>> [ ] +0 >>>> [ ] -1: I don't think this is a good idea because ... >>>> >>>> Thank you! >>>> >>>> Gengliang >>>> >>> >>> >>> -- >>> --- >>> Takeshi Yamamuro >>> >> -- > [image: Databricks Summit - Watch the talks] > <https://databricks.com/sparkaisummit/north-america> > -- Ryan Blue Software Engineer Netflix

DataSourceV2 sync notes - 2 October 2019

2019-10-10 Thread Ryan Blue
Here are my notes from last week's DSv2 sync. *Attendees*: Ryan Blue Terry Kim Wenchen Fan *Topics*: - SchemaPruning only supports Parquet and ORC? - Out of order optimizer rules - 3.0 work - Rename session catalog to spark_catalog - Finish TableProvider update to

Cancel DSv2 sync this week

2019-10-15 Thread Ryan Blue
Hi everyone, I can't make it to the DSv2 sync tomorrow, so let's skip it. If anyone would prefer to have one and is willing to take notes, I can send out the invite. Just let me know, otherwise let's consider it cancelled. Thanks, rb -- Ryan Blue Software Engineer Netflix

DSv2 sync notes - 30 October 2019

2019-11-01 Thread Ryan Blue
*Attendees*: Ryan Blue Terry Kim Wenchen Fan Jose Torres Jacky Lee Gengliang Wang *Topics*: - DROP NAMESPACE cascade behavior - 3.0 tasks - TableProvider API changes - V1 and V2 table resolution rules - Separate logical and physical write (for streaming) - Bucketing support

Re: DSv2 reader lifecycle

2019-11-06 Thread Ryan Blue
at, it's quite > expensive to deserialize all the various metadata, so I was holding the > deserialized version in the DataSourceReader, but if Spark is repeatedly > constructing new ones, then that doesn't help. If this is the expected > behavior, how should I handle this as a consumer of the API? > > Thanks! > Andrew > -- Ryan Blue Software Engineer Netflix

Re: Enabling fully disaggregated shuffle on Spark

2019-11-19 Thread Ryan Blue
lows >> arbitrary metadata/framing data to be wrapped around individual objects >> cheaply. Right now, that’s only possible at the stream level. (There are >> hacks around this, but this would enable more idiomatic use in efficient >> shuffle implementations.) >> >> >> Have serializers indicate whether they are deterministic. This provides >> much of the value of a shuffle service because it means that reducers do >> not need to spill to disk when reading/merging/combining inputs--the data >> can be grouped by the service, even without the service understanding data >> types or byte representations. Alternative (less preferable since it would >> break Java serialization, for example): require all serializers to be >> deterministic. >> >> >> >> -- >> >> - Ben >> > -- Ryan Blue Software Engineer Netflix

Re: Spark 2.4.5 release for Parquet and Avro dependency updates?

2019-11-22 Thread Ryan Blue
://mail-archives.apache.org/mod_mbox/parquet-dev/201911.mbox/%3c8357699c-9295-4eb0-a39e-b3538d717...@gmail.com%3E> > ). > > Might there be any desire to cut a Spark 2.4.5 release so that users can > pick up these changes independently of all the other changes in Spark 3.0? > &g

Re: [DISCUSS] Consistent relation resolution behavior in SparkSQL

2019-12-05 Thread Ryan Blue
d be >> minimal since this applies only when there are temp views and tables with >> the same name. >> >> Any feedback will be appreciated. >> >> I also want to thank Wenchen Fan, Ryan Blue, Burak Yavuz, and Dongjoon >> Hyun for guidance and suggestion. >> >> Regards, >> Terry >> >> >> <https://issues.apache.org/jira/browse/SPARK-29900> >> > -- Ryan Blue Software Engineer Netflix

Next DSv2 sync date

2019-12-08 Thread Ryan Blue
Hi everyone, I have a conflict with the normal DSv2 sync time this Wednesday and I'd like to attend to talk about the TableProvider API. Would it work for everyone to have the sync at 6PM PST on Tuesday, 10 December instead? I could also make it at the normal time on Thursday. Thanks, --

Re: Next DSv2 sync date

2019-12-09 Thread Ryan Blue
Actually, my conflict was cancelled so I'll send out the usual invite for Wednesday. Sorry for the noise. On Sun, Dec 8, 2019 at 3:15 PM Ryan Blue wrote: > Hi everyone, > > I have a conflict with the normal DSv2 sync time this Wednesday and I'd > like to attend to talk a

Re: [DISCUSS] Add close() on DataWriter interface

2019-12-11 Thread Ryan Blue
ng something. >>> >>> What do you think? It would bring backward incompatible change, but >>> given the interface is marked as Evolving and we're making backward >>> incompatible changes in Spark 3.0, so I feel it may not matter. >>> >>> Would love to hear your thoughts. >>> >>> Thanks in advance, >>> Jungtaek Lim (HeartSaVioR) >>> >>> >>> -- Ryan Blue Software Engineer Netflix

DSv2 sync notes - 11 December 2019

2019-12-19 Thread Ryan Blue
Hi everyone, here are my notes for the DSv2 sync last week. Sorry they’re late! Feel free to add more details or corrections. Thanks! rb *Attendees*: Ryan Blue John Zhuge Dongjoon Hyun Joseph Torres Kevin Yu Russel Spitzer Terry Kim Wenchen Fan Hyukjin Kwan Jacky Lee *Topics*: - Relation

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-16 Thread Ryan Blue
s are going to be >>> allowed together (e.g., `concat(years(col) + days(col))`); >>> however, it looks impossible to extend with the current design. It just >>> directly maps transformName to implementation class, >>> and just pass arguments: >>> >>> transform >>> ... >>> | transformName=identifier >>> '(' argument+=transformArgument (',' argument+=transformArgument)* >>> ')' #applyTransform >>> ; >>> >>> It looks regular expressions are supported; however, it's not. >>> - If we should support, the design had to consider that. >>> - if we should not support, different syntax might have to be used >>> instead. >>> >>> *Limited Compatibility Management* >>> The name can be arbitrary. For instance, if "transform" is supported in >>> Spark side, the name is preempted by Spark. >>> If every the datasource supported such name, it becomes not compatible. >>> >>> >>> >>> -- Ryan Blue Software Engineer Netflix

Re: [Discuss] Metrics Support for DS V2

2020-01-17 Thread Ryan Blue
support metrics. >> >> So it will be easy to collect the metrics if FilePartitionReaderFactory >> implements ReportMetrics >> >> >> Please let me know the views, or even if we want to have new solution or >> design. >> > -- Ryan Blue Software Engineer Netflix

Re: [Discuss] Metrics Support for DS V2

2020-01-20 Thread Ryan Blue
st. > > On Fri, 17 Jan 2020 at 10:33 PM, Ryan Blue wrote: > >> We've implemented these metrics in the RDD (for input metrics) and in the >> v2 DataWritingSparkTask. That approach gives you the same metrics in the >> stage views that you get with v1 sources, regardl

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Ryan Blue
;>>> easier to write the native CREATE TABLE syntax. Unfortunately, it leads >>>>>>> to >>>>>>> some conflicts with the Hive CREATE TABLE syntax, but I don't see a >>>>>>> serious >>>>>>> problem here. If a user just writes CREATE TABLE without USING or ROW >>>>>>> FORMAT or STORED AS, does it matter what table we create? Internally the >>>>>>> parser rules conflict and we pick the native syntax depending on the >>>>>>> rule >>>>>>> order. But the user-facing behavior looks fine. >>>>>>> >>>>>>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in >>>>>>> 3.0. Shall we simply remove EXTERNAL from the native CREATE TABLE >>>>>>> syntax? >>>>>>> Then CREATE EXTERNAL TABLE creates Hive table like 2.4. >>>>>>> >>>>>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim < >>>>>>> kabhwan.opensou...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi devs, >>>>>>>> >>>>>>>> I'd like to initiate discussion and hear the voices for resolving >>>>>>>> ambiguous parser rule between two "create table"s being brought by >>>>>>>> SPARK-30098 [1]. >>>>>>>> >>>>>>>> Previously, "create table" parser rules were clearly distinguished >>>>>>>> via "USING provider", which was very intuitive and deterministic. Say, >>>>>>>> DDL >>>>>>>> query creates "Hive" table unless "USING provider" is specified, >>>>>>>> (Please refer the parser rule in branch-2.4 [2]) >>>>>>>> >>>>>>>> After SPARK-30098, "create table" parser rules became ambiguous >>>>>>>> (please refer the parser rule in branch-3.0 [3]) - the factors >>>>>>>> differentiating two rules are only "ROW FORMAT" and "STORED AS" which >>>>>>>> are >>>>>>>> all defined as "optional". Now it relies on the "order" of parser rule >>>>>>>> which end users would have no idea to reason about, and very >>>>>>>> unintuitive. >>>>>>>> >>>>>>>> Furthermore, undocumented rule of EXTERNAL (added in the first rule >>>>>>>> to provide better message) brought more confusion (I've described the >>>>>>>> broken existing query via SPARK-30436 [4]). >>>>>>>> >>>>>>>> Personally I'd like to see two rules mutually exclusive, instead of >>>>>>>> trying to document the difference and talk end users to be careful >>>>>>>> about >>>>>>>> their query. I'm seeing two ways to make rules be mutually exclusive: >>>>>>>> >>>>>>>> 1. Add some identifier in create Hive table rule, like `CREATE ... >>>>>>>> "HIVE" TABLE ...`. >>>>>>>> >>>>>>>> pros. This is the simplest way to distinguish between two rules. >>>>>>>> cons. This would lead end users to change their query if they >>>>>>>> intend to create Hive table. (Given we will also provide legacy option >>>>>>>> I'm >>>>>>>> feeling this is acceptable.) >>>>>>>> >>>>>>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one. >>>>>>>> >>>>>>>> pros. Less invasive for existing queries. >>>>>>>> cons. Less intuitive, because they have been optional and now >>>>>>>> become mandatory to fall into the second rule. >>>>>>>> >>>>>>>> Would like to hear everyone's voices; better ideas are welcome! >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Jungtaek Lim (HeartSaVioR) >>>>>>>> >>>>>>>> 1. SPARK-30098 Use default datasource as provider for CREATE TABLE >>>>>>>> syntax >>>>>>>> https://issues.apache.org/jira/browse/SPARK-30098 >>>>>>>> 2. >>>>>>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 >>>>>>>> 3. >>>>>>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 >>>>>>>> 4. https://issues.apache.org/jira/browse/SPARK-30436 >>>>>>>> >>>>>>>> -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-19 Thread Ryan Blue
rk/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table.md#description> >>> add to the confusion by describing the Hive-compatible command as "CREATE >>> TABLE USING HIVE FORMAT", but neither "USING" nor "HIVE FORMAT" are

Re: [DISCUSS] Supporting hive on DataSourceV2

2020-03-23 Thread Ryan Blue
gt; Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > ----- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-25 Thread Ryan Blue
e >>>> unified syntax. Just make sure it doesn't appear together with PARTITIONED >>>> BY transformList. >>>> >>> >>> Another side note: Perhaps as part of (or after) unifying the CREATE >>> TABLE syntax, we can also update Catalog.createTable() to support >>> creating partitioned tables >>> <https://issues.apache.org/jira/browse/SPARK-31001>. >>> >> -- Ryan Blue Software Engineer Netflix

  1   2   3   4   5   >