Re: Which version of spark version supports parquet version 2 ?

2024-04-17 Thread Ryan Blue
king klaws of diminishing returns, I would not advise that >>> either.. You can ofcourse usse gzip for compression that may be more >>> suitable for your needs. >>> >>> HTH >>> >>> Mich Talebzadeh, >>> Technologist | Solutions Architect | Da

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Ryan Blue
atm ?, as We have a use > case where we need parquet V2 as one of our components uses Parquet V2 . > > On Mon, Apr 15, 2024 at 7:09 PM Ryan Blue wrote: > >> Hi Prem, >> >> Parquet v1 is the default because v2 has not been finalized and adopted >> by the com

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Ryan Blue
gt; https://github.com/apache/parquet-mr/blob/master/CHANGES.md >>>>>> >>>>>> HTH >>>>>> >>>>>> Mich Talebzadeh, >>>>>> Technologist | Solutions Architect | Data Engineer | Generative AI >>>>>> London >>>>>> United Kingdom >>>>>> >>>>>> >>>>>>view my Linkedin profile >>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>> >>>>>> >>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>> >>>>>> >>>>>> >>>>>> *Disclaimer:* The information provided is correct to the best of my >>>>>> knowledge but of course cannot be guaranteed . It is essential to note >>>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>>> expert opinions (Werner >>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>> >>>>>> >>>>>> On Mon, 15 Apr 2024 at 18:59, Prem Sahoo >>>>>> wrote: >>>>>> >>>>>>> Hello Team, >>>>>>> May I know how to check which version of parquet is supported by >>>>>>> parquet-mr 1.2.1 ? >>>>>>> >>>>>>> Which version of parquet-mr is supporting parquet version 2 (V2) ? >>>>>>> >>>>>>> Which version of spark is supporting parquet version 2? >>>>>>> May I get the release notes where parquet versions are mentioned ? >>>>>>> >>>>>> -- Ryan Blue Tabular

Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-09 Thread Ryan Blue
tes operator, making it a part of the Apache Flink project ( >>> https://github.com/apache/flink-kubernetes-operator). This move has >>> gained wide industry adoption and contributions from the community. In a >>> mere year, the Flink operator has garnered more than 600 stars and has >>> attracted contributions from over 80 contributors. This showcases the level >>> of community interest and collaborative momentum that can be achieved in >>> similar scenarios. >>> More details can be found at SPIP doc : Spark Kubernetes Operator >>> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE >>> >>> Thanks, >>> -- >>> *Zhou JIANG* >>> >>> >>> -- Ryan Blue Tabular

Re: Query hints visible to DSV2 connectors?

2023-08-03 Thread Ryan Blue
ark.apache.org/docs/latest/sql-ref-syntax-qry-select-hints.html > or sql("select 1").hint("foo").show()] aren't visible from the > TableCatalog/Table/ScanBuilder. > > I guess I could set a config parameter but I'd rather do this on a > per-query basis. Any tips? > > Thanks! > > -0xe1a > -- Ryan Blue Tabular

Re: Data Contracts

2023-06-12 Thread Ryan Blue
a that are > enforced in the implementation of a FileFormatDataWriter? > > Just throwing it out there and wondering what other people think. It's an > area that interests me as it seems that over half my problems at the day > job are because of dodgy data. > > Regards, > > Phillip > > -- Ryan Blue Tabular

Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-03 Thread Ryan Blue
d a ViewCatalog interface that can be used to load, > create, alter, and drop views in DataSourceV2. > > Please vote on the SPIP until Feb. 9th (Wednesday). > > [ ] +1: Accept the proposal as an official SPIP > [ ] +0 > [ ] -1: I don’t think this is a good idea because … > > Thanks! > -- Ryan Blue Tabular

Re: [VOTE][SPIP] Support Customized Kubernetes Schedulers Proposal

2022-01-12 Thread Ryan Blue
ous discussion in dev mailing list: [DISCUSSION] SPIP: Support >>> Volcano/Alternative Schedulers Proposal >>> > - Design doc: [SPIP] Spark-36057 Support Customized Kubernetes >>> Schedulers Proposal >>> > - JIRA: SPARK-36057 >>> > >>>

Re: Supports Dynamic Table Options for Spark SQL

2021-11-16 Thread Ryan Blue
gurations e.g., SessionConfigSupport. >> >> On Tue, 16 Nov 2021 at 04:30, Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >>> Side note about time travel: There is a PR >>> <https://github.com/apache/spark/pull/34497> to add VERSION/TIME

Re: Supports Dynamic Table Options for Spark SQL

2021-11-15 Thread Ryan Blue
gt; such loss, damage or destruction. > > > > > On Mon, 15 Nov 2021 at 17:02, Russell Spitzer > wrote: > >> I think since we probably will end up using this same syntax on write, >> this makes a lot of sense. Unless there is another good way to express a >> sim

Re: Supports Dynamic Table Options for Spark SQL

2021-11-15 Thread Ryan Blue
dev, >> >> We are discussing Support Dynamic Table Options for Spark SQL ( >> https://github.com/apache/spark/pull/34072). It is currently not sure if >> the syntax makes sense, and would like to know if there is other feedback >> or opinion on this. >> >> I would appreciate any feedback on this. >> >> Thanks. >> > -- Ryan Blue Tabular

Re: [VOTE] SPIP: Row-level operations in Data Source V2

2021-11-14 Thread Ryan Blue
read/kd8qohrk5h3qx8d6y4lhrm67vnn8p6bv> >>>>> > >>>>> > - JIRA: SPARK-35801 < >>>>> https://issues.apache.org/jira/browse/SPARK-35801> >>>>> > - PR for handling DELETE statements: >>>>> > <https://github.com/apache/spark/pull/33008> >>>>> > >>>>> > - Design doc >>>>> > < >>>>> https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60/ >>>>> > >>>>> > >>>>> > Please vote on the SPIP for the next 72 hours: >>>>> > >>>>> > [ ] +1: Accept the proposal as an official SPIP >>>>> > [ ] +0 >>>>> > [ ] -1: I don’t think this is a good idea because … >>>>> > >>>>> > - >>>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>> > >>>>> >>>>> >>>>> - >>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>> >>>>> -- Ryan Blue Tabular

Re: [VOTE] SPIP: Storage Partitioned Join for Data Source V2

2021-10-29 Thread Ryan Blue
t; > > >> > > > [ ] +1: Accept the proposal as an official SPIP >> > > > [ ] +0 >> > > > [ ] -1: I don’t think this is a good idea because … >> > > > >> > > > >> - >> > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> > > > >> > > > >> > > >> > > - >> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> > > >> > > >> > >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- Ryan Blue Tabular

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-27 Thread Ryan Blue
e can > clearly define the bucket hash function of the builtin `BucketTransform` in > the doc. > > On Thu, Oct 28, 2021 at 12:25 AM Ryan Blue wrote: > >> Two v2 sources may return different bucket IDs for the same value, and >> this breaks the phase 1 split-wise join. >

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-27 Thread Ryan Blue
org/jira/browse/SPARK-19256> has >>> details). >>> >>> >>> >>>1. Would aggregate work automatically after the SPIP? >>> >>> >>> >>> Another major benefit for having bucketed table, is to avoid shuffle >>> before agg

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-26 Thread Ryan Blue
gt;> wrote: >>>> >>>>> +1 for this SPIP. >>>>> >>>>> On Sun, Oct 24, 2021 at 9:59 AM huaxin gao >>>>> wrote: >>>>> >>>>>> +1. Thanks for lifting the current restrictions on bucket join and >>>>&

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-24 Thread Ryan Blue
tribution properties > reported by data sources and eliminate shuffle whenever possible. > > > > Design doc: > https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE > (includes a POC link at the end) > > > > We'd like to start a discussion on the doc and any feedback is welcome! > > > > Thanks, > > Chao > -- Ryan Blue

Re: [VOTE] SPIP: Catalog API for view metadata

2021-05-24 Thread Ryan Blue
e vote on the SPIP in the next 72 hours. Once it is approved, I’ll > update the PR for review. > > [ ] +1: Accept the proposal as an official SPIP > [ ] +0 > [ ] -1: I don’t think this is a good idea because … > -- Ryan Blue Software Engineer Netflix

[RESULT] [VOTE] SPIP: Add FunctionCatalog

2021-03-15 Thread Ryan Blue
This SPIP is adopted with the following +1 votes and no -1 or +0 votes: Holden Karau* John Zhuge Chao Sun Dongjoon Hyun* Russell Spitzer DB Tsai* Wenchen Fan* Kent Yao Huaxin Gao Liang-Chi Hsieh Jungtaek Lim Hyukjin Kwon* Gengliang Wang kordex Takeshi Yamamuro Ryan Blue * = binding On Mon, Mar

Re: [VOTE] SPIP: Add FunctionCatalog

2021-03-15 Thread Ryan Blue
> On Tue, Mar 9, 2021 at 9:27 AM huaxin gao >> >>>> >> >>>> > huaxin.gao11@ >> >>>> >> >>>> > wrote: >> >>>> > >> >>>> >> +1 (non-binding) >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> -- >> >>>> Sent from: >> http://apache-spark-developers-list.1001551.n3.nabble.com/ >> >>>> >> >>>> - >> >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >>>> >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> > > -- > --- > Takeshi Yamamuro > -- Ryan Blue Software Engineer Netflix

[VOTE] SPIP: Add FunctionCatalog

2021-03-08 Thread Ryan Blue
.82w8qxfl2uwl Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll do a final update of the PR and we can merge the API. [ ] +1: Accept the proposal as an official SPIP [ ] +0 [ ] -1: I don’t think this is a good idea because … -- Ryan Blue

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-04 Thread Ryan Blue
en we can have a single >>> ScalarFunction interface which has the row-parameter API (with a >>> default implementation to fail) and documents to describe the "magical >>> methods" (which can be done later). >>> >>> I'll start the PR review th

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-03 Thread Ryan Blue
UnsupportedOperationException(); > + } > > By providing the default implementation, it will not *forcing users to > implement it* technically. > And, we can provide a document about our expected usage properly. > What do you think? > > Bests, > Dongjoon. > > > >

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-03 Thread Ryan Blue
bject and cause boxing issues. > > I agree that Object[] is worse than InternalRow. But I can't think of > real use cases that will force the individual-parameters approach to use > Object instead of concrete types. > > > On Tue, Mar 2, 2021 at 3:36 AM Ryan Blue wrote: > >>

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-01 Thread Ryan Blue
safety guarantees only if you need just one set of types for each number of arguments and are using the non-codegen path. Since varargs is one of the primary reasons to use this API, then I don’t think that it is a good idea to use Object[] instead of InternalRow. -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-19 Thread Ryan Blue
merge two Arrays (of generic types) to a Map. >> >> Also, to address Wenchen's InternalRow question, can we create a number >> of Function classes, each corresponding to a number of input parameter >> length (e.g., ScalarFunction1, ScalarFunction2, etc)? >> >> Thanks

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-18 Thread Ryan Blue
-parameter version? > > To move forward, how about we implement the function loading and binding > first? Then we can have PRs for both the individual-parameters (I can take > it) and row-parameter approaches, if we still can't reach a consensus at > that time and need to see all the details. &

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-17 Thread Ryan Blue
at least one implementation in the `master` branch this month >> (February). >> If you need more time (one month or longer), why don't we have Ryan's >> suggestion in the `master` branch first and benchmark with your PR later >> during Apache Spark 3.2 timeframe. >> >&

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-16 Thread Ryan Blue
yan, > > This proposal looks very interesting. Would future goals for this > functionality include both support for aggregation functions, as well > as support for processing ColumnBatch-es (instead of Row/InternalRow)? > > Thanks > Andrew > > On Mon, Feb 15, 2021 at 12:44 PM Ryan

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-15 Thread Ryan Blue
be sufficiently user-friendly >>> >>> and >>> >>> extensible. I generally think Wenchen's proposal is easier for a >>> user to >>> >>> work with in the common case, but has greater potential for confusing >>> >>> and >>

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-10 Thread Ryan Blue
for a long period of time. I especially appreciate how the >>>>> design is focused on a minimal useful component, with future optimizations >>>>> considered from a point of view of making sure it's flexible, but actual >>>>> concrete decisions left fo

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-09 Thread Ryan Blue
stead, then the UDF wouldn’t work. What then? Does Spark detect that the wrong type was used? It would need to or else it would be difficult for a UDF developer to tell what is wrong. And this is a runtime issue so it is caught late. -- Ryan Blue

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-08 Thread Ryan Blue
F will report input data types and result data type, so the > analyzer can check if the call method is valid via reflection, and we > still have query-compile-time type safety. It also simplifies development > as we can just use the Invoke expression to invoke UDFs. > > On Tue, Feb 9, 2021

[DISCUSS] SPIP: FunctionCatalog

2021-02-08 Thread Ryan Blue
://github.com/apache/spark/pull/24559/files Let's discuss the proposal here rather than on that PR, to get better visibility. Also, please take the time to read the proposal first. That really helps clear up misconceptions. -- Ryan Blue

Re: FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

2020-12-01 Thread Ryan Blue
t; >>> SPARK-30098 Use default datasource as provider for CREATE TABLE >>> syntax >>> >>> This is merged today and now Spark's `CREATE TABLE` is using Spark's >>> default data sources instead of `hive` provider. This is a good and big >>>

Re: Seeking committers' help to review on SS PR

2020-11-30 Thread Ryan Blue
t;> from an SS contributor is enough to go ahead? >> >> https://github.com/apache/spark/pull/27649 >> https://github.com/apache/spark/pull/28363 >> >> These are under 100 lines of changes per each, and not invasive. >> > -- Ryan Blue Software Engineer Netflix

Re: Seeking committers' help to review on SS PR

2020-11-23 Thread Ryan Blue
t; including it in Spark 3.2 (another half of a year) doesn't make sense to me. >> >> In addition, is there a way to unblock me to work for meaningful features >> instead of being stuck with small improvements? I have something in my >> backlog but I'd rather not want to contin

Re: Spark 3.1 branch cut 4th Dec?

2020-11-20 Thread Ryan Blue
: >>>>>> >>>>>>> Thank you for your volunteering! >>>>>>> >>>>>>> Since the previous branch-cuts were always soft-code freeze which >>>>>>> allowed committers to merge to the new branches still for a while, I >>>>>>> believe 1st December will be better for stabilization. >>>>>>> >>>>>>> Bests, >>>>>>> Dongjoon. >>>>>>> >>>>>>> >>>>>>> On Thu, Nov 19, 2020 at 3:50 PM Hyukjin Kwon >>>>>>> wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I think we haven’t decided yet the exact branch-cut, code freeze >>>>>>>> and release manager. >>>>>>>> >>>>>>>> As we planned in https://spark.apache.org/versioning-policy.html >>>>>>>> >>>>>>>> Early Dec 2020 Code freeze. Release branch cut >>>>>>>> >>>>>>>> Code freeze and branch cutting is coming. >>>>>>>> >>>>>>>> Therefore, we should finish if there are any remaining works for >>>>>>>> Spark 3.1, and >>>>>>>> switch to QA mode soon. >>>>>>>> I think it’s time to set to keep it on track, and I would like to >>>>>>>> volunteer to help drive this process. >>>>>>>> >>>>>>>> I am currently thinking 4th Dec as the branch-cut date. >>>>>>>> >>>>>>>> Any thoughts? >>>>>>>> >>>>>>>> Thanks all. >>>>>>>> >>>>>>>> -- Ryan Blue Software Engineer Netflix

Re: SPIP: Catalog API for view metadata

2020-11-10 Thread Ryan Blue
t;>> >>> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan wrote: >>> >>>> Any updates here? I agree that a new View API is better, but we need a >>>> solution to avoid performance regression. We need to elaborate on the cache >>>> idea. >&g

Re: [DISCUSS] Disable streaming query with possible correctness issue by default

2020-11-10 Thread Ryan Blue
isable the >> check with the new config. In the PR currently there is no objection but >> suggestion to hear more voices. Please let me know if you have some >> thoughts. >> >> Thanks. >> Liang-Chi Hsieh >> >> >> >> -- >> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] preferred behavior when fails to instantiate configured v2 session catalog

2020-10-23 Thread Ryan Blue
rror log >> message at least. >> >> Would like to hear the voices. >> >> Thanks, >> Jungtaek Lim (HeartSaVioR) >> > -- Ryan Blue Software Engineer Netflix

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Ryan Blue
On Wed, Oct 7, 2020 at 11:54 AM Ryan Blue wrote: > I don’t think Spark ever claims to be 100% Hive compatible. > > By accepting the EXTERNAL keyword in some circumstances, Spark is > providing compatibility with Hive DDL. Yes, there are places where it > breaks. The question

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Ryan Blue
atible > > I don't think Spark ever claims to be 100% Hive compatible. In fact, we > diverged from Hive intentionally in several places, where we think the Hive > behavior was not reasonable and we shouldn't follow it. > > On Thu, Oct 8, 2020 at 1:58 AM Ryan Blue wrote: > >> ho

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Ryan Blue
file-based. >> >> BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as an >> external table. Hive gives warning when you create managed tables with >> custom location, which means this behavior is not recommended. Shall we >> "infer" EXTERNAL

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Ryan Blue
t; catalog (Hive) - so replacing default session catalog with custom one and >>>>>> trying to use it like it is in external catalog doesn't work, which >>>>>> destroys the purpose of replacing the default session catalog. >>>>>> >>>>

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Ryan Blue
catalogs on how to handle this makes >> sense. >> >> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue >> wrote: >> >>> I would summarize both the problem and the current state differently. >>> >>> Currently, Spark parses the EXTERNAL keyword for c

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-06 Thread Ryan Blue
g with custom one. Am I missing something? > > Thanks, > Jungtaek Lim (HeartSaVioR) > -- Ryan Blue Software Engineer Netflix

Re: Official support of CREATE EXTERNAL TABLE

2020-10-06 Thread Ryan Blue
akes sense for file source, as the table > directory can be managed. I'm not sure how to interpret EXTERNAL in > catalogs like jdbc, cassandra, etc. > > For more details, please refer to the long discussion in > https://github.com/apache/spark/pull/28026 > > Thanks, > Wenchen > -- Ryan Blue Software Engineer Netflix

Re: Performance of VectorizedRleValuesReader

2020-09-14 Thread Ryan Blue
le (valueIndex < this.currentCount) { > >> > // values are bit packed 8 at a time, so reading bitWidth will > always work > >> > ByteBuffer buffer = in.slice(bitWidth); > >> > this.packer.unpack8Values(buffer, buffer.position(), > this.currentBuffer, valueIndex); > >> > valueIndex += 8; > >> > } > >> > > >> > > >> > Per my profile, the codes will spend 30% time of readNextGrou() on > slice , why we can't call slice out of the loop? > -- Ryan Blue

Re: SPIP: Catalog API for view metadata

2020-08-19 Thread Ryan Blue
gt; desc = metadata, > output = metadata.schema.toAttributes, > child = parser.parsePlan(viewText)) > > So it is a validation (here) or cache (in DESCRIBE) nice to have but not > "required" or "should be frozen". Thanks Ryan and Burak for pointing that > out in SPIP. I will add a new paragraph accordingly. > -- Ryan Blue Software Engineer Netflix

Re: SPIP: Catalog API for view metadata

2020-08-13 Thread Ryan Blue
atalog API to load, >>>> create, alter, and drop views. >>>> >>>> Document: >>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing >>>> JIRA: https://issues.apache.org/jira/browse/SPARK-31357 >>>> WIP PR: https://github.com/apache/spark/pull/28147 >>>> >>>> As part of a project to support common views across query engines like >>>> Spark and Presto, my team used the view catalog API in Spark >>>> implementation. The project has been in production over three months. >>>> >>>> Thanks, >>>> John Zhuge >>>> >>> >>> >>> -- >>> John Zhuge >>> >> > > -- > John Zhuge > -- Ryan Blue Software Engineer Netflix

Re: [VOTE] Decommissioning SPIP

2020-07-02 Thread Ryan Blue
from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- Ryan Blue Software Engineer Netflix

Re: Removing references to slave (and maybe in the future master)

2020-06-19 Thread Ryan Blue
to spend at least 15-20 > minutes explaining that a worker will not actually do work, and the master > won't run their application. > > Thanks Holden for doing all the legwork on this! > -- Ryan Blue Software Engineer Netflix

Re: Revisiting the idea of a Spark 2.5 transitional release

2020-06-12 Thread Ryan Blue
ompatible. I think we'd all like to have as smooth an upgrade >>>> experience to Spark 3 as possible, and I believe that having a Spark 2 >>>> release some of the new functionality while continuing to support the older >>>> APIs and current Scala version would make the upgrade

Re: [vote] Apache Spark 3.0 RC3

2020-06-09 Thread Ryan Blue
Jun 9, 2020 at 6:15 PM Dr. Kent Yao wrote: >>>> >>>>> +1 (non-binding) >>>>> >>>>> >>>>> >>>>> -- >>>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >>>>> >>>>> - >>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>> >>>>> -- Ryan Blue Software Engineer Netflix

Re: [VOTE] Apache Spark 3.0 RC2

2020-05-23 Thread Ryan Blue
=== >>> >>> The current list of open tickets targeted at 3.0.0 can be found at: >>> https://issues.apache.org/jira/projects/SPARK and search for "Target >>> Version/s" = 3.0.0 >>> >>> Committers should look at

Re: [DatasourceV2] Default Mode for DataFrameWriter not Dependent on DataSource Version

2020-05-20 Thread Ryan Blue
ordingly but this seems to no longer be the > case. Was this intentional? I feel like if we could > have the default be based on the Source then upgrading code from DSV1 -> > DSV2 would be much easier for users. > > I'm currently testing this on RC2 > > > Any thoughts? > > Thanks for your time as usual, > Russ > -- Ryan Blue Software Engineer Netflix

Re: [VOTE] Apache Spark 3.0 RC2

2020-05-20 Thread Ryan Blue
ion/voting.html> to commit without waiting for a review. On Wed, May 20, 2020 at 10:00 AM Ryan Blue wrote: > Why was https://github.com/apache/spark/pull/28523 merged with a -1? We > discussed this months ago and concluded that it was a bad idea to introduce > a new v2 API that canno

Re: [VOTE] Apache Spark 3.0 RC2

2020-05-20 Thread Ryan Blue
;> >> >> -- >> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- Ryan Blue Software Engineer Netflix

Re: [DatasourceV2] Allowing Partial Writes to DSV2 Tables

2020-05-14 Thread Ryan Blue
>>> I would really appreciate that, I'm probably going to just write a >>> planner rule for now which matches up my table schema with the query output >>> if they are valid, and fails analysis otherwise. This approach is how I got >>> metadata columns in so I be

Re: [DatasourceV2] Allowing Partial Writes to DSV2 Tables

2020-05-13 Thread Ryan Blue
insert as well as those > which are not required. > > Please let me know if i've misread this, > > Thanks for your time again, > Russ > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-12 Thread Ryan Blue
licts and fix tests. >>> >>> I don't see why it's still a problem if a feature is disabled and hidden >>> from end-users (it's undocumented, the config is internal). The related >>> code will be replaced in the master branch sooner or later, when we unify >

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-11 Thread Ryan Blue
k 3.0 with SPARK-30098. Otherwise, we will have to deal >>>> with this problem for years to come. >>>> >>>> On Mon, May 11, 2020 at 1:06 AM JackyLee wrote: >>>> >>>>> +1. Agree with Xiao Li and Jungtaek Lim. >>>>> >>>>> This seems to be controversial, and can not be done in a short time. >>>>> It is >>>>> necessary to choose option 1 to unblock Spark 3.0 and support it in >>>>> 3.1. >>>>> >>>>> >>>>> >>>>> -- >>>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >>>>> >>>>> - >>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>> >>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>> -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-11 Thread Ryan Blue
y to choose option 1 to unblock Spark 3.0 and support it in 3.1. > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apach

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-27 Thread Ryan Blue
>>> every time specifically for Java APIs. But yes, it gives you Java/Scala >>>> friendly instances. >>>> >>>> For 4., having one API that returns a Java instance makes you able to >>>> use it in both Scala and Java APIs >>>> sides although it makes you call asScala in Scala side specifically. >>>> But you don’t >>>> have to search if there’s a variant of this API and it gives you a >>>> consistent API usage across languages. >>>> >>>> Also, note that calling Java in Scala is legitimate but the opposite >>>> case is not, up to my best knowledge. >>>> In addition, you should have a method that returns a Java instance for >>>> PySpark or SparkR to support. >>>> >>>> >>>> *Proposal:* >>>> >>>> I would like to have a general guidance on this that the Spark dev >>>> agrees upon: Do 4. approach. If not possible, do 3. Avoid 1 almost at all >>>> cost. >>>> >>>> Note that this isn't a hard requirement but *a general guidance*; >>>> therefore, the decision might be up to >>>> the specific context. For example, when there are some strong arguments >>>> to have a separate Java specific API, that’s fine. >>>> Of course, we won’t change the existing methods given Micheal’s rubric >>>> added before. I am talking about new >>>> methods in unreleased branches. >>>> >>>> Any concern or opinion on this? >>>> >>> -- Ryan Blue Software Engineer Netflix

Re: DSv2 & DataSourceRegister

2020-04-07 Thread Ryan Blue
I think this functionality will be useful as DSv2 continues to evolve, > please let me know your thoughts. > > Thanks > Andrew > > ----- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: [VOTE] Apache Spark 3.0.0 RC1

2020-04-01 Thread Ryan Blue
; > > > > -- > > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > > > ----- > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-25 Thread Ryan Blue
with PARTITIONED >>>>> BY transformList. >>>>> >>>> >>>> Another side note: Perhaps as part of (or after) unifying the CREATE >>>> TABLE syntax, we can also update Catalog.createTable() to support >>>> creating partitioned tables >>>> <https://issues.apache.org/jira/browse/SPARK-31001>. >>>> >>> -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-25 Thread Ryan Blue
; unified syntax. Just make sure it doesn't appear together with PARTITIONED >>>> BY transformList. >>>> >>> >>> Another side note: Perhaps as part of (or after) unifying the CREATE >>> TABLE syntax, we can also update Catalog.createTable() to support >>> creating partitioned tables >>> <https://issues.apache.org/jira/browse/SPARK-31001>. >>> >> -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Supporting hive on DataSourceV2

2020-03-23 Thread Ryan Blue
gt; Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > ----- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-19 Thread Ryan Blue
6de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table.md#description> >>> add to the confusion by describing the Hive-compatible command as "CREATE >>> TABLE USING HIVE FORMAT", but neither "USING" nor "HIVE FORMAT" are >>> actually p

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Ryan Blue
>>>>> serious >>>>>>> problem here. If a user just writes CREATE TABLE without USING or ROW >>>>>>> FORMAT or STORED AS, does it matter what table we create? Internally the >>>>>>> parser rules conflict and we pick the native syntax depending on the >>>>>>> rule >>>>>>> order. But the user-facing behavior looks fine. >>>>>>> >>>>>>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in >>>>>>> 3.0. Shall we simply remove EXTERNAL from the native CREATE TABLE >>>>>>> syntax? >>>>>>> Then CREATE EXTERNAL TABLE creates Hive table like 2.4. >>>>>>> >>>>>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim < >>>>>>> kabhwan.opensou...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi devs, >>>>>>>> >>>>>>>> I'd like to initiate discussion and hear the voices for resolving >>>>>>>> ambiguous parser rule between two "create table"s being brought by >>>>>>>> SPARK-30098 [1]. >>>>>>>> >>>>>>>> Previously, "create table" parser rules were clearly distinguished >>>>>>>> via "USING provider", which was very intuitive and deterministic. Say, >>>>>>>> DDL >>>>>>>> query creates "Hive" table unless "USING provider" is specified, >>>>>>>> (Please refer the parser rule in branch-2.4 [2]) >>>>>>>> >>>>>>>> After SPARK-30098, "create table" parser rules became ambiguous >>>>>>>> (please refer the parser rule in branch-3.0 [3]) - the factors >>>>>>>> differentiating two rules are only "ROW FORMAT" and "STORED AS" which >>>>>>>> are >>>>>>>> all defined as "optional". Now it relies on the "order" of parser rule >>>>>>>> which end users would have no idea to reason about, and very >>>>>>>> unintuitive. >>>>>>>> >>>>>>>> Furthermore, undocumented rule of EXTERNAL (added in the first rule >>>>>>>> to provide better message) brought more confusion (I've described the >>>>>>>> broken existing query via SPARK-30436 [4]). >>>>>>>> >>>>>>>> Personally I'd like to see two rules mutually exclusive, instead of >>>>>>>> trying to document the difference and talk end users to be careful >>>>>>>> about >>>>>>>> their query. I'm seeing two ways to make rules be mutually exclusive: >>>>>>>> >>>>>>>> 1. Add some identifier in create Hive table rule, like `CREATE ... >>>>>>>> "HIVE" TABLE ...`. >>>>>>>> >>>>>>>> pros. This is the simplest way to distinguish between two rules. >>>>>>>> cons. This would lead end users to change their query if they >>>>>>>> intend to create Hive table. (Given we will also provide legacy option >>>>>>>> I'm >>>>>>>> feeling this is acceptable.) >>>>>>>> >>>>>>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one. >>>>>>>> >>>>>>>> pros. Less invasive for existing queries. >>>>>>>> cons. Less intuitive, because they have been optional and now >>>>>>>> become mandatory to fall into the second rule. >>>>>>>> >>>>>>>> Would like to hear everyone's voices; better ideas are welcome! >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Jungtaek Lim (HeartSaVioR) >>>>>>>> >>>>>>>> 1. SPARK-30098 Use default datasource as provider for CREATE TABLE >>>>>>>> syntax >>>>>>>> https://issues.apache.org/jira/browse/SPARK-30098 >>>>>>>> 2. >>>>>>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 >>>>>>>> 3. >>>>>>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 >>>>>>>> 4. https://issues.apache.org/jira/browse/SPARK-30436 >>>>>>>> >>>>>>>> -- Ryan Blue Software Engineer Netflix

Re: [Discuss] Metrics Support for DS V2

2020-01-20 Thread Ryan Blue
gt; > On Fri, 17 Jan 2020 at 10:33 PM, Ryan Blue wrote: > >> We've implemented these metrics in the RDD (for input metrics) and in the >> v2 DataWritingSparkTask. That approach gives you the same metrics in the >> stage views that you get with v1 sources, regardless of

Re: [Discuss] Metrics Support for DS V2

2020-01-17 Thread Ryan Blue
>> So it will be easy to collect the metrics if FilePartitionReaderFactory >> implements ReportMetrics >> >> >> Please let me know the views, or even if we want to have new solution or >> design. >> > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-16 Thread Ryan Blue
oncat(years(col) + days(col))`); >>> however, it looks impossible to extend with the current design. It just >>> directly maps transformName to implementation class, >>> and just pass arguments: >>> >>> transform >>> ... >>> | transformName=identifier >>> '(' argument+=transformArgument (',' argument+=transformArgument)* >>> ')' #applyTransform >>> ; >>> >>> It looks regular expressions are supported; however, it's not. >>> - If we should support, the design had to consider that. >>> - if we should not support, different syntax might have to be used >>> instead. >>> >>> *Limited Compatibility Management* >>> The name can be arbitrary. For instance, if "transform" is supported in >>> Spark side, the name is preempted by Spark. >>> If every the datasource supported such name, it becomes not compatible. >>> >>> >>> >>> -- Ryan Blue Software Engineer Netflix

DSv2 sync notes - 11 December 2019

2019-12-19 Thread Ryan Blue
Hi everyone, here are my notes for the DSv2 sync last week. Sorry they’re late! Feel free to add more details or corrections. Thanks! rb *Attendees*: Ryan Blue John Zhuge Dongjoon Hyun Joseph Torres Kevin Yu Russel Spitzer Terry Kim Wenchen Fan Hyukjin Kwan Jacky Lee *Topics*: - Relation

Re: [DISCUSS] Add close() on DataWriter interface

2019-12-11 Thread Ryan Blue
t;> >>> What do you think? It would bring backward incompatible change, but >>> given the interface is marked as Evolving and we're making backward >>> incompatible changes in Spark 3.0, so I feel it may not matter. >>> >>> Would love to hear your thoughts. >>> >>> Thanks in advance, >>> Jungtaek Lim (HeartSaVioR) >>> >>> >>> -- Ryan Blue Software Engineer Netflix

Re: Next DSv2 sync date

2019-12-09 Thread Ryan Blue
Actually, my conflict was cancelled so I'll send out the usual invite for Wednesday. Sorry for the noise. On Sun, Dec 8, 2019 at 3:15 PM Ryan Blue wrote: > Hi everyone, > > I have a conflict with the normal DSv2 sync time this Wednesday and I'd > like to attend to talk about the T

Next DSv2 sync date

2019-12-08 Thread Ryan Blue
Hi everyone, I have a conflict with the normal DSv2 sync time this Wednesday and I'd like to attend to talk about the TableProvider API. Would it work for everyone to have the sync at 6PM PST on Tuesday, 10 December instead? I could also make it at the normal time on Thursday. Thanks, -- Ryan

Re: [DISCUSS] Consistent relation resolution behavior in SparkSQL

2019-12-05 Thread Ryan Blue
gt; minimal since this applies only when there are temp views and tables with >> the same name. >> >> Any feedback will be appreciated. >> >> I also want to thank Wenchen Fan, Ryan Blue, Burak Yavuz, and Dongjoon >> Hyun for guidance and suggestion. >> >> Regards, >> Terry >> >> >> <https://issues.apache.org/jira/browse/SPARK-29900> >> > -- Ryan Blue Software Engineer Netflix

Re: Spark 2.4.5 release for Parquet and Avro dependency updates?

2019-11-22 Thread Ryan Blue
s.apache.org/mod_mbox/parquet-dev/201911.mbox/%3c8357699c-9295-4eb0-a39e-b3538d717...@gmail.com%3E> > ). > > Might there be any desire to cut a Spark 2.4.5 release so that users can > pick up these changes independently of all the other changes in Spark 3.0? > > Thank you in

Re: Enabling fully disaggregated shuffle on Spark

2019-11-19 Thread Ryan Blue
ary metadata/framing data to be wrapped around individual objects >> cheaply. Right now, that’s only possible at the stream level. (There are >> hacks around this, but this would enable more idiomatic use in efficient >> shuffle implementations.) >> >> >> Have serializers indicate whether they are deterministic. This provides >> much of the value of a shuffle service because it means that reducers do >> not need to spill to disk when reading/merging/combining inputs--the data >> can be grouped by the service, even without the service understanding data >> types or byte representations. Alternative (less preferable since it would >> break Java serialization, for example): require all serializers to be >> deterministic. >> >> >> >> -- >> >> - Ben >> > -- Ryan Blue Software Engineer Netflix

Re: DSv2 reader lifecycle

2019-11-06 Thread Ryan Blue
rmat, it's quite > expensive to deserialize all the various metadata, so I was holding the > deserialized version in the DataSourceReader, but if Spark is repeatedly > constructing new ones, then that doesn't help. If this is the expected > behavior, how should I handle this as a consumer of the API? > > Thanks! > Andrew > -- Ryan Blue Software Engineer Netflix

DSv2 sync notes - 30 October 2019

2019-11-01 Thread Ryan Blue
*Attendees*: Ryan Blue Terry Kim Wenchen Fan Jose Torres Jacky Lee Gengliang Wang *Topics*: - DROP NAMESPACE cascade behavior - 3.0 tasks - TableProvider API changes - V1 and V2 table resolution rules - Separate logical and physical write (for streaming) - Bucketing support

Cancel DSv2 sync this week

2019-10-15 Thread Ryan Blue
Hi everyone, I can't make it to the DSv2 sync tomorrow, so let's skip it. If anyone would prefer to have one and is willing to take notes, I can send out the invite. Just let me know, otherwise let's consider it cancelled. Thanks, rb -- Ryan Blue Software Engineer Netflix

DataSourceV2 sync notes - 2 October 2019

2019-10-10 Thread Ryan Blue
Here are my notes from last week's DSv2 sync. *Attendees*: Ryan Blue Terry Kim Wenchen Fan *Topics*: - SchemaPruning only supports Parquet and ORC? - Out of order optimizer rules - 3.0 work - Rename session catalog to spark_catalog - Finish TableProvider update to avoid

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-10 Thread Ryan Blue
originally for Dataset >>>> encoder. As far as I know, no mainstream DBMS is using this policy by >>>> default. >>>> >>>> Currently, the V1 data source uses "Legacy" policy by default, while V2 >>>> uses "Strict". This proposal is to use "ANSI" policy by default for both V1 >>>> and V2 in Spark 3.0. >>>> >>>> This vote is open until Friday (Oct. 11). >>>> >>>> [ ] +1: Accept the proposal >>>> [ ] +0 >>>> [ ] -1: I don't think this is a good idea because ... >>>> >>>> Thank you! >>>> >>>> Gengliang >>>> >>> >>> >>> -- >>> --- >>> Takeshi Yamamuro >>> >> -- > [image: Databricks Summit - Watch the talks] > <https://databricks.com/sparkaisummit/north-america> > -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Ryan Blue
s://issues.apache.org/jira/browse/HIVE-9152). Henry R's description >> was also correct. >> >> >> >> >> >> On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue >> wrote: >> >>> Where can I find a design doc for dynamic partition pruning that >>> expla

Re: [DISCUSS] Out of order optimizer rules?

2019-10-02 Thread Ryan Blue
o run it before join reorder. > > On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue > wrote: > >> Hi everyone, >> >> I have been working on a PR that moves filter and projection pushdown >> into the optimizer for DSv2, instead of when converting to physical plan. >> Th

[DISCUSS] Out of order optimizer rules?

2019-09-28 Thread Ryan Blue
to be addressed. rb -- Ryan Blue Software Engineer Netflix

Re: [DISCUSS] Spark 2.5 release

2019-09-24 Thread Ryan Blue
ell as official > announcement), as it can give us expectation that there're bunch of changes > given it's a new major version. It also provides bunch of time to try > adopting it before the version is officially released. > > > On Wed, Sep 25, 2019 at 4:56 AM Ryan Blue wrote: >

Re: [DISCUSS] Spark 2.5 release

2019-09-24 Thread Ryan Blue
le migration path to >> Spark 3 especially if much of the work is already going to happen anyways. >> >> Maybe giving it a different name (eg something like >> Spark-2-to-3-transitional) would make it more clear about its intended >> purpose and encourage folks to move to

Re: [DISCUSS] Spark 2.5 release

2019-09-23 Thread Ryan Blue
DSv2 data >>>> sources, can we recommend the 3.0-preview release for this? That would get >>>> people shifting to 3.0 faster, which is probably better overall compared to >>>> maintaining two major versions. There’s not that much else changing in 3.0 >>&

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Ryan Blue
DSv2 would be one of the main reasons > people upgrade to 3.0. What's so special about DSv2 that we are doing this? > Why not abandoning 3.0 entirely and backport all the features to 2.x? > > > > On Sat, Sep 21, 2019 at 2:31 PM, Ryan Blue wrote: > >> Why would t

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Ryan Blue
. On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin wrote: > How would you not make incompatible changes in 3.x? As discussed the > InternalRow API is not stable and needs to change. > > On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue wrote: > >> > Making downstream to diverge thei

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Ryan Blue
t, Sep 21, 2019 at 12:46 PM Dongjoon Hyun > wrote: > >> Do you mean you want to have a breaking API change between 3.0 and 3.1? >> I believe we follow Semantic Versioning ( >> https://spark.apache.org/versioning-policy.html ). >> >> > We just won’t add any breaking

Re: [DISCUSS] Spark 2.5 release

2019-09-21 Thread Ryan Blue
licy.html ). > > > We just won’t add any breaking changes before 3.1. > > Bests, > Dongjoon. > > > On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue > wrote: > >> I don’t think we need to gate a 3.0 release on making a more stable >> version of InternalRow &g

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue
you might as well argue we should make the > entire catalyst package public to be pragmatic and not allow any changes. > > > > > On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue wrote: > >> When you created the PR to make InternalRow public >> >> This isn’t qu

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue
temporarily. You can't just make a bunch internal APIs > tightly coupled with other internal pieces public and stable and call it a > day, just because it happen to satisfy some use cases temporarily assuming > the rest of Spark doesn't change. > > > > On Fri, Sep 20, 2019 at 11:

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue
gt; think. I'm still not convinced there is a burning need to use Java 11 > but stay on 2.4, after 3.0 is out, and at least the wheels are in > motion there. Java 8 is still free and being updated. > > On Fri, Sep 20, 2019 at 12:48 PM Ryan Blue > wrote: > > > > H

  1   2   3   4   >