String instead,
then the UDF wouldn’t work. What then? Does Spark detect that the wrong
type was used? It would need to or else it would be difficult for a UDF
developer to tell what is wrong. And this is a runtime issue so it is
caught late.
--
Ryan Blue
n
>>>>> the community for a long period of time. I especially appreciate how the
>>>>> design is focused on a minimal useful component, with future optimizations
>>>>> considered from a point of view of making sure it's flexible, but actual
>&
gt;> >>> I find both of the proposed UDF APIs to be sufficiently user-friendly
>>> >>> and
>>> >>> extensible. I generally think Wenchen's proposal is easier for a
>>> user to
>>> >>> work with in the common case, but has greater
>
> This proposal looks very interesting. Would future goals for this
> functionality include both support for aggregation functions, as well
> as support for processing ColumnBatch-es (instead of Row/InternalRow)?
>
> Thanks
> Andrew
>
> On Mon, Feb 15, 2021 at 12:44 PM Ryan Bl
conclude this thread
>> and have at least one implementation in the `master` branch this month
>> (February).
>> If you need more time (one month or longer), why don't we have Ryan's
>> suggestion in the `master` branch first and benchmark with your PR later
>>
individual-parameters version or the row-parameter version?
>
> To move forward, how about we implement the function loading and binding
> first? Then we can have PRs for both the individual-parameters (I can take
> it) and row-parameter approaches, if we still can't reach a consensus at
gt; merge two Arrays (of generic types) to a Map.
>>
>> Also, to address Wenchen's InternalRow question, can we create a number
>> of Function classes, each corresponding to a number of input parameter
>> length (e.g., ScalarFunction1, ScalarFunction2, etc)?
>>
>&
safety guarantees only if you need just one set of types for each number of
arguments and are using the non-codegen path. Since varargs is one of the
primary reasons to use this API, then I don’t think that it is a good idea
to use Object[] instead of InternalRow.
--
Ryan Blue
Software Engineer
Netflix
only be Object and cause boxing issues.
>
> I agree that Object[] is worse than InternalRow. But I can't think of
> real use cases that will force the individual-parameters approach to use
> Object instead of concrete types.
>
>
> On Tue, Mar 2, 2021 at 3:36 AM Ryan
throw new UnsupportedOperationException();
> + }
>
> By providing the default implementation, it will not *forcing users to
> implement it* technically.
> And, we can provide a document about our expected usage properly.
> What do you think?
>
> Bests,
> Dongjoon.
>
the "magical methods", then we can have a single
>>> ScalarFunction interface which has the row-parameter API (with a
>>> default implementation to fail) and documents to describe the "magical
>>> methods" (which can be done later).
>>>
>&
.82w8qxfl2uwl
Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll do
a final update of the PR and we can merge the API.
[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …
--
Ryan Blue
t; > On Tue, Mar 9, 2021 at 9:27 AM huaxin gao <
>> >>>>
>> >>>> > huaxin.gao11@
>> >>>>
>> >>>> > > wrote:
>> >>>> >
>> >>>> >> +1 (non-binding)
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Sent from:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>> >>>>
>> >>>> -
>> >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> ---
> Takeshi Yamamuro
>
--
Ryan Blue
Software Engineer
Netflix
This SPIP is adopted with the following +1 votes and no -1 or +0 votes:
Holden Karau*
John Zhuge
Chao Sun
Dongjoon Hyun*
Russell Spitzer
DB Tsai*
Wenchen Fan*
Kent Yao
Huaxin Gao
Liang-Chi Hsieh
Jungtaek Lim
Hyukjin Kwon*
Gengliang Wang
kordex
Takeshi Yamamuro
Ryan Blue
* = binding
On Mon, Mar
>
> Please vote on the SPIP in the next 72 hours. Once it is approved, I’ll
> update the PR for review.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
--
Ryan Blue
Software Engineer
Netflix
tribution properties
> reported by data sources and eliminate shuffle whenever possible.
> >
> > Design doc:
> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
> (includes a POC link at the end)
> >
> > We'd like to start a discussion on the doc and any feedback is welcome!
> >
> > Thanks,
> > Chao
>
--
Ryan Blue
;>>>
>>>>> +1 for this SPIP.
>>>>>
>>>>> On Sun, Oct 24, 2021 at 9:59 AM huaxin gao
>>>>> wrote:
>>>>>
>>>>>> +1. Thanks for lifting the current restrictions on bucket join and
>>>>>
pache.org/jira/browse/SPARK-19256> has
>>> details).
>>>
>>>
>>>
>>> 1. Would aggregate work automatically after the SPIP?
>>>
>>>
>>>
>>> Another major benefit for having bucketed table, is to avoid shuffle
>>>
hash function. Or we can
> clearly define the bucket hash function of the builtin `BucketTransform` in
> the doc.
>
> On Thu, Oct 28, 2021 at 12:25 AM Ryan Blue wrote:
>
>> Two v2 sources may return different bucket IDs for the same value, and
>> this breaks the phase 1 s
t; > > >
>> > > > [ ] +1: Accept the proposal as an official SPIP
>> > > > [ ] +0
>> > > > [ ] -1: I don’t think this is a good idea because …
>> > > >
>> > > >
>> -
>> > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> > > >
>> > > >
>> > >
>> > > -
>> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> > >
>> > >
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
--
Ryan Blue
Tabular
rg/thread/kd8qohrk5h3qx8d6y4lhrm67vnn8p6bv>
>>>>> >
>>>>> > - JIRA: SPARK-35801 <
>>>>> https://issues.apache.org/jira/browse/SPARK-35801>
>>>>> > - PR for handling DELETE statements:
>>>>> > <https://github.com/apache/spark/pull/33008>
>>>>> >
>>>>> > - Design doc
>>>>> > <
>>>>> https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60/
>>>>> >
>>>>> >
>>>>> > Please vote on the SPIP for the next 72 hours:
>>>>> >
>>>>> > [ ] +1: Accept the proposal as an official SPIP
>>>>> > [ ] +0
>>>>> > [ ] -1: I don’t think this is a good idea because …
>>>>> >
>>>>> > -
>>>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>> >
>>>>>
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>
--
Ryan Blue
Tabular
t;
>> Hi dev,
>>
>> We are discussing Support Dynamic Table Options for Spark SQL (
>> https://github.com/apache/spark/pull/34072). It is currently not sure if
>> the syntax makes sense, and would like to know if there is other feedback
>> or opinion on this.
>>
>> I would appreciate any feedback on this.
>>
>> Thanks.
>>
>
--
Ryan Blue
Tabular
iable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 15 Nov 2021 at 17:02, Russell Spitzer
> wrote:
>
>> I think since we probably will end up using this same syntax on write,
>> this makes a lot of sense. Unless there
? we can extract options from runtime session
>> configurations e.g., SessionConfigSupport.
>>
>> On Tue, 16 Nov 2021 at 04:30, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Side note about time travel: There is a PR
>>> <https:
Previous discussion in dev mailing list: [DISCUSSION] SPIP: Support
>>> Volcano/Alternative Schedulers Proposal
>>> > - Design doc: [SPIP] Spark-36057 Support Customized Kubernetes
>>> Schedulers Proposal
>>> > - JIRA: SPARK-36057
>>> >
>>
to add a ViewCatalog interface that can be used to load,
> create, alter, and drop views in DataSourceV2.
>
> Please vote on the SPIP until Feb. 9th (Wednesday).
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks!
>
--
Ryan Blue
Tabular
schema metadata that are
> enforced in the implementation of a FileFormatDataWriter?
>
> Just throwing it out there and wondering what other people think. It's an
> area that interests me as it seems that over half my problems at the day
> job are because of dodgy data.
>
> Regards,
>
> Phillip
>
>
--
Ryan Blue
Tabular
hint system [
> https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-hints.html
> or sql("select 1").hint("foo").show()] aren't visible from the
> TableCatalog/Table/ScanBuilder.
>
> I guess I could set a config parameter but I'd rather do this on a
> per-query basis. Any tips?
>
> Thanks!
>
> -0xe1a
>
--
Ryan Blue
Tabular
netes operator, making it a part of the Apache Flink project (
>>> https://github.com/apache/flink-kubernetes-operator). This move has
>>> gained wide industry adoption and contributions from the community. In a
>>> mere year, the Flink operator has garnered more than 600 stars and has
>>> attracted contributions from over 80 contributors. This showcases the level
>>> of community interest and collaborative momentum that can be achieved in
>>> similar scenarios.
>>> More details can be found at SPIP doc : Spark Kubernetes Operator
>>> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
>>>
>>> Thanks,
>>> --
>>> *Zhou JIANG*
>>>
>>>
>>>
--
Ryan Blue
Tabular
tate.newHadoopConfWithOptions(relation.options))
> )
>
> *import *scala.collection.JavaConverters._
>
> *val *rows = readFile(pFile).flatMap(_ *match *{
> *case *r: InternalRow => *Seq*(r)
>
> // This doesn't work. vector mode is doing something screwy
> *case *b: ColumnarBatch => b.rowIterator().asScala
> }).toList
>
> *println*(rows)
> //List([0,1,5b,24,66647361])
> //??this is wrong I think
>
>
>
> Has anyone attempted something similar?
>
>
>
> Cheers Andrew
>
>
>
--
Ryan Blue
Software Engineer
Netflix
RM9LrT3Yp2mf1BcbBf1o_m3bcNZdOjznrogBLzUzgeE&e=
> .
> >
> >
> >
> > I’m trying to understand why this is the intended behavior – anyone
> have any knowledge of why this is the case?
> >
> >
> >
> > Thanks,
> >
> > Vinoo
>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
t;
> it happens to be in the SparkContext but is state only needed by one
>
> SparkSession and that there isn't any way to clean up now, that's a
>
> compelling reason to change the API. Is that the situation? The only
>
> downside is making the user sepa
Ryan Blue
John Zhuge
Russel Spitzer
Gengliang Wang
Yuanjian Li
Matt Cheah
Yifei Huang
Felix Cheung
Dilip Biswal
Wenchen Fan
--
Ryan Blue
Software Engineer
Netflix
> That is, when reading column partitioned Parquet files the explicitly
>> specified schema is not adhered to, instead the partitioning columns are
>> appended the end of the column list. This is a quite severe issue as some
>> operations, such as union, fails if columns are in
wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>> I fail to see how an equi-join on the key columns is different
>>>>>>>>> than the cogroup you propose.
>>>>>>>>>
>>&
'd like to start the process shortly.
>
> Michael
>
--
Ryan Blue
Software Engineer
Netflix
ed
> by this behavior. Do you have a different proposal about how this should
> be handled?
>
> On Tue, Apr 16, 2019 at 4:23 PM Ryan Blue wrote:
>
>> Is this a bug fix? It looks like a new feature to me.
>>
>> On Tue, Apr 16, 2019 at 4:13 PM Michael Armbrust
&
*:
- TableCatalog PR #24246: https://github.com/apache/spark/pull/24246
- Remove SaveMode PR #24233: https://github.com/apache/spark/pull/24233
- Streaming capabilities PR #24129:
https://github.com/apache/spark/pull/24129
*Attendees*:
Ryan Blue
John Zhuge
Matt Cheah
Yifei Huang
Bruce Robbins
Jamison
t; Jean Georges Perrin
> j...@jgp.net
>
>
>
> On Apr 19, 2019, at 10:10, Ryan Blue wrote:
>
> Here are my notes from the last DSv2 sync. As always:
>
>- If you’d like to attend the sync, send me an email and I’ll add you
>to the invite. Everyone is welcome.
&
ng is Catalyst? I’ve been trying to piece together how
> Catalyst knows that it can remove a sort and shuffle given that both tables
> are bucketed and sorted the same way. Is there any classes in particular I
> should look at?
>
>
>
> Cheers Andrew
>
--
Ryan Blue
Software Engineer
Netflix
Here are my notes for the latest DSv2 community sync. As usual, if you have
comments or corrections, please reply. If you’d like to be invited to the
next sync, email me directly. Everyone is welcome to attend.
*Attendees*:
Ryan Blue
John Zhuge
Andrew Long
Bruce Robbins
Dilip Biswal
Gengliang
> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:89)
> at
> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:41)
> at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:541)
> at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:763)
> at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:463)
> at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:209)]
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
Sorry these notes are so late, I didn’t get to the write up until now. As
usual, if anyone has corrections or comments, please reply.
*Attendees*:
John Zhuge
Ryan Blue
Andrew Long
Wenchen Fan
Gengliang Wang
Russell Spitzer
Yuanjian Li
Yifei Huang
Matt Cheah
Amardeep Singh Dhilon
Zhilmil Dhion
Here are my notes from last night’s sync. I had to leave early, so there
may be more discussion. Others can fill in the details for those topics.
*Attendees*:
John Zhuge
Ryan Blue
Yifei Huang
Matt Cheah
Yuanjian Li
Russell Spitzer
Kevin Yu
*Topics*:
- Atomic extensions for the TableCatalog
Here are the latest DSv2 sync notes. Please reply with updates or
corrections.
*Attendees*:
Ryan Blue
Michael Armbrust
Gengliang Wang
Matt Cheah
John Zhuge
*Topics*:
Wenchen’s reorganization proposal
Problems with TableProvider - property map isn’t sufficient
New PRs:
- ReplaceTable
i everyone,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I would like to call a vote for the SPIP for SPARK-25299
>>>>>> <https://issues.apache.org/jira/browse/SPARK-25299>, which proposes
>>>>>> to introduce a pluggable storage API for temporary shuffle data.
>>>>>>
>>>>>>
>>>>>>
>>>>>> You may find the SPIP document here
>>>>>> <https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit>
>>>>>> .
>>>>>>
>>>>>>
>>>>>>
>>>>>> The discussion thread for the SPIP was conducted here
>>>>>> <https://lists.apache.org/thread.html/2fe82b6b86daadb1d2edaef66a2d1c4dd2f45449656098ee38c50079@%3Cdev.spark.apache.org%3E>
>>>>>> .
>>>>>>
>>>>>>
>>>>>>
>>>>>> Please vote on whether or not this proposal is agreeable to you.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>> -Matt Cheah
>>>>>>
>>>>>
--
Ryan Blue
Software Engineer
Netflix
DSv1 sources has
> to be removed (in case of DSv2 replacement is implemented). After some
> digging I've found DSv1 sources which are already removed but in some cases
> v1 and v2 still exists in parallel.
>
> Can somebody please tell me what's the overall plan in this area?
>
> BR,
> G
>
>
--
Ryan Blue
Software Engineer
Netflix
>
> > Is there a timeline for spark 3.0 in terms of the first RC and final
> release?
> >
> >
> >
> > Cheers Andrew
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
Here are my notes from this week’s sync.
*Attendees*:
Ryan Blue
John Zhuge
Dale Richardson
Gabor Somogyi
Matt Cheah
Yifei Huang
Xin Ren
Jose Torres
Gengliang Wang
Kevin Yu
*Topics*:
- Metadata columns or function push-down for Kafka v2 source
- Open PRs
- REPLACE TABLE
s in Master,
> but can't find a JDBC implementation or related JIRA.
>
> DatasourceV2 APIs to me look in good shape to attempt a JDBC connector for
> READ/WRITE path.
>
> Thanks & Regards,
> Shiv
>
--
Ryan Blue
Software Engineer
Netflix
Sounds great! Ping me on the review, I think this will be really valuable.
On Fri, Jul 12, 2019 at 6:51 PM Xianyin Xin
wrote:
> If there’s nobody working on that, I’d like to contribute.
>
>
>
> Loop in @Gengliang Wang.
>
>
>
> Xianyin
>
>
>
> *F
Here are my notes from the last sync. If you’d like to be added to the
invite or have topics, please let me know.
*Attendees*:
Ryan Blue
Matt Cheah
Yifei Huang
Jose Torres
Burak Yavuz
Gengliang Wang
Michael Artz
Russel Spitzer
*Topics*:
- Existing PRs
- V2 session catalog: https
may hurt data source v2 performance a lot and we'd better
> fix it sooner rather than later.
>
>
> On Sat, Jul 20, 2019 at 8:20 AM Ryan Blue
> wrote:
>
>> Here are my notes from the last sync. If you’d like to be added to the
>> invite or have topics, please let
;> following ANSI SQL is a better idea.
>> For more information, please read the Discuss: Follow ANSI SQL on table
>> insertion
>> <https://docs.google.com/document/d/1b9nnWWbKVDRp7lpzhQS1buv1_lDzWIZY2ApFs5rBcGI/edit?usp=sharing>
>> Please let me know if you have any thoughts on this.
>>
>> Regards,
>> Gengliang
>>
>
--
Ryan Blue
Software Engineer
Netflix
arranted to do so.
>>
>>
>>
>> -Matt Cheah
>>
>>
>>
>> *From: *Reynold Xin
>> *Date: *Wednesday, July 31, 2019 at 9:58 AM
>> *To: *Matt Cheah
>> *Cc: *Russell Spitzer , Takeshi Yamamuro <
>> linguin@gmail.com>, Ge
gt; Thanks in advance for your help.
>>
>> Regards,
>> Shiv
>>
>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>
--
Ryan Blue
Software Engineer
Netflix
id results. My intuition is yes, because
>> different users have different levels of tolerance for different kinds of
>> errors. I’d expect these sorts of configurations to be set up at an
>> infrastructure level, e.g. to maintain consistent standards throughout a
>> who
Here are my notes from the last DSv2 sync. Sorry it's a bit late!
*Attendees*:
Ryan Blue
John Zhuge
Raynmond McCollum
Terry Kim
Gengliang Wang
Jose Torres
Wenchen Fan
Priyanka Gomatam
Matt Cheah
Russel Spitzer
Burak Yavuz
*Topics*:
- Check in on blockers
- Remove Sav
olumns
>- owner
>- createTime
>- softwareVersion
>- options (map)
>
> ViewColumn interface:
>
>- name
>- type
>
>
> Thanks,
> John Zhuge
>
--
Ryan Blue
Software Engineer
Netflix
Sorry these notes were delayed. Here’s what we talked about in the last
DSv2 sync.
*Attendees*:
Ryan Blue
John Zhuge
Burak Yavuz
Gengliang Wang
Terry Kim
Wenchen Fan
Xin Ren
Srabasti Banerjee
Priyanka Gomatam
*Topics*:
- Follow up on renaming append to insert in v2 API
- Changes to
Here are my notes from the latest sync. Feel free to reply with
clarifications if I’ve missed anything.
*Attendees*:
Ryan Blue
John Zhuge
Russell Spitzer
Matt Cheah
Gengliang Wang
Priyanka Gomatam
Holden Karau
*Topics*:
- DataFrameWriterV2 insert vs append (recap)
- ANSI and strict modes
Dataset
>> encoder. As far as I know, no maintainstream DBMS is using this policy by
>> default.
>>
>> Currently, the V1 data source uses "Legacy" policy by default, while V2 uses
>> "Strict". This proposal is to use "ANSI" policy by default for both V1 and
>> V2 in Spark 3.0.
>>
>> There was also a DISCUSS thread "Follow ANSI SQL on table insertion" in the
>> dev mailing list.
>>
>> This vote is open until next Thurs (Sept. 12nd).
>>
>> [ ] +1: Accept the proposal
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ...
>>
>> Thank you!
>>
>> Gengliang
>>
>>
>
--
Ryan Blue
Software Engineer
Netflix
[ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
> I'll start with my +1
>
> Thanks,
> Tom
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
gt;> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more
>>> operators
>>> >>>>>>> SPARK-24941 Add RDDBarrier.coalesce() function
>>> >>>>>>> SPARK-25017 Add test suite for ContextBarrierState
>>> >>>>>>> SPARK-25083 remove the type erasure hack in data source scan
>>> >>>>>>> SPARK-25383 Image data source supports sample pushdown
>>> >>>>>>> SPARK-27272 Enable blacklisting of node/executor on fetch
>>> failures by default
>>> >>>>>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a
>>> major
>>> >>>>>>> efficiency problem
>>> >>>>>>> SPARK-25128 multiple simultaneous job submissions against k8s
>>> backend
>>> >>>>>>> cause driver pods to hang
>>> >>>>>>> SPARK-26731 remove EOLed spark jobs from jenkins
>>> >>>>>>> SPARK-26664 Make DecimalType's minimum adjusted scale
>>> configurable
>>> >>>>>>> SPARK-21559 Remove Mesos fine-grained mode
>>> >>>>>>> SPARK-24942 Improve cluster resource management with jobs
>>> containing
>>> >>>>>>> barrier stage
>>> >>>>>>> SPARK-25914 Separate projection from grouping and aggregate in
>>> logical Aggregate
>>> >>>>>>> SPARK-26022 PySpark Comparison with Pandas
>>> >>>>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL
>>> standard
>>> >>>>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>>> >>>>>>> SPARK-26425 Add more constraint checks in file streaming source
>>> to
>>> >>>>>>> avoid checkpoint corruption
>>> >>>>>>> SPARK-25843 Redesign rangeBetween API
>>> >>>>>>> SPARK-25841 Redesign window function rangeBetween API
>>> >>>>>>> SPARK-25752 Add trait to easily whitelist logical operators that
>>> >>>>>>> produce named output from CleanupAliases
>>> >>>>>>> SPARK-23210 Introduce the concept of default value to schema
>>> >>>>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and
>>> window aggregate
>>> >>>>>>> SPARK-25531 new write APIs for data source v2
>>> >>>>>>> SPARK-25547 Pluggable jdbc connection factory
>>> >>>>>>> SPARK-20845 Support specification of column names in INSERT INTO
>>> >>>>>>> SPARK-24417 Build and Run Spark on JDK11
>>> >>>>>>> SPARK-24724 Discuss necessary info and access in barrier mode +
>>> Kubernetes
>>> >>>>>>> SPARK-24725 Discuss necessary info and access in barrier mode +
>>> Mesos
>>> >>>>>>> SPARK-25074 Implement maxNumConcurrentTasks() in
>>> >>>>>>> MesosFineGrainedSchedulerBackend
>>> >>>>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>> >>>>>>> SPARK-25186 Stabilize Data Source V2 API
>>> >>>>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for
>>> barrier
>>> >>>>>>> execution mode
>>> >>>>>>> SPARK-25390 data source V2 API refactoring
>>> >>>>>>> SPARK-7768 Make user-defined type (UDT) API public
>>> >>>>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based
>>> Partition Spec
>>> >>>>>>> SPARK-15691 Refactor and improve Hive support
>>> >>>>>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>> >>>>>>> SPARK-16217 Support SELECT INTO statement
>>> >>>>>>> SPARK-16452 basic INFORMATION_SCHEMA support
>>> >>>>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>>> >>>>>>> SPARK-18245 Improving support for bucketed table
>>> >>>>>>> SPARK-19842 Informational Referential Integrity Constraints
>>> Support in Spark
>>> >>>>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in
>>> nested
>>> >>>>>>> list of structures
>>> >>>>>>> SPARK-22632 Fix the behavior of timestamp values for R's
>>> DataFrame to
>>> >>>>>>> respect session timezone
>>> >>>>>>> SPARK-22386 Data Source V2 improvements
>>> >>>>>>> SPARK-24723 Discuss necessary info and access in barrier mode +
>>> YARN
>>> >>>>>>>
>>> >>>>>>>
>>> -
>>> >>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> Name : Jungtaek Lim
>>> >>>> Blog : http://medium.com/@heartsavior
>>> >>>> Twitter : http://twitter.com/heartsavior
>>> >>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> John Zhuge
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Twitter: https://twitter.com/holdenkarau
>>> >> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9
>>> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> >
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
--
Ryan Blue
Software Engineer
Netflix
Here are my notes from this week’s DSv2 sync.
*Attendees*:
Ryan Blue
Holden Karau
Russell Spitzer
Terry Kim
Wenchen Fan
Shiv Prashant Sood
Joseph Torres
Gengliang Wang
Matt Cheah
Burak Yavuz
*Topics*:
- Driver-side Hadoop conf
- SHOW DATABASES/NAMESPACES behavior
- Review outstanding
patibility, to keep the scope of the release small. The
purpose is to assist people moving to 3.0 and not distract from the 3.0
release.
Would a Spark 2.5 release help anyone else? Are there any concerns about
this plan?
rb
--
Ryan Blue
Software Engineer
Netflix
SPARK-20568)
>> >
>> > Here, I am proposing to cut the branch on October 15th. If the features
>> are targeting to 3.0 preview release, please prioritize the work and finish
>> it before the date. Note, Oct. 15th is not the code freeze of Spark 3.0.
>> That means, the community will still work on the features for the upcoming
>> Spark 3.0 release, even if they are not included in the preview release.
>> The goal of preview release is to collect more feedback from the community
>> regarding the new 3.0 features/behavior changes.
>> >
>> > Thanks!
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
--
Ryan Blue
Software Engineer
Netflix
stable.
>
>
>
> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue
> wrote:
>
>> Hi everyone,
>>
>> In the DSv2 sync this week, we talked about a possible Spark 2.5 release
>> based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
>>
&g
lease, I
> think. I'm still not convinced there is a burning need to use Java 11
> but stay on 2.4, after 3.0 is out, and at least the wheels are in
> motion there. Java 8 is still free and being updated.
>
> On Fri, Sep 20, 2019 at 12:48 PM Ryan Blue
> wrote:
> >
&
temporarily. You can't just make a bunch internal APIs
> tightly coupled with other internal pieces public and stable and call it a
> day, just because it happen to satisfy some use cases temporarily assuming
> the rest of Spark doesn't change.
>
>
>
> On Fri, Sep 20, 2
this way, you might as well argue we should make the
> entire catalyst package public to be pragmatic and not allow any changes.
>
>
>
>
> On Fri, Sep 20, 2019 at 11:32 AM, Ryan Blue wrote:
>
>> When you created the PR to make InternalRow public
>>
>> This i
licy.html ).
>
> > We just won’t add any breaking changes before 3.1.
>
> Bests,
> Dongjoon.
>
>
> On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue
> wrote:
>
>> I don’t think we need to gate a 3.0 release on making a more stable
>> version of InternalRow
&g
rate to Spark
> 3.0 if they are prepared to migrate to new DSv2.
>
> On Sat, Sep 21, 2019 at 12:46 PM Dongjoon Hyun
> wrote:
>
>> Do you mean you want to have a breaking API change between 3.0 and 3.1?
>> I believe we follow Semantic Versioning (
>> https://spark.apach
.
On Sat, Sep 21, 2019 at 2:28 PM Reynold Xin wrote:
> How would you not make incompatible changes in 3.x? As discussed the
> InternalRow API is not stable and needs to change.
>
> On Sat, Sep 21, 2019 at 2:27 PM Ryan Blue wrote:
>
>> > Making downstream to diverge thei
ine ... as
> suggested by others in the thread, DSv2 would be one of the main reasons
> people upgrade to 3.0. What's so special about DSv2 that we are doing this?
> Why not abandoning 3.0 entirely and backport all the features to 2.x?
>
>
>
> On Sat, Sep 21, 2019 at
to try the DSv2 API and build DSv2 data
>>>> sources, can we recommend the 3.0-preview release for this? That would get
>>>> people shifting to 3.0 faster, which is probably better overall compared to
>>>> maintaining two major versions. There’s not that much
t;
>> I would personally love to see us provide a gentle migration path to
>> Spark 3 especially if much of the work is already going to happen anyways.
>>
>> Maybe giving it a different name (eg something like
>> Spark-2-to-3-transitional) would make it more clear about i
view has
> advantage here (assuming we provide maven artifacts as well as official
> announcement), as it can give us expectation that there're bunch of changes
> given it's a new major version. It also provides bunch of time to try
> adopting it before the version is officially
addressed.
rb
--
Ryan Blue
Software Engineer
Netflix
ite sure. Seems to me it's better
> to run it before join reorder.
>
> On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue
> wrote:
>
>> Hi everyone,
>>
>> I have been working on a PR that moves filter and projection pushdown
>> into the optimizer for DSv2, instead of
ues.apache.org/jira/browse/HIVE-9152). Henry R's description
>> was also correct.
>>
>>
>>
>>
>>
>> On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue
>> wrote:
>>
>>> Where can I find a design doc for dynamic partition pruning tha
rules are originally for Dataset
>>>> encoder. As far as I know, no mainstream DBMS is using this policy by
>>>> default.
>>>>
>>>> Currently, the V1 data source uses "Legacy" policy by default, while V2
>>>> uses "Strict". This proposal is to use "ANSI" policy by default for both V1
>>>> and V2 in Spark 3.0.
>>>>
>>>> This vote is open until Friday (Oct. 11).
>>>>
>>>> [ ] +1: Accept the proposal
>>>> [ ] +0
>>>> [ ] -1: I don't think this is a good idea because ...
>>>>
>>>> Thank you!
>>>>
>>>> Gengliang
>>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>> --
> [image: Databricks Summit - Watch the talks]
> <https://databricks.com/sparkaisummit/north-america>
>
--
Ryan Blue
Software Engineer
Netflix
Here are my notes from last week's DSv2 sync.
*Attendees*:
Ryan Blue
Terry Kim
Wenchen Fan
*Topics*:
- SchemaPruning only supports Parquet and ORC?
- Out of order optimizer rules
- 3.0 work
- Rename session catalog to spark_catalog
- Finish TableProvider update to
Hi everyone,
I can't make it to the DSv2 sync tomorrow, so let's skip it. If anyone
would prefer to have one and is willing to take notes, I can send out the
invite. Just let me know, otherwise let's consider it cancelled.
Thanks,
rb
--
Ryan Blue
Software Engineer
Netflix
*Attendees*:
Ryan Blue
Terry Kim
Wenchen Fan
Jose Torres
Jacky Lee
Gengliang Wang
*Topics*:
- DROP NAMESPACE cascade behavior
- 3.0 tasks
- TableProvider API changes
- V1 and V2 table resolution rules
- Separate logical and physical write (for streaming)
- Bucketing support
at, it's quite
> expensive to deserialize all the various metadata, so I was holding the
> deserialized version in the DataSourceReader, but if Spark is repeatedly
> constructing new ones, then that doesn't help. If this is the expected
> behavior, how should I handle this as a consumer of the API?
>
> Thanks!
> Andrew
>
--
Ryan Blue
Software Engineer
Netflix
lows
>> arbitrary metadata/framing data to be wrapped around individual objects
>> cheaply. Right now, that’s only possible at the stream level. (There are
>> hacks around this, but this would enable more idiomatic use in efficient
>> shuffle implementations.)
>>
>>
>> Have serializers indicate whether they are deterministic. This provides
>> much of the value of a shuffle service because it means that reducers do
>> not need to spill to disk when reading/merging/combining inputs--the data
>> can be grouped by the service, even without the service understanding data
>> types or byte representations. Alternative (less preferable since it would
>> break Java serialization, for example): require all serializers to be
>> deterministic.
>>
>>
>>
>> --
>>
>> - Ben
>>
>
--
Ryan Blue
Software Engineer
Netflix
://mail-archives.apache.org/mod_mbox/parquet-dev/201911.mbox/%3c8357699c-9295-4eb0-a39e-b3538d717...@gmail.com%3E>
> ).
>
> Might there be any desire to cut a Spark 2.4.5 release so that users can
> pick up these changes independently of all the other changes in Spark 3.0?
>
&g
d be
>> minimal since this applies only when there are temp views and tables with
>> the same name.
>>
>> Any feedback will be appreciated.
>>
>> I also want to thank Wenchen Fan, Ryan Blue, Burak Yavuz, and Dongjoon
>> Hyun for guidance and suggestion.
>>
>> Regards,
>> Terry
>>
>>
>> <https://issues.apache.org/jira/browse/SPARK-29900>
>>
>
--
Ryan Blue
Software Engineer
Netflix
Hi everyone,
I have a conflict with the normal DSv2 sync time this Wednesday and I'd
like to attend to talk about the TableProvider API.
Would it work for everyone to have the sync at 6PM PST on Tuesday, 10
December instead? I could also make it at the normal time on Thursday.
Thanks,
--
Actually, my conflict was cancelled so I'll send out the usual invite for
Wednesday. Sorry for the noise.
On Sun, Dec 8, 2019 at 3:15 PM Ryan Blue wrote:
> Hi everyone,
>
> I have a conflict with the normal DSv2 sync time this Wednesday and I'd
> like to attend to talk a
ng something.
>>>
>>> What do you think? It would bring backward incompatible change, but
>>> given the interface is marked as Evolving and we're making backward
>>> incompatible changes in Spark 3.0, so I feel it may not matter.
>>>
>>> Would love to hear your thoughts.
>>>
>>> Thanks in advance,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>>
>>>
--
Ryan Blue
Software Engineer
Netflix
Hi everyone, here are my notes for the DSv2 sync last week. Sorry they’re
late! Feel free to add more details or corrections. Thanks!
rb
*Attendees*:
Ryan Blue
John Zhuge
Dongjoon Hyun
Joseph Torres
Kevin Yu
Russel Spitzer
Terry Kim
Wenchen Fan
Hyukjin Kwan
Jacky Lee
*Topics*:
- Relation
s are going to be
>>> allowed together (e.g., `concat(years(col) + days(col))`);
>>> however, it looks impossible to extend with the current design. It just
>>> directly maps transformName to implementation class,
>>> and just pass arguments:
>>>
>>> transform
>>> ...
>>> | transformName=identifier
>>> '(' argument+=transformArgument (',' argument+=transformArgument)*
>>> ')' #applyTransform
>>> ;
>>>
>>> It looks regular expressions are supported; however, it's not.
>>> - If we should support, the design had to consider that.
>>> - if we should not support, different syntax might have to be used
>>> instead.
>>>
>>> *Limited Compatibility Management*
>>> The name can be arbitrary. For instance, if "transform" is supported in
>>> Spark side, the name is preempted by Spark.
>>> If every the datasource supported such name, it becomes not compatible.
>>>
>>>
>>>
>>>
--
Ryan Blue
Software Engineer
Netflix
support metrics.
>>
>> So it will be easy to collect the metrics if FilePartitionReaderFactory
>> implements ReportMetrics
>>
>>
>> Please let me know the views, or even if we want to have new solution or
>> design.
>>
>
--
Ryan Blue
Software Engineer
Netflix
st.
>
> On Fri, 17 Jan 2020 at 10:33 PM, Ryan Blue wrote:
>
>> We've implemented these metrics in the RDD (for input metrics) and in the
>> v2 DataWritingSparkTask. That approach gives you the same metrics in the
>> stage views that you get with v1 sources, regardl
;>>> easier to write the native CREATE TABLE syntax. Unfortunately, it leads
>>>>>>> to
>>>>>>> some conflicts with the Hive CREATE TABLE syntax, but I don't see a
>>>>>>> serious
>>>>>>> problem here. If a user just writes CREATE TABLE without USING or ROW
>>>>>>> FORMAT or STORED AS, does it matter what table we create? Internally the
>>>>>>> parser rules conflict and we pick the native syntax depending on the
>>>>>>> rule
>>>>>>> order. But the user-facing behavior looks fine.
>>>>>>>
>>>>>>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in
>>>>>>> 3.0. Shall we simply remove EXTERNAL from the native CREATE TABLE
>>>>>>> syntax?
>>>>>>> Then CREATE EXTERNAL TABLE creates Hive table like 2.4.
>>>>>>>
>>>>>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim <
>>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi devs,
>>>>>>>>
>>>>>>>> I'd like to initiate discussion and hear the voices for resolving
>>>>>>>> ambiguous parser rule between two "create table"s being brought by
>>>>>>>> SPARK-30098 [1].
>>>>>>>>
>>>>>>>> Previously, "create table" parser rules were clearly distinguished
>>>>>>>> via "USING provider", which was very intuitive and deterministic. Say,
>>>>>>>> DDL
>>>>>>>> query creates "Hive" table unless "USING provider" is specified,
>>>>>>>> (Please refer the parser rule in branch-2.4 [2])
>>>>>>>>
>>>>>>>> After SPARK-30098, "create table" parser rules became ambiguous
>>>>>>>> (please refer the parser rule in branch-3.0 [3]) - the factors
>>>>>>>> differentiating two rules are only "ROW FORMAT" and "STORED AS" which
>>>>>>>> are
>>>>>>>> all defined as "optional". Now it relies on the "order" of parser rule
>>>>>>>> which end users would have no idea to reason about, and very
>>>>>>>> unintuitive.
>>>>>>>>
>>>>>>>> Furthermore, undocumented rule of EXTERNAL (added in the first rule
>>>>>>>> to provide better message) brought more confusion (I've described the
>>>>>>>> broken existing query via SPARK-30436 [4]).
>>>>>>>>
>>>>>>>> Personally I'd like to see two rules mutually exclusive, instead of
>>>>>>>> trying to document the difference and talk end users to be careful
>>>>>>>> about
>>>>>>>> their query. I'm seeing two ways to make rules be mutually exclusive:
>>>>>>>>
>>>>>>>> 1. Add some identifier in create Hive table rule, like `CREATE ...
>>>>>>>> "HIVE" TABLE ...`.
>>>>>>>>
>>>>>>>> pros. This is the simplest way to distinguish between two rules.
>>>>>>>> cons. This would lead end users to change their query if they
>>>>>>>> intend to create Hive table. (Given we will also provide legacy option
>>>>>>>> I'm
>>>>>>>> feeling this is acceptable.)
>>>>>>>>
>>>>>>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one.
>>>>>>>>
>>>>>>>> pros. Less invasive for existing queries.
>>>>>>>> cons. Less intuitive, because they have been optional and now
>>>>>>>> become mandatory to fall into the second rule.
>>>>>>>>
>>>>>>>> Would like to hear everyone's voices; better ideas are welcome!
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>
>>>>>>>> 1. SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>>>> syntax
>>>>>>>> https://issues.apache.org/jira/browse/SPARK-30098
>>>>>>>> 2.
>>>>>>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>> 3.
>>>>>>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>>>> 4. https://issues.apache.org/jira/browse/SPARK-30436
>>>>>>>>
>>>>>>>>
--
Ryan Blue
Software Engineer
Netflix
rk/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table.md#description>
>>> add to the confusion by describing the Hive-compatible command as "CREATE
>>> TABLE USING HIVE FORMAT", but neither "USING" nor "HIVE FORMAT" are
gt; Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Ryan Blue
Software Engineer
Netflix
e
>>>> unified syntax. Just make sure it doesn't appear together with PARTITIONED
>>>> BY transformList.
>>>>
>>>
>>> Another side note: Perhaps as part of (or after) unifying the CREATE
>>> TABLE syntax, we can also update Catalog.createTable() to support
>>> creating partitioned tables
>>> <https://issues.apache.org/jira/browse/SPARK-31001>.
>>>
>>
--
Ryan Blue
Software Engineer
Netflix
1 - 100 of 415 matches
Mail list logo