+1
> >
> >
> >
> > On Thu, Apr 25, 2024, 5:41 PM Dongjoon Hyun
> wrote:
> >
> > FYI, there is a proposal to drop Python 3.8 because its EOL is October
> 2024.
> >
> >
> > https://github.com/apache/spark/pull/46228
> > [SPARK-47993][PYTHON] Drop Python 3.8
> >
> >
> >
> > Since it's still alive and there will be an overlap between the
> lifecycle of Python 3.8 and Apache Spark 4.0.0, please give us your
> feedback on the PR, if you have any concerns.
> >
> >
> >
> > From my side, I agree with this decision.
> >
> >
> >
> > Thanks,
> >
> > Dongjoon.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
John Zhuge
apache/spark/pull/46013
>>
>> The vote is open until April 17th 1AM (PST) and passes
>> if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Use ANSI SQL mode by default
>> [ ] -1 Do not use ANSI SQL mode by default because ...
>>
>> Thank you in advance.
>>
>> Dongjoon
>>
>
--
John Zhuge
gt;>
>>>>>>> References:
>>>>>>>
>>>>>>>- JIRA ticket <https://issues.apache.org/jira/browse/SPARK-47240>
>>>>>>>- SPIP doc
>>>>>>>
>>>>>>> <https://docs.google.com/document/d/1rATVGmFLNVLmtxSpWrEceYm7d-ocgu8ofhryVs4g3XU/edit?usp=sharing>
>>>>>>>- Discussion thread
>>>>>>><https://lists.apache.org/thread/gocslhbfv1r84kbcq3xt04nx827ljpxq>
>>>>>>>
>>>>>>> Please vote on the SPIP for the next 72 hours:
>>>>>>>
>>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>> [ ] +0
>>>>>>> [ ] -1: I don’t think this is a good idea because …
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Gengliang Wang
>>>>>>>
>>>>>>
>>>
>>> --
>>>
>>>
--
John Zhuge
eleases/spark-release-3-5-1.html
>>
>> We would like to acknowledge all community members for contributing to
>> this
>> release. This release would not have been possible without you.
>>
>> Jungtaek Lim
>>
>> ps. Yikun is helping us through releasing the official docker image for
>> Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available.
>>
>>
--
John Zhuge
https://github.com/apache/arrow-datafusion-comet for more details if
>> you are interested. We'd love to collaborate with people from the open
>> source community who share similar goals.
>>
>> Thanks,
>> Chao
>>
>> -----
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
--
John Zhuge
+1
John Zhuge
On Sun, Feb 4, 2024 at 11:23 AM Santosh Pingale
wrote:
> +1
>
> On Sun, Feb 4, 2024, 8:18 PM Xiao Li
> wrote:
>
>> +1
>>
>> On Sun, Feb 4, 2024 at 6:07 AM beliefer wrote:
>>
>>> +1
>>>
>>>
>>>
>
Congratulations!
On Fri, Oct 6, 2023 at 6:41 PM Yi Wu wrote:
> Congrats!
>
> On Sat, Oct 7, 2023 at 9:24 AM XiDuo You wrote:
>
>> Congratulations!
>>
>> Prashant Sharma 于2023年10月6日周五 00:26写道:
>> >
>> > Congratulations
>> >
>> > On Wed, 4 Oct, 2023, 8:52 pm huaxin gao,
>> wrote:
>> >>
>> >>
this file:
> > >>> > >>> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
> > >>> > >>> >>
> > >>> > >>> >> The staging repository for this release can be found at:
> > >>> > >>> >>
> > >>> > >>>
> > >>>
> https://repository.apache.org/content/repositories/orgapachespark-1439
> > >>> > >>> >>
> > >>> > >>> >> The documentation corresponding to this release can be
> found at:
> > >>> > >>> >>
> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc5-docs/
> > >>> > >>> >>
> > >>> > >>> >> The list of bug fixes going into 3.4.0 can be found at the
> > >>> following
> > >>> > >>> URL:
> > >>> > >>> >>
> https://issues.apache.org/jira/projects/SPARK/versions/12351465
> > >>> > >>> >>
> > >>> > >>> >> This release is using the release script of the tag
> v3.4.0-rc5.
> > >>> > >>> >>
> > >>> > >>> >>
> > >>> > >>> >>
> > >>> > >>> >>
> > >>> > >>> >>
> > >>> > >>> >> FAQ
> > >>> > >>> >>
> > >>> > >>> >> =
> > >>> > >>> >> How can I help test this release?
> > >>> > >>> >> =
> > >>> > >>> >> If you are a Spark user, you can help us test this release
> by
> > >>> taking
> > >>> > >>> >> an existing Spark workload and running on this release
> > >>> candidate, then
> > >>> > >>> >> reporting any regressions.
> > >>> > >>> >>
> > >>> > >>> >> If you're working in PySpark you can set up a virtual env
> and
> > >>> install
> > >>> > >>> >> the current RC and see if anything important breaks, in the
> > >>> Java/Scala
> > >>> > >>> >> you can add the staging repository to your projects
> resolvers
> > >>> and test
> > >>> > >>> >> with the RC (make sure to clean up the artifact cache
> > >>> before/after so
> > >>> > >>> >> you don't end up building with an out of date RC going
> forward).
> > >>> > >>> >>
> > >>> > >>> >> ===
> > >>> > >>> >> What should happen to JIRA tickets still targeting 3.4.0?
> > >>> > >>> >> ===
> > >>> > >>> >> The current list of open tickets targeted at 3.4.0 can be
> found
> > >>> at:
> > >>> > >>> >> https://issues.apache.org/jira/projects/SPARK and search
> for
> > >>> "Target
> > >>> > >>> Version/s" = 3.4.0
> > >>> > >>> >>
> > >>> > >>> >> Committers should look at those and triage. Extremely
> important
> > >>> bug
> > >>> > >>> >> fixes, documentation, and API tweaks that impact
> compatibility
> > >>> should
> > >>> > >>> >> be worked on immediately. Everything else please retarget
> to an
> > >>> > >>> >> appropriate release.
> > >>> > >>> >>
> > >>> > >>> >> ==
> > >>> > >>> >> But my bug isn't fixed?
> > >>> > >>> >> ==
> > >>> > >>> >> In order to make timely releases, we will typically not
> hold the
> > >>> > >>> >> release unless the bug in question is a regression from the
> > >>> previous
> > >>> > >>> >> release. That being said, if there is something which is a
> > >>> regression
> > >>> > >>> >> that has not been correctly targeted please ping me or a
> > >>> committer to
> > >>> > >>> >> help target the issue.
> > >>> > >>> >>
> > >>> > >>> >>
> > >>> > >>> >>
> > >>> > >>> >> Thanks,
> > >>> > >>> >>
> > >>> > >>> >> Xinrong Meng
> > >>> > >>> >>
> > >>> > >>> >>
> > >>> > >>>
> > >>> > >>>
> > >>> -
> > >>> > >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >>> > >>>
> > >>> > >>>
> > >>> >
> > >>>
> > >>> -
> > >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >>>
> > >>>
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
John Zhuge
i, Mar 24, 2023 at 1:46 PM John Zhuge wrote:
>
>> Have you checked out SparkCatalog
>> <https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java>
>> in
>> Apache Iceberg project? More docs at
>&
t; org.apache.spark.* packages in addition to their own; presumably this isn't
> by accident. Is this practice necessary to get around package-private
> visibility or something?
>
> Thanks!
>
> -0xe1a
>
--
John Zhuge
>>
>>>>>> On Wed, Mar 22, 2023 at 6:50 PM Herman van Hovell
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> For Spark Connect Scala Client we are working on making the REPL
>>>>>>> experience a bit nicer <https://github.com/apache/spark/pull/40515>.
>>>>>>> In a nutshell we want to give users a turn key scala REPL, that works
>>>>>>> even
>>>>>>> if you don't have a Spark distribution on your machine (through
>>>>>>> coursier <https://get-coursier.io/>). We are using Ammonite
>>>>>>> <https://ammonite.io/> instead of the standard scala REPL for this,
>>>>>>> the main reason for going with Ammonite is that it is easier to
>>>>>>> customize,
>>>>>>> and IMO has a superior user experience.
>>>>>>>
>>>>>>> Does anyone object to doing this?
>>>>>>>
>>>>>>> Kind regards,
>>>>>>> Herman
>>>>>>>
>>>>>>>
--
John Zhuge
>> If you are a Spark user, you can help us test this release by taking
>> >> an existing Spark workload and running on this release candidate, then
>> >> reporting any regressions.
>> >>
>> >> If you're working in PySpark you can set up a virtual env and install
>> >> the current RC and see if anything important breaks, in the Java/Scala
>> >> you can add the staging repository to your projects resolvers and test
>> >> with the RC (make sure to clean up the artifact cache before/after so
>> >> you don't end up building with a out of date RC going forward).
>> >>
>> >> ===
>> >> What should happen to JIRA tickets still targeting 3.3.2?
>> >> ===
>> >>
>> >> The current list of open tickets targeted at 3.3.2 can be found at:
>> >> https://issues.apache.org/jira/projects/SPARK
>> <https://mailshield.baidu.com/check?q=4UUpJqq41y71Gnuj0qTUYo6hTjqzT7oytN6x%2fvgC5XUtQUC8MfJ77tj7K70O%2f1QMmNoa1A%3d%3d>
>> and search for "Target
>> >> Version/s" = 3.3.2
>> >>
>> >> Committers should look at those and triage. Extremely important bug
>> >> fixes, documentation, and API tweaks that impact compatibility should
>> >> be worked on immediately. Everything else please retarget to an
>> >> appropriate release.
>> >>
>> >> ==
>> >> But my bug isn't fixed?
>> >> ==
>> >>
>> >> In order to make timely releases, we will typically not hold the
>> >> release unless the bug in question is a regression from the previous
>> >> release. That being said, if there is something which is a regression
>> >> that has not been correctly targeted please ping me or a committer to
>> >> help target the issue.
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>> >
>> >
>> > --
>> > Bjørn Jørgensen
>> > Vestre Aspehaug 4, 6010 Ålesund
>> > Norge
>> >
>> > +47 480 94 297
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>>
>>
>> --
>>
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>>
>>
>>
>> --
>>
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>>
--
John Zhuge
; >> > Twitter: https://twitter.com/holdenkarau
> >> > Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> >> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >
> > --
> > Twitter: https://twitter.com/holdenkarau
> > Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --
John Zhuge
ance release,
>> > i.e. Spark 3.3.2.
>> >
>> > I'm thinking of the release of Spark 3.3.2 this Feb (2023/02).
>> >
>> > What do you think?
>> >
>> > I am willing to volunteer for Spark 3.3.2 if there is consensus about
>> > this maintenance release.
>> >
>> > Thank you.
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
--
John Zhuge
-- Original --
>>>>>>>>> *From:* "Martin Grigorov" ;
>>>>>>>>> *Date:* Sun, Oct 9, 2022 05:01 AM
>>>>>>>>> *To:* "Hyukjin Kwon";
>>>>>>>>> *Cc:* "dev";"Yikun Jiang"<
>>>>>>>>> yikunk...@gmail.com>;
>>>>>>>>> *Subject:* Re: Welcome Yikun Jiang as a Spark committer
>>>>>>>>>
>>>>>>>>> Congratulations, Yikun!
>>>>>>>>>
>>>>>>>>> On Sat, Oct 8, 2022 at 7:41 AM Hyukjin Kwon
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> The Spark PMC recently added Yikun Jiang as a committer on the
>>>>>>>>>> project.
>>>>>>>>>> Yikun is the major contributor of the infrastructure and GitHub
>>>>>>>>>> Actions in Apache Spark as well as Kubernates and PySpark.
>>>>>>>>>> He has put a lot of effort into stabilizing and optimizing the
>>>>>>>>>> builds so we all can work together in Apache Spark more
>>>>>>>>>> efficiently and effectively. He's also driving the SPIP for
>>>>>>>>>> Docker official image in Apache Spark as well for users and
>>>>>>>>>> developers.
>>>>>>>>>> Please join me in welcoming Yikun!
>>>>>>>>>>
>>>>>>>>>>
>>>>>> --
John Zhuge
e of overlapping partition and data columns
> >>
> >> SPARK-39061: Set nullable correctly for Inline output attributes
> >>
> >> SPARK-39887: RemoveRedundantAliases should keep aliases that make the
> output of projection nodes unique
> >>
> >> SPARK-38614: Don't push down limit through window that's using
> percent_rank
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --
John Zhuge
esults or NPE when using Inline function
>>> against an array of dynamically created structs
>>> SPARK-39107 Silent change in regexp_replace's handling of empty
>>> strings
>>> SPARK-39259 Timestamps returned by now() and equivalent functions
>>> are not consistent in subqueries
>>> SPARK-39293 The accumulator of ArrayAggregate should copy the
>>> intermediate result if string, struct, array, or map
>>>
>>> Best,
>>> Dongjoon.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>> --
John Zhuge
Holden has graciously agreed to shepherd the SPIP. Thanks!
On Thu, Feb 10, 2022 at 9:19 AM John Zhuge wrote:
> The vote is now closed and the vote passes. Thank you to everyone who took
> the time to review and vote on this SPIP. I’m looking forward to adding
> this feature to the n
iate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something that is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>> >
>> > Note: I added an extra day to the vote since I know some folks are
>> likely busy on the 14th with partner(s).
>> >
>> >
>> > --
>> > Twitter: https://twitter.com/holdenkarau
>> > Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
--
John Zhuge
e:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> On Fri, Feb 4, 2022 at 11:40 AM L. C. Hsieh wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Thu, Feb 3, 2022 at 7:25 PM Chao Sun wrot
Hi Spark community,
I’d like to restart the vote for the ViewCatalog design proposal (SPIP).
The proposal is to add a ViewCatalog interface that can be used to load,
create, alter, and drop views in DataSourceV2.
Please vote on the SPIP until Feb. 9th (Wednesday).
[ ] +1: Accept the proposal
sign and believe this will provide a robust and flexible solution to
>>> this problem faced by various large-scale Spark users.
>>>
>>> Thanks John!
>>>
>>> On Thu, Feb 3, 2022 at 11:22 AM Walaa Eldin Moustafa <
>>> wa.moust...@gmail.com&g
> Thanks,
> Walaa.
>
>
> On Wed, May 26, 2021 at 9:54 AM John Zhuge wrote:
>
>> Looks like we are running in circles. Should we have an online meeting to
>> get this sorted out?
>>
>> Thanks,
>> John
>>
>> On Wed, May 26, 2021 at 12:0
uld be worked on immediately. Everything else please
> retarget to an appropriate release. == But my bug isn't
> fixed? == In order to make timely releases, we will
> typically not hold the release unless the bug in question is a regression
> from the previous release. That being said, if there is something which is
> a regression that has not been correctly targeted please ping me or a
> committer to help target the issue.
>
>
>
>
--
John Zhuge
rs Proposal
>>>>>> <https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg>
>>>>>> - JIRA: SPARK-36057
>>>>>> <https://issues.apache.org/jira/browse/SPARK-36057>
>>>>>>
>>>>>> Please vote on the SPIP:
>>>>>>
>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>> [ ] +0
>>>>>> [ ] -1: I don’t think this is a good idea because …
>>>>>>
>>>>>> Regards,
>>>>>> Yikun
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>>
>>>> --
John Zhuge
AM Sean Owen wrote:
>>>>>>>>>
>>>>>>>>>> Always fine by me if someone wants to roll a release.
>>>>>>>>>>
>>>>>>>>>> It's been ~6 months since the last 3.0.x and 3.1.x releases, too;
>>>>>>>>>> a new release of those wouldn't hurt either, if any of our release
>>>>>>>>>> managers
>>>>>>>>>> have the time or inclination. 3.0.x is reaching unofficial
>>>>>>>>>> end-of-life
>>>>>>>>>> around now anyway.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Dec 6, 2021 at 6:55 PM Hyukjin Kwon
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> It's been two months since Spark 3.2.0 release, and we have
>>>>>>>>>>> resolved many bug fixes and regressions. What do you guys think
>>>>>>>>>>> about
>>>>>>>>>>> rolling Spark 3.2.1 release?
>>>>>>>>>>>
>>>>>>>>>>> cc @huaxin gao FYI who I happened to
>>>>>>>>>>> overhear that is interested in rolling the maintenance release :-).
>>>>>>>>>>>
>>>>>>>>>> --
John Zhuge
ces https://github.com/apache/spark/pull/34599
>>> Add PodGroupFeatureStep: https://github.com/apache/spark/pull/34456
>>>
>>> Regards,
>>> Yikun
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
--
John Zhuge
>>> this effort
>>>>>>> > is to come up with a flexible and easy-to-use API that will work
>>>>>>> across
>>>>>>> > data sources.
>>>>>>> >
>>>>>>> > Please also refer to:
>>>>>>> >
>>>>>>> > - Previous discussion in dev mailing list: [DISCUSS] SPIP:
>>>>>>> > Row-level operations in Data Source V2
>>>>>>> > <
>>>>>>> https://lists.apache.org/thread/kd8qohrk5h3qx8d6y4lhrm67vnn8p6bv>
>>>>>>> >
>>>>>>> > - JIRA: SPARK-35801 <
>>>>>>> https://issues.apache.org/jira/browse/SPARK-35801>
>>>>>>> > - PR for handling DELETE statements:
>>>>>>> > <https://github.com/apache/spark/pull/33008>
>>>>>>> >
>>>>>>> > - Design doc
>>>>>>> > <
>>>>>>> https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60/
>>>>>>> >
>>>>>>> >
>>>>>>> > Please vote on the SPIP for the next 72 hours:
>>>>>>> >
>>>>>>> > [ ] +1: Accept the proposal as an official SPIP
>>>>>>> > [ ] +0
>>>>>>> > [ ] -1: I don’t think this is a good idea because …
>>>>>>> >
>>>>>>> >
>>>>>>> -
>>>>>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>> -
>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>
>>>>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
> --
John Zhuge
gt;>>>>
>>>>>>> DB Tsai | https://www.dbtsai.com/ | PGP 42E5B25A8F7A82C1
>>>>>>>
>>>>>>> On Fri, Oct 22, 2021 at 12:18 PM Chao Sun
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > Hi,
>>>>>>> >
>>>>>>> > Ryan and I drafted a design doc to support a new type of join:
>>>>>>> storage partitioned join which covers bucket join support for
>>>>>>> DataSourceV2
>>>>>>> but is more general. The goal is to let Spark leverage distribution
>>>>>>> properties reported by data sources and eliminate shuffle whenever
>>>>>>> possible.
>>>>>>> >
>>>>>>> > Design doc:
>>>>>>> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
>>>>>>> (includes a POC link at the end)
>>>>>>> >
>>>>>>> > We'd like to start a discussion on the doc and any feedback is
>>>>>>> welcome!
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > Chao
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>>
>>>>>
--
John Zhuge
permissions to the PMC to publish containers and
>> update the release steps but I think this could be useful for folks.
>>
>> Cheers,
>>
>> Holden
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>> - To
>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --
John Zhuge
>>>>>
>>>>>
>>>>> On Fri, 18 Jun 2021 at 00:44, Holden Karau
>>>>> wrote:
>>>>>
>>>>>> Hi Folks,
>>>>>>
>>>>>> I'm continuing my adventures to make Spark on containers party and I
>>>>>> was wondering if folks have experience with the different batch
>>>>>> scheduler options that they prefer? I was thinking so that we can
>>>>>> better support dynamic allocation it might make sense for us to
>>>>>> support using different schedulers and I wanted to see if there are
>>>>>> any that the community is more interested in?
>>>>>>
>>>>>> I know that one of the Spark on Kube operators supports
>>>>>> volcano/kube-batch so I was thinking that might be a place I start
>>>>>> exploring but also want to be open to other schedulers that folks
>>>>>> might be interested in.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Holden :)
>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>
>>>>>> -
>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>
>>>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
--
John Zhuge
n Karau
>>>> wrote:
>>>>
>>>>> Hi Folks,
>>>>>
>>>>> I'm continuing my adventures to make Spark on containers party and I
>>>>> was wondering if folks have experience with the different batch
>>>>> scheduler options that they prefer? I was thinking so that we can
>>>>> better support dynamic allocation it might make sense for us to
>>>>> support using different schedulers and I wanted to see if there are
>>>>> any that the community is more interested in?
>>>>>
>>>>> I know that one of the Spark on Kube operators supports
>>>>> volcano/kube-batch so I was thinking that might be a place I start
>>>>> exploring but also want to be open to other schedulers that folks
>>>>> might be interested in.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Holden :)
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>> --
John Zhuge
g invalidation is always a tricky problem.
>
> On Tue, May 25, 2021 at 3:09 AM Ryan Blue
> wrote:
>
>> I don't think that it makes sense to discuss a different approach in the
>> PR rather than in the vote. Let's discuss this now since that's the purpose
>> of an SPIP.
&g
;
> >>
> >>
> >> I ran the tests, checked the related jira tickets, and compared TPCDS
> >> performance differences between
> >>
> >> this v3.1.2 candidate and v3.1.1.
> >>
> >> Everything looks fine.
> >>
> >>
> >>
> >> Thank you, Dongjoon!
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
John Zhuge
Hi everyone, I’d like to start a vote for the ViewCatalog design proposal
(SPIP).
The proposal is to add a ViewCatalog interface that can be used to load,
create, alter, and drop views in DataSourceV2.
The full SPIP doc is here:
Great! I will start a vote thread.
On Mon, May 24, 2021 at 10:54 AM Wenchen Fan wrote:
> Yea let's move forward first. We can discuss the caching approach
> and TableViewCatalog approach during the PR review.
>
> On Tue, May 25, 2021 at 1:48 AM John Zhuge wrote:
>
it only affects catalogs that support both table and
>> view, and it fits the hive catalog very well.
>>
>> On Fri, Sep 4, 2020 at 4:21 PM John Zhuge wrote:
>>
>>> SPIP
>>> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66
gt;>>> > SPARK-35382 Fix lambda variable name issues in nested
>>>>> DataFrame
>>>>> > functions in Python APIs
>>>>> >
>>>>> > # Notable K8s patches since K8s GA
>>>>> > SPARK-34674Close SparkContext after the Main method has
>>>>> finished
>>>>> > SPARK-34948Add ownerReference to executor configmap to fix
>>>>> leakages
>>>>> > SPARK-34820add apt-update before gnupg install
>>>>> > SPARK-34361In case of downscaling avoid killing of executors
>>>>> already
>>>>> > known by the scheduler backend in the pod allocator
>>>>> >
>>>>> > Bests,
>>>>> > Dongjoon.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sent from:
>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> ---
>>>> Takeshi Yamamuro
>>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
--
John Zhuge
the same issue. But RC2 and RC3 don't.
>
> Does it affect the RC?
>
>
> John Zhuge wrote
> > Got this error when browsing the staging repository:
> >
> > 404 - Repository "orgapachespark-1383 (staging: open)"
> > [id=orgapachespark-1383] exists but is
the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
John Zhuge
1: I don’t think this is a good idea because …
>> --
>> Ryan Blue
>>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
--
John Zhuge
ng. The time needed to fix a problem goes up significantly
>>>>>> vs.
>>>>>> compile-time checks. And that is even worse if the UDF is maintained by
>>>>>> someone else.
>>>>>>
>>>>>> I think we also need to consider how common it would be that a use
>>>>>> case can have the query-compile-time checks. Going through this in more
>>>>>> detail below makes me think that it is unlikely that these checks would
>>>>>> be
>>>>>> used often because of the limitations of using an interface with type
>>>>>> erasure.
>>>>>>
>>>>>> I believe that Wenchen’s proposal will provide stronger
>>>>>> query-compile-time safety
>>>>>>
>>>>>> The proposal could have better safety for each argument, assuming
>>>>>> that we detect failures by looking at the parameter types using
>>>>>> reflection
>>>>>> in the analyzer. But we don’t do that for any of the similar UDFs today
>>>>>> so
>>>>>> I’m skeptical that this would actually be a high enough priority to
>>>>>> implement.
>>>>>>
>>>>>> As Erik pointed out, type erasure also limits the effectiveness. You
>>>>>> can’t implement ScalarFunction2 and
>>>>>> ScalarFunction2>>>>> Long>. You can handle those cases using InternalRow or you can
>>>>>> handle them using VarargScalarFunction. That forces many use
>>>>>> cases into varargs with Object, where you don’t get any of the
>>>>>> proposed analyzer benefits and lose compile-time checks. The only time
>>>>>> the
>>>>>> additional checks (if implemented) would help is when only one set of
>>>>>> argument types is needed because implementing ScalarFunction>>>>> Object> defeats the purpose.
>>>>>>
>>>>>> It’s worth noting that safety for the magic methods would be
>>>>>> identical between the two options, so the trade-off to consider is for
>>>>>> varargs and non-codegen cases. Combining the limitations discussed, this
>>>>>> has better safety guarantees only if you need just one set of types for
>>>>>> each number of arguments and are using the non-codegen path. Since
>>>>>> varargs
>>>>>> is one of the primary reasons to use this API, then I don’t think that it
>>>>>> is a good idea to use Object[] instead of InternalRow.
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
--
John Zhuge
> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
> too. I'm expecting more benefits.
>
> - Structure Streaming with RocksDB backend: According to the latest
> update, it looks active enough for merging to master branch in Spark 3.2.
>
> Please share your thoughts and let's build better Apache Spark 3.2
> together.
>
> Bests,
> Dongjoon.
>
--
John Zhuge
and see if anything important breaks.
>>>> In the Java/Scala, you can add the staging repository to your projects
>>>> resolvers and test
>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>> you don't end up building with an out of date RC going forward).
>>>>
>>>> ===
>>>> What should happen to JIRA tickets still targeting 3.1.1?
>>>> ===
>>>>
>>>> The current list of open tickets targeted at 3.1.1 can be found at:
>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>> Version/s" = 3.1.1
>>>>
>>>> Committers should look at those and triage. Extremely important bug
>>>> fixes, documentation, and API tweaks that impact compatibility should
>>>> be worked on immediately. Everything else please retarget to an
>>>> appropriate release.
>>>>
>>>> ==
>>>> But my bug isn't fixed?
>>>> ==
>>>>
>>>> In order to make timely releases, we will typically not hold the
>>>> release unless the bug in question is a regression from the previous
>>>> release. That being said, if there is something which is a regression
>>>> that has not been correctly targeted please ping me or a committer to
>>>> help target the issue.
>>>>
>>>>
--
John Zhuge
reporting any regressions.
>>>>>>>
>>>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>>>> the current RC and see if anything important breaks, in the
>>>>>>> Java/Scala
>>>>>>> you can add the staging repository to your projects resolvers and
>>>>>>> test
>>>>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>>>>> you don't end up building with a out of date RC going forward).
>>>>>>>
>>>>>>> ===
>>>>>>> What should happen to JIRA tickets still targeting 3.0.2?
>>>>>>> ===
>>>>>>>
>>>>>>> The current list of open tickets targeted at 3.0.2 can be found at:
>>>>>>> https://issues.apache.org/jira/projects/SPARK and search for
>>>>>>> "Target Version/s" = 3.0.2
>>>>>>>
>>>>>>> Committers should look at those and triage. Extremely important bug
>>>>>>> fixes, documentation, and API tweaks that impact compatibility should
>>>>>>> be worked on immediately. Everything else please retarget to an
>>>>>>> appropriate release.
>>>>>>>
>>>>>>> ==
>>>>>>> But my bug isn't fixed?
>>>>>>> ==
>>>>>>>
>>>>>>> In order to make timely releases, we will typically not hold the
>>>>>>> release unless the bug in question is a regression from the previous
>>>>>>> release. That being said, if there is something which is a regression
>>>>>>> that has not been correctly targeted please ping me or a committer to
>>>>>>> help target the issue.
>>>>>>>
>>>>>>
--
John Zhuge
caches
>>>>>>> SPARK-33591 NULL is recognized as the "null" string in partition
>>>>>>> specs
>>>>>>> SPARK-33593 Vector reader got incorrect data with binary partition
>>>>>>> value
>>>>>>> SPARK-33726 Duplicate field names causes wrong answers during
>>>>>>> aggregation
>>>>>>> SPARK-33950 ALTER TABLE .. DROP PARTITION doesn't refresh cache
>>>>>>> SPARK-34011 ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
>>>>>>> SPARK-34027 ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache
>>>>>>> SPARK-34055 ALTER TABLE .. ADD PARTITION doesn't refresh cache
>>>>>>> SPARK-34187 Use available offset range obtained during polling when
>>>>>>> checking offset validation
>>>>>>> SPARK-34212 For parquet table, after changing the precision and
>>>>>>> scale of decimal type in hive, spark reads incorrect value
>>>>>>> SPARK-34213 LOAD DATA doesn't refresh v1 table cache
>>>>>>> SPARK-34229 Avro should read decimal values with the file schema
>>>>>>> SPARK-34262 ALTER TABLE .. SET LOCATION doesn't refresh v1 table
>>>>>>> cache
>>>>>>>
>>>>>>
>>>
>>> --
>>>
>>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
--
John Zhuge
>
>> Let's discuss the proposal here rather than on that PR, to get better
>> visibility. Also, please take the time to read the proposal first. That
>> really helps clear up misconceptions.
>>
>>
>>
>> --
>> Ryan Blue
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Fholdenkarau=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067997978%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=fVfSPIyazuUYv8VLfNu%2BUIHdc3ePM1AAKKH%2BlnIicF8%3D=0>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Famzn.to%2F2MaRAG9=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067997978%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=NbRl9kK%2B6Wy0jWmDnztYp3JCPNLuJvmFsLHUrXzEhlk%3D=0>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fuser%2Fholdenkarau=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060068007935%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=OWXOBELzO3hBa2JI%2FOSBZ3oNyLq0yr%2FGXMkNn7bqYDM%3D=0>
>>
>> --
>> Ryan Blue
>>
>>
--
John Zhuge
nsubscr...@spark.apache.org
>
>
--
John Zhuge
SPIP
<https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
has been updated. Please review.
On Thu, Sep 3, 2020 at 9:22 AM John Zhuge wrote:
> Wenchen, sorry for the delay, I will post an update shortly.
>
> On Thu, Sep 3, 2020 at 2:00
ces, so returning a
>>ViewOrTable is more difficult for implementations
>>- TableCatalog assumes that ViewCatalog will be added separately like
>>John proposes, so we would have to break or replace that API
>>
>> I understand the initial appeal of comb
> > AFAIK view schema is only used by DESCRIBE.
>
> Correction: Spark adds a new Project at the top of the parsed plan from
> view, based on the stored schema, to make sure the view schema doesn't
> change.
>
Thanks Wenchen! I thought I forgot something :) Yes it is the validation
done in
hange.
>
> Can you update your doc to incorporate the cache idea? Let's make sure we
> don't have perf issues if we go with the new View API.
>
> On Tue, Aug 18, 2020 at 4:25 PM John Zhuge wrote:
>
>> Thanks Burak and Walaa for the feedback!
>>
>> Here are my pers
;>> views. This way you avoid multiple RPCs to a catalog or data source or
>>>> metastore, and you avoid namespace/name conflits. Also you make yourself
>>>> less susceptible to race conditions (which still inherently exist).
>>>>
>>>> In additi
note either the order in which resolution will happen
> (views are resolved first) or note that it is not allowed and behavior is
> not guaranteed. I prefer the first option.
>
> On Wed, Aug 12, 2020 at 5:14 PM John Zhuge wrote:
>
>> Hi Wenchen,
>>
>> Thanks for the feed
> I think a new View API is more flexible. I'd vote for it if we can come up
> with a good mechanism to avoid name conflicts.
>
> On Wed, Aug 12, 2020 at 6:20 AM John Zhuge wrote:
>
>> Hi Spark devs,
>>
>> I'd like to bring more attention to this SPIP. As Dongjoon
y.
The PR has conflicts that I will resolve them shortly.
Thanks,
On Wed, Apr 22, 2020 at 12:24 AM John Zhuge wrote:
> Hi everyone,
>
> In order to disassociate view metadata from Hive Metastore and support
> different storage backends, I am proposing a new view catalog API to
three months.
Thanks,
John Zhuge
k an API.
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> Cost of Breaking an API
>>>>>>>>> >> >>
>>>>>>>>> >> >> Breaking an API almost always has a non-trivial cost to the
>>>>>>>>> users of Sp
t;>guess is all users will blindly flip the flag to true (to keep using this
>>function), so you've only succeeded in annoying them.
>>-
>>
>>Cost to Maintain - These are two relatively isolated expressions,
>>there should be little cost to keeping them. Users can be confused by
>> their
>>semantics, so we probably should update the docs to point them to a best
>>practice (I learned only by complaining on the PR, that a good practice is
>>to parse timestamps including the timezone in the format expression, which
>>naturally shifts them to UTC).
>>
>>
>> Decision: Do not deprecate these two functions. We should update the
>> docs to talk about best practices for parsing timestamps, including how to
>> correctly shift them to UTC for storage.
>>
>> [SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902
>> <https://github.com/apache/spark/pull/24902>
>>
>>
>>-
>>
>>Cost to Break - The TRIM function takes two string parameters. If we
>>switch the parameter order, queries that use the TRIM function would
>>silently get different results on different versions of Spark. Users may
>>not notice it for a long time and wrong query results may cause serious
>>problems to users.
>>-
>>
>>Cost to Maintain - We will have some inconsistency inside Spark, as
>>the TRIM function in Scala API and in SQL have different parameter order.
>>
>>
>> Decision: Do not switch the parameter order. Promote the TRIM(trimStr
>> FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate
>> (with a warning, not by removing) the SQL TRIM function and move users to
>> the SQL standard TRIM syntax.
>>
>> Thanks for taking the time to read this! Happy to discuss the specifics
>> and amend this policy as the community sees fit.
>>
>> Michael
>>
>>
--
John Zhuge
ils there. Do you want to join?
>
> On Tue, Nov 19, 2019 at 4:23 PM Amogh Margoor wrote:
>
>> We at Qubole are also looking at disaggregating shuffle on Spark. Would
>> love to collaborate and share learnings.
>>
>> Regards,
>> Amogh
>>
>> On Tue,
support writing an arbitrary number of objects into an
>>> existing OutputStream or ByteBuffer. This enables objects to be serialized
>>> to direct buffers where doing so makes sense. More importantly, it allows
>>> arbitrary metadata/framing data to be wrapped around individual objects
>>> cheaply. Right now, that’s only possible at the stream level. (There are
>>> hacks around this, but this would enable more idiomatic use in efficient
>>> shuffle implementations.)
>>>
>>>
>>> Have serializers indicate whether they are deterministic. This provides
>>> much of the value of a shuffle service because it means that reducers do
>>> not need to spill to disk when reading/merging/combining inputs--the data
>>> can be grouped by the service, even without the service understanding data
>>> types or byte representations. Alternative (less preferable since it would
>>> break Java serialization, for example): require all serializers to be
>>> deterministic.
>>>
>>>
>>>
>>> --
>>>
>>> - Ben
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
--
John Zhuge
> spec, if there is a view named "a", we can't create a table named "a"
> anymore.
>
> We can add documents and ask the implementation to guarantee it, but it's
> better if this can be guaranteed by the API.
>
> On Wed, Aug 14, 2019 at 1:46 AM John Zhuge wrote:
>
>
845 Support specification of column names in INSERT INTO
>>>> SPARK-24417 Build and Run Spark on JDK11
>>>> SPARK-24724 Discuss necessary info and access in barrier mode +
>>>> Kubernetes
>>>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
>>>> SPARK-25074 Implement maxNumConcurrentTasks() in
>>>> MesosFineGrainedSchedulerBackend
>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>>> SPARK-25186 Stabilize Data Source V2 API
>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
>>>> execution mode
>>>> SPARK-25390 data source V2 API refactoring
>>>> SPARK-7768 Make user-defined type (UDT) API public
>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition
>>>> Spec
>>>> SPARK-15691 Refactor and improve Hive support
>>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>>> SPARK-16217 Support SELECT INTO statement
>>>> SPARK-16452 basic INFORMATION_SCHEMA support
>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>>>> SPARK-18245 Improving support for bucketed table
>>>> SPARK-19842 Informational Referential Integrity Constraints Support in
>>>> Spark
>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
>>>> list of structures
>>>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
>>>> respect session timezone
>>>> SPARK-22386 Data Source V2 improvements
>>>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>
>>>>
>>>>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>
--
John Zhuge
> >
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
John Zhuge
gt;
>>>> >> > If you're working in PySpark you can set up a virtual env and
>>>> install
>>>> >> > the current RC and see if anything important breaks, in the
>>>> Java/Scala
>>>> >> > you can add the staging repository to your projects resolvers and
>>>> test
>>>> >> > with the RC (make sure to clean up the artifact cache before/after
>>>> so
>>>> >> > you don't end up building with a out of date RC going forward).
>>>> >> >
>>>> >> > ===
>>>> >> > What should happen to JIRA tickets still targeting 2.3.4?
>>>> >> > ===
>>>> >> >
>>>> >> > The current list of open tickets targeted at 2.3.4 can be found at:
>>>> >> > https://issues.apache.org/jira/projects/SPARKand search for
>>>> "Target Version/s" = 2.3.4
>>>> >> >
>>>> >> > Committers should look at those and triage. Extremely important bug
>>>> >> > fixes, documentation, and API tweaks that impact compatibility
>>>> should
>>>> >> > be worked on immediately. Everything else please retarget to an
>>>> >> > appropriate release.
>>>> >> >
>>>> >> > ==
>>>> >> > But my bug isn't fixed?
>>>> >> > ==
>>>> >> >
>>>> >> > In order to make timely releases, we will typically not hold the
>>>> >> > release unless the bug in question is a regression from the
>>>> previous
>>>> >> > release. That being said, if there is something which is a
>>>> regression
>>>> >> > that has not been correctly targeted please ping me or a committer
>>>> to
>>>> >> > help target the issue.
>>>> >> >
>>>> >>
>>>> >> -
>>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>> >>
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>
--
John Zhuge
ithub from the last release:
>>>> https://github.com/apache/spark/compare/66fd9c34bf406a4b5f86605d06c9607752bd637a...branch-2.3
>>>> > The 8 correctness issues resolved in branch-2.3:
>>>> >
>>>> https://issues.apache.org/jira/browse/SPARK-26873?jql=project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012344844%20AND%20labels%20in%20(%27correctness%27)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC
>>>> >
>>>> > Best Regards,
>>>> > Kazuaki Ishizaki
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>
> --
> [image: Databricks Summit - Watch the talks]
> <https://databricks.com/sparkaisummit/north-america>
>
--
John Zhuge
to know why you're
> proposing `softwareVersion` in the view definition.
>
> On Tue, Aug 13, 2019 at 8:56 AM John Zhuge wrote:
>
>> Catalog support has been added to DSv2 along with a table catalog
>> interface. Here I'd like to propose a view catalog interface, for the
>> f
:
- name
- originalSql
- defaultCatalog
- defaultNamespace
- viewColumns
- owner
- createTime
- softwareVersion
- options (map)
ViewColumn interface:
- name
- type
Thanks,
John Zhuge
+1 (non-binding) Great work!
On Tue, Jun 18, 2019 at 6:22 AM Vinoo Ganesh wrote:
> +1 (non-binding).
>
>
>
> Thanks for pushing this forward, Matt and Yifei.
>
>
>
> *From: *Felix Cheung
> *Date: *Tuesday, June 18, 2019 at 00:01
> *To: *Yinan Li , "rb...@netflix.com" <
> rb...@netflix.com>
>
1, 2019 at 8:04 PM Maryann Xue
> wrote:
>
>> I believe in the SQL standard, the original name cannot be accessed once
>> it’s aliased.
>>
>> On Tue, Jun 11, 2019 at 7:54 PM John Zhuge wrote:
>>
>>> Yeah, it is a touch scenario.
>>>
>>> I actu
, b from s) t join (select a, b
> from t) s on t1.a = t2.b
>
> If we allowed the hint resolving to "cross" the scopes, we'd end up with a
> really confusing spec.
>
>
> Thanks,
> Maryann
>
> On Tue, Jun 11, 2019 at 5:26 PM John Zhuge wrote:
>
&g
Hi Reynold and Maryann,
ResolveHints javadoc indicates the traversal does not go past subquery
alias. Is there any specific reason?
Thanks,
John Zhuge
=eSx5nMZvdB5hS9VepuvvFZFXjTCrdde-AdzkHC5jRYk=>
> .
>
>
>
> Please vote in the next 3 days.
>
>
>
> [ ] +1: Accept the proposal as an official SPIP
>
> [ ] +0
>
> [ ] -1: I don't think this is a good idea because ...
>
>
>
>
>
> Thanks!
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
--
John Zhuge
> [ ] -1: I don't think this is a good idea because ...
> > >
> > >
> > > Thanks!
> > >
> > > rb
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
John Zhuge
and search for "Target
>>>> Version/s" = 2.3.3
>>>> >
>>>> > Committers should look at those and triage. Extremely important bug
>>>> > fixes, documentation, and API tweaks that impact compatibility should
>>>> > be worked on immediately. Everything else please retarget to an
>>>> > appropriate release.
>>>> >
>>>> > ==
>>>> > But my bug isn't fixed?
>>>> > ==
>>>> >
>>>> > In order to make timely releases, we will typically not hold the
>>>> > release unless the bug in question is a regression from the previous
>>>> > release. That being said, if there is something which is a regression
>>>> > that has not been correctly targeted please ping me or a committer to
>>>> > help target the issue.
>>>> >
>>>> > P.S.
>>>> > I checked all the tests passed in the Amazon Linux 2 AMI;
>>>> > $ java -version
>>>> > openjdk version "1.8.0_191"
>>>> > OpenJDK Runtime Environment (build 1.8.0_191-b12)
>>>> > OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
>>>> > $ ./build/mvn -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos
>>>> -Psparkr test
>>>> >
>>>> > --
>>>> > ---
>>>> > Takeshi Yamamuro
>>>>
>>>>
>>>>
>>>> --
>>>> Marcelo
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>
>
> --
> ---
> Takeshi Yamamuro
>
--
John Zhuge
gt;> > fixes, documentation, and API tweaks that impact compatibility should
>> >> > be worked on immediately. Everything else please retarget to an
>> >> > appropriate release.
>> >> >
>> >> > ==
>> >> > But my bug isn't fixed?
>> >> > ==
>> >> >
>> >> > In order to make timely releases, we will typically not hold the
>> >> > release unless the bug in question is a regression from the previous
>> >> > release. That being said, if there is something which is a regression
>> >> > that has not been correctly targeted please ping me or a committer to
>> >> > help target the issue.
>> >> >
>> >> > P.S.
>> >> > I checked all the tests passed in the Amazon Linux 2 AMI;
>> >> > $ java -version
>> >> > openjdk version "1.8.0_191"
>> >> > OpenJDK Runtime Environment (build 1.8.0_191-b12)
>> >> > OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
>> >> > $ ./build/mvn -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos
>> -Psparkr test
>> >> >
>> >> > --
>> >> > ---
>> >> > Takeshi Yamamuro
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
--
John Zhuge
Thx Xiao!
On Mon, Feb 4, 2019 at 9:04 AM Xiao Li wrote:
> Thank you, Imran!
>
> Also, I attached the slides of "Deep Dive: Scheduler of Apache Spark".
>
> Cheers,
>
> Xiao
>
>
>
> John Zhuge 于2019年2月4日周一 上午8:59写道:
>
>> Thanks Imran!
>>
even more that
> should be discussed, & mistakes I've made. All input welcome.
>
>
> https://docs.google.com/document/d/1oiE21t-8gXLXk5evo-t-BXpO5Hdcob5D-Ps40hogsp8/edit?usp=sharing
>
--
John Zhuge
e it's not a regression from 2.2.2 either.
>>>>
>>>> On Thu, Jan 10, 2019 at 6:37 AM Takeshi Yamamuro
>>>> wrote:
>>>> >
>>>> > Hi, Dongjoon,
>>>> >
>>>> > We don't need to include https://github.com/apache/spark/pull/23456
>>>> in this release?
>>>> > The query there fails in v2.x while it passes in v1.6.
>>>> >
>>>>
>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
--
John Zhuge
;
> For the first one, I was thinking some day next week (time TBD by those
> interested) and starting off with a general roadmap discussion before
> diving into specific technical topics.
>
> Thanks,
>
> rb
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
--
John Zhuge
Yeah, operator "-" does not seem to be supported, however, you can use
"datediff" function:
In [9]: select datediff(CAST('2000-02-01 12:34:34' AS TIMESTAMP),
CAST('2000-01-01 00:00:00' AS TIMESTAMP))
Out[9]:
usually error
>>>>> prone especially for quoted values and other special cases.
>>>>>
>>>>> The proposed in the PR methods should make a better user experience in
>>>>> parsing CSV-like columns. Please, share your thoughts.
>>>>>
>>>>> --
>>>>>
>>>>> Maxim Gekk
>>>>>
>>>>> Technical Solutions Lead
>>>>>
>>>>> Databricks Inc.
>>>>>
>>>>> maxim.g...@databricks.com
>>>>>
>>>>> databricks.com
>>>>>
>>>>> <http://databricks.com/>
>>>>>
>>>>
>>>
>
> --
> *Dongjin Lee*
>
> *A hitchhiker in the mathematical world.*
>
> *github: <http://goog_969573159/>github.com/dongjinleekr
> <http://github.com/dongjinleekr>linkedin: kr.linkedin.com/in/dongjinleekr
> <http://kr.linkedin.com/in/dongjinleekr>slideshare:
> www.slideshare.net/dongjinleekr
> <http://www.slideshare.net/dongjinleekr>*
>
--
John Zhuge
+1 (non-binding)
Built on Ubuntu 16.04 with Maven flags: -Phadoop-2.7 -Pmesos -Pyarn
-Phive-thriftserver -Psparkr -Pkinesis-asl -Phadoop-provided
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
On
+1 on SPARK-25004. We have found it quite useful to diagnose PySpark OOM.
On Tue, Aug 7, 2018 at 1:21 PM Holden Karau wrote:
> I'd like to suggest we consider SPARK-25004 (hopefully it goes in soon),
> but solving some of the consistent Python memory issues we've had for years
> would be
BlockMissingException typically indicates the HDFS file is corrupted. Might
be an HDFS issue, Hadoop mailing list is a better bet:
u...@hadoop.apache.org.
Capture at the full stack trace in executor log.
If the file still exists, run `hdfs fsck -blockId blk_1233169822_159765693`
to determine
Great help from the community!
On Sun, Aug 5, 2018 at 6:17 PM Xiao Li wrote:
> FYI, the new hints have been merged. They will be available in the
> upcoming release (Spark 2.4).
>
> *John Zhuge*, thanks for your work! Really appreciate it! Please submit
> more PRs and help the co
hanism, or whether it is
>>>> possible, but I think it is worth considering such things at a fairly high
>>>> level of abstraction and try to unify and simplify before making things
>>>> more complex with multiple policy mechanisms.
>>>>
>>>
t a patch for this? If there is a
> coalesce hint, inject a coalesce logical node. Pretty simple.
>
>
> On Wed, Jul 25, 2018 at 2:48 PM John Zhuge wrote:
>
>> Thanks for the comment, Forest. What I am asking is to make whatever DF
>> repartition/coalesce functionalities available to SQL users.
>>
&g
plex with multiple policy mechanisms.
>>>
>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin wrote:
>>>
>>>> Seems like a good idea in general. Do other systems have similar
>>>> concepts? In general it'd be easier if we can follow existing convention if
>>&g
is not the same as SPARK-6221 that asked for auto-merging
output files.
Thanks,
John Zhuge
taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>> the current RC and see if anything important breaks, in the Java/Scala
>>>>> you can add the staging repository to your projects resolvers and test
>>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>>> you don't end up building with a out of date RC going forward).
>>>>>
>>>>> ===
>>>>> What should happen to JIRA tickets still targeting 2.3.2?
>>>>> ===
>>>>>
>>>>> The current list of open tickets targeted at 2.3.2 can be found at:
>>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>>> Version/s" = 2.3.2
>>>>>
>>>>> Committers should look at those and triage. Extremely important bug
>>>>> fixes, documentation, and API tweaks that impact compatibility should
>>>>> be worked on immediately. Everything else please retarget to an
>>>>> appropriate release.
>>>>>
>>>>> ==
>>>>> But my bug isn't fixed?
>>>>> ==
>>>>>
>>>>> In order to make timely releases, we will typically not hold the
>>>>> release unless the bug in question is a regression from the previous
>>>>> release. That being said, if there is something which is a regression
>>>>> that has not been correctly targeted please ping me or a committer to
>>>>> help target the issue.
>>>>>
>>>>> --
>>>>> John Zhuge
>>>>>
>>>>
the next 72 hours:
>>>
>>> [+1]: Spark should adopt the SPIP
>>> [-1]: Spark should not adopt the SPIP because . . .
>>>
>>> Thanks for voting, everyone!
>>>
>>> --
>>> Ryan Blue
>>>
>>
>>
>> --
>> Ryan Blue
>>
>> --
>> John Zhuge
>>
>
+1
On Sun, Jul 8, 2018 at 1:30 AM Saisai Shao wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 2.3.2.
>
> The vote is open until July 11th PST and passes if a majority +1 PMC votes
> are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as
;> > This is a correctness bug in a new feature of Spark 2.3: the
>>>>>>> stream-stream
>>>>>>> > join. Users can hit this bug if one of the join side is
>>>>>>> partitioned by a
>>>>>>> > subset of the join keys.
>>>>>>> >
>>>>>>> > SPARK-24552: Task attempt numbers are reused when stages are
>>>>>>> retried
>>>>>>> > This is a long-standing bug in the output committer that may
>>>>>>> introduce data
>>>>>>> > corruption.
>>>>>>> >
>>>>>>> > SPARK-24542: UDFXPath allow users to pass carefully crafted
>>>>>>> XML to
>>>>>>> > access arbitrary files
>>>>>>> > This is a potential security issue if users build access control
>>>>>>> module upon
>>>>>>> > Spark.
>>>>>>> >
>>>>>>> > I think we need a Spark 2.3.2 to address these issues(especially
>>>>>>> the
>>>>>>> > correctness bugs) ASAP. Any thoughts?
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > Wenchen
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Marcelo
>>>>>>>
>>>>>>> -
>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>
>>>>>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
> --
> John Zhuge
>
+1
On Sun, Jun 3, 2018 at 6:12 PM, Hyukjin Kwon wrote:
> +1
>
> 2018년 6월 3일 (일) 오후 9:25, Ricardo Almeida 님이
> 작성:
>
>> +1 (non-binding)
>>
>> On 3 June 2018 at 09:23, Dongjoon Hyun wrote:
>>
>>> +1
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Sat, Jun 2, 2018 at 8:09 PM, Denny Lee wrote:
>>>
95 matches
Mail list logo