Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-26 Thread John Zhuge
+1 > > > > > > > > On Thu, Apr 25, 2024, 5:41 PM Dongjoon Hyun > wrote: > > > > FYI, there is a proposal to drop Python 3.8 because its EOL is October > 2024. > > > > > > https://github.com/apache/spark/pull/46228 > > [SPARK-47993][PYTHON] Drop Python 3.8 > > > > > > > > Since it's still alive and there will be an overlap between the > lifecycle of Python 3.8 and Apache Spark 4.0.0, please give us your > feedback on the PR, if you have any concerns. > > > > > > > > From my side, I agree with this decision. > > > > > > > > Thanks, > > > > Dongjoon. > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- John Zhuge

Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-15 Thread John Zhuge
apache/spark/pull/46013 >> >> The vote is open until April 17th 1AM (PST) and passes >> if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. >> >> [ ] +1 Use ANSI SQL mode by default >> [ ] -1 Do not use ANSI SQL mode by default because ... >> >> Thank you in advance. >> >> Dongjoon >> > -- John Zhuge

Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

2024-03-12 Thread John Zhuge
gt;> >>>>>>> References: >>>>>>> >>>>>>>- JIRA ticket <https://issues.apache.org/jira/browse/SPARK-47240> >>>>>>>- SPIP doc >>>>>>> >>>>>>> <https://docs.google.com/document/d/1rATVGmFLNVLmtxSpWrEceYm7d-ocgu8ofhryVs4g3XU/edit?usp=sharing> >>>>>>>- Discussion thread >>>>>>><https://lists.apache.org/thread/gocslhbfv1r84kbcq3xt04nx827ljpxq> >>>>>>> >>>>>>> Please vote on the SPIP for the next 72 hours: >>>>>>> >>>>>>> [ ] +1: Accept the proposal as an official SPIP >>>>>>> [ ] +0 >>>>>>> [ ] -1: I don’t think this is a good idea because … >>>>>>> >>>>>>> Thanks! >>>>>>> Gengliang Wang >>>>>>> >>>>>> >>> >>> -- >>> >>> -- John Zhuge

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread John Zhuge
eleases/spark-release-3-5-1.html >> >> We would like to acknowledge all community members for contributing to >> this >> release. This release would not have been possible without you. >> >> Jungtaek Lim >> >> ps. Yikun is helping us through releasing the official docker image for >> Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available. >> >> -- John Zhuge

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread John Zhuge
https://github.com/apache/arrow-datafusion-comet for more details if >> you are interested. We'd love to collaborate with people from the open >> source community who share similar goals. >> >> Thanks, >> Chao >> >> ----- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- John Zhuge

Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-04 Thread John Zhuge
+1 John Zhuge On Sun, Feb 4, 2024 at 11:23 AM Santosh Pingale wrote: > +1 > > On Sun, Feb 4, 2024, 8:18 PM Xiao Li > wrote: > >> +1 >> >> On Sun, Feb 4, 2024 at 6:07 AM beliefer wrote: >> >>> +1 >>> >>> >>> >

Re: Welcome to Our New Apache Spark Committer and PMCs

2023-10-06 Thread John Zhuge
Congratulations! On Fri, Oct 6, 2023 at 6:41 PM Yi Wu wrote: > Congrats! > > On Sat, Oct 7, 2023 at 9:24 AM XiDuo You wrote: > >> Congratulations! >> >> Prashant Sharma 于2023年10月6日周五 00:26写道: >> > >> > Congratulations  >> > >> > On Wed, 4 Oct, 2023, 8:52 pm huaxin gao, >> wrote: >> >> >> >>

Re: [VOTE] Release Apache Spark 3.4.0 (RC5)

2023-04-05 Thread John Zhuge
this file: > > >>> > >>> >> https://dist.apache.org/repos/dist/dev/spark/KEYS > > >>> > >>> >> > > >>> > >>> >> The staging repository for this release can be found at: > > >>> > >>> >> > > >>> > >>> > > >>> > https://repository.apache.org/content/repositories/orgapachespark-1439 > > >>> > >>> >> > > >>> > >>> >> The documentation corresponding to this release can be > found at: > > >>> > >>> >> > https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc5-docs/ > > >>> > >>> >> > > >>> > >>> >> The list of bug fixes going into 3.4.0 can be found at the > > >>> following > > >>> > >>> URL: > > >>> > >>> >> > https://issues.apache.org/jira/projects/SPARK/versions/12351465 > > >>> > >>> >> > > >>> > >>> >> This release is using the release script of the tag > v3.4.0-rc5. > > >>> > >>> >> > > >>> > >>> >> > > >>> > >>> >> > > >>> > >>> >> > > >>> > >>> >> > > >>> > >>> >> FAQ > > >>> > >>> >> > > >>> > >>> >> = > > >>> > >>> >> How can I help test this release? > > >>> > >>> >> = > > >>> > >>> >> If you are a Spark user, you can help us test this release > by > > >>> taking > > >>> > >>> >> an existing Spark workload and running on this release > > >>> candidate, then > > >>> > >>> >> reporting any regressions. > > >>> > >>> >> > > >>> > >>> >> If you're working in PySpark you can set up a virtual env > and > > >>> install > > >>> > >>> >> the current RC and see if anything important breaks, in the > > >>> Java/Scala > > >>> > >>> >> you can add the staging repository to your projects > resolvers > > >>> and test > > >>> > >>> >> with the RC (make sure to clean up the artifact cache > > >>> before/after so > > >>> > >>> >> you don't end up building with an out of date RC going > forward). > > >>> > >>> >> > > >>> > >>> >> === > > >>> > >>> >> What should happen to JIRA tickets still targeting 3.4.0? > > >>> > >>> >> === > > >>> > >>> >> The current list of open tickets targeted at 3.4.0 can be > found > > >>> at: > > >>> > >>> >> https://issues.apache.org/jira/projects/SPARK and search > for > > >>> "Target > > >>> > >>> Version/s" = 3.4.0 > > >>> > >>> >> > > >>> > >>> >> Committers should look at those and triage. Extremely > important > > >>> bug > > >>> > >>> >> fixes, documentation, and API tweaks that impact > compatibility > > >>> should > > >>> > >>> >> be worked on immediately. Everything else please retarget > to an > > >>> > >>> >> appropriate release. > > >>> > >>> >> > > >>> > >>> >> == > > >>> > >>> >> But my bug isn't fixed? > > >>> > >>> >> == > > >>> > >>> >> In order to make timely releases, we will typically not > hold the > > >>> > >>> >> release unless the bug in question is a regression from the > > >>> previous > > >>> > >>> >> release. That being said, if there is something which is a > > >>> regression > > >>> > >>> >> that has not been correctly targeted please ping me or a > > >>> committer to > > >>> > >>> >> help target the issue. > > >>> > >>> >> > > >>> > >>> >> > > >>> > >>> >> > > >>> > >>> >> Thanks, > > >>> > >>> >> > > >>> > >>> >> Xinrong Meng > > >>> > >>> >> > > >>> > >>> >> > > >>> > >>> > > >>> > >>> > > >>> - > > >>> > >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > >>> > >>> > > >>> > >>> > > >>> > > > >>> > > >>> - > > >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > >>> > > >>> > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- John Zhuge

Re: Adding new connectors

2023-03-24 Thread John Zhuge
i, Mar 24, 2023 at 1:46 PM John Zhuge wrote: > >> Have you checked out SparkCatalog >> <https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java> >> in >> Apache Iceberg project? More docs at >&

Re: Adding new connectors

2023-03-24 Thread John Zhuge
t; org.apache.spark.* packages in addition to their own; presumably this isn't > by accident. Is this practice necessary to get around package-private > visibility or something? > > Thanks! > > -0xe1a > -- John Zhuge

Re: Ammonite as REPL for Spark Connect

2023-03-23 Thread John Zhuge
>> >>>>>> On Wed, Mar 22, 2023 at 6:50 PM Herman van Hovell >>>>>> wrote: >>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> For Spark Connect Scala Client we are working on making the REPL >>>>>>> experience a bit nicer <https://github.com/apache/spark/pull/40515>. >>>>>>> In a nutshell we want to give users a turn key scala REPL, that works >>>>>>> even >>>>>>> if you don't have a Spark distribution on your machine (through >>>>>>> coursier <https://get-coursier.io/>). We are using Ammonite >>>>>>> <https://ammonite.io/> instead of the standard scala REPL for this, >>>>>>> the main reason for going with Ammonite is that it is easier to >>>>>>> customize, >>>>>>> and IMO has a superior user experience. >>>>>>> >>>>>>> Does anyone object to doing this? >>>>>>> >>>>>>> Kind regards, >>>>>>> Herman >>>>>>> >>>>>>> -- John Zhuge

Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-12 Thread John Zhuge
>> If you are a Spark user, you can help us test this release by taking >> >> an existing Spark workload and running on this release candidate, then >> >> reporting any regressions. >> >> >> >> If you're working in PySpark you can set up a virtual env and install >> >> the current RC and see if anything important breaks, in the Java/Scala >> >> you can add the staging repository to your projects resolvers and test >> >> with the RC (make sure to clean up the artifact cache before/after so >> >> you don't end up building with a out of date RC going forward). >> >> >> >> === >> >> What should happen to JIRA tickets still targeting 3.3.2? >> >> === >> >> >> >> The current list of open tickets targeted at 3.3.2 can be found at: >> >> https://issues.apache.org/jira/projects/SPARK >> <https://mailshield.baidu.com/check?q=4UUpJqq41y71Gnuj0qTUYo6hTjqzT7oytN6x%2fvgC5XUtQUC8MfJ77tj7K70O%2f1QMmNoa1A%3d%3d> >> and search for "Target >> >> Version/s" = 3.3.2 >> >> >> >> Committers should look at those and triage. Extremely important bug >> >> fixes, documentation, and API tweaks that impact compatibility should >> >> be worked on immediately. Everything else please retarget to an >> >> appropriate release. >> >> >> >> == >> >> But my bug isn't fixed? >> >> == >> >> >> >> In order to make timely releases, we will typically not hold the >> >> release unless the bug in question is a regression from the previous >> >> release. That being said, if there is something which is a regression >> >> that has not been correctly targeted please ping me or a committer to >> >> help target the issue. >> >> >> >> - >> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> >> > >> > >> > -- >> > Bjørn Jørgensen >> > Vestre Aspehaug 4, 6010 Ålesund >> > Norge >> > >> > +47 480 94 297 >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> >> >> >> -- >> >> Bjørn Jørgensen >> Vestre Aspehaug 4, 6010 Ålesund >> Norge >> >> +47 480 94 297 >> >> >> >> >> -- >> >> Bjørn Jørgensen >> Vestre Aspehaug 4, 6010 Ålesund >> Norge >> >> +47 480 94 297 >> >> -- John Zhuge

Re: Spark on Kube (virtua) coffee/tea/pop times

2023-02-07 Thread John Zhuge
; >> > Twitter: https://twitter.com/holdenkarau > >> > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 > >> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > > > > -- > > Twitter: https://twitter.com/holdenkarau > > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 > > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- John Zhuge

Re: Time for release v3.3.2

2023-01-30 Thread John Zhuge
ance release, >> > i.e. Spark 3.3.2. >> > >> > I'm thinking of the release of Spark 3.3.2 this Feb (2023/02). >> > >> > What do you think? >> > >> > I am willing to volunteer for Spark 3.3.2 if there is consensus about >> > this maintenance release. >> > >> > Thank you. >> > >> > - >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> > >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- John Zhuge

Re: Welcome Yikun Jiang as a Spark committer

2022-10-09 Thread John Zhuge
-- Original -- >>>>>>>>> *From:* "Martin Grigorov" ; >>>>>>>>> *Date:* Sun, Oct 9, 2022 05:01 AM >>>>>>>>> *To:* "Hyukjin Kwon"; >>>>>>>>> *Cc:* "dev";"Yikun Jiang"< >>>>>>>>> yikunk...@gmail.com>; >>>>>>>>> *Subject:* Re: Welcome Yikun Jiang as a Spark committer >>>>>>>>> >>>>>>>>> Congratulations, Yikun! >>>>>>>>> >>>>>>>>> On Sat, Oct 8, 2022 at 7:41 AM Hyukjin Kwon >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi all, >>>>>>>>>> >>>>>>>>>> The Spark PMC recently added Yikun Jiang as a committer on the >>>>>>>>>> project. >>>>>>>>>> Yikun is the major contributor of the infrastructure and GitHub >>>>>>>>>> Actions in Apache Spark as well as Kubernates and PySpark. >>>>>>>>>> He has put a lot of effort into stabilizing and optimizing the >>>>>>>>>> builds so we all can work together in Apache Spark more >>>>>>>>>> efficiently and effectively. He's also driving the SPIP for >>>>>>>>>> Docker official image in Apache Spark as well for users and >>>>>>>>>> developers. >>>>>>>>>> Please join me in welcoming Yikun! >>>>>>>>>> >>>>>>>>>> >>>>>> -- John Zhuge

Re: Time for Spark 3.3.1 release?

2022-09-12 Thread John Zhuge
e of overlapping partition and data columns > >> > >> SPARK-39061: Set nullable correctly for Inline output attributes > >> > >> SPARK-39887: RemoveRedundantAliases should keep aliases that make the > output of projection nodes unique > >> > >> SPARK-38614: Don't push down limit through window that's using > percent_rank > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- John Zhuge

Re: Apache Spark 3.2.2 Release?

2022-07-06 Thread John Zhuge
esults or NPE when using Inline function >>> against an array of dynamically created structs >>> SPARK-39107 Silent change in regexp_replace's handling of empty >>> strings >>> SPARK-39259 Timestamps returned by now() and equivalent functions >>> are not consistent in subqueries >>> SPARK-39293 The accumulator of ArrayAggregate should copy the >>> intermediate result if string, struct, array, or map >>> >>> Best, >>> Dongjoon. >>> >>> - >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>> -- John Zhuge

Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-23 Thread John Zhuge
Holden has graciously agreed to shepherd the SPIP. Thanks! On Thu, Feb 10, 2022 at 9:19 AM John Zhuge wrote: > The vote is now closed and the vote passes. Thank you to everyone who took > the time to review and vote on this SPIP. I’m looking forward to adding > this feature to the n

Re: [VOTE] Spark 3.1.3 RC4

2022-02-17 Thread John Zhuge
iate release. >> > >> > == >> > But my bug isn't fixed? >> > == >> > >> > In order to make timely releases, we will typically not hold the >> > release unless the bug in question is a regression from the previous >> > release. That being said, if there is something that is a regression >> > that has not been correctly targeted please ping me or a committer to >> > help target the issue. >> > >> > Note: I added an extra day to the vote since I know some folks are >> likely busy on the 14th with partner(s). >> > >> > >> > -- >> > Twitter: https://twitter.com/holdenkarau >> > Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 >> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- John Zhuge

Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-10 Thread John Zhuge
e: >>>> >>>>> +1 (non-binding) >>>>> >>>>> On Fri, Feb 4, 2022 at 11:40 AM L. C. Hsieh wrote: >>>>> >>>>>> +1 >>>>>> >>>>>> On Thu, Feb 3, 2022 at 7:25 PM Chao Sun wrot

[VOTE] SPIP: Catalog API for view metadata

2022-02-03 Thread John Zhuge
Hi Spark community, I’d like to restart the vote for the ViewCatalog design proposal (SPIP). The proposal is to add a ViewCatalog interface that can be used to load, create, alter, and drop views in DataSourceV2. Please vote on the SPIP until Feb. 9th (Wednesday). [ ] +1: Accept the proposal

Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-03 Thread John Zhuge
sign and believe this will provide a robust and flexible solution to >>> this problem faced by various large-scale Spark users. >>> >>> Thanks John! >>> >>> On Thu, Feb 3, 2022 at 11:22 AM Walaa Eldin Moustafa < >>> wa.moust...@gmail.com&g

Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-03 Thread John Zhuge
> Thanks, > Walaa. > > > On Wed, May 26, 2021 at 9:54 AM John Zhuge wrote: > >> Looks like we are running in circles. Should we have an online meeting to >> get this sorted out? >> >> Thanks, >> John >> >> On Wed, May 26, 2021 at 12:0

Re: [VOTE] Release Spark 3.2.1 (RC2)

2022-01-24 Thread John Zhuge
uld be worked on immediately. Everything else please > retarget to an appropriate release. == But my bug isn't > fixed? == In order to make timely releases, we will > typically not hold the release unless the bug in question is a regression > from the previous release. That being said, if there is something which is > a regression that has not been correctly targeted please ping me or a > committer to help target the issue. > > > > -- John Zhuge

Re: [VOTE][SPIP] Support Customized Kubernetes Schedulers Proposal

2022-01-06 Thread John Zhuge
rs Proposal >>>>>> <https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg> >>>>>> - JIRA: SPARK-36057 >>>>>> <https://issues.apache.org/jira/browse/SPARK-36057> >>>>>> >>>>>> Please vote on the SPIP: >>>>>> >>>>>> [ ] +1: Accept the proposal as an official SPIP >>>>>> [ ] +0 >>>>>> [ ] -1: I don’t think this is a good idea because … >>>>>> >>>>>> Regards, >>>>>> Yikun >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>> Twitter: https://twitter.com/holdenkarau >>>> Books (Learning Spark, High Performance Spark, etc.): >>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>> >>>> >>>> -- John Zhuge

Re: Time for Spark 3.2.1?

2022-01-04 Thread John Zhuge
AM Sean Owen wrote: >>>>>>>>> >>>>>>>>>> Always fine by me if someone wants to roll a release. >>>>>>>>>> >>>>>>>>>> It's been ~6 months since the last 3.0.x and 3.1.x releases, too; >>>>>>>>>> a new release of those wouldn't hurt either, if any of our release >>>>>>>>>> managers >>>>>>>>>> have the time or inclination. 3.0.x is reaching unofficial >>>>>>>>>> end-of-life >>>>>>>>>> around now anyway. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Dec 6, 2021 at 6:55 PM Hyukjin Kwon >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> It's been two months since Spark 3.2.0 release, and we have >>>>>>>>>>> resolved many bug fixes and regressions. What do you guys think >>>>>>>>>>> about >>>>>>>>>>> rolling Spark 3.2.1 release? >>>>>>>>>>> >>>>>>>>>>> cc @huaxin gao FYI who I happened to >>>>>>>>>>> overhear that is interested in rolling the maintenance release :-). >>>>>>>>>>> >>>>>>>>>> -- John Zhuge

Re: [DISCUSSION] SPIP: Support Volcano/Alternative Schedulers Proposal

2021-11-30 Thread John Zhuge
ces https://github.com/apache/spark/pull/34599 >>> Add PodGroupFeatureStep: https://github.com/apache/spark/pull/34456 >>> >>> Regards, >>> Yikun >>> >> -- >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> > -- John Zhuge

Re: [VOTE] SPIP: Row-level operations in Data Source V2

2021-11-14 Thread John Zhuge
>>> this effort >>>>>>> > is to come up with a flexible and easy-to-use API that will work >>>>>>> across >>>>>>> > data sources. >>>>>>> > >>>>>>> > Please also refer to: >>>>>>> > >>>>>>> > - Previous discussion in dev mailing list: [DISCUSS] SPIP: >>>>>>> > Row-level operations in Data Source V2 >>>>>>> > < >>>>>>> https://lists.apache.org/thread/kd8qohrk5h3qx8d6y4lhrm67vnn8p6bv> >>>>>>> > >>>>>>> > - JIRA: SPARK-35801 < >>>>>>> https://issues.apache.org/jira/browse/SPARK-35801> >>>>>>> > - PR for handling DELETE statements: >>>>>>> > <https://github.com/apache/spark/pull/33008> >>>>>>> > >>>>>>> > - Design doc >>>>>>> > < >>>>>>> https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60/ >>>>>>> > >>>>>>> > >>>>>>> > Please vote on the SPIP for the next 72 hours: >>>>>>> > >>>>>>> > [ ] +1: Accept the proposal as an official SPIP >>>>>>> > [ ] +0 >>>>>>> > [ ] -1: I don’t think this is a good idea because … >>>>>>> > >>>>>>> > >>>>>>> - >>>>>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>> > >>>>>>> >>>>>>> >>>>>>> - >>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>> >>>>>>> >> >> -- >> Ryan Blue >> Tabular >> > -- John Zhuge

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-26 Thread John Zhuge
gt;>>>> >>>>>>> DB Tsai | https://www.dbtsai.com/ | PGP 42E5B25A8F7A82C1 >>>>>>> >>>>>>> On Fri, Oct 22, 2021 at 12:18 PM Chao Sun >>>>>>> wrote: >>>>>>> > >>>>>>> > Hi, >>>>>>> > >>>>>>> > Ryan and I drafted a design doc to support a new type of join: >>>>>>> storage partitioned join which covers bucket join support for >>>>>>> DataSourceV2 >>>>>>> but is more general. The goal is to let Spark leverage distribution >>>>>>> properties reported by data sources and eliminate shuffle whenever >>>>>>> possible. >>>>>>> > >>>>>>> > Design doc: >>>>>>> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE >>>>>>> (includes a POC link at the end) >>>>>>> > >>>>>>> > We'd like to start a discussion on the doc and any feedback is >>>>>>> welcome! >>>>>>> > >>>>>>> > Thanks, >>>>>>> > Chao >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> >>>>> -- John Zhuge

Re: Time to start publishing Spark Docker Images?

2021-08-12 Thread John Zhuge
permissions to the PMC to publish containers and >> update the release steps but I think this could be useful for folks. >> >> Cheers, >> >> Holden >> >> -- >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> >> - To >> unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- John Zhuge

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread John Zhuge
>>>>> >>>>> >>>>> On Fri, 18 Jun 2021 at 00:44, Holden Karau >>>>> wrote: >>>>> >>>>>> Hi Folks, >>>>>> >>>>>> I'm continuing my adventures to make Spark on containers party and I >>>>>> was wondering if folks have experience with the different batch >>>>>> scheduler options that they prefer? I was thinking so that we can >>>>>> better support dynamic allocation it might make sense for us to >>>>>> support using different schedulers and I wanted to see if there are >>>>>> any that the community is more interested in? >>>>>> >>>>>> I know that one of the Spark on Kube operators supports >>>>>> volcano/kube-batch so I was thinking that might be a place I start >>>>>> exploring but also want to be open to other schedulers that folks >>>>>> might be interested in. >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Holden :) >>>>>> >>>>>> -- >>>>>> Twitter: https://twitter.com/holdenkarau >>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>> https://amzn.to/2MaRAG9 >>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>> >>>>>> - >>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>> >>>>>> -- >>>> Twitter: https://twitter.com/holdenkarau >>>> Books (Learning Spark, High Performance Spark, etc.): >>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>> >>> >> >> -- >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> > -- John Zhuge

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread John Zhuge
n Karau >>>> wrote: >>>> >>>>> Hi Folks, >>>>> >>>>> I'm continuing my adventures to make Spark on containers party and I >>>>> was wondering if folks have experience with the different batch >>>>> scheduler options that they prefer? I was thinking so that we can >>>>> better support dynamic allocation it might make sense for us to >>>>> support using different schedulers and I wanted to see if there are >>>>> any that the community is more interested in? >>>>> >>>>> I know that one of the Spark on Kube operators supports >>>>> volcano/kube-batch so I was thinking that might be a place I start >>>>> exploring but also want to be open to other schedulers that folks >>>>> might be interested in. >>>>> >>>>> Cheers, >>>>> >>>>> Holden :) >>>>> >>>>> -- >>>>> Twitter: https://twitter.com/holdenkarau >>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>> https://amzn.to/2MaRAG9 >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>> >>>>> - >>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>> >>>>> -- >>> Twitter: https://twitter.com/holdenkarau >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> >> -- John Zhuge

Re: [VOTE] SPIP: Catalog API for view metadata

2021-05-26 Thread John Zhuge
g invalidation is always a tricky problem. > > On Tue, May 25, 2021 at 3:09 AM Ryan Blue > wrote: > >> I don't think that it makes sense to discuss a different approach in the >> PR rather than in the vote. Let's discuss this now since that's the purpose >> of an SPIP. &g

Re: [VOTE] Release Spark 3.1.2 (RC1)

2021-05-25 Thread John Zhuge
; > >> > >> > >> I ran the tests, checked the related jira tickets, and compared TPCDS > >> performance differences between > >> > >> this v3.1.2 candidate and v3.1.1. > >> > >> Everything looks fine. > >> > >> > >> > >> Thank you, Dongjoon! > > > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- John Zhuge

[VOTE] SPIP: Catalog API for view metadata

2021-05-24 Thread John Zhuge
Hi everyone, I’d like to start a vote for the ViewCatalog design proposal (SPIP). The proposal is to add a ViewCatalog interface that can be used to load, create, alter, and drop views in DataSourceV2. The full SPIP doc is here:

Re: SPIP: Catalog API for view metadata

2021-05-24 Thread John Zhuge
Great! I will start a vote thread. On Mon, May 24, 2021 at 10:54 AM Wenchen Fan wrote: > Yea let's move forward first. We can discuss the caching approach > and TableViewCatalog approach during the PR review. > > On Tue, May 25, 2021 at 1:48 AM John Zhuge wrote: >

Re: SPIP: Catalog API for view metadata

2021-05-24 Thread John Zhuge
it only affects catalogs that support both table and >> view, and it fits the hive catalog very well. >> >> On Fri, Sep 4, 2020 at 4:21 PM John Zhuge wrote: >> >>> SPIP >>> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66

Re: Apache Spark 3.1.2 Release?

2021-05-17 Thread John Zhuge
gt;>>> > SPARK-35382 Fix lambda variable name issues in nested >>>>> DataFrame >>>>> > functions in Python APIs >>>>> > >>>>> > # Notable K8s patches since K8s GA >>>>> > SPARK-34674Close SparkContext after the Main method has >>>>> finished >>>>> > SPARK-34948Add ownerReference to executor configmap to fix >>>>> leakages >>>>> > SPARK-34820add apt-update before gnupg install >>>>> > SPARK-34361In case of downscaling avoid killing of executors >>>>> already >>>>> > known by the scheduler backend in the pod allocator >>>>> > >>>>> > Bests, >>>>> > Dongjoon. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Sent from: >>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/ >>>>> >>>>> >>>>> - >>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>> >>>>> >>>>> >>>> >>>> -- >>>> --- >>>> Takeshi Yamamuro >>>> >>> -- >>> Twitter: https://twitter.com/holdenkarau >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> >> -- John Zhuge

Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-10 Thread John Zhuge
the same issue. But RC2 and RC3 don't. > > Does it affect the RC? > > > John Zhuge wrote > > Got this error when browsing the staging repository: > > > > 404 - Repository "orgapachespark-1383 (staging: open)" > > [id=orgapachespark-1383] exists but is

Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-09 Thread John Zhuge
the previous > release. That being said, if there is something which is a regression > that has not been correctly targeted please ping me or a committer to > help target the issue. > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- John Zhuge

Re: [VOTE] SPIP: Add FunctionCatalog

2021-03-08 Thread John Zhuge
1: I don’t think this is a good idea because … >> -- >> Ryan Blue >> > -- > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > -- John Zhuge

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-03 Thread John Zhuge
ng. The time needed to fix a problem goes up significantly >>>>>> vs. >>>>>> compile-time checks. And that is even worse if the UDF is maintained by >>>>>> someone else. >>>>>> >>>>>> I think we also need to consider how common it would be that a use >>>>>> case can have the query-compile-time checks. Going through this in more >>>>>> detail below makes me think that it is unlikely that these checks would >>>>>> be >>>>>> used often because of the limitations of using an interface with type >>>>>> erasure. >>>>>> >>>>>> I believe that Wenchen’s proposal will provide stronger >>>>>> query-compile-time safety >>>>>> >>>>>> The proposal could have better safety for each argument, assuming >>>>>> that we detect failures by looking at the parameter types using >>>>>> reflection >>>>>> in the analyzer. But we don’t do that for any of the similar UDFs today >>>>>> so >>>>>> I’m skeptical that this would actually be a high enough priority to >>>>>> implement. >>>>>> >>>>>> As Erik pointed out, type erasure also limits the effectiveness. You >>>>>> can’t implement ScalarFunction2 and >>>>>> ScalarFunction2>>>>> Long>. You can handle those cases using InternalRow or you can >>>>>> handle them using VarargScalarFunction. That forces many use >>>>>> cases into varargs with Object, where you don’t get any of the >>>>>> proposed analyzer benefits and lose compile-time checks. The only time >>>>>> the >>>>>> additional checks (if implemented) would help is when only one set of >>>>>> argument types is needed because implementing ScalarFunction>>>>> Object> defeats the purpose. >>>>>> >>>>>> It’s worth noting that safety for the magic methods would be >>>>>> identical between the two options, so the trade-off to consider is for >>>>>> varargs and non-codegen cases. Combining the limitations discussed, this >>>>>> has better safety guarantees only if you need just one set of types for >>>>>> each number of arguments and are using the non-codegen path. Since >>>>>> varargs >>>>>> is one of the primary reasons to use this API, then I don’t think that it >>>>>> is a good idea to use Object[] instead of InternalRow. >>>>>> -- >>>>>> Ryan Blue >>>>>> Software Engineer >>>>>> Netflix >>>>>> >>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > -- John Zhuge

Re: Apache Spark 3.2 Expectation

2021-03-03 Thread John Zhuge
> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also, > the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool), > too. I'm expecting more benefits. > > - Structure Streaming with RocksDB backend: According to the latest > update, it looks active enough for merging to master branch in Spark 3.2. > > Please share your thoughts and let's build better Apache Spark 3.2 > together. > > Bests, > Dongjoon. > -- John Zhuge

Re: [VOTE] Release Spark 3.1.1 (RC3)

2021-02-23 Thread John Zhuge
and see if anything important breaks. >>>> In the Java/Scala, you can add the staging repository to your projects >>>> resolvers and test >>>> with the RC (make sure to clean up the artifact cache before/after so >>>> you don't end up building with an out of date RC going forward). >>>> >>>> === >>>> What should happen to JIRA tickets still targeting 3.1.1? >>>> === >>>> >>>> The current list of open tickets targeted at 3.1.1 can be found at: >>>> https://issues.apache.org/jira/projects/SPARK and search for "Target >>>> Version/s" = 3.1.1 >>>> >>>> Committers should look at those and triage. Extremely important bug >>>> fixes, documentation, and API tweaks that impact compatibility should >>>> be worked on immediately. Everything else please retarget to an >>>> appropriate release. >>>> >>>> == >>>> But my bug isn't fixed? >>>> == >>>> >>>> In order to make timely releases, we will typically not hold the >>>> release unless the bug in question is a regression from the previous >>>> release. That being said, if there is something which is a regression >>>> that has not been correctly targeted please ping me or a committer to >>>> help target the issue. >>>> >>>> -- John Zhuge

Re: [VOTE] Release Spark 3.0.2 (RC1)

2021-02-17 Thread John Zhuge
reporting any regressions. >>>>>>> >>>>>>> If you're working in PySpark you can set up a virtual env and install >>>>>>> the current RC and see if anything important breaks, in the >>>>>>> Java/Scala >>>>>>> you can add the staging repository to your projects resolvers and >>>>>>> test >>>>>>> with the RC (make sure to clean up the artifact cache before/after so >>>>>>> you don't end up building with a out of date RC going forward). >>>>>>> >>>>>>> === >>>>>>> What should happen to JIRA tickets still targeting 3.0.2? >>>>>>> === >>>>>>> >>>>>>> The current list of open tickets targeted at 3.0.2 can be found at: >>>>>>> https://issues.apache.org/jira/projects/SPARK and search for >>>>>>> "Target Version/s" = 3.0.2 >>>>>>> >>>>>>> Committers should look at those and triage. Extremely important bug >>>>>>> fixes, documentation, and API tweaks that impact compatibility should >>>>>>> be worked on immediately. Everything else please retarget to an >>>>>>> appropriate release. >>>>>>> >>>>>>> == >>>>>>> But my bug isn't fixed? >>>>>>> == >>>>>>> >>>>>>> In order to make timely releases, we will typically not hold the >>>>>>> release unless the bug in question is a regression from the previous >>>>>>> release. That being said, if there is something which is a regression >>>>>>> that has not been correctly targeted please ping me or a committer to >>>>>>> help target the issue. >>>>>>> >>>>>> -- John Zhuge

Re: Apache Spark 3.0.2 Release ?

2021-02-13 Thread John Zhuge
caches >>>>>>> SPARK-33591 NULL is recognized as the "null" string in partition >>>>>>> specs >>>>>>> SPARK-33593 Vector reader got incorrect data with binary partition >>>>>>> value >>>>>>> SPARK-33726 Duplicate field names causes wrong answers during >>>>>>> aggregation >>>>>>> SPARK-33950 ALTER TABLE .. DROP PARTITION doesn't refresh cache >>>>>>> SPARK-34011 ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache >>>>>>> SPARK-34027 ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache >>>>>>> SPARK-34055 ALTER TABLE .. ADD PARTITION doesn't refresh cache >>>>>>> SPARK-34187 Use available offset range obtained during polling when >>>>>>> checking offset validation >>>>>>> SPARK-34212 For parquet table, after changing the precision and >>>>>>> scale of decimal type in hive, spark reads incorrect value >>>>>>> SPARK-34213 LOAD DATA doesn't refresh v1 table cache >>>>>>> SPARK-34229 Avro should read decimal values with the file schema >>>>>>> SPARK-34262 ALTER TABLE .. SET LOCATION doesn't refresh v1 table >>>>>>> cache >>>>>>> >>>>>> >>> >>> -- >>> >>> -- > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > -- John Zhuge

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-12 Thread John Zhuge
> >> Let's discuss the proposal here rather than on that PR, to get better >> visibility. Also, please take the time to read the proposal first. That >> really helps clear up misconceptions. >> >> >> >> -- >> Ryan Blue >> >> >> >> -- >> Twitter: https://twitter.com/holdenkarau >> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Fholdenkarau=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067997978%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=fVfSPIyazuUYv8VLfNu%2BUIHdc3ePM1AAKKH%2BlnIicF8%3D=0> >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 >> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Famzn.to%2F2MaRAG9=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060067997978%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=NbRl9kK%2B6Wy0jWmDnztYp3JCPNLuJvmFsLHUrXzEhlk%3D=0> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fuser%2Fholdenkarau=04%7C01%7Cekrogen%40linkedin.com%7C0ccf8c15abd74dfc974f08d8ce31ae4d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637486060068007935%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=OWXOBELzO3hBa2JI%2FOSBZ3oNyLq0yr%2FGXMkNn7bqYDM%3D=0> >> >> -- >> Ryan Blue >> >> -- John Zhuge

Re: [VOTE] Release Spark 3.1.1 (RC1)

2021-01-19 Thread John Zhuge
nsubscr...@spark.apache.org > > -- John Zhuge

Re: SPIP: Catalog API for view metadata

2020-09-04 Thread John Zhuge
SPIP <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing> has been updated. Please review. On Thu, Sep 3, 2020 at 9:22 AM John Zhuge wrote: > Wenchen, sorry for the delay, I will post an update shortly. > > On Thu, Sep 3, 2020 at 2:00

Re: SPIP: Catalog API for view metadata

2020-09-03 Thread John Zhuge
ces, so returning a >>ViewOrTable is more difficult for implementations >>- TableCatalog assumes that ViewCatalog will be added separately like >>John proposes, so we would have to break or replace that API >> >> I understand the initial appeal of comb

Re: SPIP: Catalog API for view metadata

2020-08-18 Thread John Zhuge
> > AFAIK view schema is only used by DESCRIBE. > > Correction: Spark adds a new Project at the top of the parsed plan from > view, based on the stored schema, to make sure the view schema doesn't > change. > Thanks Wenchen! I thought I forgot something :) Yes it is the validation done in

Re: SPIP: Catalog API for view metadata

2020-08-18 Thread John Zhuge
hange. > > Can you update your doc to incorporate the cache idea? Let's make sure we > don't have perf issues if we go with the new View API. > > On Tue, Aug 18, 2020 at 4:25 PM John Zhuge wrote: > >> Thanks Burak and Walaa for the feedback! >> >> Here are my pers

Re: SPIP: Catalog API for view metadata

2020-08-18 Thread John Zhuge
;>> views. This way you avoid multiple RPCs to a catalog or data source or >>>> metastore, and you avoid namespace/name conflits. Also you make yourself >>>> less susceptible to race conditions (which still inherently exist). >>>> >>>> In additi

Re: SPIP: Catalog API for view metadata

2020-08-13 Thread John Zhuge
note either the order in which resolution will happen > (views are resolved first) or note that it is not allowed and behavior is > not guaranteed. I prefer the first option. > > On Wed, Aug 12, 2020 at 5:14 PM John Zhuge wrote: > >> Hi Wenchen, >> >> Thanks for the feed

Re: SPIP: Catalog API for view metadata

2020-08-12 Thread John Zhuge
> I think a new View API is more flexible. I'd vote for it if we can come up > with a good mechanism to avoid name conflicts. > > On Wed, Aug 12, 2020 at 6:20 AM John Zhuge wrote: > >> Hi Spark devs, >> >> I'd like to bring more attention to this SPIP. As Dongjoon

Re: SPIP: Catalog API for view metadata

2020-08-11 Thread John Zhuge
y. The PR has conflicts that I will resolve them shortly. Thanks, On Wed, Apr 22, 2020 at 12:24 AM John Zhuge wrote: > Hi everyone, > > In order to disassociate view metadata from Hive Metastore and support > different storage backends, I am proposing a new view catalog API to

SPIP: Catalog API for view metadata

2020-04-22 Thread John Zhuge
three months. Thanks, John Zhuge

Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-09 Thread John Zhuge
k an API. >>>>>>>>> >> >> >>>>>>>>> >> >> >>>>>>>>> >> >> Cost of Breaking an API >>>>>>>>> >> >> >>>>>>>>> >> >> Breaking an API almost always has a non-trivial cost to the >>>>>>>>> users of Sp

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

2020-02-26 Thread John Zhuge
t;>guess is all users will blindly flip the flag to true (to keep using this >>function), so you've only succeeded in annoying them. >>- >> >>Cost to Maintain - These are two relatively isolated expressions, >>there should be little cost to keeping them. Users can be confused by >> their >>semantics, so we probably should update the docs to point them to a best >>practice (I learned only by complaining on the PR, that a good practice is >>to parse timestamps including the timezone in the format expression, which >>naturally shifts them to UTC). >> >> >> Decision: Do not deprecate these two functions. We should update the >> docs to talk about best practices for parsing timestamps, including how to >> correctly shift them to UTC for storage. >> >> [SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902 >> <https://github.com/apache/spark/pull/24902> >> >> >>- >> >>Cost to Break - The TRIM function takes two string parameters. If we >>switch the parameter order, queries that use the TRIM function would >>silently get different results on different versions of Spark. Users may >>not notice it for a long time and wrong query results may cause serious >>problems to users. >>- >> >>Cost to Maintain - We will have some inconsistency inside Spark, as >>the TRIM function in Scala API and in SQL have different parameter order. >> >> >> Decision: Do not switch the parameter order. Promote the TRIM(trimStr >> FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate >> (with a warning, not by removing) the SQL TRIM function and move users to >> the SQL standard TRIM syntax. >> >> Thanks for taking the time to read this! Happy to discuss the specifics >> and amend this policy as the community sees fit. >> >> Michael >> >> -- John Zhuge

Re: Enabling fully disaggregated shuffle on Spark

2019-11-20 Thread John Zhuge
ils there. Do you want to join? > > On Tue, Nov 19, 2019 at 4:23 PM Amogh Margoor wrote: > >> We at Qubole are also looking at disaggregating shuffle on Spark. Would >> love to collaborate and share learnings. >> >> Regards, >> Amogh >> >> On Tue,

Re: Enabling fully disaggregated shuffle on Spark

2019-11-19 Thread John Zhuge
support writing an arbitrary number of objects into an >>> existing OutputStream or ByteBuffer. This enables objects to be serialized >>> to direct buffers where doing so makes sense. More importantly, it allows >>> arbitrary metadata/framing data to be wrapped around individual objects >>> cheaply. Right now, that’s only possible at the stream level. (There are >>> hacks around this, but this would enable more idiomatic use in efficient >>> shuffle implementations.) >>> >>> >>> Have serializers indicate whether they are deterministic. This provides >>> much of the value of a shuffle service because it means that reducers do >>> not need to spill to disk when reading/merging/combining inputs--the data >>> can be grouped by the service, even without the service understanding data >>> types or byte representations. Alternative (less preferable since it would >>> break Java serialization, for example): require all serializers to be >>> deterministic. >>> >>> >>> >>> -- >>> >>> - Ben >>> >> > > -- > Ryan Blue > Software Engineer > Netflix > -- John Zhuge

Re: [DISCUSS] ViewCatalog interface for DSv2

2019-10-14 Thread John Zhuge
> spec, if there is a view named "a", we can't create a table named "a" > anymore. > > We can add documents and ask the implementation to guarantee it, but it's > better if this can be guaranteed by the API. > > On Wed, Aug 14, 2019 at 1:46 AM John Zhuge wrote: > >

Re: Thoughts on Spark 3 release, or a preview release

2019-09-12 Thread John Zhuge
845 Support specification of column names in INSERT INTO >>>> SPARK-24417 Build and Run Spark on JDK11 >>>> SPARK-24724 Discuss necessary info and access in barrier mode + >>>> Kubernetes >>>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos >>>> SPARK-25074 Implement maxNumConcurrentTasks() in >>>> MesosFineGrainedSchedulerBackend >>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 >>>> SPARK-25186 Stabilize Data Source V2 API >>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier >>>> execution mode >>>> SPARK-25390 data source V2 API refactoring >>>> SPARK-7768 Make user-defined type (UDT) API public >>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition >>>> Spec >>>> SPARK-15691 Refactor and improve Hive support >>>> SPARK-15694 Implement ScriptTransformation in sql/core >>>> SPARK-16217 Support SELECT INTO statement >>>> SPARK-16452 basic INFORMATION_SCHEMA support >>>> SPARK-18134 SQL: MapType in Group BY and Joins not working >>>> SPARK-18245 Improving support for bucketed table >>>> SPARK-19842 Informational Referential Integrity Constraints Support in >>>> Spark >>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested >>>> list of structures >>>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to >>>> respect session timezone >>>> SPARK-22386 Data Source V2 improvements >>>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN >>>> >>>> - >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>> >>>> >>>> >>>> > > -- > Name : Jungtaek Lim > Blog : http://medium.com/@heartsavior > Twitter : http://twitter.com/heartsavior > LinkedIn : http://www.linkedin.com/in/heartsavior > -- John Zhuge

Re: Welcoming some new committers and PMC members

2019-09-09 Thread John Zhuge
> > > > > -- > Shane Knapp > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- John Zhuge

Re: [VOTE] Release Apache Spark 2.3.4 (RC1)

2019-08-28 Thread John Zhuge
gt; >>>> >> > If you're working in PySpark you can set up a virtual env and >>>> install >>>> >> > the current RC and see if anything important breaks, in the >>>> Java/Scala >>>> >> > you can add the staging repository to your projects resolvers and >>>> test >>>> >> > with the RC (make sure to clean up the artifact cache before/after >>>> so >>>> >> > you don't end up building with a out of date RC going forward). >>>> >> > >>>> >> > === >>>> >> > What should happen to JIRA tickets still targeting 2.3.4? >>>> >> > === >>>> >> > >>>> >> > The current list of open tickets targeted at 2.3.4 can be found at: >>>> >> > https://issues.apache.org/jira/projects/SPARKand search for >>>> "Target Version/s" = 2.3.4 >>>> >> > >>>> >> > Committers should look at those and triage. Extremely important bug >>>> >> > fixes, documentation, and API tweaks that impact compatibility >>>> should >>>> >> > be worked on immediately. Everything else please retarget to an >>>> >> > appropriate release. >>>> >> > >>>> >> > == >>>> >> > But my bug isn't fixed? >>>> >> > == >>>> >> > >>>> >> > In order to make timely releases, we will typically not hold the >>>> >> > release unless the bug in question is a regression from the >>>> previous >>>> >> > release. That being said, if there is something which is a >>>> regression >>>> >> > that has not been correctly targeted please ping me or a committer >>>> to >>>> >> > help target the issue. >>>> >> > >>>> >> >>>> >> - >>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>> >> >>>> >>>> - >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>> >>>> -- John Zhuge

Re: Release Spark 2.3.4

2019-08-16 Thread John Zhuge
ithub from the last release: >>>> https://github.com/apache/spark/compare/66fd9c34bf406a4b5f86605d06c9607752bd637a...branch-2.3 >>>> > The 8 correctness issues resolved in branch-2.3: >>>> > >>>> https://issues.apache.org/jira/browse/SPARK-26873?jql=project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012344844%20AND%20labels%20in%20(%27correctness%27)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC >>>> > >>>> > Best Regards, >>>> > Kazuaki Ishizaki >>>> >>>> - >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>> >>>> >> >> -- >> --- >> Takeshi Yamamuro >> > > > -- > [image: Databricks Summit - Watch the talks] > <https://databricks.com/sparkaisummit/north-america> > -- John Zhuge

Re: [DISCUSS] ViewCatalog interface for DSv2

2019-08-13 Thread John Zhuge
to know why you're > proposing `softwareVersion` in the view definition. > > On Tue, Aug 13, 2019 at 8:56 AM John Zhuge wrote: > >> Catalog support has been added to DSv2 along with a table catalog >> interface. Here I'd like to propose a view catalog interface, for the >> f

[DISCUSS] ViewCatalog interface for DSv2

2019-08-13 Thread John Zhuge
: - name - originalSql - defaultCatalog - defaultNamespace - viewColumns - owner - createTime - softwareVersion - options (map) ViewColumn interface: - name - type Thanks, John Zhuge

Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

2019-06-18 Thread John Zhuge
+1 (non-binding) Great work! On Tue, Jun 18, 2019 at 6:22 AM Vinoo Ganesh wrote: > +1 (non-binding). > > > > Thanks for pushing this forward, Matt and Yifei. > > > > *From: *Felix Cheung > *Date: *Tuesday, June 18, 2019 at 00:01 > *To: *Yinan Li , "rb...@netflix.com" < > rb...@netflix.com> >

Re: Why hint does not traverse down subquery alias

2019-06-11 Thread John Zhuge
1, 2019 at 8:04 PM Maryann Xue > wrote: > >> I believe in the SQL standard, the original name cannot be accessed once >> it’s aliased. >> >> On Tue, Jun 11, 2019 at 7:54 PM John Zhuge wrote: >> >>> Yeah, it is a touch scenario. >>> >>> I actu

Re: Why hint does not traverse down subquery alias

2019-06-11 Thread John Zhuge
, b from s) t join (select a, b > from t) s on t1.a = t2.b > > If we allowed the hint resolving to "cross" the scopes, we'd end up with a > really confusing spec. > > > Thanks, > Maryann > > On Tue, Jun 11, 2019 at 5:26 PM John Zhuge wrote: > &g

Why hint does not traverse down subquery alias

2019-06-11 Thread John Zhuge
Hi Reynold and Maryann, ResolveHints javadoc indicates the traversal does not go past subquery alias. Is there any specific reason? Thanks, John Zhuge

Re: [VOTE] SPIP: Spark API for Table Metadata

2019-02-28 Thread John Zhuge
=eSx5nMZvdB5hS9VepuvvFZFXjTCrdde-AdzkHC5jRYk=> > . > > > > Please vote in the next 3 days. > > > > [ ] +1: Accept the proposal as an official SPIP > > [ ] +0 > > [ ] -1: I don't think this is a good idea because ... > > > > > > Thanks! > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > -- John Zhuge

Re: [VOTE] SPIP: Identifiers for multi-catalog Spark

2019-02-18 Thread John Zhuge
> [ ] -1: I don't think this is a good idea because ... > > > > > > > > > Thanks! > > > > > > rb > > > > > > -- > > > Ryan Blue > > > Software Engineer > > > Netflix > > > > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- John Zhuge

Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

2019-02-09 Thread John Zhuge
and search for "Target >>>> Version/s" = 2.3.3 >>>> > >>>> > Committers should look at those and triage. Extremely important bug >>>> > fixes, documentation, and API tweaks that impact compatibility should >>>> > be worked on immediately. Everything else please retarget to an >>>> > appropriate release. >>>> > >>>> > == >>>> > But my bug isn't fixed? >>>> > == >>>> > >>>> > In order to make timely releases, we will typically not hold the >>>> > release unless the bug in question is a regression from the previous >>>> > release. That being said, if there is something which is a regression >>>> > that has not been correctly targeted please ping me or a committer to >>>> > help target the issue. >>>> > >>>> > P.S. >>>> > I checked all the tests passed in the Amazon Linux 2 AMI; >>>> > $ java -version >>>> > openjdk version "1.8.0_191" >>>> > OpenJDK Runtime Environment (build 1.8.0_191-b12) >>>> > OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode) >>>> > $ ./build/mvn -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos >>>> -Psparkr test >>>> > >>>> > -- >>>> > --- >>>> > Takeshi Yamamuro >>>> >>>> >>>> >>>> -- >>>> Marcelo >>>> >>>> - >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>> >>>> > > -- > --- > Takeshi Yamamuro > -- John Zhuge

Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

2019-02-07 Thread John Zhuge
gt;> > fixes, documentation, and API tweaks that impact compatibility should >> >> > be worked on immediately. Everything else please retarget to an >> >> > appropriate release. >> >> > >> >> > == >> >> > But my bug isn't fixed? >> >> > == >> >> > >> >> > In order to make timely releases, we will typically not hold the >> >> > release unless the bug in question is a regression from the previous >> >> > release. That being said, if there is something which is a regression >> >> > that has not been correctly targeted please ping me or a committer to >> >> > help target the issue. >> >> > >> >> > P.S. >> >> > I checked all the tests passed in the Amazon Linux 2 AMI; >> >> > $ java -version >> >> > openjdk version "1.8.0_191" >> >> > OpenJDK Runtime Environment (build 1.8.0_191-b12) >> >> > OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode) >> >> > $ ./build/mvn -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos >> -Psparkr test >> >> > >> >> > -- >> >> > --- >> >> > Takeshi Yamamuro >> >> >> >> - >> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> -- John Zhuge

Re: scheduler braindump: architecture, gotchas, etc.

2019-02-04 Thread John Zhuge
Thx Xiao! On Mon, Feb 4, 2019 at 9:04 AM Xiao Li wrote: > Thank you, Imran! > > Also, I attached the slides of "Deep Dive: Scheduler of Apache Spark". > > Cheers, > > Xiao > > > > John Zhuge 于2019年2月4日周一 上午8:59写道: > >> Thanks Imran! >>

Re: scheduler braindump: architecture, gotchas, etc.

2019-02-04 Thread John Zhuge
even more that > should be discussed, & mistakes I've made. All input welcome. > > > https://docs.google.com/document/d/1oiE21t-8gXLXk5evo-t-BXpO5Hdcob5D-Ps40hogsp8/edit?usp=sharing > -- John Zhuge

Re: [VOTE] SPARK 2.2.3 (RC1)

2019-01-11 Thread John Zhuge
e it's not a regression from 2.2.2 either. >>>> >>>> On Thu, Jan 10, 2019 at 6:37 AM Takeshi Yamamuro >>>> wrote: >>>> > >>>> > Hi, Dongjoon, >>>> > >>>> > We don't need to include https://github.com/apache/spark/pull/23456 >>>> in this release? >>>> > The query there fails in v2.x while it passes in v1.6. >>>> > >>>> >>> >> >> -- >> --- >> Takeshi Yamamuro >> > -- John Zhuge

Re: DataSourceV2 hangouts sync

2018-10-25 Thread John Zhuge
; > For the first one, I was thinking some day next week (time TBD by those > interested) and starting off with a general roadmap discussion before > diving into specific technical topics. > > Thanks, > > rb > > -- > Ryan Blue > Software Engineer > Netflix > -- John Zhuge

Re: Timestamp Difference/operations

2018-10-12 Thread John Zhuge
Yeah, operator "-" does not seem to be supported, however, you can use "datediff" function: In [9]: select datediff(CAST('2000-02-01 12:34:34' AS TIMESTAMP), CAST('2000-01-01 00:00:00' AS TIMESTAMP)) Out[9]:

Re: from_csv

2018-09-19 Thread John Zhuge
usually error >>>>> prone especially for quoted values and other special cases. >>>>> >>>>> The proposed in the PR methods should make a better user experience in >>>>> parsing CSV-like columns. Please, share your thoughts. >>>>> >>>>> -- >>>>> >>>>> Maxim Gekk >>>>> >>>>> Technical Solutions Lead >>>>> >>>>> Databricks Inc. >>>>> >>>>> maxim.g...@databricks.com >>>>> >>>>> databricks.com >>>>> >>>>> <http://databricks.com/> >>>>> >>>> >>> > > -- > *Dongjin Lee* > > *A hitchhiker in the mathematical world.* > > *github: <http://goog_969573159/>github.com/dongjinleekr > <http://github.com/dongjinleekr>linkedin: kr.linkedin.com/in/dongjinleekr > <http://kr.linkedin.com/in/dongjinleekr>slideshare: > www.slideshare.net/dongjinleekr > <http://www.slideshare.net/dongjinleekr>* > -- John Zhuge

Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread John Zhuge
+1 (non-binding) Built on Ubuntu 16.04 with Maven flags: -Phadoop-2.7 -Pmesos -Pyarn -Phive-thriftserver -Psparkr -Pkinesis-asl -Phadoop-provided java version "1.8.0_181" Java(TM) SE Runtime Environment (build 1.8.0_181-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode) On

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-07 Thread John Zhuge
+1 on SPARK-25004. We have found it quite useful to diagnose PySpark OOM. On Tue, Aug 7, 2018 at 1:21 PM Holden Karau wrote: > I'd like to suggest we consider SPARK-25004 (hopefully it goes in soon), > but solving some of the consistent Python memory issues we've had for years > would be

Re: Handle BlockMissingException in pyspark

2018-08-06 Thread John Zhuge
BlockMissingException typically indicates the HDFS file is corrupted. Might be an HDFS issue, Hadoop mailing list is a better bet: u...@hadoop.apache.org. Capture at the full stack trace in executor log. If the file still exists, run `hdfs fsck -blockId blk_1233169822_159765693` to determine

Re: [DISCUSS][SQL] Control the number of output files

2018-08-05 Thread John Zhuge
Great help from the community! On Sun, Aug 5, 2018 at 6:17 PM Xiao Li wrote: > FYI, the new hints have been merged. They will be available in the > upcoming release (Spark 2.4). > > *John Zhuge*, thanks for your work! Really appreciate it! Please submit > more PRs and help the co

Re: [DISCUSS][SQL] Control the number of output files

2018-08-05 Thread John Zhuge
hanism, or whether it is >>>> possible, but I think it is worth considering such things at a fairly high >>>> level of abstraction and try to unify and simplify before making things >>>> more complex with multiple policy mechanisms. >>>> >>>

Re: [DISCUSS][SQL] Control the number of output files

2018-07-26 Thread John Zhuge
t a patch for this? If there is a > coalesce hint, inject a coalesce logical node. Pretty simple. > > > On Wed, Jul 25, 2018 at 2:48 PM John Zhuge wrote: > >> Thanks for the comment, Forest. What I am asking is to make whatever DF >> repartition/coalesce functionalities available to SQL users. >> &g

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread John Zhuge
plex with multiple policy mechanisms. >>> >>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin wrote: >>> >>>> Seems like a good idea in general. Do other systems have similar >>>> concepts? In general it'd be easier if we can follow existing convention if >>&g

[DISCUSS][SQL] Control the number of output files

2018-07-25 Thread John Zhuge
is not the same as SPARK-6221 that asked for auto-merging output files. Thanks, John Zhuge

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-18 Thread John Zhuge
taking >>>>> an existing Spark workload and running on this release candidate, then >>>>> reporting any regressions. >>>>> >>>>> If you're working in PySpark you can set up a virtual env and install >>>>> the current RC and see if anything important breaks, in the Java/Scala >>>>> you can add the staging repository to your projects resolvers and test >>>>> with the RC (make sure to clean up the artifact cache before/after so >>>>> you don't end up building with a out of date RC going forward). >>>>> >>>>> === >>>>> What should happen to JIRA tickets still targeting 2.3.2? >>>>> === >>>>> >>>>> The current list of open tickets targeted at 2.3.2 can be found at: >>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target >>>>> Version/s" = 2.3.2 >>>>> >>>>> Committers should look at those and triage. Extremely important bug >>>>> fixes, documentation, and API tweaks that impact compatibility should >>>>> be worked on immediately. Everything else please retarget to an >>>>> appropriate release. >>>>> >>>>> == >>>>> But my bug isn't fixed? >>>>> == >>>>> >>>>> In order to make timely releases, we will typically not hold the >>>>> release unless the bug in question is a regression from the previous >>>>> release. That being said, if there is something which is a regression >>>>> that has not been correctly targeted please ping me or a committer to >>>>> help target the issue. >>>>> >>>>> -- >>>>> John Zhuge >>>>> >>>>

Re: [VOTE] SPIP: Standardize SQL logical plans

2018-07-17 Thread John Zhuge
the next 72 hours: >>> >>> [+1]: Spark should adopt the SPIP >>> [-1]: Spark should not adopt the SPIP because . . . >>> >>> Thanks for voting, everyone! >>> >>> -- >>> Ryan Blue >>> >> >> >> -- >> Ryan Blue >> >> -- >> John Zhuge >> >

Re: [VOTE] SPARK 2.3.2 (RC1)

2018-07-10 Thread John Zhuge
+1 On Sun, Jul 8, 2018 at 1:30 AM Saisai Shao wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.3.2. > > The vote is open until July 11th PST and passes if a majority +1 PMC votes > are cast, with a minimum of 3 +1 votes. > > [ ] +1 Release this package as

Re: Time for 2.3.2?

2018-06-29 Thread John Zhuge
;> > This is a correctness bug in a new feature of Spark 2.3: the >>>>>>> stream-stream >>>>>>> > join. Users can hit this bug if one of the join side is >>>>>>> partitioned by a >>>>>>> > subset of the join keys. >>>>>>> > >>>>>>> > SPARK-24552: Task attempt numbers are reused when stages are >>>>>>> retried >>>>>>> > This is a long-standing bug in the output committer that may >>>>>>> introduce data >>>>>>> > corruption. >>>>>>> > >>>>>>> > SPARK-24542: UDFXPath allow users to pass carefully crafted >>>>>>> XML to >>>>>>> > access arbitrary files >>>>>>> > This is a potential security issue if users build access control >>>>>>> module upon >>>>>>> > Spark. >>>>>>> > >>>>>>> > I think we need a Spark 2.3.2 to address these issues(especially >>>>>>> the >>>>>>> > correctness bugs) ASAP. Any thoughts? >>>>>>> > >>>>>>> > Thanks, >>>>>>> > Wenchen >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Marcelo >>>>>>> >>>>>>> - >>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>> >>>>>>> >>> >>> -- >>> --- >>> Takeshi Yamamuro >>> >> >> > > -- > Ryan Blue > Software Engineer > Netflix > > -- > John Zhuge >

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-04 Thread John Zhuge
+1 On Sun, Jun 3, 2018 at 6:12 PM, Hyukjin Kwon wrote: > +1 > > 2018년 6월 3일 (일) 오후 9:25, Ricardo Almeida 님이 > 작성: > >> +1 (non-binding) >> >> On 3 June 2018 at 09:23, Dongjoon Hyun wrote: >> >>> +1 >>> >>> Bests, >>> Dongjoon. >>> >>> On Sat, Jun 2, 2018 at 8:09 PM, Denny Lee wrote: >>>