Re: [VOTE] SPIP: Monthly preview release

2025-07-04 Thread Reynold Xin
+1 On Thu, Jul 3, 2025 at 10:54 PM Peter Toth wrote: > +1 > > On Fri, Jul 4, 2025 at 6:30 AM Ruifeng Zheng wrote: > >> +1 >> >> On Fri, Jul 4, 2025 at 10:17 AM John Zhuge wrote: >> >>> +1 (non-binding) >>> >>> John Zhuge >>> >>> >>> On Thu, Jul 3, 2025 at 1:47 PM Jungtaek Lim < >>> kabhwan.op

Re: [VOTE] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-06-02 Thread Reynold Xin
+1 On Mon, Jun 2, 2025 at 7:10 PM Kent Yao wrote: > +1 > > Sandy Ryza 于2025年6月2日周一 23:00写道: > >> +1 (non-binding) >> >> On Mon, Jun 2, 2025 at 7:34 AM Chao Sun wrote: >> >>> +1 >>> >>> On Mon, Jun 2, 2025 at 7:31 AM Jungtaek Lim < >>> kabhwan.opensou...@gmail.com> wrote: >>> +1 (non-bindi

Re: [VOTE] Release Apache Spark Connect Swift Client 0.3.0 (RC1)

2025-06-02 Thread Reynold Xin
+1 On Mon, Jun 2, 2025 at 5:45 PM Xinrong Meng wrote: > +1 > > Thank you Dongjoon! > > On Mon, Jun 2, 2025 at 4:29 PM Jungtaek Lim > wrote: > >> +1 (non-binding) >> >> On Tue, Jun 3, 2025 at 12:18 AM Denny Lee wrote: >> >>> +1 (non-binding) >>> >>> >>> On Mon, Jun 2, 2025 at 07:44 Sandy Ryza

Re: [VOTE] SPIP: Add geospatial types to Spark

2025-05-05 Thread Reynold Xin
+1 On Mon, May 5, 2025 at 12:37 PM Bjørn Jørgensen wrote: > +1 > > man. 5. mai 2025 kl. 21:28 skrev Milan Stefanovic < > stefanovic.mila...@gmail.com>: > >> +1 (non-binding) >> >> Thanks, >> Milan >> >> On Mon, 5 May 2025 at 21:25, Jia Yu wrote: >> >>> Thanks for putting this together. >>> >>>

Re: [VOTE] SPIP: Declarative Pipelines

2025-04-09 Thread Reynold Xin
+1 (binding) Super exciting to have this! On Wed, Apr 9, 2025 at 9:43 AM Jules Damji wrote: > +1 (non-binding) > > Excuse the thumb typos > > > On Wed, 09 Apr 2025 at 7:22 AM, Sandy Ryza wrote: > >> We started to get some votes on the discussion thread, so I'd like to >> move to a formal vote

Re: [DISCUSS] SPIP: Add geospatial types to Spark

2025-03-29 Thread Reynold Xin
While I don’t think Spark should become a super specialized geospatial processing engine, I don’t think it makes sense to focus *only* on reading and writing from storage. Geospatial is a pretty common and fundamental capability of analytics systems and virtually every mature and popular analytics

Re: Revert of [SPARK-51229][BUILD][CONNECT] Fix dependency:analyze goal on connect common

2025-03-26 Thread Reynold Xin
gt; > Vlad > > > On Mar 25, 2025, at 11:06 PM, Reynold Xin > wrote: > > Sorry Vlad - I disagree. Where is the simple fix? As a new contributor, > you should not be coming in guns blazing blaming committers who are trying > to keep the master branch sane and clean. > >

Re: Revert of [SPARK-51229][BUILD][CONNECT] Fix dependency:analyze goal on connect common

2025-03-25 Thread Reynold Xin
ucible steps. I assume that > steps on the PR is what needs to be fixed, but it will be better to avoid > guessing. > > Thank you, > > Vlad > > On Mar 25, 2025, at 10:05 PM, Reynold Xin > wrote: > > Is there a fix already available or a very simple fix a committe

Re: Revert of [SPARK-51229][BUILD][CONNECT] Fix dependency:analyze goal on connect common

2025-03-25 Thread Reynold Xin
Is there a fix already available or a very simple fix a committer can create quickly? If yes, we can merge the fix. If there isn't, for major functionality breaking change, we should just revert. That's fairly basic software engineering practices. On Tue, Mar 25, 2025 at 9:53 PM Hyukjin Kwon wro

Re: [DISCUSS] SPARK-51318: Remove `jar` files from Apache Spark repository and disable affected tests

2025-03-25 Thread Reynold Xin
While I'd love to resolve this issue, I still don't understand why we would block the release for this. On Tue, Mar 25, 2025 at 7:49 AM Rozov, Vlad wrote: > The difference is in the way how tests are disabled. > > - the approach encourages keeping jars files in the Apache Spark repo > - it is

Re: [Discuss] SPIP: Support NanoSecond Timestamps

2025-03-17 Thread Reynold Xin
Pretty much anything (say vs current timestamp operations in Spark). On Mon, Mar 17, 2025 at 2:51 PM serge rielau.com wrote: > What are you comparing performance against? > On Mar 17, 2025 at 11:54 AM -0700, Reynold Xin , > wrote: > > Any thoughts on how to deal with performance

Re: [Discuss] SPIP: Support NanoSecond Timestamps

2025-03-17 Thread Reynold Xin
Any thoughts on how to deal with performance here? Initially we didn't do nano level precision because of performance (would not be able to fit everything into a 64 bit int). On Mon, Mar 17, 2025 at 11:34 AM Sakthi wrote: > +1 (non-binding) > > On Mon, Mar 17, 2025 at 11:32 AM Zhou Jiang > wrot

Re: [VOTE] Technical Justification for the veto of the "Retain migration logic..." code change proposal is not valid

2025-03-16 Thread Reynold Xin
Thanks Mark for starting this. +1 and agree with your reasoning. Wearing the Apache Spark PMC hat, I think having a few lines of straightforward logic to ease users' migrations is a no-brainer to do. Imagine how confused a user would be when they upgraded to 4.0 and things stopped working in a way

Re: [VOTE] SPIP: Add the TIME data type

2025-02-23 Thread Reynold Xin
+1 On Sun, Feb 23, 2025 at 7:51 AM Max Gekk wrote: > Hi Spark devs, > > Following the discussion [1], I'd like to start the vote for the SPIP [2]. > The SPIP aims to add a new data type TIME to Spark SQL types. New type > should conform to TIME(n) WITHOUT TIME ZONE as defined by the SQL > standa

Re: [VOTE] Release Apache Spark 3.5.5 deprecating `spark.databricks.*` configuration

2025-02-19 Thread Reynold Xin
+1 On Wed, Feb 19, 2025 at 2:02 PM Zhou Jiang wrote: > > +1 (non-binding) > > On Feb 19, 2025, at 04:20, Peter Toth wrote: > >  > +1 > > On Wed, Feb 19, 2025 at 10:20 AM Max Gekk wrote: > >> +1 >> >> On Wed, Feb 19, 2025 at 9:15 AM L. C. Hsieh wrote: >> >>> +1 >>> >>> On Tue, Feb 18, 2025 at

Re: Spark Website Styling Issues Partially Resolved

2025-02-06 Thread Reynold Xin
Thanks for fixing these! On Thu, Feb 6, 2025 at 4:40 PM Gengliang Wang wrote: > Hi all, > > The Spark website styling was recently broken due to a violation of > Content Security Policy (CSP). I have fixed the main website ( > spark.apache.org) and the latest documentation site (Spark 3.5.4 docs

Re: [Spark SQL]: Are SQL User-Defined Functions on the Roadmap?

2025-02-05 Thread Reynold Xin
There's already one here https://issues.apache.org/jira/browse/SPARK-46057 On Wed, Feb 5, 2025 at 5:16 PM Soumasish wrote: > Here I create one, https://issues.apache.org/jira/browse/SPARK-51102 > > Best Regards > Soumasish Goswami > in: www.linkedin.com/in/soumasish > # (415) 530-0405 > >-

Re: A documentation change is a user-facing change

2025-01-16 Thread Reynold Xin
Seems like we should fix the template if that's not the intent On Thu, Jan 16, 2025 at 1:52 PM Nicholas Chammas wrote: > The template says "including all aspects such as the documentation fix >

Re: [DISCUSS] Pythonic approach of setting Spark SQL configurations

2024-12-26 Thread Reynold Xin
I actually think this might be confusing (just in general adding too many different ways to do the same thing is also un-Pythonic). On Thu, Dec 26, 2024 at 4:58 PM Hyukjin Kwon wrote: > Hi all, > > I hope you guys are enjoying the holiday season. I just wanted to have > some quick feedback about

Re: Shuffle TTLs

2024-10-16 Thread Reynold Xin
Thanks for bringing this up. Wouldn't it be better for the notebooks to control when these DFs/RDDs expire so they can do fine granular control? On Wed, Oct 16, 2024 at 7:51 AM Holden Karau wrote: > Hi Spark Devs, > > So back in Spark 1.X we had shuffle TTLs, but they did not take into > account

Re: Dev list policy on posting genAI hallucinations

2024-10-09 Thread Reynold Xin
FWIW - Mich - I've often found your responses "gpt" like and can often be a distraction. Now I don't know if that's your actual writing style or you were indeed using genai tools to generate the responses on your behalf. I don't think we should sanction you if that's your writing style. But if you

Re: [VOTE] Single-pass Analyzer for Catalyst

2024-09-30 Thread Reynold Xin
> > Thanks, > Dongjoon. > > > On 2024/09/30 17:51:24 Herman van Hovell wrote: > > +1 > > > > On Mon, Sep 30, 2024 at 8:29 AM Reynold Xin > > > wrote: > > > > > +1 > > > > > > On Mon, Sep 30, 2024 at 6:47 AM Vladimir Go

Re: [VOTE] Single-pass Analyzer for Catalyst

2024-09-30 Thread Reynold Xin
+1 On Mon, Sep 30, 2024 at 6:47 AM Vladimir Golubev wrote: > Hi all, > > I’d like to start a vote for a single-pass Analyzer for the Catalyst > project. This project will introduce a new analysis framework to the > Catalyst, which will eventually replace the fixed-point one. > > Please refer to

Re: [DISCUSS] [Spark SQL] Single-pass Analyzer SPIP

2024-09-19 Thread Reynold Xin
Great document! Thanks for writing it up. On Tue, Sep 10, 2024 at 10:00 AM Vladimir Golubev wrote: > Hey folks, following up on the recent single-pass Analyzer discussion. I > made a high-level proposal document for this idea: > https://docs.google.com/document/d/1dWxvrJV-0joGdLtWbvJ0uNyTocDMJ90

Re: [VOTE] Move Variant to Parquet

2024-09-02 Thread Reynold Xin
+1 On Mon, Sep 2, 2024 at 9:30 AM Gene Pang wrote: > Hi all, > > I’d like to start a vote for moving the Variant specification and library > to the Parquet project. This allows the Variant binary format and shredding > format to be more widely used by other interested projects and systems. > > P

Re: [VOTE] Deprecate SparkR

2024-08-21 Thread Reynold Xin
+1 On Wed, Aug 21, 2024 at 6:42 PM Shivaram Venkataraman < shivaram.venkatara...@gmail.com> wrote: > Hi all > > Based on the previous discussion thread [1], I hereby call a vote to > deprecate the SparkR module in Apache Spark with the upcoming Spark 4 > release and remove it in the next major re

Re: [DISCUSS] [Spark SQL] A single-pass resolution approach for the Catalyst Analyzer

2024-08-20 Thread Reynold Xin
+1 on this too When I implemented "group by all", I introduced at least two subtle bugs that many reviewers weren't able to catch and those two bugs would not have been possible to introduce if we had a single pass analyzer. Single pass can make the whole framework more robust. On Tue, Aug 2

Re: [DISCUSS] Move Variant to Parquet?

2024-08-19 Thread Reynold Xin
As I said on dev@iceberg, it'd be really unfortunate if we end up with two or even more diverging specs for storing variants. It just adds more work for everybody to interop. Parquet would be a great home for this spec as a neutral project that almost all the other important projects in this space

Re: Welcoming a new PMC member

2024-08-14 Thread Reynold Xin
Congratulations Kent! On Wed, Aug 14, 2024 at 10:10 AM Kent Yao wrote: > Thank you all very much! > > Kent > > On 2024/08/13 17:12:54 Matei Zaharia wrote: > > Congrats and welcome Kent! > > > > > On Aug 13, 2024, at 7:27 AM, Wenchen Fan wrote: > > > > > > Congratulations! > > > > > > On Tue, Au

Re: [DISCUSS] Deprecating SparkR

2024-08-13 Thread Reynold Xin
+1 It’s actually great that projects outside Spark’s repo can be more successful than the projects inside. A testament to both Spark itself and Spark Connect! On Tue, Aug 13, 2024 at 10:00 AM Martin Grund wrote: > +1 > > On Tue, Aug 13, 2024 at 7:26 AM Ruifeng Zheng wrote: > >> +1 >> >> On Tue

Re: [VOTE] Using Github Issues for Spark-Connect-Go _only_ issues.

2024-08-12 Thread Reynold Xin
+1 On Mon, Aug 12, 2024 at 10:28 AM Mich Talebzadeh wrote: > +1 for me > > Mich Talebzadeh, > > Architect | Data Engineer | Data Science | Financial Crime > PhD Imperial College > London

Re: [DISCUSS] Using Github Issues for Spark-Connect-Go _only_ issues.

2024-08-08 Thread Reynold Xin
I'd love that too. But maybe we can start small and try it out with one project ... On Thu, Aug 8, 2024 at 7:16 AM Sean Owen wrote: > Oh nice if that has changed. Id personally prefer switching all of Spark > to GitHub issues for simplicity but maybe that's a big lift. And a separate > question.

Re: [VOTE] Allow GitHub Actions runs for contributors' PRs without approvals in apache/spark-connect-go

2024-07-08 Thread Reynold Xin
+1 On Mon, Jul 8, 2024 at 7:44 PM haydn wrote: > +1 > > On Mon, Jul 8, 2024 at 7:41 PM haydn wrote: > >> +1 >> >> On Mon, Jul 8, 2024 at 19:41 Takuya UESHIN >> wrote: >> >>> +1 >>> >>> On Mon, Jul 8, 2024 at 6:05 PM Yuanjian Li >>> wrote: >>> +1 Hyukjin Kwon 于2024年7月4日周四 16:54

Re: [VOTE] Move Spark Connect server to builtin package (Client API layer stays external)

2024-07-03 Thread Reynold Xin
+1 On Wed, Jul 3, 2024 at 4:45 PM L. C. Hsieh wrote: > +1 > > On Wed, Jul 3, 2024 at 3:54 PM Dongjoon Hyun > wrote: > > > > +1 > > > > Dongjoon > > > > On Wed, Jul 3, 2024 at 10:58 Xinrong Meng wrote: > >> > >> +1 > >> > >> Thank you @Hyukjin Kwon ! > >> > >> On Wed, Jul 3, 2024 at 8:55 AM bo

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Reynold Xin
+1 On Thu, Apr 25, 2024 at 9:01 AM Santosh Pingale wrote: > +1 > > On Thu, Apr 25, 2024, 5:41 PM Dongjoon Hyun > wrote: > >> FYI, there is a proposal to drop Python 3.8 because its EOL is October >> 2024. >> >> https://github.com/apache/spark/pull/46228 >> [SPARK-47993][PYTHON] Drop Python 3.8

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Reynold Xin
One of the problem in the past when something like this was brought up was that the ASF couldn't have officially blessed venues beyond the already approved ones. So that's something to look into. Now of course you are welcome to run unofficial things unblessed as long as they follow trademark r

Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

2024-03-11 Thread Reynold Xin
+1 On Mon, Mar 11 2024 at 7:38 PM, Jungtaek Lim < kabhwan.opensou...@gmail.com > wrote: > > +1 (non-binding), thanks Gengliang! > > > On Mon, Mar 11, 2024 at 5:46 PM Gengliang Wang < ltn...@gmail.com > wrote: > > > >> Hi all, >> >> I'd like to start the vote for SPIP: Structured Logging F

Re: [VOTE] SPIP: Testing Framework for Spark UI Javascript files

2023-11-24 Thread Reynold Xin
+1 On Fri, Nov 24, 2023 at 10:19 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > wrote: > > +1 > > > Thanks, > Dongjoon. > > On Fri, Nov 24, 2023 at 7:14 PM Ye Zhou < zhouyejoe@ gmail. com ( > zhouye...@gmail.com ) > wrote: > > >> +1(non-binding) >> >> On Fri, Nov 24, 2023 at 11:16 Mridul M

Re: [DISCUSS] SPIP: ShuffleManager short name registration via SparkPlugin

2023-11-04 Thread Reynold Xin
Why do we need this? The reason data source APIs need it is because it will be used by very unsophisticated end users and used all the time (for each connection / query). Shuffle is something you set up once, presumably by fairly sophisticated admins / engineers. On Sat, Nov 04, 2023 at 2:42 PM

Re: Are DataFrame rows ordered without an explicit ordering clause?

2023-09-18 Thread Reynold Xin
It should be the same as SQL. Otherwise it takes away a lot of potential future optimization opportunities. On Mon, Sep 18 2023 at 8:47 AM, Nicholas Chammas < nicholas.cham...@gmail.com > wrote: > > I’ve always considered DataFrames to be logically equivalent to SQL tables > or queries. > >

Re: [VOTE][SPIP] Python Data Source API

2023-07-07 Thread Reynold Xin
+1! On Fri, Jul 7 2023 at 11:58 AM, Holden Karau < hol...@pigscanfly.ca > wrote: > > +1 > > > On Fri, Jul 7, 2023 at 9:55 AM huaxin gao < huaxin.ga...@gmail.com > wrote: > > > >> +1 >> >> >> On Fri, Jul 7, 2023 at 8:59 AM Mich Talebzadeh < mich.talebza...@gmail.com >> > wrote: >> >> >>>

Re: [DISCUSS] SPIP: Python Data Source API

2023-06-25 Thread Reynold Xin
Personally I'd love this, but I agree with some of the earlier comments that this should not be Python specific (meaning I should be able to implement a data source in Python and then make it usable across all languages Spark  supports). I think we should find a way to make this reusable beyond P

Re: [VOTE][SPIP] PySpark Test Framework

2023-06-21 Thread Reynold Xin
+1 This is a great idea. On Wed, Jun 21, 2023 at 8:29 AM, Holden Karau < hol...@pigscanfly.ca > wrote: > > I’d like to start with a +1, better Python testing tools integrated into > the project make sense. > > On Wed, Jun 21, 2023 at 8:11 AM Amanda Liu < amandastephanieliu@ gmail. com > ( aman

Re: [DISCUSS] Deprecate DStream in 3.4

2023-01-12 Thread Reynold Xin
+1 On Thu, Jan 12, 2023 at 9:46 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > wrote: > > +1 for the proposal (guiding only without any code change). > > > Thanks, > Dongjoon. > > On Thu, Jan 12, 2023 at 9:33 PM Shixiong Zhu < zsxwing@ gmail. com ( > zsxw...@gmail.com ) > wrote: > > >> +1 >

Re: How can I get the same spark context in two different python processes

2022-12-12 Thread Reynold Xin
Spark Connect :) (It’s work in progress) On Mon, Dec 12 2022 at 2:29 PM, Kevin Su < pings...@gmail.com > wrote: > > Hey there, How can I get the same spark context in two different python > processes? > Let’s say I create a context in Process A, and then I want to use python > subprocess B to g

Re: Re: [VOTE][SPIP] Spark Connect

2022-06-15 Thread Reynold Xin
+1 super excited about this. I think it'd make Spark a lot more usable in application development and cloud setting: (1) Makes it easier to embed in applications with thinner client dependencies. (2) Easier to isolate user code vs system code in the driver. (3) Opens up the potential to upgrade

Re: Stickers and Swag

2022-06-14 Thread Reynold Xin
Nice! Going to order a few items myself ... On Tue, Jun 14, 2022 at 7:54 PM, Gengliang Wang < ltn...@gmail.com > wrote: > > FYI now you can find the shopping information on https:/ / spark. apache. org/ > community ( https://spark.apache.org/community ) as well :) > > > > Gengliang > > > >

Re: Data correctness issue with Repartition + FetchFailure

2022-03-12 Thread Reynold Xin
This is why RoundRobinPartitioning shouldn't be used ... On Sat, Mar 12, 2022 at 12:08 PM, Jason Xu < jasonxu.sp...@gmail.com > wrote: > > Hi Spark community, > > I reported a data correctness issue in https:/ / issues. apache. org/ jira/ > browse/ SPARK-38388 ( https://issues.apache.org/jira/b

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Reynold Xin
tl;dr: there's no easy way to implement aggregate expressions that'd require multiple pass over data. It is simply not something that's supported and doing so would be very high cost. Would you be OK using approximate percentile? That's relatively cheap. On Mon, Dec 13, 2021 at 6:43 PM, Nichola

Re: spark binary map

2021-10-16 Thread Reynold Xin
Read up on Unsafe here: https://mechanical-sympathy.blogspot.com/ On Sat, Oct 16, 2021 at 12:41 AM, Rohan Bajaj < rohanbaja...@gmail.com > wrote: > > In 2015 Reynold Xin made improvements to Spark and it was basically moving > some structures that were on the java heap and movin

Re: [VOTE] Release Spark 3.2.0 (RC7)

2021-10-07 Thread Reynold Xin
+1 On Thu, Oct 07, 2021 at 11:54 PM, Yuming Wang < wgy...@gmail.com > wrote: > > +1 (non-binding). > > > On Fri, Oct 8, 2021 at 1:02 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com ( > dongjoon.h...@gmail.com ) > wrote: > > >> +1 for Apache Spark 3.2.0 RC7. >> >> >> It looks good to me. I te

Re: [VOTE] SPIP: Support pandas API layer on PySpark

2021-03-26 Thread Reynold Xin
+1. Would open up a huge persona for Spark. On Fri, Mar 26 2021 at 11:30 AM, Bryan Cutler < cutl...@gmail.com > wrote: > > +1 (non-binding) > > > On Fri, Mar 26, 2021 at 9:49 AM Maciej < mszymkiew...@gmail.com > wrote: > > >> +1 (nonbinding) >> >> >> >> On 3/26/21 3:52 PM, Hyukjin Kwon wr

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-14 Thread Reynold Xin
I don't think we should deprecate existing APIs. Spark's own Python API is relatively stable and not difficult to support. It has a pretty large number of users and existing code. Also pretty easy to learn by data engineers. pandas API is a great for data science, but isn't that great for some

Re: [VOTE] Release Spark 3.1.1 (RC3)

2021-02-24 Thread Reynold Xin
+1 Correctness issues are serious! On Wed, Feb 24, 2021 at 11:08 AM, Mridul Muralidharan < mri...@gmail.com > wrote: > > That is indeed cause for concern. > +1 on extending the voting deadline until we finish investigation of this. > > > > > Regards, > Mridul > > > > On Wed, Feb 24, 2021

Re: Auto-closing PRs or How to get reviewers' attention

2021-02-18 Thread Reynold Xin
Enrico - do feel free to reopen the PRs or email people directly, unless you are told otherwise. On Thu, Feb 18, 2021 at 9:09 AM, Nicholas Chammas < nicholas.cham...@gmail.com > wrote: > > On Thu, Feb 18, 2021 at 10:34 AM Sean Owen < srowen@ gmail. com ( > sro...@gmail.com ) > wrote: > > >>

Re: [DISCUSS] Add RocksDB StateStore

2021-02-13 Thread Reynold Xin
Late +1 On Sat, Feb 13 2021 at 2:49 PM, Liang-Chi Hsieh < vii...@gmail.com > wrote: > > > > Hi devs, > > > > Thanks for all the inputs. I think overall there are positive inputs in > Spark community about having RocksDB state store as external module. Then > let's go forward with this direc

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-28 Thread Reynold Xin
There's another thing that's not mentioned … it's primarily a problem for Scala. Due to static typing, we need a very large number of function overloads for the Scala version of each function, whereas in SQL/Python they are just one. There's a limit on how many functions we can add, and it also

Re: [VOTE] Standardize Spark Exception Messages SPIP

2020-11-09 Thread Reynold Xin
Exciting & look forward to this! (And a late +1 vote that probably won't be counted) On Mon, Nov 09, 2020 at 2:37 PM, Allison Wang < allison.w...@databricks.com > wrote: > > > > Thanks everyone for voting! With 11 +1s and no -1s, this vote passes. > > > > +1s: > Mridul Muralidharan > Ange

Re: I'm going to be out starting Nov 5th

2020-10-31 Thread Reynold Xin
Take care Holden and best of luck with everything! On Sat, Oct 31 2020 at 10:21 AM, Holden Karau < hol...@pigscanfly.ca > wrote: > > Hi Folks, > > > Just a heads up so folks working on decommissioning or other areas I've > been active in don't block on me, I'm going to be out for at least a we

Re: Avoiding unnnecessary sort in FileFormatWriter/DynamicPartitionDataWriter

2020-09-04 Thread Reynold Xin
The issue is memory overhead. Writing files create a lot of buffer (especially in columnar formats like Parquet/ORC). Even a few file handlers and buffers per task can OOM the entire process easily. On Fri, Sep 04, 2020 at 5:51 AM, XIMO GUANTER GONZALBEZ < joaquin.guantergonzal...@telefonica.co

Re: Welcoming some new Apache Spark committers

2020-07-14 Thread Reynold Xin
Welcome all! On Tue, Jul 14, 2020 at 10:36 AM, Matei Zaharia < matei.zaha...@gmail.com > wrote: > > > > Hi all, > > > > The Spark PMC recently voted to add several new committers. Please join me > in welcoming them to their new roles! The new committers are: > > > > - Huaxin Gao > - Jun

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-23 Thread Reynold Xin
+1 on doing a new patch release soon. I saw some of these issues when preparing the 3.0 release, and some of them are very serious. On Tue, Jun 23, 2020 at 8:06 AM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu > wrote: > > > > +1 Thanks Yuanjian -- I think it'll be great to have a 3.0.

Re: Removing references to slave (and maybe in the future master)

2020-06-18 Thread Reynold Xin
Thanks for doing this. I think this is a great thing to do. But we gotta be careful with API compatibility. On Thu, Jun 18, 2020 at 11:32 AM, Holden Karau < hol...@pigscanfly.ca > wrote: > > Hi Folks, > > > I've started working on cleaning up the Spark code to remove references to > slave sin

[ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Reynold Xin
Hi all, Apache Spark 3.0.0 is the first release of the 3.x line. It builds on many of the innovations from Spark 2.x, bringing new ideas as well as continuing long-term projects that have been in development. This release resolves more than 3400 tickets. We'd like to thank our contributors and

Re: [vote] Apache Spark 3.0 RC3

2020-06-17 Thread Reynold Xin
...@yahoo.com > wrote: > > Reynold, > > > What's the plan on pushing the official release binaries and source tar? > It would be nice to have the release artifacts now that it's available on > maven. > > > thanks, > Tom > > > On Monday, J

Re: [vote] Apache Spark 3.0 RC3

2020-06-15 Thread Reynold Xin
> release. > > > Thanks, > Dongjoon. > > > > On Tue, Jun 9, 2020 at 9:41 PM Matei Zaharia < matei. zaharia@ gmail. com ( > matei.zaha...@gmail.com ) > wrote: > > >> Congrats! Excited to see the release posted soon. >> >> >>>

Re: Revisiting the idea of a Spark 2.5 transitional release

2020-06-12 Thread Reynold Xin
g into a release at the time we cut the > branch. > > On Fri, Jun 12, 2020 at 10:28 PM Reynold Xin < rxin@ databricks. com ( > r...@databricks.com ) > wrote: > > >> I understand the argument to add JDK 11 support just to extend the EOL, >> but the other things seem ki

Re: Revisiting the idea of a Spark 2.5 transitional release

2020-06-12 Thread Reynold Xin
I understand the argument to add JDK 11 support just to extend the EOL, but the other things seem kind of arbitrary and are not supported by your arguments, especially DSv2 which is a massive change. DSv2 IIUC is not api stable yet and will continue to evolve in the 3.x line. Spark is designed in

Re: [vote] Apache Spark 3.0 RC3

2020-06-09 Thread Reynold Xin
I waited another day to account for the weekend. This vote passes with the following +1 votes and no -1 votes! I'll start the release prep later this week. +1: Reynold Xin (binding) Prashant Sharma (binding) Gengliang Wang Sean Owen (binding) Mridul Muralidharan (binding) Takeshi Yam

Re: [vote] Apache Spark 3.0 RC3

2020-06-06 Thread Reynold Xin
Apologies for the mistake. The vote is open till 11:59pm Pacific time on Mon June 9th. On Sat, Jun 6, 2020 at 1:08 PM Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 3.0.0. > > The vote is open until [DUE DAY] and passes if a majority

[vote] Apache Spark 3.0 RC3

2020-06-06 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 3.0.0. The vote is open until [DUE DAY] and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 3.0.0 [ ] -1 Do not release this package because ... To lea

[VOTE] Apache Spark 3.0 RC2

2020-05-18 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 3.0.0. The vote is open until Thu May 21 11:59pm Pacific time and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 3.0.0 [ ] -1 Do not release this packa

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-28 Thread Reynold Xin
The con is much more than just more effort to maintain a parallel API. It puts the burden for all libraries and library developers to maintain a parallel API as well. That’s one of the primary reasons we moved away from this RDD vs JavaRDD approach in the old RDD API. On Tue, Apr 28, 2020 at 12:3

Re: Spark DAG scheduler

2020-04-16 Thread Reynold Xin
bdi...@husky.neu.edu > wrote: > > Is it correct to say, the nodes in the DAG are RDDs and the edges are > computations? > > > On Thu, Apr 16, 2020 at 6:21 PM Reynold Xin < rxin@ databricks. com ( > r...@databricks.com ) > wrote: > > >> The RDD is the DAG. >>

Re: Spark DAG scheduler

2020-04-16 Thread Reynold Xin
The RDD is the DAG. On Thu, Apr 16, 2020 at 3:16 PM, Mania Abdi < abdi...@husky.neu.edu > wrote: > > Hello everyone, > > I am implementing a caching mechanism for analytic workloads running on > top of Spark and I need to retrieve the Spark DAG right after it is > generated and the DAG schedule

Re: [VOTE] Apache Spark 3.0.0 RC1

2020-03-31 Thread Reynold Xin
The Apache Software Foundation requires voting before any release can be published. On Tue, Mar 31, 2020 at 11:27 PM, Stephen Coy < s...@infomedia.com.au.invalid > wrote: > > >> On 1 Apr 2020, at 5:20 pm, Sean Owen < srowen@ gmail. com ( >> sro...@gmail.com ) > wrote: >> >> It can be publish

[VOTE] Apache Spark 3.0.0 RC1

2020-03-31 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 3.0.0. The vote is open until 11:59pm Pacific time Fri Apr 3 , and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 3.0.0 [ ] -1 Do not release this pack

Re: Release Manager's official `branch-3.0` Assessment?

2020-03-28 Thread Reynold Xin
ll have the blockers that will fail the >>> RCs.  >>> >>> >>> >>> Cheers, >>> >>> >>> Xiao >>> >>> >>> >>> On Tue, Mar 24, 2020 at 6:56 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com >&

Re: results of taken(3) not appearing in console window

2020-03-26 Thread Reynold Xin
bcc dev, +user You need to print out the result. Take itself doesn't print. You only got the results printed to the console because the Scala REPL automatically prints the returned value from take. On Thu, Mar 26, 2020 at 12:15 PM, Zahid Rahman < zahidr1...@gmail.com > wrote: > > I am running

Re: Release Manager's official `branch-3.0` Assessment?

2020-03-24 Thread Reynold Xin
I actually think we should start cutting RCs. We can cut RCs even with blockers. On Tue, Mar 24, 2020 at 12:51 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > wrote: > > Hi, All. > > First of all, always "Community Over Code"! > I wish you the best health and happiness. > > As we know, we are s

Re: FYI: The evolution on `CHAR` type behavior

2020-03-19 Thread Reynold Xin
default datasource as provider for CREATE TABLE > syntax", 2019/12/06 > >    https:/ / lists. apache. org/ thread. html/ > > 493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev. > spark. apache. org%3E ( > https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.

Re: FYI: The evolution on `CHAR` type behavior

2020-03-19 Thread Reynold Xin
You are joking when you said " informed widely and discussed in many ways twice" right? This thread doesn't even talk about char/varchar:  https://lists.apache.org/thread.html/493f88c10169680191791f9f6962fd16cd0ffa3b06726e92ed04cbe1%40%3Cdev.spark.apache.org%3E (Yes it talked about changing the

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
periences with the non-negligible cases in on-prem. > > > > Bests, > Dongjoon. > > > On Mon, Mar 16, 2020 at 5:42 PM Reynold Xin < rxin@ databricks. com ( > r...@databricks.com ) > wrote: > > >> −User >> >> >> >> char barely sh

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
te away > from the standard on this specific behavior. > > > Bests, > Dongjoon. > > On Mon, Mar 16, 2020 at 5:35 PM Reynold Xin < rxin@ databricks. com ( > r...@databricks.com ) > wrote: > > >> BTW I'm not opposing us sticking to SQL standard (I

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
systems also deviate away from the standard on this specific behavior. On Mon, Mar 16, 2020 at 5:29 PM, Reynold Xin < r...@databricks.com > wrote: > > I looked up our usage logs (sorry I can't share this publicly) and trim > has at least four orders of magnitude higher usage

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
( >>> dongjoon.h...@gmail.com ) > wrote: >>> >>> Hi, Reynold. >>> (And +Michael Armbrust) >>> >>> >>> If you think so, do you think it's okay that we change the return value >>> silently? Then, I'm won

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Reynold Xin
, >> >> >> 100% agree with Reynold. >> >> >> >> >> Regards, >> Gourav Sengupta >> >> >> On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin < rxin@ databricks. com ( >> r...@databricks.com ) > wrote: >> >&g

Re: FYI: The evolution on `CHAR` type behavior

2020-03-15 Thread Reynold Xin
the banning was > the proposed alternative to reduce the potential issue. > > > Please give us your opinion since it's still PR. > > > Bests, > Dongjoon. > > On Sat, Mar 14, 2020 at 17:54 Reynold Xin < rxin@ databricks. com ( > r...@databricks.com ) &g

Re: FYI: The evolution on `CHAR` type behavior

2020-03-14 Thread Reynold Xin
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of both new and old users? For old users, their old code that was working for char(3) would now stop working. For new users, depending on whether the underlying metastore char(3) is either supported but different from ansi S

Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-09 Thread Reynold Xin
+1 On Mon, Mar 09, 2020 at 3:53 PM, John Zhuge < jzh...@apache.org > wrote: > > +1 (non-binding) > > > On Mon, Mar 9, 2020 at 1:32 PM Michael Heuer < heuermh@ gmail. com ( > heue...@gmail.com ) > wrote: > > >> +1 (non-binding) >> >> >> I am disappointed however that this only mentions API

Re: [DISCUSS] Shall we mark spark streaming component as deprecated.

2020-03-02 Thread Reynold Xin
It's a good discussion to have though: should we deprecate dstream, and what do we need to do to make that happen? My experience working with a lot of Spark users is that in general I recommend them staying away from dstream, due to a lot of design and architectural issues. On Mon, Mar 02, 2020

Re: [DISCUSS] naming policy of Spark configs

2020-02-12 Thread Reynold Xin
This is really cool. We should also be more opinionated about how we specify time and intervals. On Wed, Feb 12, 2020 at 3:15 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > wrote: > > Thank you, Wenchen. > > > The new policy looks clear to me. +1 for the explicit policy. > > > So, are we go

Re: Spark 3.0 branch cut and code freeze on Jan 31?

2020-02-01 Thread Reynold Xin
Note that branch-3.0 was cut. Please focus on testing, polish, and let's get the release out! On Wed, Jan 29, 2020 at 3:41 PM, Reynold Xin < r...@databricks.com > wrote: > > Just a reminder - code freeze is coming this Fri ! > > > > There can always be ex

Re: Spark 3.0 branch cut and code freeze on Jan 31?

2020-01-29 Thread Reynold Xin
;>>> named output from CleanupAliases >>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window >>>>> aggregate >>>>> SPARK-25531 new write APIs for data source v2 >>>>> SPARK-25547 Pluggable jdbc connection factory >>>>> SPARK-2

Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

2020-01-21 Thread Reynold Xin
If your UDF itself is very CPU intensive, it probably won't make that much of difference, because the UDF itself will dwarf the serialization/deserialization overhead. If your UDF is cheap, it will help tremendously. On Mon, Jan 20, 2020 at 6:33 PM, < em...@yeikel.com > wrote: > > > > Hi, >

Re: Enabling push-based shuffle in Spark

2020-01-21 Thread Reynold Xin
Thanks for writing this up.  Usually when people talk about push-based shuffle, they are motivating it primarily to reduce the latency of short queries, by pipelining the map phase, shuffle phase, and the reduce phase (which this design isn't going to address). It's interesting you are targetin

Re: Adding Maven Central mirror from Google to the build?

2020-01-21 Thread Reynold Xin
This seems reasonable! On Tue, Jan 21, 2020 at 3:23 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > wrote: > > +1, I'm supporting the following proposal. > > > > this mirror as the primary repo in the build, falling back to Central if > needed. > > > Thanks, > Dongjoon. > > > > On Tue, Jan

Re: [DISCUSS] Support year-month and day-time Intervals

2020-01-10 Thread Reynold Xin
Introducing a new data type has high overhead, both in terms of internal complexity and users' cognitive load. Introducing two data types would have even higher overhead. I looked quickly and looks like both Redshift and Snowflake, two of the most recent SQL analytics successes, have only one i

Re: [SPARK-30296][SQL] Add Dataset diffing feature

2020-01-07 Thread Reynold Xin
Can this perhaps exist as an utility function outside Spark? On Tue, Jan 07, 2020 at 12:18 AM, Enrico Minack < m...@enrico.minack.dev > wrote: > > > > Hi Devs, > > > > I'd like to get your thoughts on this Dataset feature proposal. Comparing > datasets is a central operation when regressio

Spark 3.0 branch cut and code freeze on Jan 31?

2019-12-23 Thread Reynold Xin
We've pushed out 3.0 multiple times. The latest release window documented on the website ( http://spark.apache.org/versioning-policy.html ) says we'd code freeze and cut branch-3.0 early Dec. It looks like we are suffering a bit from the tragedy of the commons, that nobody is pushing for getting

  1   2   3   4   5   6   7   8   9   10   >