Re: [DISCUSS] Deprecate DStream in 3.4

2023-01-12 Thread Jungtaek Lim
bump for more visibility. On Wed, Jan 11, 2023 at 12:20 PM Jungtaek Lim wrote: > Hi dev, > > I'd like to propose the deprecation of DStream in Spark 3.4, in favor of > promoting Structured Streaming. > (Sorry for the late proposal, if we don't make the change in 3.4, w

[DISCUSS] Deprecate DStream in 3.4

2023-01-10 Thread Jungtaek Lim
ic API. I don't intend to propose the target version for removal. The goal is to guide users to refrain from constructing a new workload with DStream. We might want to go with this in future, but it would require a new discussion thread at that time. What do you think? Thanks, Jungtaek Lim (HeartSaVioR)

[VOTE][RESULT][SPIP] Asynchronous Offset Management in Structured Streaming

2022-12-04 Thread Jungtaek Lim
The vote passes with 7 +1s (5 binding +1s). Thanks to all who reviews the SPIP doc and votes! (* = binding) +1: - Jungtaek Lim - Xingbo Jiang - Mridul Muralidharan (*) - Hyukjin Kwon (*) - Shixiong Zhu (*) - Wenchen Fan (*) - Dongjoon Hyun (*) +0: None -1: None Thanks, Jungtaek Lim

Re: [VOTE][SPIP] Asynchronous Offset Management in Structured Streaming

2022-11-30 Thread Jungtaek Lim
Starting with +1 from me. On Thu, Dec 1, 2022 at 10:54 AM Jungtaek Lim wrote: > Hi all, > > I'd like to start the vote for SPIP: Asynchronous Offset Management in > Structured Streaming. > > The high level summary of the SPIP is that we propose a couple of > improvemen

[VOTE][SPIP] Asynchronous Offset Management in Structured Streaming

2022-11-30 Thread Jungtaek Lim
ts.apache.org/thread/yv8ffr56prjr16qh12lwjyjl1q8dl7lp> Please vote on the SPIP for the next 72 hours: [ ] +1: Accept the proposal as an official SPIP [ ] +0 [ ] -1: I don’t think this is a good idea because … Thanks! Jungtaek Lim (HeartSaVioR)

Re: [DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming

2022-11-30 Thread Jungtaek Lim
;>> may serve as the "future" engine powering Spark Streaming. Improving the >>>>> "current" engine does not mean we cannot work on a "future" engine. These >>>>> two are not mutually exclusive. I would like to focus the discussion o

Re: [ANNOUNCE] Apache Spark 3.2.3 released

2022-11-30 Thread Jungtaek Lim
Thanks Chao for driving the release! On Wed, Nov 30, 2022 at 6:03 PM Wenchen Fan wrote: > Thanks, Chao! > > On Wed, Nov 30, 2022 at 1:33 AM Chao Sun wrote: > >> We are happy to announce the availability of Apache Spark 3.2.3! >> >> Spark 3.2.3 is a maintenance release containing stability fixes

Re: [DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming

2022-11-23 Thread Jungtaek Lim
te: > >> Jungtaek, >> >> Thanks for taking up the role to shepard this SPIP! Thank you for also >> chiming in on your thoughts concerning the continuous mode! >> >> Best, >> >> Jerry >> >> On Tue, Nov 22, 2022 at 5:57 PM Jungtaek Lim <

Re: [DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming

2022-11-22 Thread Jungtaek Lim
Just FYI, I'm shepherding this SPIP project. I think the major meta question would be, "why don't we spend effort on continuous mode rather than initiating another feature aiming for the same workload?". Jerry already updated the doc to answer the question, but I can also share my thoughts about i

Re: [VOTE][SPIP] Better Spark UI scalability and Driver stability for large applications

2022-11-16 Thread Jungtaek Lim
+1 Nice to see the chance for driver to reduce resource usage and increase stability, especially the fact that the driver is SPOF. It's even promising to have a future plan to pre-bake the kvstore for SHS from the driver. Thanks for driving the effort, Gengliang! On Thu, Nov 17, 2022 at 5:32 AM

Re: [DISCUSS] Flip the default value of Kafka offset fetching config (spark.sql.streaming.kafka.useDeprecatedOffsetFetching)

2022-10-18 Thread Jungtaek Lim
No further voice so far. I'm going to submit a PR. Thanks again for the feedback! On Mon, Oct 17, 2022 at 9:30 AM Jungtaek Lim wrote: > Thanks Gabor and Dongjoon for supporting this! > > Bump to reach more eyes. If there is no further voice on this in a couple > of days, I&#x

Re: [DISCUSS] Flip the default value of Kafka offset fetching config (spark.sql.streaming.kafka.useDeprecatedOffsetFetching)

2022-10-16 Thread Jungtaek Lim
t;> >> BR, >> G >> >> >> On Thu, Oct 13, 2022 at 4:12 AM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> Hi all, >>> >>> I would like to propose flipping the default value of Kafka offset >>> fetching c

[DISCUSS] Flip the default value of Kafka offset fetching config (spark.sql.streaming.kafka.useDeprecatedOffsetFetching)

2022-10-12 Thread Jungtaek Lim
e would be introduced inevitably (they can set topic based ACL rule), but most people will get benefited. IMHO this is something we can deal with release/migration note. Would like to hear the voices on this. Thanks, Jungtaek Lim (HeartSaVioR)

Re: Welcome Yikun Jiang as a Spark committer

2022-10-07 Thread Jungtaek Lim
Congrats! 2022년 10월 8일 (토) 오후 3:24, huaxin gao 님이 작성: > Congratulations! > > On Fri, Oct 7, 2022 at 11:22 PM Yang,Jie(INF) wrote: > >> Congratulations Yikun! >> >> Regards, >> Yang Jie >> -- >> *发件人:* Mridul Muralidharan >> *发送时间:* 2022年10月8日 14:16:02 >> *收件人:* Yumin

Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-05 Thread Jungtaek Lim
+1 On Thu, Oct 6, 2022 at 5:59 AM Chao Sun wrote: > +1 > > > and specifically may allow us to finally move off of the ancient version > of Guava (?) > > I think the Guava issue comes from Hive 2.3 dependency, not Hadoop. > > On Wed, Oct 5, 2022 at 1:55 PM Xinrong Meng > wrote: > >> +1. >> >> On

Re: [Structured Streaming + Kafka] Reduced support for alternative offset management

2022-09-01 Thread Jungtaek Lim
: https://github.com/HeartSaVioR/spark-sql-kafka-offset-committer Hope this helps. Thanks, Jungtaek Lim (HeartSaVioR) On Tue, Aug 30, 2022 at 5:05 PM Martin Andersson wrote: > I was looking around for some documentation regarding how checkpointing > (or rather, delivery semantics) is don

Re: Welcoming three new PMC members

2022-08-09 Thread Jungtaek Lim
Congrats everyone! On Wed, Aug 10, 2022 at 8:57 AM Hyukjin Kwon wrote: > Congrats everybody! > > On Wed, 10 Aug 2022 at 05:50, Mridul Muralidharan > wrote: > >> >> Congratulations ! >> Great to have you join the PMC !! >> >> Regards, >> Mridul >> >> On Tue, Aug 9, 2022 at 11:57 AM vaquar khan

Re: Welcome Xinrong Meng as a Spark committer

2022-08-09 Thread Jungtaek Lim
Congrats Xinrong! Well deserved. 2022년 8월 9일 (화) 오후 5:13, Hyukjin Kwon 님이 작성: > Hi all, > > The Spark PMC recently added Xinrong Meng as a committer on the project. > Xinrong is the major contributor of PySpark especially Pandas API on Spark. > She has guided a lot of new contributors enthusiasti

Re: [DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow

2022-07-18 Thread Jungtaek Lim
g all available data in a single microbatch. While this can behave the same with Trigger.Once on processing new available data (watermark advancement happens after processing all the data), this can also handle previous uncommitted batch(es) as well as no-data batch. On Tue, Jul 12, 2022 at 9:43 AM Jungtae

Re: [DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow

2022-07-11 Thread Jungtaek Lim
Final reminder. I'll leave this thread for a couple of days to see further voices, and go forward if there is no outstanding comment. On Sat, Jul 9, 2022 at 9:54 PM Jungtaek Lim wrote: > It sounds like none of the approaches perfectly solve the issue of > backfill. > > 1. Tr

Re: [DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow

2022-07-09 Thread Jungtaek Lim
e our batches are processed > in the correct event time order when starting from scratch. > > I'm not against deprecating Trigger.Once, just wanted to chime in that > someone was using it! I'm itching to upgrade and try out the new stuff. > > Adam > > On Fri, Jul 8

Re: [DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow

2022-07-08 Thread Jungtaek Lim
orkaround. Backfill may warrant its own design to deal with.) > > Adam > > On Fri, Jul 8, 2022 at 3:24 AM Jungtaek Lim > wrote: > >> Bump to get a chance to expose the proposal to wider audiences. >> >> Given that there are not many active contributors/maint

Re: [DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow

2022-07-08 Thread Jungtaek Lim
ove forward if there are no outstanding objections. On Wed, Jul 6, 2022 at 8:46 PM Jungtaek Lim wrote: > Hi dev, > > I would like to hear voices about deprecating Trigger.Once, and promoting > Trigger.AvailableNow as a replacement [1] in Structured Streaming. > (It doesn't

[DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow

2022-07-06 Thread Jungtaek Lim
n the next day. Thanks to the behavior of Trigger.AvailableNow, it handles no-data batch as well before termination of the query. Please review and let us know if you have any feedback or concerns on the proposal. Thanks! Jungtaek Lim 1. https://issues.apache.org/jira/browse/SPARK-36533

Observed consistent test failure in master (ParquetIOSuite)

2022-06-27 Thread Jungtaek Lim
ng context looks into this sooner. Thanks! Jungtaek Lim (HeartSaVioR)

Re: 回复: [VOTE] Release Spark 3.3.0 (RC6)

2022-06-13 Thread Jungtaek Lim
+1 (non-binding) Checked signature and checksum. Confirmed SPARK-39412 is resolved. Built source tgz with JDK 11. Thanks Max for driving the efforts of this huge release! On Tue, Jun 14, 2022 at 2:51 PM huaxin gao wrote: > +1 (non-binding) >

Re: [VOTE] Release Spark 3.3.0 (RC5)

2022-06-08 Thread Jungtaek Lim
Apologize for late participation. I'm sorry, but -1 (non-binding) from me. Unfortunately I found a major user-facing issue which hurts UX seriously on Kafka data source usage. In some cases, Kafka data source can throw IllegalStateException for the case of failOnDataLoss=true which condition is

Re: SIGMOD System Award for Apache Spark

2022-05-12 Thread Jungtaek Lim
Congrats Spark community! On Fri, May 13, 2022 at 10:40 AM Qian Sun wrote: > Congratulations !!! > > 2022年5月13日 上午3:44,Matei Zaharia 写道: > > Hi all, > > We recently found out that Apache Spark received > the SIGMOD System Award this > year, given by SIGM

Re: [DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-23 Thread Jungtaek Lim
ons (does it require a server-side > update or not?), and document the change itself for sure along with any > Spark-side migration notes. > > On Fri, Mar 18, 2022 at 8:47 PM Jungtaek Lim > wrote: > >> The thing is, it is “us” who upgrades Kafka client and makes possible >&g

Re: [DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-23 Thread Jungtaek Lim
Bump to try gathering more voices before taking action. For now, I see two voices as option 2 & 5 (similar to option 2 but not in the migration node but in the release note). On Fri, Mar 18, 2022 at 7:15 PM Jungtaek Lim wrote: > CORRECTION: in option 2, we enumerate KIPs which ma

Re: bazel and external/

2022-03-22 Thread Jungtaek Lim
ate steps. If there is consensus that connectors will move out, should >>>> the directory be named misc for everything else until there is some >>>> direction for the remaining modules? >>>> >>>> On Fri, 18 Mar 2022 at 03:03 Jungtaek Lim >>>> wrote: &g

Re: [DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-18 Thread Jungtaek Lim
that would affect > Kafka usage itself; focus on the connector-related issues. > > On Fri, Mar 18, 2022 at 5:15 AM Jungtaek Lim > wrote: > >> CORRECTION: in option 2, we enumerate KIPs which may bring >> incompatibility with older brokers (not all KIPs). >&g

Re: [DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-18 Thread Jungtaek Lim
tween releases because >> they've told what's important to check :) >> >> Seems like my Kafka Spark compatibility gist is out-of-date so maybe I >> need to invest some time to resurrect it: >> https://gist.github.com/gaborgsomogyi/3476c32d69ff2087ed5d7d031653c7a9 >&

Re: [DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-18 Thread Jungtaek Lim
CORRECTION: in option 2, we enumerate KIPs which may bring incompatibility with older brokers (not all KIPs). On Fri, Mar 18, 2022 at 7:12 PM Jungtaek Lim wrote: > Hi dev, > > I would like to initiate the discussion about how to deal with the > migration guide on upgrading Kafka

[DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-18 Thread Jungtaek Lim
back. 4. Do not care. End users can indicate the upgrade in the release note, and we expect end users to actively check the notable changes (& KIPs) from Kafka doc. 5. Options not described above... Please take a look and provide your voice on this. Thanks, Jungtaek Lim (HeartSaVioR) p

Re: bazel and external/

2022-03-17 Thread Jungtaek Lim
re top level dirs. > > On Thu, Mar 17, 2022 at 7:33 PM Jungtaek Lim > wrote: > >> We seem to just focus on how to avoid the conflict with the name >> "external" used in bazel. Since we consider the possibility of renaming, >> why not revisit the modules "exter

Re: bazel and external/

2022-03-17 Thread Jungtaek Lim
We seem to just focus on how to avoid the conflict with the name "external" used in bazel. Since we consider the possibility of renaming, why not revisit the modules "external" contains? Looks like kinds of the modules external directory contains are 1) Docker 2) Connectors 3) Sink on Dropwizard m

Re: Apache Spark 3.3 Release

2022-03-03 Thread Jungtaek Lim
Thanks Maxim for volunteering to drive the release! I support the plan (March 15th) to perform a release branch cut. Btw, would we be open for modification of critical/blocker issues after the release branch cut? I have a blocker JIRA ticket and the PR is open for reviewing, but need some time to

Re: [MISC] Should we add .github/FUNDING.yml

2021-12-15 Thread Jungtaek Lim
If ASF wants to do it, INFRA could probably deal with it for entire projects, like ASF code of conduct being exposed to the right side of the all ASF github repos recently. On Wed, Dec 15, 2021 at 11:49 PM Sean Owen wrote: > It might imply that this is a way to fund Spark alone, and it isn't. >

Re: [Proposal] Deprecate Trigger.Once and replace with Trigger.AvailableNow

2021-12-12 Thread Jungtaek Lim
Friendly reminder. I'll submit the proposed change if there is no objection observed this week. On Wed, Dec 8, 2021 at 4:16 PM Jungtaek Lim wrote: > Hi dev, > > I would like to hear voices about deprecating Trigger.Once, and replacing > it with Trigger.AvailableNow [1] in Str

[Proposal] Deprecate Trigger.Once and replace with Trigger.AvailableNow

2021-12-07 Thread Jungtaek Lim
Trigger.AvailableNow in migration guide - Replace all usages of Trigger.Once with Trigger.AvailableNow, except the test cases of Trigger.Once itself Please review the proposal and share your voice on this. Thanks! Jungtaek Lim 1. https://issues.apache.org/jira/browse/SPARK-36533

Re: Time for Spark 3.2.1?

2021-12-07 Thread Jungtaek Lim
+1 for both releases and the time! On Wed, Dec 8, 2021 at 3:46 PM Mridul Muralidharan wrote: > > +1 for maintenance release, and also +1 for doing this in Jan ! > > Thanks, > Mridul > > On Tue, Dec 7, 2021 at 11:41 PM Gengliang Wang wrote: > >> +1 for new maintenance releases for all 3.x branch

Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-06 Thread Jungtaek Lim
Thanks for all the hard work you have been doing, Shane! On Tue, Dec 7, 2021 at 2:17 PM Nick Pentreath wrote: > Wow! end of an era > > Thanks so much to you Shane for all you work over 10 (!!) years. And to > Amplab also! > > Farewell Spark Jenkins! > > N > > On Tue, Dec 7, 2021 at 6:49 AM Nicho

Re: Update Spark 3.3 release window?

2021-10-28 Thread Jungtaek Lim
+1 for mid-March 2022. +1 for EOL 2.x as well. I guess we did it already according to Dongjoon's quote from the Spark website. On Fri, Oct 29, 2021 at 3:49 AM Dongjoon Hyun wrote: > +1 for mid March for Spark 3.3. > > For 2.4, our document already mentioned its EOL like > > " For example, 2.4.0

Re: [ANNOUNCE] Apache Spark 3.2.0

2021-10-19 Thread Jungtaek Lim
Thanks to Gengliang for driving this huge release! On Wed, Oct 20, 2021 at 1:50 AM Dongjoon Hyun wrote: > Thank you so much, Gengliang and all! > > Dongjoon. > > On Tue, Oct 19, 2021 at 8:48 AM Xiao Li wrote: > >> Thank you, Gengliang! >> >> Congrats to our community and all the contributors! >

Re: [DISCUSS] SPIP: Row-level operations in Data Source V2

2021-06-24 Thread Jungtaek Lim
Meta question: this doesn't target Spark 3.2, right? Many folks have been working on branch cut for Spark 3.2, so might be less active to jump in new feature proposals right now. On Fri, Jun 25, 2021 at 9:00 AM Holden Karau wrote: > I took an initial look at the PRs this morning and I’ll go thro

Re: [VOTE] Release Spark 3.0.3 (RC1)

2021-06-20 Thread Jungtaek Lim
+1 (non-binding) Thanks for your efforts! On Mon, Jun 21, 2021 at 2:40 PM Kent Yao wrote: > +1 (non-binding) > > *Kent Yao * > @ Data Science Center, Hangzhou Research Institute, NetEase Corp. > *a spark enthusiast* > *kyuubi is a unified multi-tenant JDBC > i

Re: Apache Spark 3.0.3 Release?

2021-06-09 Thread Jungtaek Lim
Late +1 Thanks! On Thu, Jun 10, 2021 at 12:06 PM Yi Wu wrote: > Thanks all, I'll start the RC soon. > > On Wed, Jun 9, 2021 at 7:07 PM Gengliang Wang wrote: > >> +1, thanks Yi >> >> Gengliang Wang >> >> >> >> >> On Jun 9, 2021, at 6:03 PM, 郑瑞峰 wrote: >> >> +1, thanks Yi >> >> >>

Re: [ANNOUNCE] Apache Spark 3.1.2 released

2021-06-01 Thread Jungtaek Lim
Nice! Thanks Dongjoon for your amazing efforts! On Wed, Jun 2, 2021 at 2:59 PM Liang-Chi Hsieh wrote: > Thank you, Dongjoon! > > > > Takeshi Yamamuro wrote > > Thank you, Dongjoon! > > > > On Wed, Jun 2, 2021 at 2:29 PM Xiao Li < > > > lixiao@ > > > > wrote: > > > >> Thank you! > >> > >> Xiao >

Re: Apache Spark 3.1.2 Release?

2021-05-18 Thread Jungtaek Lim
Late +1 here as well, thanks for volunteering! 2021년 5월 19일 (수) 오전 11:24, 郑瑞峰 님이 작성: > late +1. thanks Dongjoon! > > > -- 原始邮件 -- > *发件人:* "Dongjoon Hyun" ; > *发送时间:* 2021年5月19日(星期三) 凌晨1:29 > *收件人:* "Wenchen Fan"; > *抄送:* "Xiao Li";"Kent Yao";"John > Zhuge";"Hyukji

Re: [ANNOUNCE] Apache Spark 2.4.8 released

2021-05-18 Thread Jungtaek Lim
Thanks for the huge efforts on driving the release! On Tue, May 18, 2021 at 4:53 PM Wenchen Fan wrote: > Thank you, Liang-Chi! > > On Tue, May 18, 2021 at 1:32 PM Dongjoon Hyun > wrote: > >> Finally! Thank you, Liang-Chi. >> >> Bests, >> Dongjoon. >> >> >> On Mon, May 17, 2021 at 10:14 PM Takes

Re: [DISCUSS] Add RocksDB StateStore

2021-04-27 Thread Jungtaek Lim
I think adding RocksDB state store to sql/core directly would be OK. Personally I also voted "either way is fine with me" against RocksDB state store implementation in Spark ecosystem. The overall stance hasn't changed, but I'd like to point out that the risk becomes quite lower than before, given

Re: [VOTE] Release Spark 2.4.8 (RC2)

2021-04-13 Thread Jungtaek Lim
+1 (non-binding) signature OK, extracting tgz files OK, build source without running tests OK. On Tue, Apr 13, 2021 at 5:02 PM Herman van Hovell wrote: > +1 > > On Tue, Apr 13, 2021 at 2:40 AM sarutak wrote: > >> +1 (non-binding) >> >> > +1 >> > >> > On Tue, 13 Apr 2021, 02:58 Sean Owen, wrot

Re: Welcoming six new Apache Spark committers

2021-03-26 Thread Jungtaek Lim
Congrats all! 2021년 3월 27일 (토) 오전 6:56, Liang-Chi Hsieh 님이 작성: > Congrats! Welcome! > > > Matei Zaharia wrote > > Hi all, > > > > The Spark PMC recently voted to add several new committers. Please join > me > > in welcoming them to their new role! Our new committers are: > > > > - Maciej Szymkiew

Re: Checkpointing in Spark Structured Streaming

2021-03-22 Thread Jungtaek Lim
e provider ? > > Rohit > > On Mon, Mar 22, 2021 at 4:09 PM Jungtaek Lim > wrote: > >> I see some points making async checkpoint be tricky to add in >> micro-batch; one example is "end to end exactly-once", as the commit phase >> in sink for the batch N can b

Re: Checkpointing in Spark Structured Streaming

2021-03-22 Thread Jungtaek Lim
I see some points making async checkpoint be tricky to add in micro-batch; one example is "end to end exactly-once", as the commit phase in sink for the batch N can be run "after" the batch N + 1 has been started and write for batch N + 1 can happen before committing batch N. state store checkpoint

Re: Determine global watermark via StreamingQueryProgress eventTime watermark String

2021-03-16 Thread Jungtaek Lim
There was a similar question (but another approach) and I've explained the current status a bit. https://lists.apache.org/thread.html/r89a61a10df71ccac132ce5d50b8fe405635753db7fa2aeb79f82fb77%40%3Cuser.spark.apache.org%3E I guess this would also answer your question as well. At least for now, Spa

Re: Observable Metrics on Spark Datasets

2021-03-16 Thread Jungtaek Lim
n and un-registration happens. I think this qualifies > as: "all the logic happens in the JVM". All that is transferred to Python > is a row's data. No listeners needed. > > Enrico > > > > Am 16.03.21 um 00:13 schrieb Jungtaek Lim: > > If I remember c

Re: Observable Metrics on Spark Datasets

2021-03-15 Thread Jungtaek Lim
If I remember correctly, the major audience of the "observe" API is Structured Streaming, micro-batch mode. From the example, the abstraction in 2 isn't something working with Structured Streaming. It could be still done with callback, but it remains the question how much complexity is hidden from

Re: [VOTE] SPIP: Add FunctionCatalog

2021-03-11 Thread Jungtaek Lim
+1 (non-binding) Excellent description on SPIP doc! Thanks for the amazing effort! On Wed, Mar 10, 2021 at 3:19 AM Liang-Chi Hsieh wrote: > > +1 (non-binding). > > Thanks for the work! > > > Erik Krogen wrote > > +1 from me (non-binding) > > > > On Tue, Mar 9, 2021 at 9:27 AM huaxin gao < > > >

Re: Property spark.sql.streaming.minBatchesToRetain

2021-03-09 Thread Jungtaek Lim
That property decides how many log files (log file is created per batch per type - types are like offsets, commits, etc.) to retain on the checkpoint. Unless you're struggling with a small files problem on checkpoint, you wouldn't need to tune the value. I guess that's why the configuration is mar

Re: using accumulators in (MicroBatch) InputPartitionReader

2021-03-07 Thread Jungtaek Lim
I'm not sure about the accumulator approach; one possible approach which might work (DISCLAIMER: a random thought) would be employing an RPC endpoint on the driver side which receives such information from executors and plays as a coordinator. Beware that Spark's RPC implementation is package priv

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread Jungtaek Lim
Thanks Hyukjin for driving the huge release, and thanks everyone for contributing the release! On Wed, Mar 3, 2021 at 6:54 PM angers zhu wrote: > Great work, Hyukjin ! > > Bests, > Angers > > Wenchen Fan 于2021年3月3日周三 下午5:02写道: > >> Great work and congrats! >> >> On Wed, Mar 3, 2021 at 3:51 PM K

Re: Please take a look at the draft of the Spark 3.1.1 release notes

2021-02-27 Thread Jungtaek Lim
Thanks Hyukjin! I've only looked into the SS part, and added a comment. Otherwise it looks great! On Sat, Feb 27, 2021 at 7:12 PM Dongjoon Hyun wrote: > Thank you for sharing, Hyukjin! > > Dongjoon. > > On Sat, Feb 27, 2021 at 12:36 AM Hyukjin Kwon wrote: > >> Hi all, >> >> I am preparing to pu

Re: [VOTE] Release Spark 3.1.1 (RC3)

2021-02-22 Thread Jungtaek Lim
+1 (non-binding) Verified signatures. Only a few commits added after RC2 which don't seem to change the SS behavior, so I'd carry over my +1 from RC2. On Mon, Feb 22, 2021 at 3:57 PM Hyukjin Kwon wrote: > Starting with my +1 (binding). > > 2021년 2월 22일 (월) 오후 3:56, Hyukjin Kwon 님이 작성: > >> Plea

Re: Please use Jekyll via "bundle exec" from now on

2021-02-18 Thread Jungtaek Lim
Nice fix. Thanks! On Thu, Feb 18, 2021 at 7:13 PM Hyukjin Kwon wrote: > Thanks Attlila for fixing and sharing this. > > 2021년 2월 18일 (목) 오후 6:17, Attila Zsolt Piros 님이 > 작성: > >> Hello everybody, >> >> To pin the exact same version of Jekyll across all the contributors, Ruby >> Bundler is introd

Re: [DISCUSS] assignee practice on committers+ (possible issue on preemption)

2021-02-18 Thread Jungtaek Lim
and you have a proposal, nothing wrong with just going ahead with > a proposal. There may be no disagreement. It might result in the > other person joining your PR. As I say, not sure if there's a deeper issue > than that if even this hasn't been tried? > > On Mon, Feb 15

Re: [DISCUSS] assignee practice on committers+ (possible issue on preemption)

2021-02-15 Thread Jungtaek Lim
> etc. > It makes me think that the actual issue by setting an assignee happens > rarely, and it is an issue to several specific cases that would need a look > case-by-case. > Were there specific cases that made you concerned? > > > 2021년 2월 15일 (월) 오전 8:58, Jungtaek Lim 님이

[DISCUSS] assignee practice on committers+ (possible issue on preemption)

2021-02-14 Thread Jungtaek Lim
doc proving that they really spent non-trivial effort already. My point is preempting JIRA issues with only sketched ideas or even just rationalizations.) Would like to hear everyone's voices. Thanks, Jungtaek Lim (HeartSaVioR) ps. better yet, probably it's better then to restrict something explicitly if we sincerely respect the underlying culture on the statement "In case several people contributed, prefer to assign to the more ‘junior’, non-committer contributor".

Re: [VOTE] Release Spark 3.1.1 (RC2)

2021-02-09 Thread Jungtaek Lim
+1 (non-binding) * verified signatures * built custom distribution with enabling kubernetes & hadoop-cloud profile * built custom docker image from dist * ran applications "rate to kafka" & "kafka to kafka" on k8s cluster (local k3s) Thanks for driving the release

Re: [DISCUSS] Add RocksDB StateStore

2021-02-08 Thread Jungtaek Lim
+1 to add, no matter to add under sql-core vs external module. Rationalization for myself: * The discussion thread and voices here show strong demand for adding RocksDB state store out of the box. * No workaround on huge state store problem out of the box. Direct competitors on streaming framewor

Re: [VOTE] Release Spark 3.1.1 (RC1)

2021-01-18 Thread Jungtaek Lim
+1 (non-binding) * verified signature and sha for all files (there's a glitch which I'll describe in below) * built source (DISCLAIMER: didn't run tests) and made custom distribution, and built a docker image based on the distribution - used profiles: kubernetes, hadoop-3.2, hadoop-cloud * ran s

Re: [VOTE] Release Spark 3.1.0 (RC1)

2021-01-06 Thread Jungtaek Lim
No worries about the accident. We're human beings, and everyone can make a mistake. Let's wait and see the response of INFRA-21266. Just a 2 cents, I'm actually leaning toward to skip 3.1.0 and start the release process for 3.1.1, as anyone could be some sort of "rushing" on verification on 3.1.0.

Re: [VOTE] Release Spark 3.1.0 (RC1)

2021-01-05 Thread Jungtaek Lim
There's an issue SPARK-33635 [1] reported due to performance regression on Kafka read between Spark 2.4 vs 3.0, which sounds like a blocker. I'll mark this as a blocker, unless anyone has different opinions. 1. https://issues.apache.org/jira/browse/SPARK-33635 On Wed, Jan 6, 2021 at 9:01 AM Hyukj

Re: What's the root cause of not supporting multiple aggregations in structured streaming?

2020-11-28 Thread Jungtaek Lim
mode for the >> complete mode support. That is to say, if we use "complete" mode for every >> aggregation operators, the wrong result will return. >> >> SPARK-26655 would be a good start, which only considers about "append" >> mode. Maybe we need mor

Re: Seeking committers' help to review on SS PR

2020-11-27 Thread Jungtaek Lim
spark/pull/27649 https://github.com/apache/spark/pull/28363 These are under 100 lines of changes per each, and not invasive. On Sat, Nov 28, 2020 at 11:34 AM Jungtaek Lim wrote: > Thanks for providing valuable feedback. Appreciate it. Sorry I haven't had > time to reply to this in ti

Re: Seeking committers' help to review on SS PR

2020-11-27 Thread Jungtaek Lim
your own PR" idea isn't hard-and-fast. I don't think >>> anyone needs to block for anything like this long if you have other capable >>> reviews and you are a committer, if you don't see that it impacts other >>> code meaningfully in a way that really demands

Re: [SS] full outer stream-stream join

2020-11-22 Thread Jungtaek Lim
Adding rationalization here, my request for raising the thead to dev mailing list is, to figure out possible reasons not having full outer join at the moment when adding left/right outer join. This is rather historical knowledge, so I have no idea about this. Most likely a limited number of folks

Seeking committers' help to review on SS PR

2020-11-22 Thread Jungtaek Lim
in my backlog but I'd rather not want to continue struggling with new PRs. Thanks, Jungtaek Lim (HeartSaVioR) 1. https://github.com/apache/spark/pull/24173 2. https://issues.apache.org/jira/browse/SPARK-27237

Re: [DISCUSS] Review/merge phase, and post-review

2020-11-13 Thread Jungtaek Lim
"@A @B I'll leave this a few days more to see if anyone has further comments. Otherwise I'll merge this.". I see both are used across various PRs, so it's not really something I want to blame. Just want to make us think about what would be the ideal approach we'd be

Re: [DISCUSS] Review/merge phase, and post-review

2020-11-13 Thread Jungtaek Lim
ot > followed enough to know what they are. Can you point them out? I think that > is most productive for everyone to understand. > > On Fri, Nov 13, 2020 at 10:16 PM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Hi devs, >> >> I know this is a

[DISCUSS] Review/merge phase, and post-review

2020-11-13 Thread Jungtaek Lim
e to finalize reviewing before merging. Again I know it's super hard to reconsider the ongoing practice while the project has gone for the long way (10 years), but just wanted to hear the voices about this. Thanks, Jungtaek Lim (HeartSaVioR)

Re: [DISCUSS] Disable streaming query with possible correctness issue by default

2020-11-08 Thread Jungtaek Lim
, a better approach would be dropping global watermark and implementing operator-wise watermark properly. This is just a workaround, but fixing watermark would require major effort. Thanks, Jungtaek Lim (HeartSaVioR) 1. https://issues.apache.org/jira/browse/SPARK-24634 On Sat, Nov 7, 2020 at 3:59 P

Re: [DISCUSS] preferred behavior when fails to instantiate configured v2 session catalog

2020-10-25 Thread Jungtaek Lim
; requested is incorrect. > > On Fri, Oct 23, 2020 at 5:24 AM Russell Spitzer > wrote: > >> I was convinced that we should probably just fail, but if that is too >> much of a change, then logging the exception is also acceptable. >> >> On Thu, Oct 22, 2020, 10:32 PM J

[DISCUSS] preferred behavior when fails to instantiate configured v2 session catalog

2020-10-22 Thread Jungtaek Lim
need to add the exception information in the error log message at least. Would like to hear the voices. Thanks, Jungtaek Lim (HeartSaVioR)

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Jungtaek Lim
t; If you just want to save typing the catalog name when writing table >> names, you can set your custom catalog as the default catalog (See >> SQLConf.DEFAULT_CATALOG). SQLConf.V2_SESSION_CATALOG_IMPLEMENTATION is >> used to extend the v1 session catalog, not replace it. >

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Jungtaek Lim
dent about > the new v2 DDL commands that work with v2 catalog APIs. > > On Wed, Oct 7, 2020 at 5:00 PM Jungtaek Lim > wrote: > >> My case is DROP TABLE and DROP TABLE supports both v1 and v2 (as it >> simply works when I use custom catalog without replacing the default

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Jungtaek Lim
APIs (e.g. CREATE TABLE LIKE), > so it's possible that some commands still go through the v1 session catalog > although you configured a custom v2 session catalog. > > Can you create JIRA tickets if you hit any DDL commands that don't support > v2 catalog? We should fix them.

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-06 Thread Jungtaek Lim
hit in our downstream test suites, but we haven't been > exploring the use of a session catalog for fallback. We use v2 for > everything now, which avoids the problem and comes with multi-catalog > support. > > On Tue, Oct 6, 2020 at 5:55 PM Jungtaek Lim > wrote: > >&g

SQL DDL statements with replacing default catalog with custom catalog

2020-10-06 Thread Jungtaek Lim
e paths and different catalog interfaces. That sounds to me as being stuck and the only "clear" approach seems to disallow default catalog with custom one. Am I missing something? Thanks, Jungtaek Lim (HeartSaVioR)

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-09-27 Thread Jungtaek Lim
bump to see anyone interested or concerned about this. On Tue, Aug 25, 2020 at 4:56 PM Jungtaek Lim wrote: > Bump this again. > > On Tue, Aug 18, 2020 at 12:11 PM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Bump again. >> >> Unlike file st

Re: Output mode in Structured Streaming and DSv1 sink/DSv2 table

2020-09-27 Thread Jungtaek Lim
bump to see anyone interested or concerned about this On Sun, Sep 20, 2020 at 1:59 PM Jungtaek Lim wrote: > Hi devs, > > We have a capability check in DSv2 defining which operations can be done > against the data source both read and write. The concept was brought in > DSv2, so

Re: What's the root cause of not supporting multiple aggregations in structured streaming?

2020-09-25 Thread Jungtaek Lim
, and I don't see willingness to correct them. On Fri, Sep 4, 2020 at 5:55 PM Etienne Chauchot wrote: > Hi Jungtaek Lim, > > Nice to hear from you again since last time we talked :) and congrats on > becoming a Spark committer in the meantime ! (if I'm not mistaking you wer

Output mode in Structured Streaming and DSv1 sink/DSv2 table

2020-09-19 Thread Jungtaek Lim
he support on truncate if the data source is unable to truncate? (Foreach and Kafka output tables will be unable to apply complete mode afterwards.) Looking forward to hear everyone's thoughts. Thanks, Jungtaek Lim (HeartSaVioR)

Re: [DISCUSS] Time to evaluate "continuous mode" in SS?

2020-09-15 Thread Jungtaek Lim
me logic you'd delete graphx >> long ago. >> >> Anecdotally, yes there are people using it that I know of at least, >> but I wouldn't know a lot of them. >> I think the question is, is it causing a problem, like a lot of >> maintenance? doesn't

Re: [DISCUSS] Time to evaluate "continuous mode" in SS?

2020-09-15 Thread Jungtaek Lim
If you're saying remove it, probably not? I don't see that it's > anywhere near deprecated, and not sure it's unmaintained - obviously > tests etc still have to keep passing. > > On Mon, Sep 14, 2020 at 11:34 PM Jungtaek Lim > wrote: > > > > Hi devs,

[DISCUSS] Time to evaluate "continuous mode" in SS?

2020-09-14 Thread Jungtaek Lim
out cost on maintenance. I know there's a mood to avoid discontinue support as possible, but it sounds weird to keep something as "unmaintained", especially it's still "experimental" and main authors are no more active enough to promise maintenance/improvement on the module. Thoughts? Thanks, Jungtaek Lim (HeartSaVioR)

Re: What's the root cause of not supporting multiple aggregations in structured streaming?

2020-09-03 Thread Jungtaek Lim
Unfortunately I don't see enough active committers working on Structured Streaming; I don't expect major features/improvements can be brought in this situation. Technically I can review and merge the PR on major improvements in SS, but that depends on how huge the proposal is changing. If the prop

Re: [VOTE] Release Spark 3.0.1 (RC3)

2020-08-29 Thread Jungtaek Lim
all asc and sha512 files. - Checked no blocker issues exist on 3.0.1. Thanks, Jungtaek Lim (HeartSaVioR) On Sat, Aug 29, 2020 at 11:28 AM Sean Owen wrote: > +1 from me. Same result as the last RC. I did see this test failure > but I think it was transient; unless anyone else sees it. >

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-08-25 Thread Jungtaek Lim
Bump this again. On Tue, Aug 18, 2020 at 12:11 PM Jungtaek Lim wrote: > Bump again. > > Unlike file stream sink which has lots of limitations and many of us have > been suggesting alternatives, file stream source is the only way if end > users want to read the data from files.

<    1   2   3   4   5   >