Re: [DISCUSS] Multi-table transactions

2023-08-30 Thread Vinoth Chandar
+1 Reviewed the RFC. Looks like a promising direction to take. On Thu, Aug 24, 2023 at 9:26 AM sagar sumit wrote: > Hi devs, > > RFC-69 proposes some exciting features and in line with that vision, > I would like to propose support for multi-table transactions in Hudi. > > As the name suggests,

Re: [DISCUSS] Hudi Reverse Streamer

2023-08-21 Thread Pratyaksh Sharma
Hi Vinoth, I have raised a PR here - https://github.com/apache/hudi/pull/9492. Let us continue the discussion there. On Wed, Aug 16, 2023 at 4:43 PM Vinoth Chandar < mail.vinoth.chan...@gmail.com> wrote: > Hi Pratyaksh, > > Are you still actively driving this? > > On Tue, Jul 11, 2023 at 2:18 

Re: [DISCUSS] Release Manager for 1.0

2023-08-16 Thread Vinoth Chandar
Awesome! that was easy. lets go! On Wed, Aug 16, 2023 at 5:32 AM sagar sumit wrote: > Hi Vinoth, > > 1.0 seems to be packed with exciting features. > I would be glad to volunteer as the release manager. > > Regards, > Sagar > > On Wed, Aug 16, 2023 at 5:24 PM Vinoth Chandar wrote: > > > Hi

Re: [DISCUSS] Release Manager for 1.0

2023-08-16 Thread sagar sumit
Hi Vinoth, 1.0 seems to be packed with exciting features. I would be glad to volunteer as the release manager. Regards, Sagar On Wed, Aug 16, 2023 at 5:24 PM Vinoth Chandar wrote: > Hi PMC/Committers, > > We are looking for a volunteer to act as release manager for the 1.0 > release. >

Re: DISCUSS Hudi 1.x plans

2023-08-16 Thread Vinoth Chandar
Hello everyone, We have been doing a lot of foundational design, prototyping work and I have outlined an execution plan here. https://cwiki.apache.org/confluence/display/HUDI/1.0+Execution+Planning Look forward to contributions! On Wed, May 10, 2023 at 4:14 PM Sivabalan wrote: > Great! Left

Re: [DISCUSS] Hudi Reverse Streamer

2023-08-16 Thread Vinoth Chandar
Hi Pratyaksh, Are you still actively driving this? On Tue, Jul 11, 2023 at 2:18 PM Pratyaksh Sharma wrote: > Update: I will be raising the initial draft of RFC in the next couple of > days. > > On Thu, Jun 15, 2023 at 2:28 AM Rajesh Mahindra > wrote: > > > Great. We also need it for use cases

Re: [DISCUSS] Should we support a service to manage all deltastreamer jobs?

2023-08-16 Thread Vinoth Chandar
+1 there are RFCs on table management services, but not specific to deltastreamer itself. Are you proposing building something specific to that? On Wed, Jun 14, 2023 at 8:26 AM Pratyaksh Sharma wrote: > Hi, > > Personally I am in favour of creating such a UI where monitoring and > managing

Re: Discuss fast copy on write rfc-68

2023-07-21 Thread Nicolas Paris
Definitely can't see a benefit to use 30MB row groups over just creating 30MB parquet files. I would add that stats indexes are on the file level, so it's in favor to using row groups size=file size. The only context it would help is when clustering is setup and targets 1GB files, w/ 128MB

Re: Discuss fast copy on write rfc-68

2023-07-20 Thread Nicolas Paris
Spliting parquet file into 5 row groups, leads to same benefit as creating 5 parquet files each 1 row group instead. Also the later can involve more parallelism for writes. Am I missing something? On July 20, 2023 12:38:54 PM UTC, sagar sumit wrote: >Good questions! The idea is to be able to

Re: Discuss fast copy on write rfc-68

2023-07-20 Thread sagar sumit
Good questions! The idea is to be able to skip rowgroups based on index. But, if we have to do a full snapshot load, then our wrapper should actually be doing batch GET on S3. Why incur 5x more calls. As for the update, I think this is in the context of COW. So, the footer will be recomputed

Re: [DISCUSS] Hudi Reverse Streamer

2023-07-11 Thread Pratyaksh Sharma
Update: I will be raising the initial draft of RFC in the next couple of days. On Thu, Jun 15, 2023 at 2:28 AM Rajesh Mahindra wrote: > Great. We also need it for use cases of loading data into warehouses, and > would love to help. > > On Wed, Jun 14, 2023 at 9:06 AM Pratyaksh Sharma > wrote:

Re: [DISCUSS] Hudi Reverse Streamer

2023-06-14 Thread Rajesh Mahindra
Great. We also need it for use cases of loading data into warehouses, and would love to help. On Wed, Jun 14, 2023 at 9:06 AM Pratyaksh Sharma wrote: > Hi, > > I missed this email earlier. Sure let me start an RFC this week and we can > take it from there. > > On Wed, Jun 14, 2023 at 9:20 PM

Re: [DISCUSS] Hudi Reverse Streamer

2023-06-14 Thread Pratyaksh Sharma
Hi, I missed this email earlier. Sure let me start an RFC this week and we can take it from there. On Wed, Jun 14, 2023 at 9:20 PM Nicolas Paris wrote: > Hi any rfc/ongoing efforts on the reverse delta streamer ? We have a use > case to do hudi => Kafka and would enjoy building a more general

Re: [DISCUSS] Hudi Reverse Streamer

2023-06-14 Thread Nicolas Paris
Hi any rfc/ongoing efforts on the reverse delta streamer ? We have a use case to do hudi => Kafka and would enjoy building a more general tool. However we need a rfc basis to start some effort in the right way On April 12, 2023 3:08:22 AM UTC, Vinoth Chandar wrote: >Cool. lets draw up a RFC

Re: [DISCUSS] Should we support a service to manage all deltastreamer jobs?

2023-06-14 Thread Pratyaksh Sharma
Hi, Personally I am in favour of creating such a UI where monitoring and managing configurations is just a click away. That makes life a lot easier for users. So +1 on the proposal. I remember the work for it had started long back around 2019. You can check this RFC

Re: Re: [DISCUSS] should deltastreamer support configuration hot update?

2023-05-24 Thread Sivabalan
sure. sg. thanks! On Tue, 23 May 2023 at 23:22, 孔维 <18701146...@163.com> wrote: > > Hi, Sivabalan, > > > Great to hear from you. Then I will create a JIRA ticket to track this feature > > > Best Regards > > At 2023-05-24 02:22:36, "Sivabalan" wrote: > >I could not see the image you have

Re: [DISCUSS] should deltastreamer support configuration hot update?

2023-05-23 Thread Sivabalan
I could not see the image you have attached. But I do get your ask here. Will definitely benefit continuous deltastreamer use-cases. One possible option is to keep track of the last mod time of the property file that we feed in for detlastreamer top level config and before every batch, we can

Re: DISCUSS Hudi 1.x plans

2023-05-10 Thread Sivabalan
Great! Left some feedback. On Wed, 10 May 2023 at 06:56, Vinoth Chandar wrote: > > All - the RFC is up here. Please comment on the PR or use the dev list to > discuss ideas. > https://github.com/apache/hudi/pull/8679/ > > On Mon, May 8, 2023 at 11:43 PM Vinoth Chandar wrote: > > > I have

Re: DISCUSS Hudi 1.x plans

2023-05-10 Thread Vinoth Chandar
All - the RFC is up here. Please comment on the PR or use the dev list to discuss ideas. https://github.com/apache/hudi/pull/8679/ On Mon, May 8, 2023 at 11:43 PM Vinoth Chandar wrote: > I have claimed RFC-69, per our process. > > On Mon, May 8, 2023 at 9:19 PM Vinoth Chandar wrote: > >> Hi

Re: DISCUSS Hudi 1.x plans

2023-05-08 Thread Vinoth Chandar
I have claimed RFC-69, per our process. On Mon, May 8, 2023 at 9:19 PM Vinoth Chandar wrote: > Hi all, > > I have been consolidating all our progress on Hudi and putting together a > proposal for Hudi 1.x vision and a concrete plan for the first version 1.0. > > Will plan to open up the RFC to

Re:Re:Re: Re: Re: DISCUSS

2023-04-24 Thread 吕虎
Hi folks, I haven't received your reply for a long time. I think you must have something more important to do. Am I right? : ) At 2023-03-31 21:40:46, "吕虎" wrote: >Hi Vinoth, I'm glad to receive your letter. Here are some of my thoughts. >At 2023-03-31 10:17:52, "Vinoth Chandar"

Re: [DISCUSS] Hudi Reverse Streamer

2023-04-11 Thread Vinoth Chandar
Cool. lets draw up a RFC for this? @pratyaksh - do you want to start one, given you expressed interest? On Mon, Apr 10, 2023 at 7:32 PM Léo Biscassi wrote: > +1 > This would be great! > > Cheers, > > On Mon, Apr 3, 2023 at 3:00 PM Pratyaksh Sharma > wrote: > > > Hi Vinoth, > > > > I am aligned

Re: [DISCUSS] Hudi Reverse Streamer

2023-04-10 Thread Léo Biscassi
+1 This would be great! Cheers, On Mon, Apr 3, 2023 at 3:00 PM Pratyaksh Sharma wrote: > Hi Vinoth, > > I am aligned with the first reason that you mentioned. Better to have a > separate tool to take care of this. > > On Mon, Apr 3, 2023 at 9:01 PM Vinoth Chandar < >

Re: Re: Re: [DISCUSS] split source of kafka partition by count

2023-04-07 Thread Vinoth Chandar
Pulled in another reviewer as well. Left a comment. We can move the discussion to the PR? Thanks for the useful contribution! On Thu, Apr 6, 2023 at 12:34 AM 孔维 <18701146...@163.com> wrote: > Hi, vinoth, > > I created a PR(https://github.com/apache/hudi/pull/8376) for this > feature, could you

Re:Re: Re: [DISCUSS] split source of kafka partition by count

2023-04-06 Thread 孔维
Hi, vinoth, I created a PR(https://github.com/apache/hudi/pull/8376) for this feature, could you help review it? BR, Kong At 2023-04-05 00:19:20, "Vinoth Chandar" wrote: >Look forward to this! could really help backfill/rebootstrap scenarios. > >On Tue, Apr 4, 2023 at 9:18 AM

Re: Re: [DISCUSS] split source of kafka partition by count

2023-04-04 Thread Vinoth Chandar
Look forward to this! could really help backfill/rebootstrap scenarios. On Tue, Apr 4, 2023 at 9:18 AM Vinoth Chandar wrote: > Thinking out loud. > > 1. For insert operations, it should not matter anyway. > 2. For upsert etc, the preCombine would handle the ordering problems. > > Is that what

Re: Re: [DISCUSS] split source of kafka partition by count

2023-04-04 Thread Vinoth Chandar
Thinking out loud. 1. For insert operations, it should not matter anyway. 2. For upsert etc, the preCombine would handle the ordering problems. Is that what you are saying? I feel we don't want to leak any Kafka specific logic or force use of special payloads etc. thoughts? I assigned the jira

Re: [DISCUSS] Hudi Reverse Streamer

2023-04-03 Thread Pratyaksh Sharma
Hi Vinoth, I am aligned with the first reason that you mentioned. Better to have a separate tool to take care of this. On Mon, Apr 3, 2023 at 9:01 PM Vinoth Chandar wrote: > +1 > > I was thinking that we add a new utility and NOT extend DeltaStreamer by > adding a Sink interface, for the

Re: [DISCUSS] Hudi Reverse Streamer

2023-04-03 Thread Vinoth Chandar
+1 I was thinking that we add a new utility and NOT extend DeltaStreamer by adding a Sink interface, for the following reasons - It will make it look like a generic Source => Sink ETL tool, which is actually not our intention to support on Hudi. There are plenty of good tools for that out there.

Re: [DISCUSS] split source of kafka partition by count

2023-04-03 Thread Vinoth Chandar
Hi, Does your implementation read out offset ranges from Kafka partitions? which means - we can create multiple spark input partitions per Kafka partitions? if so, +1 for overall goals here. How does this affect ordering? Can you think about how/if Hudi write operations can handle potentially

Re: Re: [DISCUSS] Hudi data TTL

2023-03-31 Thread Sivabalan
left some comments. thanks! On Fri, 31 Mar 2023 at 00:59, 符其军 <18889897...@163.com> wrote: > Hi community, we have submitted RFC-65 Partition TTL Management in this > pr: https://github.com/apache/hudi/pull/8062.Let me know if you > have any questions or concerns with this proposal. > At

Re:Re: Re: Re: DISCUSS

2023-03-31 Thread 吕虎
Hi Vinoth, I'm glad to receive your letter. Here are some of my thoughts. At 2023-03-31 10:17:52, "Vinoth Chandar" wrote: >I think we can focus more on validating the hash index + bloom filter vs >consistent hash index more first. Have you looked at RFC-08, which is a >kind of hash index as well,

Re: [DISCUSS] Hudi Reverse Streamer

2023-03-31 Thread Davidiam
Hello Vinoth, Can you please unsubscribe me? I have been trying to unsubscribe for months without success. Kind Regards, David Sent from Outlook for Android From: Vinoth Chandar Sent: Friday, March 31, 2023 5:09:52 AM To: dev Subject:

Re: [DISCUSS] Hudi Reverse Streamer

2023-03-31 Thread Pratyaksh Sharma
+1 to this. I can help drive some of this work. On Fri, Mar 31, 2023 at 10:09 AM Prashant Wason wrote: > Could be useful. Also, may be useful for backup / replication scenario > (keeping a copy of data in alternate/cloud DC). > > HoodieDeltaStreamer already has the concept of "sources". This

Re: [DISCUSS] Hudi Reverse Streamer

2023-03-30 Thread Prashant Wason
Could be useful. Also, may be useful for backup / replication scenario (keeping a copy of data in alternate/cloud DC). HoodieDeltaStreamer already has the concept of "sources". This can be implemented as a "sink" concept. On Thu, Mar 30, 2023 at 8:12 PM Vinoth Chandar wrote: > Essentially. > >

Re: Re: Re: DISCUSS

2023-03-30 Thread Vinoth Chandar
I think we can focus more on validating the hash index + bloom filter vs consistent hash index more first. Have you looked at RFC-08, which is a kind of hash index as well, except it stores the key => file group mapping externally. On Fri, Mar 24, 2023 at 2:14 AM 吕虎 wrote: > Hi Vinoth, I am

Re: [DISCUSS] Hudi Reverse Streamer

2023-03-30 Thread Vinoth Chandar
Essentially. Old architecture :(operational database) ==> some tool ==> (data warehouse raw data) ==> SQL ETL ==> (data warehouse derived data) New architecture : (operational database) ==> Hudi delta Streamer ==> (Hudi raw data) ==> Spark/Flink Hudi ETL ==> (Hudi derived data) ==> Hudi

Re:Re: Re: DISCUSS

2023-03-24 Thread 吕虎
Hi Vinoth, I am very happy to receive your reply. Here are some of my thoughts。 At 2023-03-21 23:32:44, "Vinoth Chandar" wrote: >>but when it is used for data expansion, it still involves the need to >redistribute the data records of some data files, thus affecting the >performance. >but

Re: Re: DISCUSS

2023-03-21 Thread Vinoth Chandar
>but when it is used for data expansion, it still involves the need to redistribute the data records of some data files, thus affecting the performance. but expansion of the consistent hash index is an optional operation right? Sorry, not still fully understanding the differences here, >Because

Re: DISCUSS

2023-03-16 Thread Vinoth Chandar
Thanks for the proposal! Some first set of questions here. >You need to pre-select the number of buckets and use the hash function to determine which bucket a record belongs to. >when building the table according to the estimated amount of data, and it cannot be changed after building the table

Re: [DISCUSS] Build tool upgrade

2023-02-13 Thread Vinoth Chandar
This is cool! :) On Mon, Feb 13, 2023 at 2:02 PM Daniel Kaźmirski wrote: > Hi, > > I did try to add the mentioned extension to Hudi pom. Here are the results: > > Clean with cache extension disabled > mvn clean package -DskipTests -Dspark3.3 -Dscala-2.12 > -Dmaven.build.cache.enabled=false >

Re: [DISCUSS] Build tool upgrade

2023-02-13 Thread Daniel Kaźmirski
Hi, I did try to add the mentioned extension to Hudi pom. Here are the results: Clean with cache extension disabled mvn clean package -DskipTests -Dspark3.3 -Dscala-2.12 -Dmaven.build.cache.enabled=false [INFO] BUILD SUCCESS [INFO]

Re: [DISCUSS] Build tool upgrade

2023-02-10 Thread Daniel Kaźmirski
Hi all, Going back to this topic, Maven 3.9.0 has been released recently along with a new build cache extension that provides incremental builds: https://maven.apache.org/extensions/maven-build-cache-extension/ Might be worth considering. pon., 24 paź 2022 o 19:59 Shiyan Xu napisał(a): > Thank

Re: [DISCUSS] Merging Nov and Dec community sync calls

2022-11-18 Thread Bhavani Sudha
Thank you all. I'll update the calendar invites and the website for necessary changes. -Sudha On Thu, Nov 17, 2022 at 9:01 AM Pratyaksh Sharma wrote: > +1 as well. > > On Thu, Nov 17, 2022 at 9:57 PM sagar sumit > wrote: > > > +1 > > > > On Thu, Nov 17, 2022 at 9:44 AM Sivabalan wrote: > >

Re: [DISCUSS] Merging Nov and Dec community sync calls

2022-11-17 Thread Pratyaksh Sharma
+1 as well. On Thu, Nov 17, 2022 at 9:57 PM sagar sumit wrote: > +1 > > On Thu, Nov 17, 2022 at 9:44 AM Sivabalan wrote: > > > +1 makes sense. > > > > On Wed, 16 Nov 2022 at 17:40, Y Ethan Guo wrote: > > > > > +1 on having a single community sync all on Dec 14 during the holiday > > > season.

Re: [DISCUSS] Merging Nov and Dec community sync calls

2022-11-17 Thread sagar sumit
+1 On Thu, Nov 17, 2022 at 9:44 AM Sivabalan wrote: > +1 makes sense. > > On Wed, 16 Nov 2022 at 17:40, Y Ethan Guo wrote: > > > +1 on having a single community sync all on Dec 14 during the holiday > > season. > > > > On Wed, Nov 16, 2022 at 5:12 PM Bhavani Sudha > > wrote: > > > > > Hello

Re: [DISCUSS] Merging Nov and Dec community sync calls

2022-11-17 Thread Shiyan Xu
+1 On Thu, Nov 17, 2022 at 12:15 PM Sivabalan wrote: > +1 makes sense. > > On Wed, 16 Nov 2022 at 17:40, Y Ethan Guo wrote: > > > +1 on having a single community sync all on Dec 14 during the holiday > > season. > > > > On Wed, Nov 16, 2022 at 5:12 PM Bhavani Sudha > > wrote: > > > > > Hello

Re: [DISCUSS] Merging Nov and Dec community sync calls

2022-11-16 Thread Sivabalan
+1 makes sense. On Wed, 16 Nov 2022 at 17:40, Y Ethan Guo wrote: > +1 on having a single community sync all on Dec 14 during the holiday > season. > > On Wed, Nov 16, 2022 at 5:12 PM Bhavani Sudha > wrote: > > > Hello Hudi community, > > > > We have monthly community sync calls on the last

Re: [DISCUSS] Merging Nov and Dec community sync calls

2022-11-16 Thread Y Ethan Guo
+1 on having a single community sync all on Dec 14 during the holiday season. On Wed, Nov 16, 2022 at 5:12 PM Bhavani Sudha wrote: > Hello Hudi community, > > We have monthly community sync calls on the last wednesday of every month. > For November and December months these collide with public

Re: [Discuss] SCD-2 Payload

2022-10-24 Thread 冯健
to Raymond: now combineAndGetUpdateValue can only return one IndexedRecord, but in the case of SCD-2, both old and new records need to be stored. to Alexey: yeah, this feature should be designed on top of RFC-46. Can HoodieRecordMerger return 2 HoodieRecord in this case? On Tue, 25 Oct 2022

Re: [Discuss] SCD-2 Payload

2022-10-24 Thread Alexey Kudinkin
Hey, hey, Fengjian! With the landing of the RFC-46 we'll be kick-starting a process of phasing out HoodieRecordPayload as an abstraction and instead migrating to HoodieRecordMerger interface. I'd recommend to base your design considerations off the new HoodieRecordMerger interface instead of

Re: [Discuss] SCD-2 Payload

2022-10-24 Thread Shiyan Xu
Interesting thoughts. Not sure if I fully understand this part: "generate 2 records in combineAndGetUpdateValue". the API is defined to return just 1 record? On Fri, Oct 21, 2022 at 1:07 AM 冯健 wrote: > Hi guys, > After reading this article with respect to how to implement SCD-2 with > Hudi

Re: [DISCUSS] [RFC] Hudi bundle standards

2022-10-24 Thread Shiyan Xu
Thanks Xinyao for raising the problem. Let's align more on the RFC to help clarify usage. Agree on the importance - the bundle artifacts are the user-facing components from this project. On Mon, Oct 10, 2022 at 4:44 PM 田昕峣 (Xinyao Tian) wrote: > Hi Shiyan, > > > Having carefully read the

Re: [DISCUSS] Build tool upgrade

2022-10-24 Thread Shiyan Xu
Thank you all for the valuable inputs! I think we can close this topic for now, given the majority is leaning towards continuing with maven. On Mon, Oct 17, 2022 at 8:48 PM zhaojing yu wrote: > I have experienced some gradle development projects and want to share some > thoughts. > > The

Re: [DISCUSS] Hudi data TTL

2022-10-21 Thread stream2000
Yes we can have a talk about it. We will try our best to write the RFC, maybe publish it in a few weeks. > On Oct 21, 2022, at 10:18, JerryYue <272614...@qq.com.INVALID> wrote: > > Looking forward to the RFC > It's a good idea, we also need hudi data TTL in some case > Do we have any plan or

Re: [DISCUSS] Hudi data TTL

2022-10-20 Thread JerryYue
Looking forward to the RFC It's a good idea, we also need hudi data TTL in some case Do we have any plan or time to do this? We also had some simple designs to implement it Maybe we can had a talk about it 在 2022/10/20 上午9:47,“Bingeng Huang” 写入: Looking forward to the RFC. We can

Re: [DISCUSS] Hudi data TTL

2022-10-19 Thread Bingeng Huang
Looking forward to the RFC. We can propose RFC about support TTL config using non-partition field after sagar sumit 于2022年10月19日周三 14:42写道: > +1 Very nice idea. Looking forward to the RFC! > > On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu > wrote: > > > great proposal. Partition TTL is a good

Re: [DISCUSS] Hudi data TTL

2022-10-19 Thread stream2000
___ > From: sagar sumit > Sent: Wednesday, October 19, 2022 2:42:36 PM > To: dev@hudi.apache.org > Subject: Re: [DISCUSS] Hudi data TTL > > +1 Very nice idea. Looking forward to the RFC! > > On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu > wrote: > >> gr

Re: [DISCUSS] Hudi data TTL

2022-10-19 Thread Teng Huo
@hudi.apache.org Subject: Re: [DISCUSS] Hudi data TTL +1 Very nice idea. Looking forward to the RFC! On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu wrote: > great proposal. Partition TTL is a good starting point. we can extend it to > other TTL strategies like column-based, and make it custom

Re: [DISCUSS] Hudi data TTL

2022-10-19 Thread sagar sumit
+1 Very nice idea. Looking forward to the RFC! On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu wrote: > great proposal. Partition TTL is a good starting point. we can extend it to > other TTL strategies like column-based, and make it customizable and > pluggable. Looking forward to the RFC! > > On

Re: [DISCUSS] Hudi data TTL

2022-10-18 Thread Shiyan Xu
great proposal. Partition TTL is a good starting point. we can extend it to other TTL strategies like column-based, and make it customizable and pluggable. Looking forward to the RFC! On Wed, Oct 19, 2022 at 11:40 AM Jian Feng wrote: > Good idea, > this is definitely worth an RFC > btw should

Re: [DISCUSS] Hudi data TTL

2022-10-18 Thread Jian Feng
Good idea, this is definitely worth an RFC btw should it only depend on Hudi's partition? I feel it should be a more common feature since sometimes customers' data can not update across partitions On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com> wrote: > Hi all, we have

Re: [DISCUSS] Hudi data TTL

2022-10-18 Thread stream2000
Hi all, we have implemented a partition based data ttl management, which we can manage ttl for hudi partition by size, expired time and sub-partition count. When a partition is detected as outdated, we use delete partition interface to delete it, which will generate a replace commit to mark the

Re: [DISCUSS] Hudi data TTL

2022-10-18 Thread Vinoth Chandar
+1 love to discuss this on a RFC proposal. On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin wrote: > That's a very interesting idea. > > Do you want to take a stab at writing a full proposal (in the form of RFC) > for it? > > On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang > wrote: > > > Hi all, >

Re: [DISCUSS] Hudi data TTL

2022-10-18 Thread Alexey Kudinkin
That's a very interesting idea. Do you want to take a stab at writing a full proposal (in the form of RFC) for it? On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang wrote: > Hi all, > > Do we have plan to integrate data TTL into HUDI, so we don't have to > schedule a offline spark job to delete

Re: [DISCUSS] Build tool upgrade

2022-10-17 Thread zhaojing yu
I have experienced some gradle development projects and want to share some thoughts. The flexibility and faster speed of gradle itself can certainly bring some advantages, but it will also greatly increase the troubleshooting time due to the bugs of gradle itself, and gradle DSL is very different

Re: [DISCUSS] Build tool upgrade

2022-10-17 Thread Gary Li
Hi folks, I'd share my thoughts as well. I personally won't build the whole project too often, only before push to the remote branch or make big changes in different modules. If I just make some changes and run a test, the IDE will only build the necessary modules I believe. In addition, each

Re: [DISCUSS] Build tool upgrade

2022-10-17 Thread Danny Chan
I have a full experience with how Apache Calcite switches from Maven to Gradle, and I want to share some thoughts. The gradle build is fast, but it relies heavily on its local cache, usually it needs too much time to download these cache jars because gradle upgrade itself very frequently. The

Re: [DISCUSS] Diagnostic reporter

2022-10-14 Thread Forward Xu
+1, Thanks Shiyan Xu and Zhang Yue, This is a very useful function. Best, Forward sagar sumit 于2022年9月12日周一 18:39写道: > Thanks Zhang Yue for drafting the RFC. > It's an interesting read! I have left some comments. > > While exposing certain info such as "sample_hoodie_key", > we have to

Re:[DISCUSS] [RFC] Hudi bundle standards

2022-10-10 Thread Xinyao Tian
Hi Shiyan, Having carefully read the RFC-63.md on the PR, I really think this feature is crucial for everyone who builds Hudi from source. For example, when I tried to compile Hudi 0.12.0 with flink1.15, I used command ‘mvn clean package -DskipTests -Dflink1.15 -Dscala-2.12’ but still get

Re: [DISCUSS] Build tool upgrade

2022-10-03 Thread Prashant Wason
+1 for incremental builds with build cache which will be a huge prod bosot especially when working with multiple branches at the same time. Prashant On Mon, Oct 3, 2022 at 11:42 PM Alexey Kudinkin wrote: > I think full project build slowly gravitates towards 15min already (it’s > about

Re: [DISCUSS] Build tool upgrade

2022-10-03 Thread Alexey Kudinkin
I think full project build slowly gravitates towards 15min already (it’s about 12-14min on my 2021 Macbook). @Vinoth the most important aspect that Maven couldn’t provide us with are local incremental builds. Currently you have to build full dependency hierarchy of the project whenever you’re

Re: [DISCUSS] Build tool upgrade

2022-10-03 Thread Pratyaksh Sharma
My two cents. I have seen open source projects take more than 20-25 minutes for building on maven, so I guess we are fine for now. But we can definitely investigate and try to optimize if we can. On Sun, Oct 2, 2022 at 9:33 AM Shiyan Xu wrote: > Yes, Vinoth, agree on the efforts and impact

Re: [DISCUSS] Build tool upgrade

2022-10-01 Thread Shiyan Xu
Yes, Vinoth, agree on the efforts and impact being big. Some perf comparison on gradle vs maven can be found in https://gradle.org/gradle-vs-maven-performance/ where it claims multi-fold build time reduction. I'd estimate maybe 2-4 min for a full build and based on that. I mainly hope to collect

Re: [DISCUSS] Build tool upgrade

2022-09-30 Thread Vinoth Chandar
Hi Raymond. This would be a large undertaking and a big change for everyone. What does the build time look like if we switch to gradle or bazel? And do we know why it takes 10 min to build and why is that not okay? Given we all use IDEs mostly anyway Thanks Vinoth On Fri, Sep 30, 2022 at 22:48

Re: [DISCUSS] New RFC to support 'Snapshot view management'

2022-09-16 Thread 冯健
Hi Sagar, HMS shouldn't be the core part, the external table location will depend on which metastore the user is using. I'm still working on it, will add more detail in this RFC pr. https://github.com/apache/hudi/pull/6576 On Fri, 16 Sept 2022 at 11:28, sagar sumit wrote: > Automatic

Re: [DISCUSS] New RFC to support 'Snapshot view management'

2022-09-15 Thread sagar sumit
Automatic lifecycle management based on a few configurations would be very useful for the community. I read the description in https://issues.apache.org/jira/browse/HUDI-4677 May I ask the rationale for choosing Hive Metastore to manage the snapshots? Perhaps, RFC would have more details.

Re: [DISCUSS] New RFC to support 'Snapshot view management'

2022-09-13 Thread 冯健
Hi Ethan, Yes, based on the current situation, we still need to do much extra work to provide snapshot view feature for the users( or users do this by themself) . I plan to merge the COW part of this feature to 0.13.0 at least. will consider your suggestion if time is tight Thanks On

Re: [DISCUSS] New RFC to support 'Snapshot view management'

2022-09-13 Thread Y Ethan Guo
Hi Feng Jian, Looking forward to the RFC! Is the snapshot view management more like managing commits / savepoints in the Hudi timeline and hiding Hudi internals from the users? Do you plan to merge the implementation of snapshot view and lifecycle management for the next major release (0.13.0)?

Re: [DISCUSS] New RFC to support 'Snapshot view management'

2022-09-12 Thread Sivabalan
Sounds like a nice feature to have. Eagerly looking forward for the RFC. On Sat, 27 Aug 2022 at 20:51, 冯健 wrote: > I attached the image in this Jira Epic > https://issues.apache.org/jira/browse/HUDI-4677, and the RFC is WIP, will > create a pr in the next few days > Yeah, the basic idea is to

Re: [DISCUSS] Diagnostic reporter

2022-09-12 Thread sagar sumit
Thanks Zhang Yue for drafting the RFC. It's an interesting read! I have left some comments. While exposing certain info such as "sample_hoodie_key", we have to consider masking/obfuscation. Looking forward to the implementation. Regards, Sagar On Wed, Sep 7, 2022 at 1:49 PM Yue Zhang wrote:

Re: [DISCUSS] Diagnostic reporter

2022-09-07 Thread Yue Zhang
Hi Hudi, Just raise a RFC about this diagnostic reporter https://github.com/apache/hudi/pull/6600. PLEASE feel free to leave any comments or concerns if you are interested! | | Yue Zhang | | zhangyue921...@163.com | On 08/4/2022 19:38,Yue Zhang wrote: Hi Shiyan and everyone, This is

Re: [DISCUSS] New RFC to support 'Snapshot view management'

2022-08-27 Thread 冯健
I attached the image in this Jira Epic https://issues.apache.org/jira/browse/HUDI-4677, and the RFC is WIP, will create a pr in the next few days Yeah, the basic idea is to implement lifecycle management based on the savepoint and time travel features, providing new ways for the user to operate

Re: [DISCUSS] New RFC to support 'Snapshot view management'

2022-08-27 Thread Shiyan Xu
The dev email list does not support showing images unfortunately. you may want to put it behind a link. As for the idea itself, What I plan to do is to let Hudi support release a snapshot view and > lifecycle management out-of-box. Are you planning to extend the savepoint feature to have

Re: [DISCUSS]: Integrate column stats index with all query engines

2022-08-10 Thread Pratyaksh Sharma
孟涛 > mengtao0...@qq.com > > > > > > > > > --原始邮件-- > 发件人: > "dev" > < > vin...@apache.org; > 发送

Re: [DISCUSS]: Integrate column stats index with all query engines

2022-08-10 Thread Vinoth Chandar
+1 for this. Suggested new reviewers on the RFC. https://github.com/apache/hudi/pull/6345/files#r943073339 On Wed, Aug 10, 2022 at 9:56 PM Pratyaksh Sharma wrote: > Hello community, > > With the introduction of multi modal index in Hudi, there is a lot of scope > for improvement on the

Re: [DISCUSS] Diagnostic reporter

2022-08-05 Thread Shiyan Xu
Sure, Zhang Yue, feel free to initiate the RFC! On Fri, Aug 5, 2022 at 4:57 AM 田昕峣 (Xinyao Tian) wrote: > Hi Shiyan and everyone, > > > Definitely this feature is very important. We really need to gather error > infos to fix bugs more efficiently. > > > If there’s any thing I could help please

Re: [DISCUSS] Diagnostic reporter

2022-08-05 Thread Xinyao Tian
Hi Shiyan and everyone, Definitely this feature is very important. We really need to gather error infos to fix bugs more efficiently. If there’s any thing I could help please feel free to let me know :) Regards, Xinyao Hi Shiyan and everyone, This is a great idea! As one of Hudi user, I

Re: [DISCUSS] Diagnostic reporter

2022-08-04 Thread Yue Zhang
Hi Shiyan and everyone, This is a great idea! As one of Hudi user, I also struggle to Hudi troubleshooting sometimes. With this feature, it will definitely be able to reduce the burden. So I volunteer to draft a discuss and maybe raise a RFC about if you don't mind. Thanks :) | | Yue

Re: [DISCUSS] Diagnostic reporter

2022-08-02 Thread 冯健
Maybe we can start this with an audit feature? Since we need some sort of "images" to represent “facts”, can create an identity of a writer to link them. and in this audit file, we can label each operation with IP, environment, platform, version, write config and etc. On Sun, 31 Jul 2022 at

Re: [DISCUSS] Diagnostic reporter

2022-07-30 Thread Shiyan Xu
To bubble this up On Wed, Jun 15, 2022 at 11:47 PM Vinoth Chandar wrote: > +1 from me. > > It will be very useful if we can have something that can gather > troubleshooting info easily. > This part takes a while currently. > > On Mon, May 30, 2022 at 9:52 AM Shiyan Xu > wrote: > > > Hi all, >

Re: [DISCUSS] Diagnostic reporter

2022-06-15 Thread Vinoth Chandar
+1 from me. It will be very useful if we can have something that can gather troubleshooting info easily. This part takes a while currently. On Mon, May 30, 2022 at 9:52 AM Shiyan Xu wrote: > Hi all, > > When troubleshooting Hudi jobs in users' environments, we always ask users > to share

Re: [DISCUSS] Hudi sync meetings for Chinese community

2022-06-15 Thread Shimin Yang
Hi all, the proposal of hudi sync meeting for Chinese community is attached. The first sync meeting will be held online on Thursday, June 29 at 10:00 AM CST. Welcome everyone to the meeting! 大家好,以下是 Hudi 中文社区交流会议的提议文档。首次会议会在北京时间 6 月 29 日上午 10 点线上举行。欢迎大家加入! Hudi 中文社区交流会议提议

Re: [DISCUSS] Hudi sync meetings for Chinese community

2022-05-26 Thread Vinoth Chandar
Great! Thanks for volunteering On Thu, May 26, 2022 at 02:09 Shiyan Xu wrote: > Awesome! looking forward to an initial proposal! > > On Thu, May 26, 2022 at 4:17 PM Shimin Yang wrote: > > > Hi Shiyan, I'm from bytedance data lake team, and our team would like to > > drive and host the hudi

Re: [DISCUSS] Hudi sync meetings for Chinese community

2022-05-26 Thread Shiyan Xu
Awesome! looking forward to an initial proposal! On Thu, May 26, 2022 at 4:17 PM Shimin Yang wrote: > Hi Shiyan, I'm from bytedance data lake team, and our team would like to > drive and host the hudi sync meetsing for Chinese community. > > Shiyan Xu 于2022年5月26日周四 16:14写道: > > > Related info:

Re: [DISCUSS] Hudi sync meetings for Chinese community

2022-05-26 Thread Shimin Yang
Hi Shiyan, I'm from bytedance data lake team, and our team would like to drive and host the hudi sync meetsing for Chinese community. Shiyan Xu 于2022年5月26日周四 16:14写道: > Related info: we are noting down the current community sync info here > https://hudi.apache.org/community/syncs > > > On Thu,

Re: [DISCUSS] Hudi sync meetings for Chinese community

2022-05-26 Thread Shiyan Xu
Related info: we are noting down the current community sync info here https://hudi.apache.org/community/syncs On Thu, May 26, 2022 at 3:44 PM Shiyan Xu wrote: > Hi all, > > This is a topic brought up previously, and also recently raised in this > issue

Re: [DISCUSS] Hudi community sync time

2022-05-17 Thread Bhavani Sudha
Sounds good. Thank you all for chiming in. Based on the responses we have had here, we can move the existing community sync to later time. I will send a separate voting thread to finalize the exact time. Thanks, Sudha On Thu, Apr 28, 2022 at 1:55 AM Pratyaksh Sharma wrote: > I would propose 8

Re: [DISCUSS] Hudi community sync time

2022-04-28 Thread Pratyaksh Sharma
I would propose 8 AM or 8.30 AM PST though since 9 AM PST will clash with my other meetings. But happy to go with time that suits most of the folks. On Thu, Apr 28, 2022 at 3:31 AM Vinoth Govindarajan < vinoth.govindara...@gmail.com> wrote: > +1 for 9 am PST call, the current time is super early

Re: [DISCUSS] Hudi community sync time

2022-04-27 Thread Vinoth Govindarajan
+1 for 9 am PST call, the current time is super early hence I missed one of the meetings in the past. Best, Vinoth On Tue, Apr 26, 2022 at 8:01 PM Vinoth Chandar wrote: > +1 as well. Current PST times are pretty hard for many folks. > > On Sat, Apr 16, 2022 at 6:20 AM Gary Li wrote: > > > +1

  1   2   3   4   5   6   7   8   9   10   >