Re: [DISCUSS] New RFC? Hudi dataset snapshotter

2019-11-12 Thread leesf
+1. and we would discuss it further when design docs are available. Best, Leesf Balaji Varadarajan 于2019年11月12日周二 下午4:17写道: > +1 on the exporter tool idea. > > On Mon, Nov 11, 2019 at 10:36 PM vino yang wrote: > > > Hi Shiyan, > > > > +1 for this proposal, Also, it looks like an exporter

Re: [DISCUSS] Simplification of terminologies

2019-11-12 Thread leesf
[1] +1. `views` indeed confused me a lot. [2] +1. `snapshot` is more reasonable. [3] I don't feel very strong to rename it, the current name `COPY_ON_WRITE` is reasonable considering the cost to rename and the behavior that new version parquet file will be created and seems to be copied from old

Re: [DISCUSS] New RFC? Hudi dataset snapshotter

2019-11-12 Thread Balaji Varadarajan
+1 on the exporter tool idea. On Mon, Nov 11, 2019 at 10:36 PM vino yang wrote: > Hi Shiyan, > > +1 for this proposal, Also, it looks like an exporter tool. > > @Vinoth Chandar Any thoughts about where to place it? > > Best, > Vino > > Vinoth Chandar 于2019年11月12日周二 上午8:58写道: > > > We can

Re: [DISCUSS] Simplification of terminologies

2019-11-11 Thread Balaji Varadarajan
Agree with all 3 changes. The naming now looks more consistent than earlier. +1 on them Depending on whether we are renaming Input formats for (1) and (2) - this could require some migration steps for Balaji.V On Mon, Nov 11, 2019 at 7:38 PM vino yang wrote: > Hi Vinoth, > > Thanks for

Re: DISCUSS RFC 7 - Point in time queries on Hudi table (Time-Travel)

2019-11-11 Thread Balaji Varadarajan
+1. This would be a powerful feature which would open up use-cases requiring repeatable query results. Balaji.V On Mon, Nov 11, 2019 at 8:12 AM nishith agarwal wrote: > Folks, > > Starting a discussion thread for enabling time-travel for Hudi datasets. > Please provide feedback on the RFC

Re: [DISCUSS] New RFC? Hudi dataset snapshotter

2019-11-11 Thread vino yang
Hi Shiyan, +1 for this proposal, Also, it looks like an exporter tool. @Vinoth Chandar Any thoughts about where to place it? Best, Vino Vinoth Chandar 于2019年11月12日周二 上午8:58写道: > We can wait for others to chime in as well. :) > > On Mon, Nov 11, 2019 at 4:37 PM Shiyan Xu > wrote: > > >

Re: [DISCUSS] Simplification of terminologies

2019-11-11 Thread vino yang
Hi Vinoth, Thanks for bringing these proposals. +1 on all three. Especially, big +1 on the third renaming proposal. When I was a newbie. The "COPY_ON_WRITE" term confused me a lot. It easily mislead users on the "copy" term. And make users compare it with the `CopyOnWriteArrayList` data

Re: [DISCUSS] Simplification of terminologies

2019-11-11 Thread Shiyan Xu
[1] +1; "query" indeed sounds better [2] +1 on the term "snapshot"; so basically we follow the convention that when we say "snapshot", it means "give me the most up-to-date facts (lowest data latency) even if it takes some query time" [3] Though I agree with the renaming, I have a different

Re: [DISCUSS] Simplification of terminologies

2019-11-11 Thread Bhavani Sudha
+1 on all three rename proposals. I think this would make the concepts super easy to follow for new users. If changing [3] seems to be a stretch, we should definitely do [1] & [2] at the least IMO. I will be glad to help out on the renames to whatever extent possible should the Hudi community

Re: [DISCUSS] New RFC? Hudi dataset snapshotter

2019-11-11 Thread Vinoth Chandar
We can wait for others to chime in as well. :) On Mon, Nov 11, 2019 at 4:37 PM Shiyan Xu wrote: > Yes, Vinoth, you're right that it is more of an exporter, which exports a > snapshot from Hudi dataset. > > It should support MOR too; it shall just leverage on existing > SnapshotCopier logic to

Re: [DISCUSS] New RFC? Hudi dataset snapshotter

2019-11-11 Thread Shiyan Xu
Yes, Vinoth, you're right that it is more of an exporter, which exports a snapshot from Hudi dataset. It should support MOR too; it shall just leverage on existing SnapshotCopier logic to find the latest file slices. So is it good to create a RFC for further discussion? On Mon, Nov 11, 2019 at

Re: [DISCUSS] New RFC? Hudi dataset snapshotter

2019-11-11 Thread Vinoth Chandar
What you suggest sounds more like an `Exporter` tool? I imagine you will support MOR as well? +1 on the idea itself. It could be useful if plain parquet snapshot was generated as a backup. On Mon, Nov 11, 2019 at 4:21 PM Shiyan Xu wrote: > Hi All, > > The existing SnapshotCopier under Hudi

Re: [Discuss] Convenient time for weekly sync meeting

2019-11-11 Thread Vinoth Chandar
yes. sounds good. As of now, its just Kabeer.@kabeer wdyt? @nishith Personally, timing is an issue for me, if you are willing to drive, please go ahead! I ll try to make it if possible On Mon, Nov 11, 2019 at 8:25 AM nishith agarwal wrote: > Vinoth, > > To meet mid way, how about once in 3

Re: [Discuss] Feedback on Hudi improvements

2019-11-11 Thread Scheller, Brandon
Yep, you are correct that it is throwing the exception because of the DataSourceUtils.getNestedFieldValAsString. I can take up the work to fix this behavior if it is not intended. I'd also like to add extra error messaging and validation because currently it is not clear to users what the error

Re: [Discuss] Convenient time for weekly sync meeting

2019-11-11 Thread nishith agarwal
Vinoth, To meet mid way, how about once in 3 weeks for Europe and other time zones ? That works fine for me. In the interest of making the meetings useful for everyone, we can see how productive the meetings are/% attendance for the meetings for the initial few ones, and then may be we can follow

Re: [Discuss] Convenient time for weekly sync meeting

2019-11-11 Thread Pratyaksh Sharma
That overlaps with my office hours. I will try to attend it in 9 PM to 10 PM PST slot only. :) On Mon, Nov 11, 2019 at 6:07 PM Vinoth Chandar wrote: > I can make early morning PST meetings.i.e before 6AM. > > On Sun, Nov 10, 2019 at 11:22 PM Pratyaksh Sharma > wrote: > > > @Vinoth Chandar

Re: [Discuss] Feedback on Hudi improvements

2019-11-10 Thread Jaimin Shah
Hi Brandon, I contributed to complex ComplexKeyGenerator sometime back. I don't think it is intended behavior. If you are getting exception is it because of DataSourceUtils.getNestedFieldValAsString(record, recordKeyField) ? I can't think of any other reason why it should throw exception. I

Re: [Discuss] Feedback on Hudi improvements

2019-11-09 Thread Scheller, Brandon
Thanks for the quick response Balaji! I think there is a lot here to continue with: 1. I did see that recent pull request for the delete API. I think collaborating to support another delete API with just record key would be a great next step. I'll begin looking into it. Additionally, the

Re: [Discuss] Convenient time for weekly sync meeting

2019-11-09 Thread Kabeer Ahmed
Dear Sudha It looks like it is going to be an early call for those in Europe or follow the weekly minutes of the meeting email. Looking at the poll it is quite obvious that 9pm to 10pm PST wins the choice. Thank you so much for running the poll and reporting the stats. Kabeer. On Nov 8 2019,

Re: [Discuss] Feedback on Hudi improvements

2019-11-08 Thread Balaji Varadarajan
Brandon, Great initiative and thoughts. Thanks for writing detailed description on what you are looking to achieve. Here are some of my comments/thoughts: 1. HUDI-326 : There is some work that is happening in this direction. But, we should be able to collaborate on this. Siva has opened

Re: [Discuss] Convenient time for weekly sync meeting

2019-11-08 Thread Bhavani Sudha
Thank you all for the prompt response! I realized I dint add my preferred times.These are the times that work for me. Mon,Tue,Thu - 9pm - 11pm PST Mon-Thu - 5 am - 6:30 am PST Here is the summary from responses: - From the 11 responses received so far, 9 of 11 people (including all

Re: [Discuss] Convenient time for weekly sync meeting

2019-11-07 Thread Kabeer Ahmed
Dear Sudha Really appreciate the initiative to promptly start this thread. My preferences are as below: Any weekday: 10PM PST to 11PM PST OR 10AM PST TO 2PM PST thank you On Nov 7 2019, at 6:46 am, Pratyaksh Sharma wrote: > Interested. > > Timings: > Mon-Fri 6AM-7.30AM PST > > On Thu, Nov 7,

Re: [Discuss] Creation of database in Hive

2019-11-06 Thread Pratyaksh Sharma
Ok, that is a valid reason. On Thu, Nov 7, 2019 at 2:03 AM Bhavani Sudha wrote: > Ah okay. That is a valid concern. Dint think about admin management for > Hive dbs. > > Thanks, > Sudha > > On Wed, Nov 6, 2019 at 12:28 PM Balaji Varadarajan > wrote: > > > I have a different opinion on this.

Re: [Discuss] Convenient time for weekly sync meeting

2019-11-06 Thread Pratyaksh Sharma
Interested. Timings: Mon-Fri 6AM-7.30AM PST On Thu, Nov 7, 2019 at 11:33 AM Gurudatt Kulkarni wrote: > Interested. > > Mon-Thu 5AM-6:30AM PST > Mon-Thu 9PM-10:30PM PST > > These timings work for me. > > > On Thu, Nov 7, 2019 at 10:20 AM Gary Li wrote: > > > Interested. > > Mon-Thu 8 PM-11

Re: [Discuss] Convenient time for weekly sync meeting

2019-11-06 Thread Gurudatt Kulkarni
Interested. Mon-Thu 5AM-6:30AM PST Mon-Thu 9PM-10:30PM PST These timings work for me. On Thu, Nov 7, 2019 at 10:20 AM Gary Li wrote: > Interested. > Mon-Thu 8 PM-11 PM PST. > It's very difficult to cover America, Europe, and Asia in the same meeting. > Maybe we can have US and US two

Re: [Discuss] Convenient time for weekly sync meeting

2019-11-06 Thread Gary Li
Interested. Mon-Thu 8 PM-11 PM PST. It's very difficult to cover America, Europe, and Asia in the same meeting. Maybe we can have US and US two sessions and make them biweekly? On Wed, Nov 6, 2019 at 7:12 PM Taher Koitawala wrote: > Hi All, >Mon-Thu 5AM-6:30AM PST >

Re: [Discuss] Convenient time for weekly sync meeting

2019-11-06 Thread Taher Koitawala
Hi All, Mon-Thu 5AM-6:30AM PST Mon-Thu 9PM-10:30PM PST Works for me On Thu, Nov 7, 2019, 7:26 AM Nishith wrote: > Following times work for me > > Evening : Mon-Thu, 9pm - 1am > > Unfortunately, can’t do mornings. > > Sent from my iPhone > > > On Nov 6, 2019, at 4:51 PM,

Re: [Discuss] Convenient time for weekly sync meeting

2019-11-06 Thread Nishith
Following times work for me Evening : Mon-Thu, 9pm - 1am Unfortunately, can’t do mornings. Sent from my iPhone > On Nov 6, 2019, at 4:51 PM, Y. Ethan Guo wrote: > > I'm interested in attending each weekly meeting. My preferred times: > > Morning: Wed, Fri, 5AM - 7:30AM PT > Evening: Mon -

Re: [Discuss] Convenient time for weekly sync meeting

2019-11-06 Thread leesf
Thanks Sudha. Interested. Tue - Thu, 8:30PM - 10:00PM PST Wed - Fri, 3:00AM - 4:30AM PST Y. Ethan Guo 于2019年11月7日周四 上午8:52写道: > I'm interested in attending each weekly meeting. My preferred times: > > Morning: Wed, Fri, 5AM - 7:30AM PT > Evening: Mon - Thu, 8PM - 11PM PT > > > On Wed, Nov 6,

Re: [Discuss] Convenient time for weekly sync meeting

2019-11-06 Thread Balaji Varadarajan
Thanks Sudha. The following times work for me : Mon, Tue, Thursday - 9 p.m to 12 a.m PST Wed - 5:00 to 6:00 am and 9:30 p.m to 12 a.m PST On Wed, Nov 6, 2019 at 12:31 PM Vinoth Chandar wrote: > Interested. > > Mon-Thu 5AM-6:30AM PST > Mon-Thu 9PM-10:30PM PST > > > On Wed, Nov 6, 2019 at

Re: [Discuss] Creation of database in Hive

2019-11-06 Thread Bhavani Sudha
Ah okay. That is a valid concern. Dint think about admin management for Hive dbs. Thanks, Sudha On Wed, Nov 6, 2019 at 12:28 PM Balaji Varadarajan wrote: > I have a different opinion on this. Usually, in production deployments > (atleast whatever I am aware of), database is generally managed

Re: [Discuss] Convenient time for weekly sync meeting

2019-11-06 Thread Vinoth Chandar
Interested. Mon-Thu 5AM-6:30AM PST Mon-Thu 9PM-10:30PM PST On Wed, Nov 6, 2019 at 12:28 PM Bhavani Sudha wrote: > Hello all, > > Currently the weekly sync meeting is scheduled to run on Tuesdays from 9pm > PST to 10 pm PST. Given our users are from multiple time zones, we can try > to see

Re: [Discuss] Creation of database in Hive

2019-11-06 Thread Balaji Varadarajan
I have a different opinion on this. Usually, in production deployments (atleast whatever I am aware of), database is generally managed at the org/group level. Privacy policies like ACLs are usually done at database level and would need first level management by admins. With such a setup, its

Re: [Discuss] Creation of database in Hive

2019-11-06 Thread Bhavani Sudha
+1 I think we should create db if it does not exist. On Tue, Nov 5, 2019 at 11:08 PM Pratyaksh Sharma wrote: > Hi, > > While doing hive sync using HiveSyncTool, we first check if the target > table exists in hive. If not, we try to create it. However in this flow, if > the database itself does

Re: DISCUSS RFC 6 - Add indexing support to the log file

2019-10-30 Thread Nishith
Thanks for the detailed design write up Vinoth. I concur with the others on option 2, default indexing as off and enable it when we have enough confidence on stability & performance. Although, I do think practically it might be good to have the code in place for users who might revert to an

Re: DISCUSS RFC 6 - Add indexing support to the log file

2019-10-30 Thread Balaji Varadarajan
Thanks Vinoth for proposing a clean and extendable design. The overall design looks great. Another rollout option is to only use consolidated log index for index lookup if latest "valid" log block has been written in new format. If that is not the case, we can revert to scanning previous log

Re: DISCUSS RFC 6 - Add indexing support to the log file

2019-10-29 Thread Bhavani Sudha
I vote for the second option. Also it can give time to analyze on how to deal with backwards compatibility. I ll take a look at the RFC later tonight and get back. On Sun, Oct 27, 2019 at 10:24 AM Vinoth Chandar wrote: > One issue I have some open questions myself > > Is it ok to assume log

Re: DISCUSS RFC 6 - Add indexing support to the log file

2019-10-27 Thread Vinoth Chandar
One issue I have some open questions myself Is it ok to assume log will have old data block versions, followed by new data block versions. For e.g, if rollout new code, then revert back then there could be an arbitrary mix of new and old data blocks. Handling this might make design/code fairly

Re: [DISCUSS][VOTE] DyanamoDB Streams support in Hudi

2019-10-24 Thread Vinoth Chandar
Great! On Wed, Oct 23, 2019 at 10:41 PM Vinay Patil wrote: > Thanks a lot Vinoth for opening this jira. > > Will start with the initial design and share the document. > > Regards, > Vinay Patil > > > On Mon, Oct 21, 2019 at 9:36 PM Balaji Varadarajan > wrote: > > > +1. This is a much needed

Re: [DISCUSS][VOTE] DyanamoDB Streams support in Hudi

2019-10-23 Thread Vinay Patil
Thanks a lot Vinoth for opening this jira. Will start with the initial design and share the document. Regards, Vinay Patil On Mon, Oct 21, 2019 at 9:36 PM Balaji Varadarajan wrote: > +1. This is a much needed and super useful feature for a lot of folks in > the community. > > Balaji.V

Re: [DISCUSS] Rename HIP process to RFC

2019-10-23 Thread Vinoth Chandar
Thanks all for the constructive comments! Will change the name in cWiki On Tue, Oct 22, 2019 at 6:27 PM vino yang wrote: > agree Vinoth, +1 > > Vinoth Chandar 于2019年10月22日周二 下午8:31写道: > > > Good point. Even for HIP we initially had gdoc as the starting point and > > once ratified we planned to

Re: [DISCUSS] Rename HIP process to RFC

2019-10-22 Thread vino yang
agree Vinoth, +1 Vinoth Chandar 于2019年10月22日周二 下午8:31写道: > Good point. Even for HIP we initially had gdoc as the starting point and > once ratified we planned to move it to cwiki. But practical issues like > retaining formatting, porting over diagrams, version history between two > things made

Re: [DISCUSS] Rename HIP process to RFC

2019-10-22 Thread Vinoth Chandar
Good point. Even for HIP we initially had gdoc as the starting point and once ratified we planned to move it to cwiki. But practical issues like retaining formatting, porting over diagrams, version history between two things made it cumbersome. So IMO single place is actually good. Wdyt? On Tue,

Re: [DISCUSS] Rename HIP process to RFC

2019-10-22 Thread vino yang
+1 agree Thomas: For some general ideas, we can write gdoc and open a "DISCUSS" ML thread. Best, Vino Thomas Weise 于2019年10月22日周二 下午12:45写道: > Just in case that wasn't considered: Not every document needs to be on > cwiki, it is perfectly fine to write up ideas that are not a formal "HIP" >

Re: [DISCUSS] Rename HIP process to RFC

2019-10-21 Thread Thomas Weise
Just in case that wasn't considered: Not every document needs to be on cwiki, it is perfectly fine to write up ideas that are not a formal "HIP" in gdocs or similar. Thomas On Mon, Oct 21, 2019 at 9:40 PM Nishith wrote: > +1 > > Encourages folks to read and write designs/ideas. > > Sent from

Re: [DISCUSS] Rename HIP process to RFC

2019-10-21 Thread Nishith
+1 Encourages folks to read and write designs/ideas. Sent from my iPhone > On Oct 21, 2019, at 6:30 PM, leesf wrote: > > +1 > > Best, > Leesf > > 于2019年10月22日周二 上午3:40写道: > >> +1 >> >> Balaji.V On Monday, October 21, 2019, 11:38:01 AM PDT, Y. Ethan Guo >> wrote: >> >> +1 on RFC.

Re: [DISCUSS] Rename HIP process to RFC

2019-10-21 Thread leesf
+1 Best, Leesf 于2019年10月22日周二 上午3:40写道: > +1 > > Balaji.V On Monday, October 21, 2019, 11:38:01 AM PDT, Y. Ethan Guo > wrote: > > +1 on RFC. It's good to have a few pages of RFC to get a quick look of an > idea. It doesn't have to be a full standard like some IETF RFCs. > > On Mon,

Re: [DISCUSS] Rename HIP process to RFC

2019-10-21 Thread vbalaji
+1 Balaji.V On Monday, October 21, 2019, 11:38:01 AM PDT, Y. Ethan Guo wrote: +1 on RFC.  It's good to have a few pages of RFC to get a quick look of an idea.  It doesn't have to be a full standard like some IETF RFCs. On Mon, Oct 21, 2019 at 5:31 AM Taher Koitawala wrote: > Agree

Re: [DISCUSS][VOTE] DyanamoDB Streams support in Hudi

2019-10-21 Thread Balaji Varadarajan
+1. This is a much needed and super useful feature for a lot of folks in the community. Balaji.V On Monday, October 21, 2019, 7:08:30 AM PDT, Vinoth Chandar wrote: https://issues.apache.org/jira/browse/HUDI-310 tracks this. Love to get this into the next release as much as possible

Re: [DISCUSS][VOTE] DyanamoDB Streams support in Hudi

2019-10-21 Thread Vinoth Chandar
https://issues.apache.org/jira/browse/HUDI-310 tracks this. Love to get this into the next release as much as possible :) On Thu, Oct 17, 2019 at 10:16 PM Vinoth Chandar wrote: > No problem. Having kinesis will get us a compelling story for cloud data > ingestion > > On Thu, Oct 17, 2019 at

Re: [DISCUSS] Rename HIP process to RFC

2019-10-21 Thread Taher Koitawala
Agree Vinoth +1 Regards, Taher Koitawala On Mon, Oct 21, 2019, 5:49 PM Bhavani Sudha wrote: > +1 on RFC. Makes sense to me. > > > On Sun, Oct 20, 2019 at 8:29 PM Vinoth Chandar wrote: > > > Someone asked me this and made me thinking about it. While HIP process > > covers concrete proposals to

Re: [DISCUSS] Rename HIP process to RFC

2019-10-21 Thread Bhavani Sudha
+1 on RFC. Makes sense to me. On Sun, Oct 20, 2019 at 8:29 PM Vinoth Chandar wrote: > Someone asked me this and made me thinking about it. While HIP process > covers concrete proposals to Hudi, sometimes we may need to just write up > some ideas and solicit comments (e.g HudiLink > >

Re: [DISCUSS][VOTE] DyanamoDB Streams support in Hudi

2019-10-17 Thread Vinoth Chandar
No problem. Having kinesis will get us a compelling story for cloud data ingestion On Thu, Oct 17, 2019 at 8:38 PM Vinay Patil wrote: > Hi Vinoth, > > Sry to miss these, busy with on-call issues for the last couple of weeks. > > Will create a ticket for tracking this , I will be actively

Re: [DISCUSS][VOTE] DyanamoDB Streams support in Hudi

2019-10-15 Thread Vinoth Chandar
Just wanted to bump this thread and see if anyone is actively working on kinesis support On Mon, Sep 23, 2019 at 11:51 AM Vinoth Chandar wrote: > I think we are on the same page. Thanks for clarifying! > Note on implementation: it would be great if we can reuse the spark > streaming connector

Re: [DISCUSS] cleaning up git history from Notice/License changes

2019-10-07 Thread Vinoth Chandar
https://issues.apache.org/jira/browse/HUDI-295 now tracks this On Thu, Oct 3, 2019 at 5:45 PM leesf wrote: > +1 on cleanup. > > Best, > Leesf > > Bhavani Sudha Saktheeswaran 于2019年10月4日周五 > 上午5:53写道: > > > +1 . Thats a good idea. > > > > > > > > On Thu, Oct 3, 2019 at 2:32 PM

Re: [DISCUSS] cleaning up git history from Notice/License changes

2019-10-03 Thread leesf
+1 on cleanup. Best, Leesf Bhavani Sudha Saktheeswaran 于2019年10月4日周五 上午5:53写道: > +1 . Thats a good idea. > > > > On Thu, Oct 3, 2019 at 2:32 PM vbal...@apache.org > wrote: > > > > > +1 on both cleanup. This would keep the git history clean and consistent > > with contribution. > > Balaji.V

Re: [DISCUSS] cleaning up git history from Notice/License changes

2019-10-03 Thread Bhavani Sudha Saktheeswaran
+1 . Thats a good idea. On Thu, Oct 3, 2019 at 2:32 PM vbal...@apache.org wrote: > > +1 on both cleanup. This would keep the git history clean and consistent > with contribution. > Balaji.VOn Thursday, October 3, 2019, 09:53:46 AM PDT, Vinoth Chandar < > vin...@apache.org> wrote: > >

Re: [DISCUSS] cleaning up git history from Notice/License changes

2019-10-03 Thread vbal...@apache.org
+1 on both cleanup. This would keep the git history clean and consistent with contribution. Balaji.VOn Thursday, October 3, 2019, 09:53:46 AM PDT, Vinoth Chandar wrote: Folks, As we iterate across the RCs, we have added and removed to the NOTICE/LICENSE files a lot. Does anyone feel

Re: [DISCUSS] Decouple Hudi and Spark (in wiki design page)

2019-10-02 Thread Vinoth Chandar
Based on some conversations I had with Flink folks including Hudi's very own mentor Thomas, it seems future proof to look into supporting the Flink streaming APIs. The batch apis IIUC will move towards converging with Streaming APIs, which matches Hudi's model anyway >From Hudi's perspective,

Re: [DISCUSS] Decouple Hudi and Spark (in wiki design page)

2019-09-26 Thread Taher Koitawala
Hi Vinoth, IMHO we should stick to Spark for micro batching for 2 reasons. 1: Easy out use 2: Performance. Flink batch is not as fast as Spark. Also the rich library of functions and the ease of integration which Spark has with Hive etc that is not there in Flink batch. Regards, Taher

Re: [DISCUSS] Decouple Hudi and Spark (in wiki design page)

2019-09-25 Thread Taher Koitawala
Hi Vino, Agree with your suggestion. We all know when thought Flink is streaming we can control how files get rolled out through checkpointing configurations. Bad config and small files get rolled out. Good config and files are properly sized. Also I understand the concern of

Re: [DISCUSS] Decouple Hudi and Spark (in wiki design page)

2019-09-25 Thread vino yang
Hi A simple example. In Hudi Project, you can find many code snippet like `spark.read().format().load()` The load method can pass any path, especially DFS paths. While if we only want to use Flink streaming, there is not a good way to read HDFS now. In addition, we.also need to consider other

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-24 Thread Taher Koitawala
Hi Vino, This is not a design for Hudi on Flink. This was simply a mock up of tagLocations() spark cache to Flink state as Vinoth wanted to see. As per the Flink batch and Streaming I am well aware of the batch and Stream unification efforts of Flink. However I think that is still on

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-24 Thread vino yang
Hi Taher, As I mentioned in the previous mail. Things may not be too easy by just using Flink state API. Copied here "Hudi can connect with many different Source/Sinks. Some file-based reads are not appropriate for Flink Streaming." Although, unify Batch and Streaming is Flink's goal. But, it

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-24 Thread Taher Koitawala
Hi All, Sample code to see how records tagging will be handled in Flink is posted on [1]. The main class to run the same is MockHudi.java with a sample path for checkpointing. As of now this is just a sample to know we should ke caching in Flink states with bare minimum configs.

Re: [DISCUSS] Hudi with Nifi

2019-09-24 Thread Vinoth Chandar
Sg, lets capture these discussions in the JIRA (link to the discussion thread should suffice) and we can revisit one by one.. On Mon, Sep 23, 2019 at 8:31 PM Taher Koitawala wrote: > Sure Vinoth, I think we need to try this out and check how it fits together > and how deployable it is. > > On

Re: [DISCUSS] Hudi with Nifi

2019-09-23 Thread Taher Koitawala
Sure Vinoth, I think we need to try this out and check how it fits together and how deployable it is. On Sun, Sep 22, 2019, 7:01 PM Vinoth Chandar wrote: > See a lot of Spark Streaming receiver based approach code there, which > makes me a bit worried about scalability. > > Nonetheless. API

Re: [DISCUSS][VOTE] DyanamoDB Streams support in Hudi

2019-09-22 Thread Vinoth Chandar
+1 For now we can keep this in hudi-utilities itself IMO. As for the connector or Deltastreamer Source to be specific, should we just integrate to Kinesis? If DynamoDB will pump its changes into Kinesis anyway, why should we aware of DynanoDB directly? Also we may need to rethink how we are going

Re: [DISCUSS][VOTE] DyanamoDB Streams support in Hudi

2019-09-21 Thread Bhavani Sudha Saktheeswaran
+1 to adding more connectors to DeltStreamer and making them as much pluggable modules as possible like Vino Yang suggested. On Sat, Sep 21, 2019 at 7:12 PM vino yang wrote: > + 1 to introduce these connectors. It's nice to see that Hudi's ecosystem > is growing. As Hudi connects to more and

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-21 Thread Vinay Patil
Hi Taher, I agree with this , if the state is becoming too large we should have an option of storing it in external state like File System or RocksDb. @Vinoth Chandar can the state of HoodieBloomIndex go beyond 10-15 GB Regards, Vinay Patil On Fri, Sep 20, 2019 at 11:37 AM Taher Koitawala

Re: [DISCUSS] Hudi with Nifi

2019-09-21 Thread Taher Koitawala
Hi Vinoth, Nifi has the capability to pass data to a custom spark job. However that is done through a StreamingContext, not sure if we can build something on this. I'm trying to wrap my head around how to fit the StreamingContext in our existing code. Here is an example:

Re: [DISCUSS][VOTE] DyanamoDB Streams support in Hudi

2019-09-21 Thread Vinay Patil
Hi Taher, Basically this can be proposal to support Kinesis and DynamoDb stream support can be enabled by reusing this source code. Flink has provided support for DynamoDb Streams by reusing Kinesis Streams classes. Regards, Vinay Patil On Sat, Sep 21, 2019 at 4:26 PM Taher Koitawala wrote:

Re: [DISCUSS][VOTE] DyanamoDB Streams support in Hudi

2019-09-21 Thread Taher Koitawala
That would be a great addition Vinay. How about adding Kinesis as well? Regards, Taher Koitawala On Sat, Sep 21, 2019, 4:20 PM Vinay Patil wrote: > Hi Team, > > The DynamoDb streams contains the CDC data when enabled on a DynamoDb > table, we can add a source for DeltaStreamer which will

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-20 Thread Taher Koitawala
Hey Guys, Any thoughts on the above idea? To handle HoodieBloomIndex with HeapState, RocksDBState and FsState but on Spark. On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala wrote: > Hi Vinoth, >Having seen the doc and code. I understand the > HoodieBloomIndex mainly caches key

Re: [DISCUSS] Hudi with Nifi

2019-09-18 Thread Taher Koitawala
I think we will have to make a Nifi Processor. The Nifi processor should host all what do with Spark to write data. We will have to scope out the work on this and compactions. Regards, Taher Koitawala On Wed, Sep 18, 2019, 8:30 PM Suneel Marthi wrote: > Adding Nifi dev@ to this thread. > > >

Re: [DISCUSS] Hudi with Nifi

2019-09-18 Thread Suneel Marthi
Adding Nifi dev@ to this thread. On Wed, Sep 18, 2019 at 10:57 AM Vinoth Chandar wrote: > Not too familiar wth Nifi myself. Would this still target an use-case like > what pratyaksh mentioned? > For delta streamer specifically, we are moving more and more towards > continuous mode, where >

Re: [DISCUSS] Hudi with Nifi

2019-09-18 Thread Vinoth Chandar
Not too familiar wth Nifi myself. Would this still target an use-case like what pratyaksh mentioned? For delta streamer specifically, we are moving more and more towards continuous mode, where Hudi writing and compaction are amanged by a single long running spark application. Would Nifi

Re: [DISCUSS] Hudi with Nifi

2019-09-18 Thread Taher Koitawala
That's another way of doing things. I want to know if someone wrote something like PutParquet. Which directly can write data to Hudi. AFAIK I don't think anyone has. That will really be powerful. On Wed, Sep 18, 2019, 1:37 PM Pratyaksh Sharma wrote: > Hi Taher, > > In the initial phase of our

Re: [DISCUSS] Hudi with Nifi

2019-09-18 Thread Pratyaksh Sharma
Hi Taher, In the initial phase of our CDC pipeline, we were using Hudi with Nifi. Nifi was being used to read Binlog file of mysql and to push that data to some Kafka topic. This topic was then getting consumed by DeltaStreamer. So Nifi was indirectly involved in that flow. On Wed, Sep 18, 2019

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-17 Thread Taher Koitawala
Hi Vinoth, Having seen the doc and code. I understand the HoodieBloomIndex mainly caches key and partition path. Can we address how Flink does it? Like, have HeapState where the user chooses to cache the Index on heap, RockDBState where indexes are written to RocksDB and finally

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-16 Thread Vinoth Chandar
Alright then. Happy to take the lead here. But please give me a week or so, to finish up the spark bundling and other jar issues.. Too much context switching :) On Mon, Sep 16, 2019 at 6:57 PM vino yang wrote: > Hi guys, > > Currently, I am busy with HUDI-203[1] and other things. > > I agree

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-16 Thread vino yang
Hi guys, Currently, I am busy with HUDI-203[1] and other things. I agree with Vinoth that we should try to find a new solution to decouple the dependency with the Spark RDD cache. It's an excellent way to start this big work. [1]: https://issues.apache.org/jira/browse/HUDI-203

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-16 Thread vbal...@apache.org
+1 This is a pretty large undertaking. While the community is getting their hands dirty and ramping up on Hudi internals, it would be productive if Vinoth shepherds this Balaji.VOn Monday, September 16, 2019, 11:30:44 AM PDT, Vinoth Chandar wrote: sg. :) I will wait for others on

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-16 Thread Vinoth Chandar
sg. :) I will wait for others on this thread as well to chime in. On Mon, Sep 16, 2019 at 11:27 AM Taher Koitawala wrote: > Vinoth, I think right now given your experience with the project you should > be scoping out what needs to be done to take us there. So +1 for giving you > more work :) >

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-16 Thread Taher Koitawala
Vinoth, I think right now given your experience with the project you should be scoping out what needs to be done to take us there. So +1 for giving you more work :) We want to reach a point where we can start scoping out addition of Flink and Beam components within. Then I think will tremendous

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-16 Thread Vinoth Chandar
I still feel the key thing here is reimplementing HoodieBloomIndex without needing spark caching. https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design(non-global) documents the spark DAG in detail. If everyone feels, it's best for me to scope the work out, then happy

Re: [DISCUSS] [VOTE] JDBC incremental load with DeltaStreamer

2019-09-16 Thread Vinoth Chandar
It should work like any other source and none of the others are aware if whether deltaStreamer is running in continuous mode or not. Simplistically, it just needs a config to denote an incremental field - say `_last_modified_at` and we use that as a checkpoint to tail that table by including a

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-16 Thread Taher Koitawala
Guys I think we are slowing down on this again. We need to start planning small small tasks towards this VC please can you help fast track this? Regards, Taher Koitawala On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar wrote: > Look forward to the analysis. A key class to read would be >

Re: [DISCUSS] [VOTE] JDBC incremental load with DeltaStreamer

2019-09-16 Thread Taher Koitawala
Will this be the same implementation as session.read.jdbc("") and then call this code continuously like how we are running HUDI in continuous mode? On Mon, Sep 16, 2019 at 9:09 PM Vinoth Chandar wrote: > Thanks, Taher! Any takers for driving this? This is something I would be > very interested

Re: [DISCUSS] [VOTE] JDBC incremental load with DeltaStreamer

2019-09-16 Thread Vinoth Chandar
Thanks, Taher! Any takers for driving this? This is something I would be very interested in getting involved with. Dont have the bandwidth atm :/ On Sun, Sep 15, 2019 at 11:15 PM Taher Koitawala wrote: > Thank you all for your support. JIRA filed at >

Re: [DISCUSS] [VOTE] JDBC incremental load with DeltaStreamer

2019-09-16 Thread Taher Koitawala
Thank you all for your support. JIRA filed at https://issues.apache.org/jira/browse/HUDI-251 Regards, Taher Koitawala On Mon, Sep 16, 2019 at 11:34 AM Taher Koitawala wrote: > Since everyone is fully onboard. I am creating a JIRA to track this. > > On Sun, Sep 15, 2019 at 9:47 AM

Re: [DISCUSS] [VOTE] JDBC incremental load with DeltaStreamer

2019-09-16 Thread Taher Koitawala
Since everyone is fully onboard. I am creating a JIRA to track this. On Sun, Sep 15, 2019 at 9:47 AM vbal...@apache.org wrote: > > +1. Agree with everyone's point. Go for it Taher !! > Balaji.VOn Saturday, September 14, 2019, 07:44:04 PM PDT, Bhavani > Sudha Saktheeswaran wrote: > > +1 I

Re: [DISCUSS] [VOTE] JDBC incremental load with DeltaStreamer

2019-09-14 Thread vbal...@apache.org
+1. Agree with everyone's point. Go for it Taher !! Balaji.VOn Saturday, September 14, 2019, 07:44:04 PM PDT, Bhavani Sudha Saktheeswaran wrote: +1 I  think adding new sources to DeltaStreamer is really valuable. Thanks, Sudha On Sat, Sep 14, 2019 at 7:52 AM vino yang wrote: > Hi

Re: [DISCUSS] [VOTE] JDBC incremental load with DeltaStreamer

2019-09-14 Thread Bhavani Sudha Saktheeswaran
+1 I think adding new sources to DeltaStreamer is really valuable. Thanks, Sudha On Sat, Sep 14, 2019 at 7:52 AM vino yang wrote: > Hi Taher, > > IMO, it's a good supplement to Hudi. > > So +1 from my side. > > Vinoth Chandar 于2019年9月14日周六 下午10:23写道: > > > Hi Taher, > > > > I am fully

Re: [DISCUSS] [VOTE] JDBC incremental load with DeltaStreamer

2019-09-14 Thread Vinoth Chandar
Hi Taher, I am fully onboard on this. This is such a frequently asked question and having it all doable with a simple DeltaStreamer command would be really powerful. +1 - Vinoth On 2019/09/14 05:51:05, Taher Koitawala wrote: > Hi All, > Currently, we are trying to pull data

Re: [DISCUSS] Promote Hudi Chinese Documentation into the official website

2019-08-24 Thread Vinoth Chandar
Thanks for driving this ! :) On Sat, Aug 24, 2019 at 4:39 PM vino yang wrote: > Hi guys, > > Glad to see that Hudi's doc has supported Jekyll-multiple-languages plugin. > It's the precondition to contribute to the translation of Chinese docs. > Thanks to Vinoth. > > Now, welcome to start our

Re: [DISCUSS] Promote Hudi Chinese Documentation into the official website

2019-08-24 Thread vino yang
Hi guys, Glad to see that Hudi's doc has supported Jekyll-multiple-languages plugin. It's the precondition to contribute to the translation of Chinese docs. Thanks to Vinoth. Now, welcome to start our contribution. Please pay attention, if you want to contribute to Chinese documents, you should

Re: [DISCUSS] Suggestion for Docs UI

2019-08-22 Thread vino yang
+1, good idea vbal...@apache.org 于2019年8月23日周五 上午3:46写道: > > +1, I like the idea. It would also make the whole page modular. > Balaji.VOn Thursday, August 22, 2019, 12:40:11 PM PDT, Vinoth Chandar < > vin...@apache.org> wrote: > > +1 I was thinking along similar lines for the demo page > >

Re: [DISCUSS] Suggestion for Docs UI

2019-08-22 Thread vbal...@apache.org
+1, I like the idea. It would also make the whole page modular. Balaji.VOn Thursday, August 22, 2019, 12:40:11 PM PDT, Vinoth Chandar wrote: +1 I was thinking along similar lines for the demo page Our doc theme should already support this

Re: [DISCUSS] Suggestion for Docs UI

2019-08-22 Thread Vinoth Chandar
+1 I was thinking along similar lines for the demo page Our doc theme should already support this https://idratherbewriting.com/documentation-theme-jekyll/mydoc_navtabs.html On Thu, Aug 22, 2019 at 12:04 PM Bhavani Sudha Saktheeswaran wrote: > Hi all, > > I was going through the

<    5   6   7   8   9   10   11   >