Re: [DISCUSS] Simplification of terminologies

2019-11-11 Thread Balaji Varadarajan
Agree with all 3 changes. The naming now looks more consistent than earlier. +1 on them Depending on whether we are renaming Input formats for (1) and (2) - this could require some migration steps for Balaji.V On Mon, Nov 11, 2019 at 7:38 PM vino yang wrote: > Hi Vinoth, > > Thanks for

Re: DISCUSS RFC 7 - Point in time queries on Hudi table (Time-Travel)

2019-11-11 Thread Balaji Varadarajan
+1. This would be a powerful feature which would open up use-cases requiring repeatable query results. Balaji.V On Mon, Nov 11, 2019 at 8:12 AM nishith agarwal wrote: > Folks, > > Starting a discussion thread for enabling time-travel for Hudi datasets. > Please provide feedback on the RFC

Re: [DISCUSS] New RFC? Hudi dataset snapshotter

2019-11-11 Thread vino yang
Hi Shiyan, +1 for this proposal, Also, it looks like an exporter tool. @Vinoth Chandar Any thoughts about where to place it? Best, Vino Vinoth Chandar 于2019年11月12日周二 上午8:58写道: > We can wait for others to chime in as well. :) > > On Mon, Nov 11, 2019 at 4:37 PM Shiyan Xu > wrote: > > >

Re: [DISCUSS] Simplification of terminologies

2019-11-11 Thread vino yang
Hi Vinoth, Thanks for bringing these proposals. +1 on all three. Especially, big +1 on the third renaming proposal. When I was a newbie. The "COPY_ON_WRITE" term confused me a lot. It easily mislead users on the "copy" term. And make users compare it with the `CopyOnWriteArrayList` data

Re: [DISCUSS] Simplification of terminologies

2019-11-11 Thread Shiyan Xu
[1] +1; "query" indeed sounds better [2] +1 on the term "snapshot"; so basically we follow the convention that when we say "snapshot", it means "give me the most up-to-date facts (lowest data latency) even if it takes some query time" [3] Though I agree with the renaming, I have a different

Re: [DISCUSS] Simplification of terminologies

2019-11-11 Thread Bhavani Sudha
+1 on all three rename proposals. I think this would make the concepts super easy to follow for new users. If changing [3] seems to be a stretch, we should definitely do [1] & [2] at the least IMO. I will be glad to help out on the renames to whatever extent possible should the Hudi community

Re: [DISCUSS] New RFC? Hudi dataset snapshotter

2019-11-11 Thread Vinoth Chandar
We can wait for others to chime in as well. :) On Mon, Nov 11, 2019 at 4:37 PM Shiyan Xu wrote: > Yes, Vinoth, you're right that it is more of an exporter, which exports a > snapshot from Hudi dataset. > > It should support MOR too; it shall just leverage on existing > SnapshotCopier logic to

Re: [DISCUSS] New RFC? Hudi dataset snapshotter

2019-11-11 Thread Shiyan Xu
Yes, Vinoth, you're right that it is more of an exporter, which exports a snapshot from Hudi dataset. It should support MOR too; it shall just leverage on existing SnapshotCopier logic to find the latest file slices. So is it good to create a RFC for further discussion? On Mon, Nov 11, 2019 at

Re: [DISCUSS] New RFC? Hudi dataset snapshotter

2019-11-11 Thread Vinoth Chandar
What you suggest sounds more like an `Exporter` tool? I imagine you will support MOR as well? +1 on the idea itself. It could be useful if plain parquet snapshot was generated as a backup. On Mon, Nov 11, 2019 at 4:21 PM Shiyan Xu wrote: > Hi All, > > The existing SnapshotCopier under Hudi

[DISCUSS] New RFC? Hudi dataset snapshotter

2019-11-11 Thread Shiyan Xu
Hi All, The existing SnapshotCopier under Hudi Utilities is a Hudi-to-Hudi copy and primarily for backup purpose. I would like to start a RFC for a more generic Hudi snapshotter, which - Supports existing SnapshotCopier features - Add option to export a Hudi dataset to plain parquet files

Re: [Discuss] Convenient time for weekly sync meeting

2019-11-11 Thread Vinoth Chandar
yes. sounds good. As of now, its just Kabeer.@kabeer wdyt? @nishith Personally, timing is an issue for me, if you are willing to drive, please go ahead! I ll try to make it if possible On Mon, Nov 11, 2019 at 8:25 AM nishith agarwal wrote: > Vinoth, > > To meet mid way, how about once in 3

[DISCUSS] Simplification of terminologies

2019-11-11 Thread Vinoth Chandar
Hello all, I wanted to raise an important topic with the community around whether we should rename some of our terminologies in code/docs to be more user-friendly and understandable.. Let me also provide some context for each, since I am probably guilty of introducing most of them in the first

Re: Migrate Existing DataFrame to Hudi DataSet

2019-11-11 Thread Zhengxiang Pan
Hi The snippet for issue is here https://gist.github.com/zxpan/c5e989958d7688026f1679e53d2fca44 1) write script is to simulate to migrate existing data frame (saved in /tmp/hudi-testing/inserts parquet) 2) update script is to simulate to incremental update (saved in /tmp/hudi-testing/updates

Re: [Discuss] Feedback on Hudi improvements

2019-11-11 Thread Scheller, Brandon
Yep, you are correct that it is throwing the exception because of the DataSourceUtils.getNestedFieldValAsString. I can take up the work to fix this behavior if it is not intended. I'd also like to add extra error messaging and validation because currently it is not clear to users what the error

Re: [Discuss] Convenient time for weekly sync meeting

2019-11-11 Thread nishith agarwal
Vinoth, To meet mid way, how about once in 3 weeks for Europe and other time zones ? That works fine for me. In the interest of making the meetings useful for everyone, we can see how productive the meetings are/% attendance for the meetings for the initial few ones, and then may be we can follow

DISCUSS RFC 7 - Point in time queries on Hudi table (Time-Travel)

2019-11-11 Thread nishith agarwal
Folks, Starting a discussion thread for enabling time-travel for Hudi datasets. Please provide feedback on the RFC here . Thanks, Nishith

Re: Migrate Existing DataFrame to Hudi DataSet

2019-11-11 Thread Vinoth Chandar
Hi, On 1. I am wondering if its relatd to https://issues.apache.org/jira/browse/HUDI-83 , i.e support for timestamps. if you can give us a small snippet to reproduce the problem that would be great. On 2, Not sure whats going on. there are no size limitations. Please check if you precombine

Re: [Discuss] Convenient time for weekly sync meeting

2019-11-11 Thread Pratyaksh Sharma
That overlaps with my office hours. I will try to attend it in 9 PM to 10 PM PST slot only. :) On Mon, Nov 11, 2019 at 6:07 PM Vinoth Chandar wrote: > I can make early morning PST meetings.i.e before 6AM. > > On Sun, Nov 10, 2019 at 11:22 PM Pratyaksh Sharma > wrote: > > > @Vinoth Chandar