Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-19 Thread Vinoth Chandar
Thanks all. Started a VOTE around this. If you can chime in with your +1s there as well, it will be great! On Fri, Jul 16, 2021 at 8:31 PM Gary Li wrote: > +1 for option B. > > On Sat, Jul 17, 2021 at 9:51 AM Udit Mehrotra wrote: > > > +1 for option B. For A, I will need more data points to

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-16 Thread Gary Li
+1 for option B. On Sat, Jul 17, 2021 at 9:51 AM Udit Mehrotra wrote: > +1 for option B. For A, I will need more data points to convince myself if > GitHub issues will provide all the issue tracking functionality that Jira > provides today. > > Thanks, > Udit > > On Fri, Jul 16, 2021 at 2:33 PM

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-16 Thread Udit Mehrotra
+1 for option B. For A, I will need more data points to convince myself if GitHub issues will provide all the issue tracking functionality that Jira provides today. Thanks, Udit On Fri, Jul 16, 2021 at 2:33 PM Vinoth Chandar wrote: > Looks like we can start with B has a lot of support. > I

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-16 Thread Vinoth Chandar
Looks like we can start with B has a lot of support. I will start a VOTE on B alone and we can proceed if the VOTE passes. On Fri, Jul 16, 2021 at 8:05 AM Nishith wrote: > +1 for option B. > > > On Jul 15, 2021, at 10:50 PM, Bhavani Sudha > wrote: > > > > Completely agree on B. On A I feel

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-16 Thread Nishith
+1 for option B. > On Jul 15, 2021, at 10:50 PM, Bhavani Sudha wrote: > > Completely agree on B. On A I feel the necessity to centralize everything > in one place but also without losing the capabilities of Jira. I think we > will have to explore tools in eitherways. > > Thanks, > Sudha >

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-15 Thread Bhavani Sudha
Completely agree on B. On A I feel the necessity to centralize everything in one place but also without losing the capabilities of Jira. I think we will have to explore tools in eitherways. Thanks, Sudha On Thu, Jul 15, 2021 at 10:42 PM vino yang wrote: > +1 for option B. > > Best, > Vino > >

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-15 Thread vino yang
+1 for option B. Best, Vino Sivabalan 于2021年7月16日周五 上午10:35写道: > +1 on B. Not sure on A though. I understand the intent to have all in > one place. but not very sure if we can get all functionality (version, > type, component, status, parent- child relation), etc ported over to > github. I

Re: [DISCUSS] Move to spark v2 datasource API

2021-07-15 Thread Sivabalan
I don't have much knowledge wrt catalog, but is there an option of exploring spark catalog based table to create a hudi table? I do know with spark3.2, you can add Distribution(a.k.a partitioning) and Sort order to your table. But still not sure on custom transformation for indexing, etc. Also,

Re: [DISCUSS] scenario-based quickstart demo

2021-07-15 Thread Sivabalan
+1 agree we don't have recipes for each feature as such. would benefit users who are interested in a particular feature. On Tue, Jul 6, 2021 at 2:17 AM Vinoth Chandar wrote: > Hi Raymond, > > Are you suggesting a fix to the dev workflow or general site/quickstart > docs? > > Agree, that the

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-15 Thread Sivabalan
+1 on B. Not sure on A though. I understand the intent to have all in one place. but not very sure if we can get all functionality (version, type, component, status, parent- child relation), etc ported over to github. I assume labels are the only option we have to achieve these. Probably, we

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-09 Thread Vinoth Chandar
Based on this, I will start consolidating more of the cWiki content to github wiki and master branch? JIRA vs GH Issue still probably needs more feedback. I do see the tradeoffs there. On Fri, Jul 9, 2021 at 2:39 AM wei li wrote: > +1 > > On 2021/07/02 03:40:51, Vinoth Chandar wrote: > > Hi

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-09 Thread wei li
+1 On 2021/07/02 03:40:51, Vinoth Chandar wrote: > Hi all, > > When we incubated Hudi, we made some initial choices around collaboration > tools of choice. I am wondering if there are still optimal, given the scale > of the community at this point. > > Specifically, two points. > > A) Our

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-08 Thread Vinoth Chandar
Any more strong opinions around these? On Mon, Jul 5, 2021 at 7:43 AM Vinoth Chandar wrote: > I had similar views on A actually. JIRA is pretty powerful, queryable. > But, I convinced myself on labelling and then building out dashboards using > SQL (for reports/analytics). > Still having one

Re: [DISCUSS] scenario-based quickstart demo

2021-07-06 Thread Vinoth Chandar
Hi Raymond, Are you suggesting a fix to the dev workflow or general site/quickstart docs? Agree, that the current doc is all-at-once and at least better docs on incrementally testing parts could be useful. It takes a while to learn what to skip and what not to. Thanks Vinoth On Sat, Jul 3,

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-05 Thread Vinoth Chandar
I had similar views on A actually. JIRA is pretty powerful, queryable. But, I convinced myself on labelling and then building out dashboards using SQL (for reports/analytics). Still having one place for issues/prs. For releases, we can directly leverage milestones. We can definitely prioritize B

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-03 Thread Raymond Xu
Just to mention there are some GitHub plugin brings JIRA features to GH issues. This one for example is free for open source. https://www.zenhub.com/pricing On Fri, Jul 2, 2021 at 8:58 PM Navi Brar wrote: > Hi, > > > +1 on B > > > But I have a slightly orthogonal view on A. I think jira

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-02 Thread Navi Brar
Hi, +1 on B But I have a slightly orthogonal view on A. I think jira should stay. It provides a lot more visibility on the issue management. You can link PRs, wikis, releases etc easily which everyone will have to dig through the comments in github or every github issue might end up having

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-02 Thread vbal...@apache.org
+1 for both A and B. Makes sense to centralize bug tracking and RFCs in github. Balaji.V  On Friday, July 2, 2021, 06:44:06 PM PDT, Vinoth Chandar wrote: Raymond - +1 on your thoughts. Once we have more voices and alignment, we can do one final RFC on cWiki covering everything. Can

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-02 Thread Raymond Xu
+1 for both A and B Also a related suggestion: we can put the release notes and new feature highlights in the release notes section in GitHub releases instead of separately writing them in the asf-site On Fri, Jul 2, 2021 at 11:25 AM Prashant Wason wrote: > +1 for complete Github migration.

Re: [DISCUSS] Hash Index for HUDI

2021-06-30 Thread Vinoth Chandar
I see that we already have a PR up. Will catch up on it and provide some initial comments. Thanks! On Wed, Jun 16, 2021 at 9:02 AM Shawy Geng wrote: > Combining bucket index and bloom filter is a great idea. There is no > conflict between the two in implementation, and the bloom filter info can

Re: [Discuss] Provide a Flag to choose between Flink or Spark

2021-06-27 Thread Vinay Patil
Thanks Danny and Vinoth for your comments. I have created https://issues.apache.org/jira/browse/HUDI-2082 for tracking this. Regards, Vinay Patil On Wed, Jun 16, 2021 at 12:53 PM Vinoth Chandar wrote: > +1 on this effort overall. It will be a little tricky, but doable. > > First thing is

Re: [DISCUSS] Hash Index for HUDI

2021-06-16 Thread Shawy Geng
Combining bucket index and bloom filter is a great idea. There is no conflict between the two in implementation, and the bloom filter info can be still stored in the file to position faster. Best, Shawy > 2021年6月9日 16:23,Thiru Malai 写道: > > Hi, > > This feature seems promising. If we are

Re: [DISCUSS] Hash Index for HUDI

2021-06-16 Thread Shawy Geng
Thank you for your questions and advice. Differently from RFC-08, this one doesn’t introduce the HFile to store the mapping of record and its location. One bucket having a file group is one of the options. For one file group per bucket, assigning bucket id to file group id is a great idea.

Re: [Discuss] Provide a Flag to choose between Flink or Spark

2021-06-16 Thread Vinoth Chandar
+1 on this effort overall. It will be a little tricky, but doable. First thing is to see how we can replace raw usages of Spark APIs with HoodieEngineContext. It will be cool if we can completely generify DeltaStreamer. but I suspect we need flink/spark specific modules ultimately On Wed, Jun

Re: [Discuss] Provide a Flag to choose between Flink or Spark

2021-06-11 Thread Vinay Patil
Thank you Danny for your response. Can we have a JIRA story where all the refactoring is required for Hudi-Flink code as well. I will create a task if we agree that a Flag will be helpful to choose different runners Regards, Vinay Patil On Fri, Jun 11, 2021 at 12:34 PM Danny Chan wrote: >

Re: [Discuss] Provide a Flag to choose between Flink or Spark

2021-06-11 Thread Danny Chan
Basically agree with that, but before that we may need some refactoring to the existing code: Move the HoodieFlinkStreamer from the hudi-flink module into the hudi-utilities to be together with the HoodieDeltaStreamer. We are planning to add separate flink compaction programs too, which has the

Re: [DISCUSS] Hash Index for HUDI

2021-06-09 Thread Thiru Malai
Hi, This feature seems promising. If we are planning to assign the filegroupID as the hash mod value, then we can leverage this change in Bloom Index as well by pruning the files based on hash mod value before mix max record_key pruning. So that the exploded RDD will be comparatively smaller

Re: [DISCUSS] Hash Index for HUDI

2021-06-06 Thread Danny Chan
> number of buckets expanded by multiple is recommended The condition is too harsh and the bucket number would be with exponential growth. > with hash index can be solved by using mutiple file groups per bucket as mentioned in the RFC The relation of file groups and bucket would be too

Re: [DISCUSS] Hash Index for HUDI

2021-06-04 Thread Vinoth Chandar
Thanks for opening the RFC! At first glance, it seemed similar to RFC-08, but the proposal seems to be adding a bucket id to each file group ID? If I may suggest, we should call this BucketedIndex? Instead of changing the existing file name, can we simply assign the filegroupID as the hash mod

Re: [DISCUSS] Hash Index for HUDI

2021-06-04 Thread 耿筱喻
Thank you for your questions. For the first question, the number of buckets expanded by mutiple is recommended. Combine rehashing and clustering to re-distribute the data without shuffling. For example, 2 buckets expands to 4 by splitting the 1st bucket and rehashing data in it to two small

Re: [DISCUSS] Hash Index for HUDI

2021-06-02 Thread Danny Chan
Thanks for the new feature, very promising ~ Some confusion about the *Scalability* and *Data Skew* part: How do we expanded the number of existing buckets, say if we have 100 buckets before, but 120 buckets now, what is the algorithm ? About the data skew, did you mean there is no good

Re: [DISCUSS] Hash Index for HUDI

2021-06-02 Thread Gary Li
+1. Hash index is very efficient for CDC data with random updates. Also friendly for streaming ingestion. Looking forward to this feature! Best, Gary On Thu, Jun 3, 2021 at 1:51 AM Satish Kotha wrote: > +1. You may want to read this thread > < >

Re: [DISCUSS] Hash Index for HUDI

2021-06-02 Thread Satish Kotha
+1. You may want to read this thread as well. There are minor differences between these threads, but the high level idea is similar. On Wed, Jun 2, 2021

Re: [DISCUSS] Improving hudi user experience by providing more ways to configure hudi jobs

2021-05-24 Thread wangxianghu
Thanks for the reply ticket filed : https://issues.apache.org/jira/browse/HUDI-1928 > 2021年5月24日 下午6:41,vino yang 写道: > > also +1, > > IMO, simplifying the complexity of configuration and reducing the cost of > entry for new users are very important for improving user experience. > > It is a

Re: [DISCUSS] Improving hudi user experience by providing more ways to configure hudi jobs

2021-05-24 Thread vino yang
also +1, IMO, simplifying the complexity of configuration and reducing the cost of entry for new users are very important for improving user experience. It is a good proposal to simplify the configuration complexity by introducing some built-in enumerations. But at the same time, it is

Re: [DISCUSS] Improving hudi user experience by providing more ways to configure hudi jobs

2021-05-22 Thread Pratyaksh Sharma
+1 from my side. Introducing new configs based on types definitely improves user experience as compared to supplying full class names. We just need to define the enums properly. On Sat, May 22, 2021 at 9:13 AM wangxianghu wrote: > Hi community: > > > > Here I want to start a discussion about

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-28 Thread Vinoth Chandar
Hi Danny, Thanks, I will review this asap. Already, in the "review in progress" column. Thanks Vinoth On Thu, Apr 22, 2021 at 12:49 AM Danny Chan wrote: > > Should we throw together a PoC/test code for an example Flink pipeline > that > will use hudi cdc flags + state ful operators? > > I

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-22 Thread Danny Chan
> Should we throw together a PoC/test code for an example Flink pipeline that will use hudi cdc flags + state ful operators? I have updated the pr https://github.com/apache/hudi/pull/2854, see the test case HoodieDataSourceITCase#testStreamReadWithDeletes. A data source: change_flag | uuid |

Re: [DISCUSS] Hudi is the data lake platform

2021-04-21 Thread wei li
+1 , Cannot agree more. *aux metadata* and metatable, can make hudi have large preformance optimization on query end. Can continuous develop. cache service may the necessary component in cloud native environment. On 2021/04/13 05:29:55, Vinoth Chandar wrote: > Hello all, > > Reading one

Re: [DISCUSS] Refactor the Hudi configuration framework

2021-04-21 Thread nishith agarwal
+1 from me as well. This will help in maintainability immensely. -Nishith On Mon, Apr 19, 2021 at 2:06 PM Vinoth Chandar wrote: > Biggest difference from PR 1094 and the current PR open, is the addition of > fallback support and that no moving around of configs in the same PR. > This would

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-20 Thread Vinoth Chandar
Keeping compatibility is a must. i.e users should be able to upgrade to the new release with the _hoodie_cdc_flag meta column, and be able to query new data (with this new meta col) alongside old data (without this new meta col). In fact, they should be able to downgrade back to previous versions

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-20 Thread Danny Chan
> Is it providing the ability to author continuous queries on Hudi source tables end-end, given Flink can use the flags to generate retract/upsert streams Yes,that's the key point, with these flags plus flink stateful operators, we can have a real time incremental ETL pipeline. For example, a

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-20 Thread Vinoth Chandar
Hi Danny, Read up on the Flink docs as well. If we don't actually publish data to the metacolumn, I think the overhead is pretty low w.r.t avro/parquet. Both are very good at encoding nulls. But, I feel it's worth adding a HoodieWriteConfig to control this and since addition of meta columns

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-20 Thread Danny Chan
Hi, i have created a PR here: https://github.com/apache/hudi/pull/2854/files In the PR i do these changes: 1. Add a metadata column: "_hoodie_cdc_operation", i did not add a config option because i can not find a good way to make the code clean, a metadata column is very primitive and a config

Re: [DISCUSS] Hudi is the data lake platform

2021-04-19 Thread Vinoth Chandar
Looks like we have consensus here! Will share the blog PR here once ready. Thanks all! On Fri, Apr 16, 2021 at 8:43 PM Sivabalan wrote: > totally +1 on clarifying Hudi's vision. > > On Wed, Apr 14, 2021 at 3:43 AM nishith agarwal > wrote: > > > +1 > > > > I also believe Hudi is a Data

Re: [DISCUSS] Refactor the Hudi configuration framework

2021-04-19 Thread Vinoth Chandar
Biggest difference from PR 1094 and the current PR open, is the addition of fallback support and that no moving around of configs in the same PR. This would make this effort straightforward IMO. >HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP in their client code, they need to either replace it

Re: [DISCUSS] Refactor the Hudi configuration framework

2021-04-19 Thread Vinoth Chandar
+1 from me. Long time coming. On Mon, Apr 19, 2021 at 12:02 PM Ding, Wenning wrote: > Hi, > I planned to refactor the current Hudi configuration framework. lamberken< > https://github.com/lamberken> did similar things before: > https://github.com/apache/hudi/pull/1094 and I’d like to continue

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-19 Thread Danny Chan
Thanks @Sivabalan ~ I agree that parquet and log files should keep sync in metadata columns in case there are confusions and special handling in some use cases like compaction. I also agree add a metadata column is more ease to use for SQL connectors. We can add a metadata column named

Re: [DISCUSS] Hudi is the data lake platform

2021-04-16 Thread Sivabalan
totally +1 on clarifying Hudi's vision. On Wed, Apr 14, 2021 at 3:43 AM nishith agarwal wrote: > +1 > > I also believe Hudi is a Data Platform technology providing many different > functionalities to build modern data lakes, Hudi's table format being just > one of them. I've been using this

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-16 Thread Sivabalan
wrt changes if we plan to add this only to log files, compaction needs to be fixed to omit this column to the minimum. On Fri, Apr 16, 2021 at 9:07 PM Sivabalan wrote: > Just got a chance to read about dynamic tables. sounds interesting. > > some thoughts on your questions: > - yes, just MOR

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-16 Thread Sivabalan
Just got a chance to read about dynamic tables. sounds interesting. some thoughts on your questions: - yes, just MOR makes sense. - But adding this new meta column only to avro logs might incur some non trivial changes. Since as of today, schema of avro and base files are in sync. If this new col

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-15 Thread Danny Chan
Thanks Vinoth ~ Here is a document about the notion of 《Flink Dynamic Table》[1] , every operator that has accumulate state can handle retractions(UPDATE_BEFORE or DELETE) then apply new changes (INSERT or UPDATE_AFTER), so that each operator can consume the CDC format messages in streaming way.

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-15 Thread Vinoth Chandar
Hi, Is the intent of the flag to convey if an insert delete or update changed the record? If so I would imagine that we do this even for cow tables, since that also supports a logical notion of a change stream using the commit_time meta field. You may be right, but I am trying to understand the

Re: [DISCUSS] Hudi is the data lake platform

2021-04-14 Thread nishith agarwal
+1 I also believe Hudi is a Data Platform technology providing many different functionalities to build modern data lakes, Hudi's table format being just one of them. I've been using this perspective in some of the conference talks already ;) With this rebranding (and hopefully some code/package

Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread Vinoth Chandar
Thanks everyone for the feedback, so far! On the incremental aspects, that's actually Hudi's core design differentiation. While I believe the ETL today is still largely batch oriented, the way forward for everyone's benefit is indeed - incremental processing. We have already taken a giant step

Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread Danny Chan
+1 for the vision, personally i'm promising the incremental ETL part, with engine like Apache Flink we can do intermediate aggregation in streaming style. Best, Danny Chan leesf 于2021年4月14日周三 上午9:52写道: > +1. Cool and promising. > > Mehrotra, Udit 于2021年4月14日周三 上午2:57写道: > > > Agree with the

Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread leesf
+1. Cool and promising. Mehrotra, Udit 于2021年4月14日周三 上午2:57写道: > Agree with the rebranding Vinoth. Hudi is not just a "table format" and we > need to do justice to all the cool auxiliary features/services we have > built. > > Also, timeline metadata service in particular would be a really big

Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread Mehrotra, Udit
Agree with the rebranding Vinoth. Hudi is not just a "table format" and we need to do justice to all the cool auxiliary features/services we have built. Also, timeline metadata service in particular would be a really big win if we move towards something like that. On 4/13/21, 11:01 AM,

Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread Pratyaksh Sharma
Definitely we are doing much more than only ingesting and managing data over DFS. +1 from my side as well. :) On Tue, Apr 13, 2021 at 10:02 PM Susu Dong wrote: > I love this rebranding. Totally agree. +1 > > On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu > wrote: > > > +1 The vision looks

Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread Susu Dong
I love this rebranding. Totally agree. +1 On Wed, Apr 14, 2021 at 1:25 AM Raymond Xu wrote: > +1 The vision looks fantastic. > > On Tue, Apr 13, 2021 at 7:45 AM Gary Li wrote: > > > Awesome summary of Hudi! +1 as well. > > > > Gary Li > > On 2021/04/13 14:13:24, Rubens Rodrigues > > wrote: >

Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread Raymond Xu
+1 The vision looks fantastic. On Tue, Apr 13, 2021 at 7:45 AM Gary Li wrote: > Awesome summary of Hudi! +1 as well. > > Gary Li > On 2021/04/13 14:13:24, Rubens Rodrigues > wrote: > > Excellent, I agree > > > > Em ter, 13 de abr de 2021 07:23, vino yang > escreveu: > > > > > +1 Excited by

Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread vbal...@apache.org
++1. The rewording makes total sense Balaji.V On Tuesday, April 13, 2021, 07:45:16 AM PDT, Gary Li wrote: Awesome summary of Hudi! +1 as well. Gary Li On 2021/04/13 14:13:24, Rubens Rodrigues wrote: > Excellent, I agree > > Em ter, 13 de abr de 2021 07:23, vino yang escreveu: >

Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread Gary Li
Awesome summary of Hudi! +1 as well. Gary Li On 2021/04/13 14:13:24, Rubens Rodrigues wrote: > Excellent, I agree > > Em ter, 13 de abr de 2021 07:23, vino yang escreveu: > > > +1 Excited by this new vision! > > > > Best, > > Vino > > > > Dianjin Wang 于2021年4月13日周二 下午3:53写道: > > > > > +1

Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread Rubens Rodrigues
Excellent, I agree Em ter, 13 de abr de 2021 07:23, vino yang escreveu: > +1 Excited by this new vision! > > Best, > Vino > > Dianjin Wang 于2021年4月13日周二 下午3:53写道: > > > +1 The new brand is straightforward, a better description of Hudi. > > > > Best, > > Dianjin Wang > > > > > > On Tue, Apr

Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread vino yang
+1 Excited by this new vision! Best, Vino Dianjin Wang 于2021年4月13日周二 下午3:53写道: > +1 The new brand is straightforward, a better description of Hudi. > > Best, > Dianjin Wang > > > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha > wrote: > > > +1 . Cannot agree more. I think this makes total

Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread Dianjin Wang
+1 The new brand is straightforward, a better description of Hudi. Best, Dianjin Wang On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha wrote: > +1 . Cannot agree more. I think this makes total sense and will provide for > a much better representation of the project. > > On Mon, Apr 12, 2021 at

Re: [DISCUSS] Hudi is the data lake platform

2021-04-12 Thread Bhavani Sudha
+1 . Cannot agree more. I think this makes total sense and will provide for a much better representation of the project. On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar wrote: > Hello all, > > Reading one more article today, positioning Hudi, as just a table format, > made me wonder, if we have

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-08 Thread Danny Chan
I tries to do a POC for flink locally and it works well, in the PR i add a new metadata column named "_hoodie_change_flag", but actually i found that only log format needs this flag, and the Spark may has no ability to handle the flag for incremental processing yet. So should i add the

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-01 Thread Danny Chan
Thanks cool, then the left questions are: - where we record these change, should we add a builtin meta field such as the _change_flag_ like the other system columns for e.g _hoodie_commit_time - what kind of table should keep these flags, in my thoughts, we should only add these flags for

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-03-31 Thread vino yang
>> Oops, the image crushes, for "change flags", i mean: insert, update(before and after) and delete. Yes, the image I attached is also about these flags. [image: image (3).png] +1 for the idea. Best, Vino Danny Chan 于2021年4月1日周四 上午10:03写道: > Oops, the image crushes, for "change flags", i

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-03-31 Thread Danny Chan
Oops, the image crushes, for "change flags", i mean: insert, update(before and after) and delete. The Flink engine can propagate the change flags internally between its operators, if HUDI can send the change flags to Flink, the incremental calculation of CDC would be very natural (almost

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-03-31 Thread vino yang
Hi Danny, Thanks for kicking off this discussion thread. Yes, incremental query( or says "incremental processing") has always been an important feature of the Hudi framework. If we can make this feature better, it will be even more exciting. In the data warehouse, in some complex calculations,

Re: [DISCUSS] Introduce lgtm to analyze the changes of PR and simplify the cost of code review

2021-03-10 Thread vino yang
Hi, I configured the lgtm service to let it scan my hudi repository(the mirror of the official apache-hudi). It found 50 alerts in the project. And I exported them into a file(sarif format and attached it as an attachment). We can use "sarif-web-component"[1] to view it. Generally speaking,

Re: [DISCUSS] Introduce lgtm to analyze the changes of PR and simplify the cost of code review

2021-03-05 Thread vino yang
OK, let me try to know more about it and test it via one PR. nishith agarwal 于2021年3月5日周五 上午2:20写道: > I see, thanks Vino! > > "*Prevent bugs from ever making it to your project' - *That's an > extremely bold statement for anyone to make :) > > Like it mentions, although it tries to reduce the

Re: [DISCUSS] Introduce lgtm to analyze the changes of PR and simplify the cost of code review

2021-03-04 Thread nishith agarwal
I see, thanks Vino! "*Prevent bugs from ever making it to your project' - *That's an extremely bold statement for anyone to make :) Like it mentions, although it tries to reduce the false positive rate, we probably still will get some noise. Can we try it with one of the PR's to see it's worth

Re: [DISCUSS] Introduce lgtm to analyze the changes of PR and simplify the cost of code review

2021-03-03 Thread vino yang
Hi, It did not provide much public information, but gave a description on the official website: *“Prevent bugs from ever making it to your project by using automated reviews that let you know when your code changes would introduce alerts into your project. We support GitHub and Bitbucket.We

Re: [DISCUSS] Introduce lgtm to analyze the changes of PR and simplify the cost of code review

2021-03-03 Thread nishith agarwal
This is a good idea @vino yang Have you looked into what the "automated code review" actually does ? -Nishith On Wed, Mar 3, 2021 at 7:38 AM vino yang wrote: > Hi guys, > > I want to introduce a code analysis service called lgtm[1] in the > community. Recently, in the Kylin community, I

Re: [DISCUSS] Support multiple ordering fields

2021-03-03 Thread Danny Chan
ble in that > case. > > So > > > > overall a +1 from my side as well. > > > > > > > > On Tue, Jan 19, 2021 at 1:58 PM 刘金辉 <965147...@qq.com> wrote: > > > > > > > > > +1,Currently we have encountered such scenarios and loo

Re: [DISCUSS] Improve data locality during ingestion

2021-02-18 Thread Satish Kotha
t 12:52 PM > To: dev@hudi.apache.org > Subject: Re: [DISCUSS] Improve data locality during ingestion > Caution: This e-mail originated from outside of Philips, be careful for > phishing. > > > Good discussion points. > Basically on a high level, we are looking to propose > sub-

Re: [DISCUSS] Improve data locality during ingestion

2021-02-17 Thread Kizhakkel Jose, Felix
by the given sort key, thereby we could get the range pruning much better. Regards, Felix K Jose From: Sivabalan Date: Monday, February 15, 2021 at 12:52 PM To: dev@hudi.apache.org Subject: Re: [DISCUSS] Improve data locality during ingestion Caution: This e-mail originated from outside of Philips

Re: [DISCUSS] Improve data locality during ingestion

2021-02-15 Thread Sivabalan
actly what I would like to have. Thank you for seeking > > clarification. > > > > Regards, > > Felix K Jose > > From: Vinoth Chandar > > Date: Tuesday, February 9, 2021 at 10:44 PM > > To: dev > > Subject: Re: [DISCUSS] Improve data locality during ingestio

Re: [DISCUSS] Improve data locality during ingestion

2021-02-12 Thread Vinoth Chandar
larification. > > Regards, > Felix K Jose > From: Vinoth Chandar > Date: Tuesday, February 9, 2021 at 10:44 PM > To: dev > Subject: Re: [DISCUSS] Improve data locality during ingestion > Caution: This e-mail originated from outside of Philips, be careful for > phishing

Re: [DISCUSS] Improve data locality during ingestion

2021-02-10 Thread Kizhakkel Jose, Felix
Hi Vinoth, Yes that’s exactly what I would like to have. Thank you for seeking clarification. Regards, Felix K Jose From: Vinoth Chandar Date: Tuesday, February 9, 2021 at 10:44 PM To: dev Subject: Re: [DISCUSS] Improve data locality during ingestion Caution: This e-mail originated from

Re: [DISCUSS] Improve data locality during ingestion

2021-02-09 Thread Vinoth Chandar
e. > > Regards, > Felix K Jose > From: Rubens Rodrigues > Date: Tuesday, February 9, 2021 at 8:36 PM > To: dev@hudi.apache.org > Subject: Re: [DISCUSS] Improve data locality during ingestion > Caution: This e-mail originated from outside of Philips, be careful for > phis

Re: [DISCUSS] Measure latency by storing event time in WriteStatus

2021-02-09 Thread Vinoth Chandar
yes. there is definitely intent to support stream-stream joins between Hudi tables, in the future. This can be used to implement really good watermarks (structured streaming/flink/beam) At least for me, this is why I created the project originally to begin with :)

Re: [DISCUSS] Improve data locality during ingestion

2021-02-09 Thread Kizhakkel Jose, Felix
:36 PM To: dev@hudi.apache.org Subject: Re: [DISCUSS] Improve data locality during ingestion Caution: This e-mail originated from outside of Philips, be careful for phishing. Hi guys, Talking about my use case... I have datasets that ordering data by date makes a lot sense or ordering by some

Re: [DISCUSS] Improve data locality during ingestion

2021-02-09 Thread Vinoth Chandar
Hi, We already support a user defined custom partitioner for bulk insert. So you can actually control it whichever way you like, for the initial load. Thanks Vinoth On Tue, Feb 9, 2021 at 5:36 PM Rubens Rodrigues wrote: > Hi guys, > > Talking about my use case... > > I have datasets that

Re: [DISCUSS] Improve data locality during ingestion

2021-02-09 Thread Rubens Rodrigues
Hi guys, Talking about my use case... I have datasets that ordering data by date makes a lot sense or ordering by some id to have less touched files on merge operations. On my use of delta lake I used to bootstrap tables ever ordering by one of these fields and helps a lot on file pruning. Hudi

Re: [DISCUSS] Improve data locality during ingestion

2021-02-09 Thread Vinoth Chandar
Hi Satish, Been to respond to this. I think I like the idea overall. Here's a (hopefully) my understanding version and let me know if I am getting this right. Predominantly, we are just talking about the problem of: where do we send the "inserts" to. Today the upsert partitioner does the file

Re: [DISCUSS] Support multiple ordering fields

2021-02-05 Thread Raymond Xu
; > > > > > > > > > > > > > --原始邮件-- > > > > 发件人: > > > > "dev" > > > >

Re: [DISCUSS] Measure latency by storing event time in WriteStatus

2021-02-05 Thread Raymond Xu
liujinhui and vinoth, Thank you for the input! Have created https://issues.apache.org/jira/browse/HUDI-1587 Yes I think the min and max timing mitigates the late arrival issue, which also should not happen that frequently like for every other commit. I think the histogram is a cool idea and the

Re: [DISCUSS] Improve data locality during ingestion

2021-02-03 Thread Satish Kotha
I got some feedback that this thread may be a bit complex to understand. So I tried to simplify proposal to below: Users can already specify 'partitionpath' using this config when writing data. My proposal is we also

Re: [DISCUSS] Rethink the abstraction of current client

2021-02-02 Thread Vinoth Chandar
Sorry for the late reply. Standard excuse: 0.7.0 release. +1 on the need to rethink this. Some comments on issues in this thread IMO. 1. Agree that the hierarchy has gotten much taller now. and we need to immediately pull back more code into hudi-client-common. IMO what we lack is some kind of

Re: [DISCUSS] Measure latency by storing event time in WriteStatus

2021-02-02 Thread Vinoth Chandar
+1 I was involved in a very similar design at my previous job. We could actually track both min and max event times. We used to call the min - latency and max - freshness (i.e indicates that some data for these later time intervals are flowing in). It does not solve the issue liujinjui

Re: [DISCUSS] Rethink the abstraction of current client

2021-02-02 Thread vino yang
Hi, > I think the proposed interfaces indeed look more intuitive and could simplify the code structures. My concern is mostly around the ROI of such refactoring work. Probably I lack some direct involvement in the flink client work but it looks like it's mainly about code restructuring and

Re: [DISCUSS] Support multiple ordering fields

2021-01-31 Thread Vinoth Chandar
untered such scenarios and look forward > to > > > supporting > > > > > > > > > > > > > > > --原始邮件-- > > > 发件人: > > > "dev" > >

Re: [DISCUSS] Support multiple ordering fields

2021-01-20 Thread Raymond Xu
< > > danny0...@apache.org; > > 发送时间:2021年1月19日(星期二) 下午4:25 > > 收件人:"dev" > > > 主题:Re: [DISCUSS] Support multiple ordering fields > > > > > > > > Wondering if we should just take a b

Re: [DISCUSS] Rethink the abstraction of current client

2021-01-19 Thread vino yang
>> For the Spark client, it is true because no matter Spark or Spark streaming engine, they write as batches, but things are different for pure streaming engines like Flink, Flink writes per-record, it does not accumulate buffers. Yes, what I mean about the "batch" is not about the behavior or

Re: [DISCUSS] Rethink the abstraction of current client

2021-01-19 Thread Danny Chan
> It contains three components: - Two objects: a table, a batch of records; For the Spark client, it is true because no matter Spark or Spark streaming engine, they write as batches, but things are different for pure streaming engines like Flink, Flink writes per-record, it does not

Re: [DISCUSS] Support multiple ordering fields

2021-01-19 Thread Pratyaksh Sharma
countered such scenarios and look forward to > supporting > > > > > --原始邮件-- > 发件人: > "dev" > < > danny0...@apache.or

<    1   2   3   4   5   6   7   8   9   10   >