Re: New site/docs navigation

2021-10-28 Thread Vinoth Chandar
ho want to work together to translate? > Please contact me. > > > 2021年10月28日 20:35,Vinoth Chandar 写道: > > > > Hi all, > > > > https://github.com/apache/hudi/pull/3855 puts up a nice redesign of the > > content, that can show case all of the Hudi capabilities. P

New site/docs navigation

2021-10-28 Thread Vinoth Chandar
Hi all, https://github.com/apache/hudi/pull/3855 puts up a nice redesign of the content, that can show case all of the Hudi capabilities. Please chime in and help merge the PR. As follow on, we can also fix the Chinese site docs after this? Thanks Vinoth

Re: feature request/proposal: leverage bloom indexes for readingb

2021-10-28 Thread Vinoth Chandar
On Fri Oct 22, 2021 at 4:33 PM CEST, Vinoth Chandar wrote: > > Hi Nicolas, > > > > Thanks for raising this! I think it's a very valid ask. > > https://issues.apache.org/jira/browse/HUDI-2601 has been raised. > > > > As a proof of concept, would you be able t

Re: Limitations of non unique keys

2021-10-28 Thread Vinoth Chandar
Hi, Are you asking if there are advantages to allowing duplicates or not having keys in your table? Having keys, helps with othe practical scenarios, in addition to what you called out. e.g: Oftentimes, you would want to backfill an insert-only table and you don't want to introduce duplicates

Re: Monthly or Bi-Monthly Dev meeting?

2021-10-22 Thread Vinoth Chandar
We could, but just need storage space over the longer term. :) On Wed, Oct 20, 2021 at 9:56 PM Raymond Xu wrote: > Timing looks ok. Are we going to record the sessions too? > > On Wed, Oct 20, 2021 at 7:17 PM Vinoth Chandar wrote: > > > I think we can do 7AM PST winte

Re: feature request/proposal: leverage bloom indexes for readingb

2021-10-22 Thread Vinoth Chandar
Hi Nicolas, Thanks for raising this! I think it's a very valid ask. https://issues.apache.org/jira/browse/HUDI-2601 has been raised. As a proof of concept, would you be able to give filterExists() a shot and see if the filtering time improves?

Re: Monthly or Bi-Monthly Dev meeting?

2021-10-20 Thread Vinoth Chandar
I think we can do 7AM PST winters and 8AM summers. Will draft a page with a zoom link we can use and put up a PR. On Thu, Oct 14, 2021 at 9:48 AM Vinoth Chandar wrote: > Yes. I can do 7AM PST. Can others in PST chime in please? > > We can wrap this up this week. > > On Tue, Oc

Re: [Phishing Risk] [External] is there solution to solve hbase data screw issue

2021-10-18 Thread Vinoth Chandar
mit with BloomIndex, > if data is new , it may need to append them to the existing file group. > meanwhile may cause concurrent issue with async compaction thread if > compaction plan contains same file group,how Hudi avoid that? > > On Fri, Oct 15, 2021 at 12:50 AM Vinoth Chandar wrote: >

Re: [DISCUSS] Presto Plugin for Hudi

2021-10-17 Thread Vinoth Chandar
+1 here in general. Raised some points on the Trino thread. Same apply here as well. For presto, would the new connector work with the aria/raptorx work done by facebook engs? On Sat, Oct 16, 2021 at 11:12 PM sagar sumit wrote: > Dear Hudi Community, > > I initiated a discussion thread

Re: [DISCUSS] Trino Plugin for Hudi

2021-10-17 Thread Vinoth Chandar
Hi Sagar; Thanks for the detailed write up. +1 on the separate connector in general. I would love to understand few aspects which work really well for the Hive connector path (which is kind of why we did it this way to begin with) - whats the new user experience for users? With the hive plugin

Re: Difference/compatibility between original Parquet files and Hudi modified Parquet files

2021-10-14 Thread Vinoth Chandar
n Presto/Trino, and then copy the modified Parquet > (i.e. output Parquet from Hudi dataset) from bucket2 to bucket1. *Will that > result in duplicate (if we are on Copy On Write mode)?* > > Besides the potential duplicate, any other pitfall that I need to pay > special attention to?

Release 0.10.0 planning

2021-10-14 Thread Vinoth Chandar
Hi all, It's time for our next release again! I have marked out some blockers here on JIRA. https://issues.apache.org/jira/projects/HUDI/versions/12350285 Quick highlights: - Metadata table v2, which is synchronously updated - Row writing (Spark) for all write operations - Kafka Connect for

Re: [Phishing Risk] [External] is there solution to solve hbase data screw issue

2021-10-14 Thread Vinoth Chandar
solution will work in the future,since > bloomindex cannot index mor log file,hence new insert data still write into > parquet ,that why I choose hbase index ,get better performance. > > Vinoth Chandar 于2021年10月5日 周二下午7:29写道: > > > +1 on that answer. It's pretty spot on. >

Re: Monthly or Bi-Monthly Dev meeting?

2021-10-14 Thread Vinoth Chandar
na. It might be a bit late > IMO. Does 3 PM UTC(7 AM PST in winter, 8 AM in summer) work? > > Best, > Gary > > On Tue, Oct 5, 2021 at 9:20 PM Pratyaksh Sharma > wrote: > > > Works for me in India :) > > > > On Tue, Oct 5, 2021 at 9:41 AM Vinoth Chandar wro

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-10-06 Thread Vinoth Chandar
itialization. > Based on the log I see, constructing the filesystem view of a partition > with 500 filegroups is taking 200ms. If the AppendHandle is only flushing a > few records to disk, the actual flush could be faster than filesystem view > construction. > > On Fri, Sep 24,

Re: [Phishing Risk] [External] is there solution to solve hbase data screw issue

2021-10-05 Thread Vinoth Chandar
+1 on that answer. It's pretty spot on. Even as random prefix helps with HBase balancing, the issue then becomes that you lose all the key ordering inside the Hudi table, which can be a nice thing if you even want range pruning/indexing to be effective. To paint a picture of all the work being

Re: Monthly or Bi-Monthly Dev meeting?

2021-10-04 Thread Vinoth Chandar
gt; > > > > > > > > Monthly should be good. Been a long time since we connected in > these > > > > > meetings. :) > > > > > > > > > > On Thu, Sep 23, 2021 at 7:02 PM Vinoth Chandar < > > > > > mail.vinoth.chan..

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-09-23 Thread Vinoth Chandar
when we try to commit very aggressively. The timeline server > and remote filesystem view are helpful, but I feel like there is still some > room for improvement. > > Best, > Gary > > On Fri, Sep 24, 2021 at 3:04 AM Vinoth Chandar wrote: > > > Hi Gary, > > > &g

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-09-23 Thread Vinoth Chandar
riting the data. Is that possible? A > HoodieTable could have many Handles writing data at the same time and it > will look cleaner if we can keep the timeline and file system view inside > the table itself. > > Best, > Gary > > On Sat, Sep 18, 2021 at 12:06 AM Vinoth Chanda

Re: Monthly or Bi-Monthly Dev meeting?

2021-09-23 Thread Vinoth Chandar
1 hour monthly is what I was proposing to be specific. On Thu, Sep 23, 2021 at 6:30 AM Gary Li wrote: > +1 for monthly. > > On Thu, Sep 23, 2021 at 8:28 PM Vinoth Chandar wrote: > > > Hi all, > > > > Once upon a time, we used to have a weekly community sync. Wond

Re: How to do apache hudi performance test?

2021-09-23 Thread Vinoth Chandar
Hi, Those numbers you see are from production at Uber, which I no longer have access to. So they are not synthetic numbers. I use my own little script for testing write performance - tpcds does not really have good support for updates/delete workloads. I am happy to throw it up, but I think we

Monthly or Bi-Monthly Dev meeting?

2021-09-23 Thread Vinoth Chandar
Hi all, Once upon a time, we used to have a weekly community sync. Wondering if there is interest in having a monthly or bi-monthly dev meeting? Agenda could be - Update/Summary of all dev work tracks - Show and tell, where people can present their ongoing work - Open floor discussions, bring up

Re: Difference/compatibility between original Parquet files and Hudi modified Parquet files

2021-09-22 Thread Vinoth Chandar
Hi, There is no format difference whatsoever. Hudi just adds additional footers for min, max key values and bloom filters to parquet and some meta fields for tracking commit times for incremental queries and keys. Any standard parquet reader can read the parquet files in a Hudi table. These

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-09-17 Thread Vinoth Chandar
nterest in major features > and > > > don't like to spend time in such foundational work. But as the project > > > scales, these foundational work will have a higher returns in the long > > run. > > > > > > On Wed, Sep 15, 2021 at 8:29 AM Vinoth Chan

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-09-15 Thread Vinoth Chandar
Another +1 , HoodieData abstraction will go a long way in reducing LoC. Happy to work with you to see this through! I really encourage top contributors to the Flink and Java clients as well, actively review all PRs, given there are subtle differences everywhere. This will help us smoothly

Re: [ANNOUNCEMENT] CI changes

2021-09-07 Thread Vinoth Chandar
+1 this is a truly champion effort. On Tue, Sep 7, 2021 at 11:51 AM Sivabalan wrote: > Really great job Raymond! Good to see improvements on CI infra. Definitely > helps developer experience a lot better. > > > On Mon, Sep 6, 2021 at 9:19 PM vino yang wrote: > > > awesome! Great job! > > > >

Re: Apache Hudi release voting process

2021-08-22 Thread Vinoth Chandar
Hi all, First of all, it's great to see us debating around ensuring high quality, timely releases. Shows we have developers who care and are passionate around the project! Thanks for establishing the timelines, Siva. I would like to add the following data points that all 4 PRs in question

Re: [VOTE] Release 0.9.0, release candidate #2

2021-08-22 Thread Vinoth Chandar
+1 (binding) RC check [1] passed [1] https://gist.github.com/vinothchandar/68b34f3051e41752ebffd6a3edeb042b On Sun, Aug 22, 2021 at 1:28 PM Sivabalan wrote: > We can keep the specific discussion out of this voting thread. Have started > a new thread here > < >

Re: DISCUSS RFC RFC-32 Kafka Connect Sink for Hudi

2021-08-19 Thread Vinoth Chandar
+1 on this. Thanks for driving this! On Wed, Aug 18, 2021 at 4:44 PM Rajesh Mahindra wrote: > Hi All, > > We have a new RFC, RFC-32 > < > https://cwiki.apache.org/confluence/display/HUDI/RFC-32+Kafka+Connect+Sink+for+Hudi > > > that > details the design and implementation of a Kafka Sink for

Re: [DISCUSS] Enable Github Discussions

2021-08-18 Thread Vinoth Chandar
t; > > Best, > > > > Vino > > > > > > > > Pratyaksh Sharma 于2021年8月12日周四 上午2:16写道: > > > > > > > > > +1 > > > > > > > > > > I have never used it, but we can try this out. :) > > > > >

Re: [VOTE] Release 0.9.0, release candidate #1

2021-08-17 Thread Vinoth Chandar
at 7:48 AM Vinoth Chandar wrote: > -1 (binding) > > An issue was surfaced yesterday, that affects the re-definition of some > configs in HoodieWriteConfig e.g TABLE_NAME, AVRO_SCHEMA, > AVRO_SCHEMA_VALIDATE > which unfortunately do not have the _PROP suffix added. These are

Re: [VOTE] Release 0.9.0, release candidate #1

2021-08-17 Thread Vinoth Chandar
-1 (binding) An issue was surfaced yesterday, that affects the re-definition of some configs in HoodieWriteConfig e.g TABLE_NAME, AVRO_SCHEMA, AVRO_SCHEMA_VALIDATE which unfortunately do not have the _PROP suffix added. These are now re-defining as ConfigProperty members now and jobs using them

Re: How to read hudi files with Mapreduce?

2021-08-13 Thread Vinoth Chandar
Hi Jian, We have a hoodie-hadoop-mr package with some InputFormat. You can try using HoodieParquetInputFormat to read from a MR job. I have only tested with Hive this way myself. So wondering if anyone else here has real experience trying with MR itself. Thanks Vinoth On Wed, Aug 11, 2021 at

Re: Website redesign

2021-08-10 Thread Vinoth Chandar
> > the content as-is with few minor changes to the docusaurus > platform, > > > then > > > > > we can make incremental changes to add more features. > > > > > > > > > > I'll be adding the search bar in the next iteration. > > &

Re: [DISCUSS] Hudi 0.9.0 Release

2021-08-05 Thread Vinoth Chandar
e, feel free to chime in. > > On Tue, Aug 3, 2021 at 8:10 PM Vinoth Chandar wrote: > > > Thanks Udit! I propose we set end of next week as a hard deadline for > > cutting the RC. Any thoughts? > > > > A good amount of progress is being made on these blockers, I think. &g

Re: [DISCUSS] Hudi is the data lake platform

2021-08-04 Thread Vinoth Chandar
with - "Hudi brings transactions, record-level updates/deletes and change streams to data lakes" then explain the platform, in the next level of detail. https://github.com/apache/hudi/pull/3406 On Mon, Aug 2, 2021 at 9:39 AM Vinoth Chandar wrote: > Thanks! Will work on it this week. &

Re: [DISCUSS] Hudi 0.9.0 Release

2021-08-03 Thread Vinoth Chandar
Thanks Udit! I propose we set end of next week as a hard deadline for cutting the RC. Any thoughts? A good amount of progress is being made on these blockers, I think. On Tue, Aug 3, 2021 at 5:13 PM Udit Mehrotra wrote: > Hi Community, > > As we draw close to doing Hudi 0.9.0 release, I am

Re: [DISCUSS] Disable ASF GitHub Bot comments under the JIRA issue

2021-08-02 Thread Vinoth Chandar
+1 as well. Danny, please go ahead and remove, land this. On Sun, Aug 1, 2021 at 2:51 AM leesf wrote: > +1 to disable. > > Vinoth Chandar 于2021年7月28日周三 上午12:37写道: > > > Anybody with strong opinions to keep them? > > I am happy to go back to clicking to get to github li

Re: [DISCUSS] Hudi is the data lake platform

2021-08-02 Thread Vinoth Chandar
oop FileSystem compatible storage)." > > > > On Sat, Jul 24, 2021 at 6:31 AM Vinoth Chandar > wrote: > > > >> Thanks Vino! Got a bunch of emoticons on the PR as well. > >> > >> Will land this monday, giving it more time over the weekend as well.

Re: Long test run times

2021-07-30 Thread Vinoth Chandar
There is probably some good amount of low hanging fruit, we can make some headway and see how it goes from there? On Thu, Jul 29, 2021 at 7:18 PM Danny Chan wrote: > What should we do for these long running tests ? Simplify them to more > simple UTs ? > > Vinoth Chandar 于2021年7月30

Re: Long test run times

2021-07-29 Thread Vinoth Chandar
aStreamer > org.apache.hudi.spark3.internal.TestHoodieDataSourceInternalBatchWrite > org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer > org.apache.hudi.spark3.internal.TestHoodieBulkInsertDataInternalWriter > > > > > > > > > On Thu, Jul 29, 2021

Long test run times

2021-07-29 Thread Vinoth Chandar
Folks, Our tests are now exceeding 60 minutes and suffering timeouts on azure (travis is slow to actually schedule, but seems to finish on time). Following are the list of top slow tests. I am working from the top of the list. Any one interested in chipping in? Please respond with the test you

Re: Website redesign

2021-07-29 Thread Vinoth Chandar
Folks, the PR is up! https://github.com/apache/hudi/pull/3366 Please review. This is truly heroic work, vingov, fixing all the broken links and cleaning a lot of debt in the jekyll based theme ! On Mon, Jul 12, 2021 at 10:48 PM Vinoth Chandar wrote: > Hi, > > Sounds good! Pl

Re: [DISCUSS] Disable ASF GitHub Bot comments under the JIRA issue

2021-07-27 Thread Vinoth Chandar
Anybody with strong opinions to keep them? I am happy to go back to clicking to get to github links. On Tue, Jul 27, 2021 at 6:33 AM xuedong luan wrote: > +1 > > Danny Chan 于2021年7月27日周二 上午10:38写道: > > > I found that there are many ASF GitHub Bot comments under our issue now, > it > > messes

Re: How to disable the ASF GitHub Bot comments under the issue ticket ?

2021-07-26 Thread Vinoth Chandar
Hi Danny, Worth discussing. It was turned on by adding "comment" here. https://github.com/apache/hudi/blob/master/.asf.yaml#L41 The intention is that all GH activity is reflected once you see a JIRA. otherwise, one has to keep clicking back and forth. Maybe open this for discussion and see

Re: [DISCUSS] Hudi is the data lake platform

2021-07-23 Thread Vinoth Chandar
Thanks Vino! Got a bunch of emoticons on the PR as well. Will land this monday, giving it more time over the weekend as well. On Wed, Jul 21, 2021 at 7:36 PM vino yang wrote: > Thanks vc > > Very good blog, in-depth and forward-looking. Learned! > > Best, > Vino > > Vi

Re: [VOTE] Move content off cWiki

2021-07-23 Thread Vinoth Chandar
Vote is now closed. Vote passed with 13 +1s and no -1s. Thanks all! On Fri, Jul 23, 2021 at 7:31 AM Vinoth Chandar wrote: > +1 > > On Fri, Jul 23, 2021 at 2:41 AM Gary Li wrote: > >> +1 >> >> On Tue, Jul 20, 2021 at 8:06 PM vino yang wrote: >> >> &

Re: [VOTE] Move content off cWiki

2021-07-23 Thread Vinoth Chandar
+1 On Fri, Jul 23, 2021 at 2:41 AM Gary Li wrote: > +1 > > On Tue, Jul 20, 2021 at 8:06 PM vino yang wrote: > > > +1 > > > > Navinder Brar 于2021年7月20日周二 上午11:01写道: > > > > > +1 > > > Navinder > > > > > > > > > Sent from Yahoo Mail for iPhone > > > > > > > > > On Tuesday, July 20, 2021, 7:28

Re: [DISCUSS] Create Spark and Flink utilities module

2021-07-21 Thread Vinoth Chandar
il with `java.lang.NoSuchMethodError: > org.apache.parquet.column.ParquetProperties.getColumnIndexTruncateLength()` > > > Regards, > Vinay Patil > > > On Tue, Jul 20, 2021 at 9:46 PM Vinoth Chandar wrote: > > > Hi Vinay. > > > > Thanks for kicking this off. > > > &g

Re: [DISCUSS] Hudi is the data lake platform

2021-07-21 Thread Vinoth Chandar
large preformance > optimization on query end. > Can continuous develop. > cache service may the necessary component in cloud native environment. > > On 2021/04/13 05:29:55, Vinoth Chandar wrote: > > Hello all, > > > > Reading one more article today, positioning Hu

Re: Hive integration Improvment

2021-07-20 Thread Vinoth Chandar
Thanks for this! Will review this week! On Thu, Jul 15, 2021 at 5:15 AM 18717838093 <18717838...@126.com> wrote: > > > Hi, experts. > > > Currently, Hudi sql statements for DML are executed by Hive Driver with > concatenation SQL statements in most cases. The way SQL is concatenated is > hard to

Re: [DISCUSS] Move to spark v2 datasource API

2021-07-20 Thread Vinoth Chandar
di_v2 > should also be an option. I need to explore more on these lines, but just > putting it out. > > Once we make some headway in this(by some spark expertise), I can > definitely contribute from my side on this project. > > > On Thu, Jul 15, 2021 at 12:13 AM Vinoth Chandar

Re: [DISCUSS] Create Spark and Flink utilities module

2021-07-20 Thread Vinoth Chandar
Hi Vinay. Thanks for kicking this off. I wonder if it's possible to structure the code in separate modules, but have a single bundle. Or is that a painful experience? (if so, can you share what issues we are running into?) We have rarely done backwards incompatible changes and users appreciate

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-19 Thread Vinoth Chandar
d more data points to convince myself > if > > GitHub issues will provide all the issue tracking functionality that Jira > > provides today. > > > > Thanks, > > Udit > > > > On Fri, Jul 16, 2021 at 2:33 PM Vinoth Chandar > wrote: > > > > > Lo

[VOTE] Move content off cWiki

2021-07-19 Thread Vinoth Chandar
Hi all, Starting a vote based on the DISCUSS thread here [1], to consolidate content from cWiki into Github wiki and project's master branch (for design docs) Please chime with a +1 - Approve the move -1 - Disapprove the move (please state your reasoning) The vote will use lazy consensus,

Re: Welcome New Committers: Pengzhiwei and DannyChan

2021-07-16 Thread Vinoth Chandar
Congrats both! Your impact is amazing! More miles to travel. Looking forward On Fri, Jul 16, 2021 at 4:43 PM 18717838093 <18717838...@126.com> wrote: > Congratulations! Well deserved! > > > > | | > 18717838093 > | > | > 18717838...@126.com > | > 签名由网易邮箱大师定制 > > > On 07/16/2021 19:50,wangxianghu

Welcome our PMC Member, Raymond Xu

2021-07-16 Thread Vinoth Chandar
Folks, I am incredibly happy to share the addition of Raymond Xu to the Hudi PMC. Raymond has been a valuable member of our community, over the past few years now. Always hustlin and taking on the most underappreciated, but extremely valuable aspects of the project, mostly recently with getting

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-16 Thread Vinoth Chandar
ld also document the labels in detail so that anyone > >>> looking to take a look at untriaged issues should know how/where to > look > >>> at. If we plan to use GH issues for all, I am sure there will be a lot > of > >>> proliferation of issues. > >>>

[DISCUSS] Move to spark v2 datasource API

2021-07-14 Thread Vinoth Chandar
Folks, As you may know, we still use the V1 API, given it the flexibility further transform the dataframe, after one calls `df.write.format()`, to implement a fully featured write pipeline with precombining, indexing, custom partitioning. V2 API takes this away and rather provides a very

[DISCUSS] Enable Github Discussions

2021-07-14 Thread Vinoth Chandar
Hi all, I would like to propose that we explore the use of github discussions. Few other apache projects have also been trying this out. Please chime in Thanks Vinoth

Re: Release manager for 0.9.0

2021-07-14 Thread Vinoth Chandar
gt; Thanks, > Udit > > Sent from my iPhone > > > On Jul 14, 2021, at 7:07 PM, Vinoth Chandar wrote: > > > > Hi all, > > > > 0.9.0 is upon us. Any volunteers to drive this forward? > > > > Thanks > > Vinoth >

Release manager for 0.9.0

2021-07-14 Thread Vinoth Chandar
Hi all, 0.9.0 is upon us. Any volunteers to drive this forward? Thanks Vinoth

Re: Website redesign

2021-07-12 Thread Vinoth Chandar
on > this re-design. > > Best, > Vinoth > > > On Fri, Jul 2, 2021 at 6:45 PM Vinoth Chandar wrote: > > > At this point, scoping the work itself is a good first task, breaking > into > > sub tasks. > > > > I am willing to partner with someone closely, to

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-09 Thread Vinoth Chandar
Based on this, I will start consolidating more of the cWiki content to github wiki and master branch? JIRA vs GH Issue still probably needs more feedback. I do see the tradeoffs there. On Fri, Jul 9, 2021 at 2:39 AM wei li wrote: > +1 > > On 2021/07/02 03:40:51, Vinoth Chandar wrot

PSA : Rebase PRs before landing

2021-07-08 Thread Vinoth Chandar
Hi all, We had a large Config framework change that went in and since then there have been at-least two master breaks, from not rebasing before landing. Please check if your PR changes any configs and if so you may need to rebase and rework before landing (even if there are no conflicts per se

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-08 Thread Vinoth Chandar
Any more strong opinions around these? On Mon, Jul 5, 2021 at 7:43 AM Vinoth Chandar wrote: > I had similar views on A actually. JIRA is pretty powerful, queryable. > But, I convinced myself on labelling and then building out dashboards using > SQL (for reports/analytics). > Stil

Re: [DISCUSS] scenario-based quickstart demo

2021-07-06 Thread Vinoth Chandar
Hi Raymond, Are you suggesting a fix to the dev workflow or general site/quickstart docs? Agree, that the current doc is all-at-once and at least better docs on incrementally testing parts could be useful. It takes a while to learn what to skip and what not to. Thanks Vinoth On Sat, Jul 3,

Re: [DISCUSS] Consolidate all dev collaboration to Github

2021-07-05 Thread Vinoth Chandar
t; > > > > > > > On 03-Jul-2021, at 8:12 AM, vbal...@apache.org wrote: > > > > > >  +1 for both A and B. Makes sense to centralize bug tracking and RFCs > > in github. > > > Balaji.V > > > > > > > > >On Friday,

Re: Website redesign

2021-07-02 Thread Vinoth Chandar
; Danny Chan > > Vinoth Chandar 于2021年7月1日 周四上午6:00写道: > > > Any volunteers? Also worth asking in slack? > > > > On Sat, Jun 26, 2021 at 5:03 PM Raymond Xu > > wrote: > > > > > Hi all, > > > > > > We've completed a re-design of Hudi's

[DISCUSS] Consolidate all dev collaboration to Github

2021-07-01 Thread Vinoth Chandar
Hi all, When we incubated Hudi, we made some initial choices around collaboration tools of choice. I am wondering if there are still optimal, given the scale of the community at this point. Specifically, two points. A) Our issue tracker is JIRA, while we just use Github Issues for support

Re: [DISCUSS] Hash Index for HUDI

2021-06-30 Thread Vinoth Chandar
I see that we already have a PR up. Will catch up on it and provide some initial comments. Thanks! On Wed, Jun 16, 2021 at 9:02 AM Shawy Geng wrote: > Combining bucket index and bloom filter is a great idea. There is no > conflict between the two in implementation, and the bloom filter info can

Re: Website redesign

2021-06-30 Thread Vinoth Chandar
Any volunteers? Also worth asking in slack? On Sat, Jun 26, 2021 at 5:03 PM Raymond Xu wrote: > Hi all, > > We've completed a re-design of Hudi's website (hudi.apache.org) , in the > goal of making the navigation more organized and information more > discoverable. The design document can be

Re: [NOTICE] Git web site publishing to be done via .asf.yaml only as of July 1st

2021-06-30 Thread Vinoth Chandar
This is now done, Thanks to Navi! Also checked that an update to doc is properly reflected after this On Fri, Jun 25, 2021 at 8:21 AM Vinoth Chandar wrote: > Thanks! Happy to jump in as needed! > > On Thu, Jun 24, 2021 at 12:46 PM Navinder Brar > wrote: > >> Hi Vinoth

Re: [HELP] unstable tests in the travis CI

2021-06-30 Thread Vinoth Chandar
uild?definitionId=3&_a=summary > > Looks like some flakiness only happened in Travis CI. Let's also keep > observing how it goes in Azure. > > On Wed, Jun 23, 2021 at 12:48 PM Vinoth Chandar wrote: > > > yes. CI is pretty flaky atm. There is a compiled list here > > ht

Re: Could Hudi Data lake support low latency, high throughput random reads?

2021-06-26 Thread Vinoth Chandar
exported periodically. > > Not sure if this is the right pattern, appreciated if you can point me to > any similar architecture that I could study. > > Best regards, > Bill > > On Wed, Jun 23, 2021 at 3:51 PM Vinoth Chandar wrote: > > > >>>>Maybe it is jus

Re: [NOTICE] Git web site publishing to be done via .asf.yaml only as of July 1st

2021-06-25 Thread Vinoth Chandar
NavinderOn Friday, 25 June, 2021, 12:23:26 am IST, Vinoth Chandar < > vin...@apache.org> wrote: > > Hi Navinder, > > Our site is pushed from the asf-site branch and it has a README with > building the site locally etc. that’s a good starting point. I don’t > believe th

Re: [NOTICE] Git web site publishing to be done via .asf.yaml only as of July 1st

2021-06-24 Thread Vinoth Chandar
t; Thanks, > Navinder > > On Thursday, 24 June, 2021, 04:10:41 am IST, Vinoth Chandar < > vin...@apache.org> wrote: > > Hi all, > > Looks like this will apply to our site? Any volunteers to help fix this? > > Thanks > Vinoth > > -- Forwarded

Re: Could Hudi Data lake support low latency, high throughput random reads?

2021-06-23 Thread Vinoth Chandar
Maybe it is just not sane to serve online request-response service using Data lake as backend? In general, data lakes have not evolved beyond analytics, ML at this point, i.e optimized for large batch scans. Not to say that this cannot be possible, but I am skeptical that it will ever be as

Fwd: [NOTICE] Git web site publishing to be done via .asf.yaml only as of July 1st

2021-06-23 Thread Vinoth Chandar
Hi all, Looks like this will apply to our site? Any volunteers to help fix this? Thanks Vinoth -- Forwarded message - From: Daniel Gruno Date: Mon, May 31, 2021 at 6:41 AM Subject: [NOTICE] Git web site publishing to be done via .asf.yaml only as of July 1st To: Users TL;DR:

Re: [HELP] unstable tests in the travis CI

2021-06-23 Thread Vinoth Chandar
yes. CI is pretty flaky atm. There is a compiled list here https://issues.apache.org/jira/browse/HUDI-1248 Siva and I are looking into some of this and try and get everything back to normal again That schema evolution test, I have tried reproducing a few times, without luck. :/ On Wed, Jun 23,

Re: [Discuss] Provide a Flag to choose between Flink or Spark

2021-06-16 Thread Vinoth Chandar
+1 on this effort overall. It will be a little tricky, but doable. First thing is to see how we can replace raw usages of Spark APIs with HoodieEngineContext. It will be cool if we can completely generify DeltaStreamer. but I suspect we need flink/spark specific modules ultimately On Wed, Jun

Re: Why hudi consider the Avro be the MOR's log format?

2021-06-15 Thread Vinoth Chandar
Hi, We wanted a row based format to quickly log changes to the base files and flexibly compact the file groups we wanted. If we wrote parquet for e.g, we would incur costs of writing parquet (can be upto to 10x even) once during ingest and once again during compaction. Of course. This trades off

Re: [DISCUSS] Hash Index for HUDI

2021-06-04 Thread Vinoth Chandar
Thanks for opening the RFC! At first glance, it seemed similar to RFC-08, but the proposal seems to be adding a bucket id to each file group ID? If I may suggest, we should call this BucketedIndex? Instead of changing the existing file name, can we simply assign the filegroupID as the hash mod

Welcome new committers and PMC Members!

2021-05-11 Thread Vinoth Chandar
Hello all, Please join me in congratulating our newest set of committers and PMCs. *Wenning Ding (Committer) * Wenning has been a consistent contributor to Hudi, over the past year or so. He has added some critical bug fixes, lots of good contributions around Spark! *Gary Li (PMC Member) * Gary

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-28 Thread Vinoth Chandar
:06, par3 > > with streaming query "select name, sum(age) from t1 group by name" returns: > > change_flag | name | age_sum > I, Danny, 24 > I Stephen, 34 > > The result is the same as a batch snapshot query. > > Best, > Danny Chan > > Vi

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-20 Thread Vinoth Chandar
ithout the new metadata columns, I > think there is no need to add a config option which brings in > unnecessary overhead. If we do not ensure backward compatibility for new > column, then we should add such a config option and by default > disable it. > > Best, > Danny Ch

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-20 Thread Vinoth Chandar
erator that has accumulate state can handle > >> retractions(UPDATE_BEFORE or > >> >> DELETE) then apply new changes (INSERT or UPDATE_AFTER), so that each > >> >> operator can consume the CDC format messages in streaming way. > >> >> > >> >

Re: Re[2]:Re: About re-run Travis CI

2021-04-20 Thread Vinoth Chandar
> empty commit with the message only. > > > > > https://stackoverflow.com/questions/20138640/pushing-empty-commits-to-remote > > > >Susu > > > >Friday, April 16, 2021 19:35 +0900 from flin...@126.com >: > >>Hi,Vinoth Chandar,Thanks you for your reply. &

Re: [DISCUSS] Hudi is the data lake platform

2021-04-19 Thread Vinoth Chandar
is rebranding (and hopefully some code/package structuring down > the > > road..), it's easier for us to communicate the value add of Hudi and its > > associated features and generate interest for future contributors. > > > > Thanks, > > Nishith > > > > >

Re: PR Tracker board

2021-04-19 Thread Vinoth Chandar
t 9:40 AM Vinoth Chandar wrote: > Hi all, > > I know we have a build up of great contributions :) [great problem to > have], that kind of exceeded our existing triaging processes. > > So, in order to generate more transparency into the review process and > understand where PRs are i

Re: [DISCUSS] Refactor the Hudi configuration framework

2021-04-19 Thread Vinoth Chandar
ace it with HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP.key() I think this is a small cost we can take, in return for much better docs and maintainability. On Mon, Apr 19, 2021 at 1:16 PM Vinoth Chandar wrote: > +1 from me. Long time coming. > > On Mon, Apr 19, 2021 at 12:02 PM Ding, Wenn

Re: [DISCUSS] Refactor the Hudi configuration framework

2021-04-19 Thread Vinoth Chandar
+1 from me. Long time coming. On Mon, Apr 19, 2021 at 12:02 PM Ding, Wenning wrote: > Hi, > I planned to refactor the current Hudi configuration framework. lamberken< > https://github.com/lamberken> did similar things before: > https://github.com/apache/hudi/pull/1094 and I’d like to continue

PR Tracker board

2021-04-19 Thread Vinoth Chandar
Hi all, I know we have a build up of great contributions :) [great problem to have], that kind of exceeded our existing triaging processes. So, in order to generate more transparency into the review process and understand where PRs are in the pipeline, made a tracker board here

Re: About re-run Travis CI

2021-04-15 Thread Vinoth Chandar
If you leave a comment "rerun tests", I think there is a bot that also kicks it off again. Please report if that still works, and if possible kindly send us a PR to update the contributing page with this info :) On Thu, Apr 15, 2021 at 9:56 PM Vinoth Chandar wrote: > Hi Roc, &

Re: About re-run Travis CI

2021-04-15 Thread Vinoth Chandar
Hi Roc, You should be able to click the travis build, and restart from the travis-ci page. Thanks Vinoth On Thu, Apr 15, 2021 at 8:00 PM Roc Marshal wrote: > Hello, all. > In some cases, the Travis CI test failed shown github-PR page. > How to rerun Travis CI? > Thank you . >

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-15 Thread Vinoth Chandar
Hi, Is the intent of the flag to convey if an insert delete or update changed the record? If so I would imagine that we do this even for cow tables, since that also supports a logical notion of a change stream using the commit_time meta field. You may be right, but I am trying to understand the

Re: GDPR deletes and Consenting deletes of data from hudi table

2021-04-15 Thread Vinoth Chandar
dnesday, April 14, 2021 at 3:49 PM > > To: dev > > Subject: Re: GDPR deletes and Consenting deletes of data from hudi table > > Caution: This e-mail originated from outside of Philips, be careful for > > phishing. > > > > > > Felix, > > > > Happy to h

Re: I want to contribute to Apache Hudi.

2021-04-15 Thread Vinoth Chandar
Done. You should have access now On Thu, Apr 15, 2021 at 1:27 AM 蒋龙 wrote: > Hi, > > I want to contribute to Apache Hudi. Would you please give me the > contributor permission? My JIRA ID is > > 用户名:long jiang > 全名: john > >

Re: GDPR deletes and Consenting deletes of data from hudi table

2021-04-14 Thread Vinoth Chandar
> To: dev > > Subject: Re: GDPR deletes and Consenting deletes of data from hudi table > > Caution: This e-mail originated from outside of Philips, be careful for > > phishing. > > > > > > Felix, > > > > Happy to help you through trying and rolling out

Re: GDPR deletes and Consenting deletes of data from hudi table

2021-04-14 Thread Vinoth Chandar
Hi Felix, Most people I think are publishing this data into Kafka,and apply the deletes as a part of the streaming job itself. The reason why this works is because typically, only a small fraction of users leave the service (say << 0.1% weekly is what I have heard). So, the cost of storage on

<    1   2   3   4   5   6   7   8   9   10   >