Re: [Hudi Improvement]: Modification of partition path format to support simplified queries

2019-08-13 Thread vbal...@apache.org
Hi Pratyaksh, The partitioning format is pluggable in Hudi. 1. For Hudi Writing, you can simply use one of the several implementations of org.apache.hudi.KeyGenerator or write your own implementation to control partition path format. You can configure partition-path using 

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-13 Thread vino yang
>> Currently Spark Streaming micro batching fits well with Hudi, since it amortizes the cost of indexing, workload profiling etc. 1 spark micro batch = 1 hudi commit With the per-record model in Flink, I am not sure how useful it will be to support hudi.. for e.g, 1 input record cannot be 1 hudi

Re: [DISCUSS] Promote Hudi Chinese Documentation into the official website

2019-08-13 Thread vino yang
Hi guys, Thanks for agreeing with this proposal. To vinu: > I also suggest we add a new component in JIRA with a few volunteers to help review PRs that come in this area? +1, yes, we really need a new component in JIRA. Best, Vino Y. Ethan Guo 于2019年8月14日周三 上午2:22写道: > +1 I can also help

Re: Unable to subscribe to slack group

2019-08-13 Thread Pratyaksh Sharma
Hi Vinoth, I have commented my mail id on the mentioned github issue. Sure, I will update the documentation. On Tue, Aug 13, 2019 at 11:41 PM Vinoth Chandar wrote: > Hi Pratyaksh, > > We have pre-approved anyone with @apache.org. email and a few others.. > Typically,

Re: [DISCUSS] Promote Hudi Chinese Documentation into the official website

2019-08-13 Thread Y. Ethan Guo
+1 I can also help on the Chinese version of the docs. On Tue, Aug 13, 2019 at 11:08 AM Vinoth Chandar wrote: > +1 Thanks for starting this initiative, Vino. > > I also suggest we add a new component in JIRA with a few volunteers to help > review PRs that come in this area? > > On Tue, Aug 13,

Re: Contributing to Apache Hudi

2019-08-13 Thread Vinoth Chandar
Done! On Tue, Aug 13, 2019 at 6:44 AM leesf wrote: > Hi, > > I want to contribute to Apache Hudi. > Would you please give me the contributor permission? > My JIRA ID is xleesf. > > leesf 于2019年8月13日周二 下午9:42写道: > > > Hi, > > > > I want to contribute to Apache Calcite. > > Would you please give

Re: Unable to subscribe to slack group

2019-08-13 Thread Vinoth Chandar
Hi Pratyaksh, We have pre-approved anyone with @apache.org. email and a few others.. Typically, https://github.com/apache/incubator-hudi/issues/143 is used for reporting the email to be added.. Can you provide your email there and we will add you in P.S: I realize there is a documentation gap on

Re: [DISCUSS] Promote Hudi Chinese Documentation into the official website

2019-08-13 Thread Vinoth Chandar
+1 Thanks for starting this initiative, Vino. I also suggest we add a new component in JIRA with a few volunteers to help review PRs that come in this area? On Tue, Aug 13, 2019 at 9:02 AM Gary Li wrote: > +1 This is a great idea. I think there are also some room for improvement > for the

Re: [DISCUSS] Promote Hudi Chinese Documentation into the official website

2019-08-13 Thread Gary Li
+1 This is a great idea. I think there are also some room for improvement for the English version as well. Some of my colleagues are very interested in Hudi but they found the documentation was a little bit challenging to understand. Same for me when I first started to work on Hudi as well. I am

[DISCUSS] Promote Hudi Chinese Documentation into the official website

2019-08-13 Thread vino yang
Currently, Hudi has not won much attention in China, partly because of the lack of Chinese resources and documents. I personally think we should add more documentation and develop multilingualism. Just as Flink has official Chinese documentation[1], this can quickly let Chinese developers know

Re: Unable to subscribe to slack group

2019-08-13 Thread Kuo, Shinray
Leave a comment here: https://github.com/apache/incubator-hudi/issues/143 And the team will get back to you shortly! Shinray K. On 8/13/19, 2:25 AM, "Pratyaksh Sharma" wrote: This email is from an external sender. Hi, I was going through the pre-requisites here

Re: [DISCUSS] Refactor the package name of Hudi

2019-08-13 Thread Vinoth Chandar
Thanks! Def glad to get this done. Credits should go to balaji :) My 2c is that its okay for classes to remain how they are. There is diminishing returns to do that and “Hoodie” is the suggested pronunciation :) anyway. On Sun, Aug 11, 2019 at 6:23 PM vino yang wrote: > Hi Vinoth, > > Thanks

Re: Contributing to Apache Hudi

2019-08-13 Thread leesf
Hi, I want to contribute to Apache Hudi. Would you please give me the contributor permission? My JIRA ID is xleesf. leesf 于2019年8月13日周二 下午9:42写道: > Hi, > > I want to contribute to Apache Calcite. > Would you please give me the contributor permission? > My JIRA ID is xleesf. >

Contributing to Apache Hudi

2019-08-13 Thread leesf
Hi, I want to contribute to Apache Calcite. Would you please give me the contributor permission? My JIRA ID is xleesf.

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-13 Thread vino yang
Hi Nick and Taher, I just want to answer Nishith's question. Reference his old description here: > You can do a parallel investigation while we are deciding on the module structure. You could be looking at all the patterns in Hudi's Spark APIs usage (RDD/DataSource/SparkContext) and see if such

[Hudi Improvement]: Modification of partition path format to support simplified queries

2019-08-13 Thread Pratyaksh Sharma
Hi, I have been working on Hudi for sometime and have an improvement suggestion. When we build a CDC pipeline, generally the field used for partitioning is date (created_at), and the general format of created_at is -MM-dd HH:mm:ss.S. If we have this field formatted to /MM/dd, then

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-13 Thread taher koitawala
Hi Vino, According to what I've seen Hudi has a lot of spark component flowing throwing it. Like Taskcontexts, JavaSparkContexts etc. The main classes I guess we should focus upon is HoodieTable and Hoodie write clients. Also Vino, I don't think we should be providing Flink dataset

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-13 Thread vino yang
Hi all, After doing some research, let me share my information: - Limitation of computing engine capabilities: Hudi uses Spark's RDD#persist, and Flink currently has no API to cache datasets. Maybe we can only choose to use external storage or do not use cache? For the use of other