Re: [DISCUSS] Decouple Hudi and Spark (in wiki design page)

2019-10-02 Thread Vinoth Chandar
Based on some conversations I had with Flink folks including Hudi's very own mentor Thomas, it seems future proof to look into supporting the Flink streaming APIs. The batch apis IIUC will move towards converging with Streaming APIs, which matches Hudi's model anyway >From Hudi's perspective,

Re: [DISCUSS] Decouple Hudi and Spark (in wiki design page)

2019-09-26 Thread Taher Koitawala
Hi Vinoth, IMHO we should stick to Spark for micro batching for 2 reasons. 1: Easy out use 2: Performance. Flink batch is not as fast as Spark. Also the rich library of functions and the ease of integration which Spark has with Hive etc that is not there in Flink batch. Regards, Taher

Re: [DISCUSS] Decouple Hudi and Spark (in wiki design page)

2019-09-25 Thread Taher Koitawala
Hi Vino, Agree with your suggestion. We all know when thought Flink is streaming we can control how files get rolled out through checkpointing configurations. Bad config and small files get rolled out. Good config and files are properly sized. Also I understand the concern of

Re: [DISCUSS] Decouple Hudi and Spark (in wiki design page)

2019-09-25 Thread vino yang
Hi A simple example. In Hudi Project, you can find many code snippet like `spark.read().format().load()` The load method can pass any path, especially DFS paths. While if we only want to use Flink streaming, there is not a good way to read HDFS now. In addition, we.also need to consider other

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-24 Thread Taher Koitawala
Hi Vino, This is not a design for Hudi on Flink. This was simply a mock up of tagLocations() spark cache to Flink state as Vinoth wanted to see. As per the Flink batch and Streaming I am well aware of the batch and Stream unification efforts of Flink. However I think that is still on

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-24 Thread vino yang
Hi Taher, As I mentioned in the previous mail. Things may not be too easy by just using Flink state API. Copied here "Hudi can connect with many different Source/Sinks. Some file-based reads are not appropriate for Flink Streaming." Although, unify Batch and Streaming is Flink's goal. But, it

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-24 Thread Taher Koitawala
Hi All, Sample code to see how records tagging will be handled in Flink is posted on [1]. The main class to run the same is MockHudi.java with a sample path for checkpointing. As of now this is just a sample to know we should ke caching in Flink states with bare minimum configs.

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-21 Thread Vinay Patil
Hi Taher, I agree with this , if the state is becoming too large we should have an option of storing it in external state like File System or RocksDb. @Vinoth Chandar can the state of HoodieBloomIndex go beyond 10-15 GB Regards, Vinay Patil On Fri, Sep 20, 2019 at 11:37 AM Taher Koitawala

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-20 Thread Taher Koitawala
Hey Guys, Any thoughts on the above idea? To handle HoodieBloomIndex with HeapState, RocksDBState and FsState but on Spark. On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala wrote: > Hi Vinoth, >Having seen the doc and code. I understand the > HoodieBloomIndex mainly caches key

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-17 Thread Taher Koitawala
Hi Vinoth, Having seen the doc and code. I understand the HoodieBloomIndex mainly caches key and partition path. Can we address how Flink does it? Like, have HeapState where the user chooses to cache the Index on heap, RockDBState where indexes are written to RocksDB and finally

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-16 Thread Vinoth Chandar
Alright then. Happy to take the lead here. But please give me a week or so, to finish up the spark bundling and other jar issues.. Too much context switching :) On Mon, Sep 16, 2019 at 6:57 PM vino yang wrote: > Hi guys, > > Currently, I am busy with HUDI-203[1] and other things. > > I agree

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-16 Thread vino yang
Hi guys, Currently, I am busy with HUDI-203[1] and other things. I agree with Vinoth that we should try to find a new solution to decouple the dependency with the Spark RDD cache. It's an excellent way to start this big work. [1]: https://issues.apache.org/jira/browse/HUDI-203

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-16 Thread vbal...@apache.org
+1 This is a pretty large undertaking. While the community is getting their hands dirty and ramping up on Hudi internals, it would be productive if Vinoth shepherds this Balaji.VOn Monday, September 16, 2019, 11:30:44 AM PDT, Vinoth Chandar wrote: sg. :) I will wait for others on

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-16 Thread Vinoth Chandar
sg. :) I will wait for others on this thread as well to chime in. On Mon, Sep 16, 2019 at 11:27 AM Taher Koitawala wrote: > Vinoth, I think right now given your experience with the project you should > be scoping out what needs to be done to take us there. So +1 for giving you > more work :) >

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-16 Thread Taher Koitawala
Vinoth, I think right now given your experience with the project you should be scoping out what needs to be done to take us there. So +1 for giving you more work :) We want to reach a point where we can start scoping out addition of Flink and Beam components within. Then I think will tremendous

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-16 Thread Vinoth Chandar
I still feel the key thing here is reimplementing HoodieBloomIndex without needing spark caching. https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design(non-global) documents the spark DAG in detail. If everyone feels, it's best for me to scope the work out, then happy

Re: [DISCUSS] Decouple Hudi and Spark

2019-09-16 Thread Taher Koitawala
Guys I think we are slowing down on this again. We need to start planning small small tasks towards this VC please can you help fast track this? Regards, Taher Koitawala On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar wrote: > Look forward to the analysis. A key class to read would be >

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-14 Thread Vinoth Chandar
Look forward to the analysis. A key class to read would be HoodieBloomIndex, which uses a lot of spark caching and shuffles. On Tue, Aug 13, 2019 at 7:52 PM vino yang wrote: > >> Currently Spark Streaming micro batching fits well with Hudi, since it > amortizes the cost of indexing, workload

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-13 Thread vino yang
>> Currently Spark Streaming micro batching fits well with Hudi, since it amortizes the cost of indexing, workload profiling etc. 1 spark micro batch = 1 hudi commit With the per-record model in Flink, I am not sure how useful it will be to support hudi.. for e.g, 1 input record cannot be 1 hudi

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-13 Thread vino yang
Hi Nick and Taher, I just want to answer Nishith's question. Reference his old description here: > You can do a parallel investigation while we are deciding on the module structure. You could be looking at all the patterns in Hudi's Spark APIs usage (RDD/DataSource/SparkContext) and see if such

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-13 Thread taher koitawala
Hi Vino, According to what I've seen Hudi has a lot of spark component flowing throwing it. Like Taskcontexts, JavaSparkContexts etc. The main classes I guess we should focus upon is HoodieTable and Hoodie write clients. Also Vino, I don't think we should be providing Flink dataset

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-13 Thread vino yang
Hi all, After doing some research, let me share my information: - Limitation of computing engine capabilities: Hudi uses Spark's RDD#persist, and Flink currently has no API to cache datasets. Maybe we can only choose to use external storage or do not use cache? For the use of other

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-07 Thread taher koitawala
Thanks a ton Vinoth. On Wed, Aug 7, 2019 at 4:34 PM Vinoth Chandar wrote: > >>Are there some tasks I can take up to ramp up the code? > Certainly. There are some open tasks that touch the hoodie-client and > hoodie-utilities module. > https://issues.apache.org/jira/browse/HUDI-37 >

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-07 Thread Vinoth Chandar
>>Are there some tasks I can take up to ramp up the code? Certainly. There are some open tasks that touch the hoodie-client and hoodie-utilities module. https://issues.apache.org/jira/browse/HUDI-37 https://issues.apache.org/jira/browse/HUDI-194 https://issues.apache.org/jira/browse/HUDI-145

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-06 Thread nishith agarwal
+1 for Approach 1 Point integration with each framework. Pros for point integration - Hudi community is already familiar with spark and spark based actions/shuffles etc. Since both modules can be decoupled, this enables us to have a steady release for Hudi for 1 execution engine (spark) while we

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-06 Thread vbal...@apache.org
+1 on approach 1. As pointed out approach 2 has a risk for performance regression when introducing beam abstraction. To keep things simpler and start iterating, we can try an incremental route where beam can be thought of another engine supporting Hudi. When there is material confidence that

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-06 Thread taher koitawala
Hi Vinoth, Are there some tasks I can take up to ramp up the code? Want to get more used to the code and understand the existing implementation better. Thanks, Taher Koitawala On Tue, Aug 6, 2019, 10:02 PM Vinoth Chandar wrote: > Let's see if others have any thoughts as well. We can

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-06 Thread Vinoth Chandar
Let's see if others have any thoughts as well. We can plan to fix the approach by EOW. On Mon, Aug 5, 2019 at 7:06 PM vino yang wrote: > Hi guys, > > Also, +1 for Approach 1 like Taher. > > > If we can do a comprehensive analysis of this model and come up with. > means > > to refactor this

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-05 Thread taher koitawala
+1 for Approch 1 Point integration with each framework Approach 2 has a problem as you said "Developers need to think about what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end, this may not be the panacea that it seems to be" We have seen various pipelines in the beam dag being

Re: [DISCUSS] Decouple Hudi and Spark (HudiLink / approach)

2019-08-05 Thread Vinoth Chandar
Great discussions! Responded on the. original thread on decoupling.. Let's continue there? On Mon, Aug 5, 2019 at 1:39 AM Semantic Beeng wrote: > "design is more important. When we have a clear idea, it is not too late > to create an issue" > > 100% with Vino > > > On August 5, 2019 at 2:50 AM

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-05 Thread Vinoth Chandar
Would like to highlight that there are two distinct approaches here with different tradeoffs. Think of this as my braindump, as I have been thinking about this quite a bit in the past. *Approach 1 : Point integration with each framework * >>We may need a pure client module named for example

Re: [DISCUSS] Decouple Hudi and Spark (HudiLink / approach)

2019-08-05 Thread vino yang
Hi Taher, IMO, Let's listen to more comments, after all, this discussion took place over the weekend. Then listen to Vinoth and the community's comments and suggestions. I personally think that design is more important. When we have a clear idea, it is not too late to create an issue. I am

Re: [DISCUSS] Decouple Hudi and Spark (HudiLink / approach)

2019-08-05 Thread taher koitawala
If everyone agrees that we should decouple Hudi and Spark to enable processing engine abstraction. Should I open a jira ticket for that? On Sun, Aug 4, 2019 at 6:59 PM taher koitawala wrote: > If anyone wants to see a Flink Streaming pipeline here is a really small > and basic Flink pipeline. >

Re: [DISCUSS] Decouple Hudi and Spark (HudiLink / approach)

2019-08-04 Thread taher koitawala
If anyone wants to see a Flink Streaming pipeline here is a really small and basic Flink pipeline. https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/com/flink/hudi/example Consider users playing a game across multiple platforms and we only get the timestamp,

Re: [DISCUSS] Decouple Hudi and Spark (HudiLink / approach)

2019-08-04 Thread vino yang
Hi Nick, Thank you for your more detailed thoughts, and I fully agree with your thoughts about HudiLink, which should also be part of the long-term planning of the Hudi Ecology. *But I found that the angle of our thinking and the starting point are not consistent. I pay more attention to the

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-03 Thread Suneel Marthi
+1 for Beam -- agree with Semantic Beeng's analysis. On Sat, Aug 3, 2019 at 10:30 PM taher koitawala wrote: > So the way to go around this is that file a hip. Chalk all th classes our > and start moving towards Pure client. > > Secondly should we want to try beam? > > I think there is to much

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-03 Thread taher koitawala
So the way to go around this is that file a hip. Chalk all th classes our and start moving towards Pure client. Secondly should we want to try beam? I think there is to much going on here and I'm not able to follow. If we want to try out beam all along I don't think it makes sense to do anything

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-03 Thread Vinoth Chandar
>>More for my own edification, how does the recently introduced timeline service play into the delta writer components? TimelineService runs in the Spark driver (DeltaStreamer is a Hudi Spark app) and answers metadata/timeline api calls from the executors.. it is not aware of Spark vs Flink or

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-03 Thread Vinoth Chandar
Decoupling Spark and Hudi is the first step to bring in a Flink runtime, and its also the hardest part. On the decoupling itself, the IOHandle classes are (almost) unaware of Spark itself, where the Write/ReadClient and the Table classes are very aware.. First step here is to probably draw out

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-02 Thread vino yang
Hi Suneel, Thank you for your suggestion, let me clarify. *The context of this email is that we are evaluating how to implement a Stream Delta writer base on Flink.* About the discussion between me, Taher and Vinay, those are just some trivial details in the preparation of the document, and the

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-02 Thread Suneel Marthi
Please keep all discussions to Mailing lists here - no offline discussions please. On Fri, Aug 2, 2019 at 10:22 AM vino yang wrote: > Hi guys, > > Currently, I, Taher and Vinay are working on issue HUDI-184.[1] > > As a first step, we are discussing the design doc. > > After diving into the