Based on some conversations I had with Flink folks including Hudi's very
own mentor Thomas, it seems future proof to look into supporting the Flink
streaming APIs. The batch apis IIUC will move towards converging with
Streaming APIs, which matches Hudi's model anyway
>From Hudi's perspective,
Hi Vinoth,
IMHO we should stick to Spark for micro batching for 2 reasons. 1:
Easy out use 2: Performance. Flink batch is not as fast as Spark. Also the
rich library of functions and the ease of integration which Spark has with
Hive etc that is not there in Flink batch.
Regards,
Taher
Hi Vino,
Agree with your suggestion. We all know when thought Flink is
streaming we can control how files get rolled out through checkpointing
configurations. Bad config and small files get rolled out. Good config and
files are properly sized.
Also I understand the concern of
Hi A simple example. In Hudi Project, you can find many code snippet like
`spark.read().format().load()` The load method can pass any path, especially
DFS paths. While if we only want to use Flink streaming, there is not a good
way to read HDFS now. In addition, we.also need to consider other
Hi Vino,
This is not a design for Hudi on Flink. This was simply a mock up of
tagLocations() spark cache to Flink state as Vinoth wanted to see.
As per the Flink batch and Streaming I am well aware of the batch and
Stream unification efforts of Flink. However I think that is still on
Hi Taher, As I mentioned in the previous mail. Things may not be too easy by
just using Flink state API. Copied here "Hudi can connect with many different
Source/Sinks. Some file-based reads are not appropriate for Flink Streaming."
Although, unify Batch and Streaming is Flink's goal. But, it
Hi All,
Sample code to see how records tagging will be handled in
Flink is posted on [1]. The main class to run the same is MockHudi.java
with a sample path for checkpointing.
As of now this is just a sample to know we should ke caching in Flink
states with bare minimum configs.
Hi Taher,
I agree with this , if the state is becoming too large we should have an
option of storing it in external state like File System or RocksDb.
@Vinoth Chandar can the state of HoodieBloomIndex go
beyond 10-15 GB
Regards,
Vinay Patil
On Fri, Sep 20, 2019 at 11:37 AM Taher Koitawala
Hey Guys, Any thoughts on the above idea? To handle HoodieBloomIndex with
HeapState, RocksDBState and FsState but on Spark.
On Tue, Sep 17, 2019 at 1:41 PM Taher Koitawala wrote:
> Hi Vinoth,
>Having seen the doc and code. I understand the
> HoodieBloomIndex mainly caches key
Hi Vinoth,
Having seen the doc and code. I understand the
HoodieBloomIndex mainly caches key and partition path. Can we address how
Flink does it? Like, have HeapState where the user chooses to cache the
Index on heap, RockDBState where indexes are written to RocksDB and finally
Alright then. Happy to take the lead here. But please give me a week or so,
to finish up the spark bundling and other jar issues.. Too much context
switching :)
On Mon, Sep 16, 2019 at 6:57 PM vino yang wrote:
> Hi guys,
>
> Currently, I am busy with HUDI-203[1] and other things.
>
> I agree
Hi guys,
Currently, I am busy with HUDI-203[1] and other things.
I agree with Vinoth that we should try to find a new solution to decouple
the dependency with the Spark RDD cache.
It's an excellent way to start this big work.
[1]: https://issues.apache.org/jira/browse/HUDI-203
+1 This is a pretty large undertaking. While the community is getting their
hands dirty and ramping up on Hudi internals, it would be productive if Vinoth
shepherds this
Balaji.VOn Monday, September 16, 2019, 11:30:44 AM PDT, Vinoth Chandar
wrote:
sg. :)
I will wait for others on
sg. :)
I will wait for others on this thread as well to chime in.
On Mon, Sep 16, 2019 at 11:27 AM Taher Koitawala wrote:
> Vinoth, I think right now given your experience with the project you should
> be scoping out what needs to be done to take us there. So +1 for giving you
> more work :)
>
Vinoth, I think right now given your experience with the project you should
be scoping out what needs to be done to take us there. So +1 for giving you
more work :)
We want to reach a point where we can start scoping out addition of Flink
and Beam components within. Then I think will tremendous
I still feel the key thing here is reimplementing HoodieBloomIndex without
needing spark caching.
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103093742#Design(non-global)
documents the spark DAG in detail.
If everyone feels, it's best for me to scope the work out, then happy
Guys I think we are slowing down on this again. We need to start planning
small small tasks towards this VC please can you help fast track this?
Regards,
Taher Koitawala
On Thu, Aug 15, 2019, 10:07 AM Vinoth Chandar wrote:
> Look forward to the analysis. A key class to read would be
>
Look forward to the analysis. A key class to read would be
HoodieBloomIndex, which uses a lot of spark caching and shuffles.
On Tue, Aug 13, 2019 at 7:52 PM vino yang wrote:
> >> Currently Spark Streaming micro batching fits well with Hudi, since it
> amortizes the cost of indexing, workload
>> Currently Spark Streaming micro batching fits well with Hudi, since it
amortizes the cost of indexing, workload profiling etc. 1 spark micro batch
= 1 hudi commit
With the per-record model in Flink, I am not sure how useful it will be to
support hudi.. for e.g, 1 input record cannot be 1 hudi
Hi Nick and Taher,
I just want to answer Nishith's question. Reference his old description
here:
> You can do a parallel investigation while we are deciding on the module
structure. You could be looking at all the patterns in Hudi's Spark APIs
usage (RDD/DataSource/SparkContext) and see if such
Hi Vino,
According to what I've seen Hudi has a lot of spark component flowing
throwing it. Like Taskcontexts, JavaSparkContexts etc. The main classes I
guess we should focus upon is HoodieTable and Hoodie write clients.
Also Vino, I don't think we should be providing Flink dataset
Hi all,
After doing some research, let me share my information:
- Limitation of computing engine capabilities: Hudi uses Spark's
RDD#persist, and Flink currently has no API to cache datasets. Maybe we can
only choose to use external storage or do not use cache? For the use of
other
Thanks a ton Vinoth.
On Wed, Aug 7, 2019 at 4:34 PM Vinoth Chandar wrote:
> >>Are there some tasks I can take up to ramp up the code?
> Certainly. There are some open tasks that touch the hoodie-client and
> hoodie-utilities module.
> https://issues.apache.org/jira/browse/HUDI-37
>
>>Are there some tasks I can take up to ramp up the code?
Certainly. There are some open tasks that touch the hoodie-client and
hoodie-utilities module.
https://issues.apache.org/jira/browse/HUDI-37
https://issues.apache.org/jira/browse/HUDI-194
https://issues.apache.org/jira/browse/HUDI-145
+1 for Approach 1 Point integration with each framework.
Pros for point integration
- Hudi community is already familiar with spark and spark based
actions/shuffles etc. Since both modules can be decoupled, this enables us
to have a steady release for Hudi for 1 execution engine (spark) while we
+1 on approach 1. As pointed out approach 2 has a risk for performance
regression when introducing beam abstraction. To keep things simpler and start
iterating, we can try an incremental route where beam can be thought of another
engine supporting Hudi. When there is material confidence that
Hi Vinoth,
Are there some tasks I can take up to ramp up the code? Want to get
more used to the code and understand the existing implementation better.
Thanks,
Taher Koitawala
On Tue, Aug 6, 2019, 10:02 PM Vinoth Chandar wrote:
> Let's see if others have any thoughts as well. We can
Let's see if others have any thoughts as well. We can plan to fix the
approach by EOW.
On Mon, Aug 5, 2019 at 7:06 PM vino yang wrote:
> Hi guys,
>
> Also, +1 for Approach 1 like Taher.
>
> > If we can do a comprehensive analysis of this model and come up with.
> means
> > to refactor this
+1 for Approch 1 Point integration with each framework
Approach 2 has a problem as you said "Developers need to think about
what-if-this-piece-of-code-ran-as-spark-vs-flink.. So in the end, this may
not be the panacea that it seems to be"
We have seen various pipelines in the beam dag being
Great discussions! Responded on the. original thread on decoupling..
Let's continue there?
On Mon, Aug 5, 2019 at 1:39 AM Semantic Beeng
wrote:
> "design is more important. When we have a clear idea, it is not too late
> to create an issue"
>
> 100% with Vino
>
>
> On August 5, 2019 at 2:50 AM
Would like to highlight that there are two distinct approaches here with
different tradeoffs. Think of this as my braindump, as I have been thinking
about this quite a bit in the past.
*Approach 1 : Point integration with each framework *
>>We may need a pure client module named for example
Hi Taher,
IMO, Let's listen to more comments, after all, this discussion took place
over the weekend. Then listen to Vinoth and the community's comments and
suggestions.
I personally think that design is more important. When we have a clear
idea, it is not too late to create an issue.
I am
If everyone agrees that we should decouple Hudi and Spark to enable
processing engine abstraction. Should I open a jira ticket for that?
On Sun, Aug 4, 2019 at 6:59 PM taher koitawala wrote:
> If anyone wants to see a Flink Streaming pipeline here is a really small
> and basic Flink pipeline.
>
If anyone wants to see a Flink Streaming pipeline here is a really small
and basic Flink pipeline.
https://github.com/taherk77/FlinkHudi/tree/master/FlinkHudiExample/src/main/java/com/flink/hudi/example
Consider users playing a game across multiple platforms and we only get the
timestamp,
Hi Nick,
Thank you for your more detailed thoughts, and I fully agree with your
thoughts about HudiLink, which should also be part of the long-term
planning of the Hudi Ecology.
*But I found that the angle of our thinking and the starting point are not
consistent. I pay more attention to the
+1 for Beam -- agree with Semantic Beeng's analysis.
On Sat, Aug 3, 2019 at 10:30 PM taher koitawala wrote:
> So the way to go around this is that file a hip. Chalk all th classes our
> and start moving towards Pure client.
>
> Secondly should we want to try beam?
>
> I think there is to much
So the way to go around this is that file a hip. Chalk all th classes our
and start moving towards Pure client.
Secondly should we want to try beam?
I think there is to much going on here and I'm not able to follow. If we
want to try out beam all along I don't think it makes sense to do anything
>>More for my own edification, how does the recently introduced
timeline service play into the delta writer components?
TimelineService runs in the Spark driver (DeltaStreamer is a Hudi Spark
app) and answers metadata/timeline api calls from the executors.. it is not
aware of Spark vs Flink or
Decoupling Spark and Hudi is the first step to bring in a Flink runtime,
and its also the hardest part.
On the decoupling itself, the IOHandle classes are (almost) unaware of
Spark itself, where the Write/ReadClient and the Table classes are very
aware..
First step here is to probably draw out
Hi Suneel,
Thank you for your suggestion, let me clarify.
*The context of this email is that we are evaluating how to implement a
Stream Delta writer base on Flink.*
About the discussion between me, Taher and Vinay, those are just some
trivial details in the preparation of the document, and the
Please keep all discussions to Mailing lists here - no offline discussions
please.
On Fri, Aug 2, 2019 at 10:22 AM vino yang wrote:
> Hi guys,
>
> Currently, I, Taher and Vinay are working on issue HUDI-184.[1]
>
> As a first step, we are discussing the design doc.
>
> After diving into the
41 matches
Mail list logo