Re: Recursive Split Detection + same split optimization

Hang Ruan Mon, 10 Jul 2023 00:19:40 -0700

Hi, Benoit.

A split enumerator responsible for discovering the source splits, and
assigning them to the reader. It seems like that your connector discovering
splits in TM and assigning them in JM.


I think there are 2 choices:
1. If you need the enumerator to assign splits, you have to send the events
about the splits between the source reader and the enumerator.
2. If you can make use of the subtaskId and let every reader read some
scope of the IDs, the enumerator is useless for you.

I am not sure whether you are able to move the discovering splits task back
to the enumerator by multi thread. Putting it to the TM may be weird and
error-prone.

Finally, I have a second problem which is about avoiding extracting
> multiple times the same split. We can imagine, based on my previous
> explanation, that same ID might be detected through multiple parent splits.
> To avoid losing time doing the same job multiple times, I need to avoid
> extracting the same ID.
> Actually, I am thinking about storing the already extracted ID into the
> state and storing it into my state backend. What do you think about this ?
>

For choice 1, put the information in the enumerator's state.
For choice 2, no need to consider that issue.

Best,
Hang

Benoit Tailhades <benoit.tailha...@gmail.com> 于2023年7月10日周一 12:59写道：

> Hello Everyone,
>
> I am trying to implement a custom source where split detection is an
> expensive operation and I would like to benefit from the split reader
> results to build my next splits.
>
> Basically, my source is taking as input an id from my external system,
> let's name it ID1.
>
> From ID1, I can get a list of other sub splits but getting this list is an
> expensive operation so I want it to be done on a task manager during the
> split reading of ID1. Now we can imagine sub splits of ID1 are ID1.1 and
> ID1.2.
> So, to sum up my split reader of ID1 will be responsible for:
> 1. Collecting content of ID1
> 2. Producing n sub splits
> Then, the split enumerator will receive these sub splits and schedule
> ID1.1, ... ID1.n for split reading.
>
> As of now, I have implemented this mechanism using events between split
> reader and split enumerator but I think there might be a better
> architecture using Flink.
>
> Finally, I have a second problem which is about avoiding extracting
> multiple times the same split. We can imagine, based on my previous
> explanation, that same ID might be detected through multiple parent splits.
> To avoid losing time doing the same job multiple times, I need to avoid
> extracting the same ID.
> Actually, I am thinking about storing the already extracted ID into the
> state and storing it into my state backend. What do you think about this ?
>
> Thank you.
>
> Benoit
>
>

Re: Recursive Split Detection + same split optimization

Reply via email to