Re: Controlling group partitioning with DataStream

Ken Krugler Fri, 18 Mar 2022 15:18:43 -0700

Hi Guowei,

Thanks for following up on this, sorry I missed your email earlier.


Unfortunately I don’t think auto-rebalancing will help my situation, because I 
have a small number of unique key values (low cardinality).

And processing these groups (training one deep-learning model per group) 
requires a lot fo memory, so I need to ensure only one group per slot.

Regards,

— Ken


> On Mar 8, 2022, at 8:35 PM, Guowei Ma <guowei....@gmail.com> wrote:
> 
> Hi, Ken
> 
> If you are talking about the Batch scene, there may be another idea that the 
> engine automatically and evenly distributes the amount of data to be 
> processed by each Stage to each worker node. This also means that, in some 
> cases, the user does not need to manually define a Partitioner.
> 
> At present, Flink has a FLIP-187 [1], which is working in this direction, but 
> to achieve the above goals may also require the follow up work of FLIP-186 
> [2]. After the release of 1.15, we will carry out the "Auto-rebalancing of 
> workloads" related work as soon as possible, you can pay attention to the 
> progress of this FLIP.
> 
> [1] 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-187%3A+Adaptive+Batch+Job+Scheduler
>  
> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-187%3A+Adaptive+Batch+Job+Scheduler>
> [2] 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-187%3A+Adaptive+Batch+Job+Scheduler#FLIP187:AdaptiveBatchJobScheduler-Futureimprovements
>  
> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-187%3A+Adaptive+Batch+Job+Scheduler#FLIP187:AdaptiveBatchJobScheduler-Futureimprovements>
> 
> Best,
> Guowei
> 
> 
> On Wed, Mar 9, 2022 at 8:44 AM Ken Krugler <kkrugler_li...@transpac.com 
> <mailto:kkrugler_li...@transpac.com>> wrote:
> Hi Dario,
> 
> Just to close the loop on this, I answered my own question on SO.
> 
> Unfortunately it seems like the recommended solution is to do the same hack I 
> did a while ago, which is to generate (via trial-and-error) a key that gets 
> assigned to the target slot.
> 
> I was hoping for something a bit more elegant :)
> 
> I think it’s likely I could make it work by implementing my own version of 
> KeyGroupStreamPartitioner, but as I’d noted in my SO question, that would 
> involve use of some internal-only classes, so maybe not a win.
> 
> — Ken
> 
> 
>> On Mar 4, 2022, at 3:14 PM, Dario Heinisch <dario.heini...@gmail.com 
>> <mailto:dario.heini...@gmail.com>> wrote:
>> 
>> Hi, 
>> 
>> I think you are looking for this answer from David: 
>> https://stackoverflow.com/questions/69799181/flink-streaming-do-the-events-get-distributed-to-each-task-slots-separately-acc
>>  
>> <https://stackoverflow.com/questions/69799181/flink-streaming-do-the-events-get-distributed-to-each-task-slots-separately-acc>
>> I think then you could technically create your partitioner - though little 
>> bit cubersome - by mapping your existing keys to new keys who will have then 
>> an output to the desired
>> group & slot. 
>> 
>> Hope this may help, 
>> 
>> Dario
>> 
>> On 04.03.22 23:54, Ken Krugler wrote:
>>> Hi all,
>>> 
>>> I need to be able to control which slot a keyBy group goes to, in order to 
>>> compensate for a badly skewed dataset.
>>> 
>>> Any recommended approach to use here?
>>> 
>>> Previously (with a DataSet) I used groupBy followed by a withPartitioner, 
>>> and provided my own custom partitioner.
>>> 
>>> I posted this same question to 
>>> https://stackoverflow.com/questions/71357833/equivalent-of-dataset-groupby-withpartitioner-for-datastream
>>>  
>>> <https://stackoverflow.com/questions/71357833/equivalent-of-dataset-groupby-withpartitioner-for-datastream>
>>> 
>>> Thanks,
>>> 
>>> — Ken
> 
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com <http://www.scaleunlimited.com/>
> Custom big data solutions
> Flink, Pinot, Solr, Elasticsearch
> 
> 
> 

--------------------------
Ken Krugler
http://www.scaleunlimited.com
Custom big data solutions
Flink, Pinot, Solr, Elasticsearch

Re: Controlling group partitioning with DataStream

Reply via email to