Re: A "per operator instance" window all ?

Julien Tue, 20 Feb 2018 00:28:48 -0800

Hi Xingcan, Ken and Till,

OK, thank you. It is clear.


I have various option then:

 * the one suggested by Ken where I can find a way to build a key that
   will be well distributed (1 key per task)
     o it relies on the way Flink partitions the key, but it will do
       the job
 * or I can also go with another way to build my key where I will have
   more keys than the parallelism, so the distribution will be better
     o I will still have few number of requests (much less than the
       number of resource ids as 1 key will be for multiple resource ids)
     o I will potentially do multiple requests on the same task, but it
       may be acceptable, especially if I go with AsyncIO
 * or I can go with the OperatorState and implements my own firing logic
     o I am in a case where the memory-based mechanism should be fine


Thanks again,
Regards.


Julien.


On 20/02/2018 02:48, Xingcan Cui wrote:

Hi Julien,
you could use the OperatorState<https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/state/state.html#using-managed-operator-state> tocache the data in a window and the last time your window fired. Thenyou check the ctx.timerService().currentProcessingTime() inprocessElement() and once it exceeds the next window boundary, all thecached data should be processed as if the window is fired.
Note that currently, there are only memory-based operator states provided.

Hope this helps,
Xingcan
On 19 Feb 2018, at 4:34 PM, Julien <jmassio...@gmail.com<mailto:jmassio...@gmail.com>> wrote:
Hello,
I've already tried to key my stream with"resourceId.hashCode%parallelism" (with parallelism of 4 in my example).So all my keys will be either 0,1, 2 or 3. I can then benefit from atime window on this keyed stream and do only 4 queries to my externalsystem.But it is not well distributed with the default partitioner on keyedstream. (keys 0, 1, 2 and 3 only goes to operator idx 2, 3).
I think I should explore the customer partitioner, as you suggestedXingcan.Maybe my last question on this will be: "can you give me more detailson this point "and simulate a window operation by yourself in aProcessFunction" ?
When I look at the documentation about the custom partitioner, I cansee that the result of partitionCustom is a DataStream.
It is not a KeyedStream.
So the only window I have will be windowAll (which will bring me backto a parallelism of 1, no ?).
And if I do something like "myStream.partitionCustom(<my newpartitioner>,<my key>).keyBy(<myKey>).window(...)", will it preservemy custom partitioner ?When looking at the "KeyedStream" class, it seems that it will goback to the "KeyGroupStreamPartitioner" and forget my custompartitioner ?
Thanks again for your feedback,

Julien.


On 19/02/2018 03:45, 周思华 wrote:
Hi Julien,
If I am not misunderstand, I think you can key your stream on a`Random.nextInt() % parallesm`, this way you can "group" togetheralerts from different and benefit from multi parallems.
发自网易邮箱大师
On 02/19/2018 09:08，Xingcan Cui<xingc...@gmail.com<mailto:xingc...@gmail.com>> wrote：
Hi Julien,
sorry for my misunderstanding before. For now, the window can onlybe defined on a KeyedStream or an ordinary DataStream but withparallelism = 1. I’d like to provide three options for your scenario.
1. If your external data is static and can be fit into the memory,you can use ManagedStates to cache them without considering thequerying problem.2. Or you can use a CustomPartitioner to manually distribute youralert data and simulate an window operation by yourself in aProcessFuncton.3. You may also choose to use some external systems such asin-memory store, which can work as a cache for your queries.
Best,
Xingcan
On 19 Feb 2018, at 5:55 AM, Julien <jmassio...@gmail.com<mailto:jmassio...@gmail.com>> wrote:
Hi Xingcan,

Thanks for your answer.
Yes, I understand that point:
• if I have 100 resource IDs with parallelism of 4, then eachoperator instance will handle about 25 keys
The issue I have is that I want, on a given operator instance, togroup those 25 keys together in order to do only 1 query to anexternal system per operator instance:
• on a given operator instance, I will do 1 query for my 25 keys
• so with the 4 operator instances, I will do 4 query in parallel(with about 25 keys per query)
I do not know how I can do that.
If I define a window on my keyed stream (with forexample stream.key(_.resourceId).window(TumblingProcessingTimeWindows.of(Time.milliseconds(500))), thenmy understanding is that the window is "associated" to the key. Soin this case, on a given operator instance, I will have 25 of thosewindows (one per key), and I will do 25 queries (instead of 1).
Do you understand my point ?
Or maybe am I missing something ?
I'd like to find a way on operator instance 1 to group all thealerts received on those 25 resource ids and do 1 query for those25 resource ids.
Same thing for operator instance 2, 3 and 4.


Thank you,
Regards.


On 18/02/2018 14:43, Xingcan Cui wrote:
Hi Julien,
the cardinality of your keys (e.g., resource ID) will not berestricted to the parallelism. For instance, if you have 100resource IDs processed by KeyedStream with parallelism 4, eachoperator instance will handle about 25 keys.
Hope that helps.

Best,
Xingcan
On 18 Feb 2018, at 8:49 PM, Julien <jmassio...@gmail.com> wrote:

Hi,
I am pretty new to flink and I don't know what will be the bestway to deal with the following use case:
• as an input, I recieve some alerts from a kafka topic
• an alert is linked to a network resource (like router-1,router-2, switch-1, switch-2, ...)• so an alert has two main information (the alert id and theresource id of the resource on which this alert has been raised)• then I need to do a query to an external system in order toenrich the alert with additional information on the resource
(A "natural" candidate for the key on this stream will be theresource id)
The issue I have is that regarding the query to the external system:
• I do not want to do 1 query per resource id
• I want to do a small number of queries in parallel (for example4 queries in parallel every 500ms), each query requesting theexternal system for several alerts linked to several resource id
Currently, I don't know what will be the best way to deal with that:
• I can key my stream on the resource id and then define aprocessing time window of 500ms and when the trigger is ok, thenI do my query• by doing so, I will "group" several alerts in a single query,but they will all be linked to the same resource.• so I will do 1 query per resource id (which will be too much inmy use case)
• I can also do a windowAll on a non keyed stream
• by doing so, I will "group" together alerts from differentresource ids, but from what I've read in such a case theparallelism will always be one.• so in this case, I will only do 1 query whereas I'd like tohave some parallelism
I am thinking that a way to deal with that will be:
• define the resource id as the key of stream and put aparallelism of 4
• and then having a way to do a windowAll on this keyed stream
• which is that, on a given operator instance, I will "group" onthe same window all the keys (ie all the resource ids) managed bythis operator instance• with a parallelism of 4, I will do 4 queries in parallel (1 peroperator instance, and each query will be for several alertslinked to several resource ids)But after looking at the documentation, I cannot see this ability(having a windowAll on a keyed stream).
Am I missing something?

What will be the best way to deal with such a use case?
I've tried for example to review my key and to do somethinglike "resourceId.hahsCode%<max nb of queries in parallel>" andthen to use a time window.
In my example above, the <max nb of queries in parallel> will be4. And all my keys will be 0, 1, 2 or 3.
The issue with this approach is that due to the way theoperatorIdx is computed based on the key, it does not distributewell my processing:
• when this partitioning logic from the "KeyGroupRangeAssignment"class is applied
•     /**
     * Assigns the given key to a parallel operator index.
     *
     * @param key the key to assign
* @param maxParallelism the maximum supported parallelism,aka the number of key-groups.
     * @param parallelism the current parallelism of the operator
* @return the index of the parallel operator to which thegiven key should be routed.
     */
public static int assignKeyToParallelOperator(Object key, intmaxParallelism, int parallelism) { return computeOperatorIndexForKeyGroup(maxParallelism,parallelism, assignToKeyGroup(key, maxParallelism));
    }

    /**
     * Assigns the given key to a key-group index.
     *
     * @param key the key to assign
* @param maxParallelism the maximum supported parallelism,aka the number of key-groups.
     * @return the key-group to which the given key is assigned
     */
public static int assignToKeyGroup(Object key, intmaxParallelism) { return computeKeyGroupForKeyHash(key.hashCode(),maxParallelism);
    }
• key 0, 1, 2 and 3 are only assigned to operator 2 and 3 (so 2over my 4 operators will not have anything to do)
So, what will be the best way to deal with that?



Thank you in advance for your support.

Regards.



Julien.

Re: A "per operator instance" window all ?

Reply via email to