Achieving 1ms in any distributed system might be problematic, because even 
simplest ping messages between worker nodes take ~0.2ms.

However, as you stated your desired throughput (40k records/s) and state is 
small, so maybe there is no need for using a distributed system for that? You 
could try run single node Flink instance (or 2 node instance with parallelism 
set to 1, just for automatic failures recovery). 

As Jörn wrote earlier it might be just simpler to write simple custom java 
standalone application for that. As long as your state fits into memory of a 
single node, you should be easily able to process millions of records per 
second on a single machine. 

Piotrek

> On Aug 31, 2017, at 3:01 PM, Jörn Franke <jornfra...@gmail.com> wrote:
> 
> If you really need to get that low something else might be more suitable. 
> Given the times a custom solution might be necessary. Flink is a generic 
> powerful framework - hence it does not address these latencies. 
> 
>> On 31. Aug 2017, at 14:50, Marchant, Hayden <hayden.march...@citi.com> wrote:
>> 
>> We're about to get started on a 9-person-month PoC using Flink Streaming. 
>> Before we get started, I am interested to know how low-latency I can expect 
>> for my end-to-end flow for a single event (from source to sink). 
>> 
>> Here is a very high-level description of our Flink design: 
>> We need at least once semantics, and our main flow of application is parsing 
>> a message ( < 50 microseconds) from Kafka, and then doing a keyBy on the 
>> parsed event ( <1kb) and then updating a very small user state in the 
>> KeyedStream, and then doing another keyBy and then operator of that 
>> KeyedStream. Each of the operators is a very simple operation - very little 
>> calculation and no I/O.
>> 
>> 
>> ** Our requirement is to get close to 1ms (99%) or lower for end-to-end 
>> processing (timer starts once we get message from Kafka). Is this at all 
>> realistic if are flow contains 2 aggregations?  If so, what optimizations 
>> might we need to get there regarding cluster configuration (both Flink and 
>> Hardware). Our throughput is possibly small enough (40,000 events per 
>> second) that we could run on one node - which might eliminate some network 
>> latency. 
>> 
>> I did read in 
>> https://ci.apache.org/projects/flink/flink-docs-master/internals/stream_checkpointing.html
>>  in Exactly Once vs At Least Once that a few milliseconds is considered 
>> super low-latency - wondering if we can get lower.
>> 
>> Any advice or 'war stories' are very welcome.
>> 
>> Thanks,
>> Hayden Marchant
>> 
>> 

Reply via email to