'Custom' mapping function on keyed WindowedStream

2018-02-26 Thread Marchant, Hayden
I would like to create a custom aggregator function for a windowed KeyedStream which I have complete control over - i.e. instead of implementing an AggregatorFunction, I would like to control the lifecycle of the flink state by implementing the CheckpointedFunction interface, though I still

RE: Kafka as source for batch job

2018-02-08 Thread Marchant, Hayden
Forget to mention that my target Kafka version is 0.11.x with aim to upgrade to 1.0 when 1.0.x fixpack is released. From: Tzu-Li (Gordon) Tai [mailto:tzuli...@apache.org] Sent: Thursday, February 08, 2018 8:05 PM To: user@flink.apache.org; Marchant, Hayden [ICG-IT] <hm97...@imceu.eu.ssmb.

RE: Kafka as source for batch job

2018-02-08 Thread Marchant, Hayden
? Thanks Hayden From: Tzu-Li (Gordon) Tai [mailto:tzuli...@apache.org] Sent: Thursday, February 08, 2018 8:05 PM To: user@flink.apache.org; Marchant, Hayden [ICG-IT] <hm97...@imceu.eu.ssmb.com> Subject: RE: Kafka as source for batch job Hi Marchant, Yes I agree. In general, the isEndOf

RE: Kafka as source for batch job

2018-02-08 Thread Marchant, Hayden
of KafkaFetcher.runFetchLoop that has slightly different logic for changing running to be false. What would you recommend in this case? From: Tzu-Li (Gordon) Tai [mailto:tzuli...@apache.org] Sent: Thursday, February 08, 2018 12:24 PM To: user@flink.apache.org; Marchant, Hayden [ICG-IT] <h

Kafka as source for batch job

2018-02-08 Thread Marchant, Hayden
I know that traditionally Kafka is used as a source for a streaming job. In our particular case, we are looking at extracting records from a Kafka topic from a particular well-defined offset range (per partition) - i.e. from offset X to offset Y. In this case, we'd somehow want the application

RE: S3 for state backend in Flink 1.4.0

2018-02-07 Thread Marchant, Hayden
WE actually got it working. Essentially, it's an implementation of HadoopFilesytem, and was written with the idea that it can be used with Spark (since it has broader adoption than Flink as of now). We managed to get it configured, and found the latency to be much lower than by using the s3

RE: Latest version of Kafka

2018-02-07 Thread Marchant, Hayden
Thanks for the info! -Original Message- From: Piotr Nowojski [mailto:pi...@data-artisans.com] Sent: Friday, February 02, 2018 4:37 PM To: Marchant, Hayden [ICG-IT] <hm97...@imceu.eu.ssmb.com> Cc: user@flink.apache.org Subject: Re: Latest version of Kafka Hi, Flink as for now pr

Latest version of Kafka

2018-02-01 Thread Marchant, Hayden
What is the newest version of Kafka that is compatible with Flink 1.4.0? I see the last version of Kafka supported is 0.11 , from documentation, but has any testing been done with Kafka 1.0? Hayden Marchant

Reading bounded data from Kafka in Flink job

2018-02-01 Thread Marchant, Hayden
I have 2 datasets that I need to join together in a Flink batch job. One of the datasets needs to be created dynamically by completely 'draining' a Kafka topic in an offset range (start and end), and create a file containing all messages in that range. I know that in Flink streaming I can

RE: S3 for state backend in Flink 1.4.0

2018-02-01 Thread Marchant, Hayden
Edward, We are using Object Storage for checkpointing. I'd like to point out that we were seeing performance problems using the S3 protocol. Btw, we had quite a few problems using the flink-s3-fs-hadoop jar with Object Storage and had to do some ugly hacking to get it working all over. We

RE: Joining data in Streaming

2018-01-30 Thread Marchant, Hayden
Stefan, So are we essentially saying that in this case, for now, I should stick to DataSet / Batch Table API? Thanks, Hayden -Original Message- From: Stefan Richter [mailto:s.rich...@data-artisans.com] Sent: Tuesday, January 30, 2018 4:18 PM To: Marchant, Hayden [ICG-IT] <h

Joining data in Streaming

2018-01-30 Thread Marchant, Hayden
We have a use case where we have 2 data sets - One reasonable large data set (a few million entities), and a smaller set of data. We want to do a join between these data sets. We will be doing this join after both data sets are available. In the world of batch processing, this is pretty

RE: S3 for state backend in Flink 1.4.0

2018-01-28 Thread Marchant, Hayden
that interfaces directly to IBM Object Storage that can be configured through the hdfs-site.xml called stocator that might speed things up. -Original Message- From: Aljoscha Krettek [mailto:aljos...@apache.org] Sent: Thursday, January 25, 2018 6:30 PM To: Marchant, Hayden [ICG-IT] <h

S3 for state backend in Flink 1.4.0

2018-01-24 Thread Marchant, Hayden
Hi, We have a Flink Streaming application that uses S3 for storing checkpoints. We are not using 'regular' S3, but rather IBM Object Storage which has an S3-compatible connector. We had quite some challenges in overiding the endpoint from the default s3.amnazonaws.com to our internal IBM

Hardware Reference Architecture

2017-12-07 Thread Marchant, Hayden
Hi, I'm looking for guidelines for Reference architecture for Hardware for a small/medium Flink cluster - we'll be installing on in-house bare-metal servers. I'm looking for guidance for: 1. Number and spec of CPUs 2. RAM 3. Disks 4. Network 5. Proximity of servers to each other (Most

TaskManager HA on YARN

2017-12-04 Thread Marchant, Hayden
Hi, WE are currently start to test Flink running on YARN. Till now, we've been testing on Standalone Cluster. One thing lacking in standalone is that we have to manually restart a Task Manager if it dies. I looked at

Garbage collection concerns with Task Manager memory

2017-10-18 Thread Marchant, Hayden
I read in the Flink documentation that the TaskManager runs all tasks within its own JVM, and that the recommendation is to set the taskmanager.heap.mb to be as much as is available on the server. I have a very large server with 192GB so thinking of giving most of it to the Task Manager. I

start-cluster.sh not working in HA mode

2017-10-16 Thread Marchant, Hayden
I am attempting to run Flink 1.3.2 in HA mode with zookeeper. When I run the start-cluster.sh, the job manager is not started, even though the task manager is started. When I delved into this, I saw that the command: ssh -n $FLINK_SSH_OPTS $master -- "nohup /bin/bash -l

In-memory cache

2017-10-02 Thread Marchant, Hayden
We have an operator in our streaming application that needs to access 'reference data' that is updated by another Flink streaming application. This reference data has about ~10,000 entries and has a small footprint. This reference data needs to be updated ~ every 100 ms. The required latency

Testing recoverable job state

2017-09-13 Thread Marchant, Hayden
I'm a newbie to Flink and am trying to understand how the recovery works using state backends. I've read the documentation and am now trying to run a simple test to demonstrate the abilities - I'd like to test the recovery of a flink job and how the state is recovered from where it left off

RE: Queryable State

2017-09-13 Thread Marchant, Hayden
I can see the job running in the FlinkUI for the job, and specifically specified the port for the Job Manager. When I provided a different port, I got an akka exception. Here, it seems that the code is getting further. I think that it might be connected with how I am creating the

QueryableState - No KvStateLocation found for KvState instance

2017-09-13 Thread Marchant, Hayden
I am trying to use queryable state, and am encountering issues when querying the state from the client. I get the following exception: Exception in thread "main" org.apache.flink.runtime.query.UnknownKvStateLocation: No KvStateLocation found for KvState instance with name 'word_sums'.

Shuffling between map and keyBy operator

2017-09-05 Thread Marchant, Hayden
I have a streaming application that has a keyBy operator followed by an operator working on the keyed values (a custom sum operator). If the map operator and aggregate operator are running on same Task Manager , will Flink always serialize and deserialize the tuples, or is there an optimization

Very low-latency - is it possible?

2017-08-31 Thread Marchant, Hayden
We're about to get started on a 9-person-month PoC using Flink Streaming. Before we get started, I am interested to know how low-latency I can expect for my end-to-end flow for a single event (from source to sink). Here is a very high-level description of our Flink design: We need at least

RE: Using local FS for checkpoint

2017-08-31 Thread Marchant, Hayden
I didn’t think about NFS. That would save me the hassle of installing HDFS cluster just for that, especially if my organization already has an NFS ‘handy’. Thanks Hayden From: Tony Wei [mailto:tony19920...@gmail.com] Sent: Thursday, August 31, 2017 12:12 PM To: Marchant, Hayden [ICG-IT] Cc

Using local FS for checkpoint

2017-08-31 Thread Marchant, Hayden
Whether I use RocksDB or FS State backends, if my requirements are to have fault-tolerance and ability to recover with 'at-least once' semantics for my Flink job, is there still a valid case for using a backing local FS for storing states? i.e. If a Flink Node is invalidated, I would have