Re: Specifying Schema dynamically

2017-02-12 Thread Tzu-Li (Gordon) Tai
Hi Luqman, From your description, it seems like that you want to infer the type (case class, tuple, etc.) of a stream dynamically at runtime. AFAIK, I don’t think this is supported in Flink. You’re required to have defined types for your DataStreams. Could you also provide an example code of

Re: Start streaming tuples depending on another streams rate

2017-02-09 Thread Tzu-Li (Gordon) Tai
Hi Jonas, A few things to clarify first: Stream A has a rate of 100k tuples/s. After processing the whole Kafka queue, the rate drops to 10 tuples/s. From this description it seems like the job is re-reading from the beginning from the topic, and once you reach the latest record at the head

Re: Where to put "pre-start" logic and how to detect recovery?

2017-02-09 Thread Tzu-Li (Gordon) Tai
Hi Dmitry, I think currently the simplest way to do this is simply to add a program argument as a flag to whether or not the current run is from a savepoint (so you manually supply the flag whenever you’re starting the job from a savepoint), and check that flag in the main method. The main

Re: Fink: KafkaProducer Data Loss

2017-02-02 Thread Tzu-Li (Gordon) Tai
Hi Ninad and Till, Thank you for looking into the issue! This is actually a bug. Till’s suggestion is correct: The producer holds a `pendingRecords` value that is incremented on each invoke() and decremented on each callback, used to check if the producer needs to sync on pending callbacks on

Re: Support for Auto scaling

2017-02-01 Thread Tzu-Li (Gordon) Tai
Hi Sandeep! While auto scaling jobs in Flink still isn’t possible, in Flink 1.2 you will be able to rescale jobs by stopping and restarting. This works by taking a savepoint of the job before stopping the job, and then redeploy the job with a higher / lower parallelism using the savepoint. Upon

Re: Fault tolerance guarantees of Elasticsearch sink in flink-elasticsearch2?

2017-01-18 Thread Tzu-Li (Gordon) Tai
the affected sinks participate in checkpointing is enough of a solution - is there anything special about `SinkFunction[T]` extending `Checkpointed[S]`, or can I just implement it as I would for e.g. a mapping function? Thanks, Andrew On Jan 13, 2017, at 4:34 PM, Tzu-Li (Gordon) Tai <tz

Re: Kafka KeyedStream source

2017-01-15 Thread Tzu-Li (Gordon) Tai
and I could filter the data more efficiently because the data would not need to go over the network before this filter. Afterwards I can scale it up to 'many' tasks for the heavier processing that follows. As a concept: Could that be made to work? Niels  On Mon, Jan 9, 2017 at 9:14 AM, Tzu-Li (G

Re: Deduplicate messages from Kafka topic

2017-01-15 Thread Tzu-Li (Gordon) Tai
Hi, You’re correct that the FlinkKafkaProducer may emit duplicates to Kafka topics, as it currently only provides at-least-once guarantees. Note that this isn’t a restriction only in the FlinkKafkaProducer, but a general restriction for Kafka's message delivery. This can definitely be improved

Re: Fault tolerance guarantees of Elasticsearch sink in flink-elasticsearch2?

2017-01-13 Thread Tzu-Li (Gordon) Tai
Hi Andrew, Your observations are correct. Like you mentioned, the current problem circles around how we deal with the pending buffered requests with accordance to Flink’s checkpointing. I’ve filed a JIRA for this, as well as some thoughts for the solution in the description: 

Re: Kafka topic partition skewness causes watermark not being emitted

2017-01-13 Thread Tzu-Li (Gordon) Tai
Hi, This is expected behaviour due to how the per-partition watermarks are designed in the Kafka consumer, but I think it’s probably a good idea to handle idle partitions also when the Kafka consumer itself emits watermarks. I’ve filed a JIRA issue for this: 

Re: Kafka 09 consumer does not commit offsets

2017-01-09 Thread Tzu-Li (Gordon) Tai
Henri Heiskanen <henri.heiska...@gmail.com> wrote: Hi, We had the same problem when running 0.9 consumer against 0.10 Kafka. Upgrading Flink Kafka connector to 0.10 fixed our issue. Br, Henkka On Mon, Jan 9, 2017 at 5:39 PM, Tzu-Li (Gordon) Tai <tzuli...@apache.org> wrote: Hi, N

Re: Kafka 09 consumer does not commit offsets

2017-01-09 Thread Tzu-Li (Gordon) Tai
Hi, Not sure what might be going on here. I’m pretty certain that for FlinkKafkaConsumer09 when checkpointing is turned off, the internally used KafkaConsumer client will auto commit offsets back to Kafka at a default interval of 5000ms (the default value for “auto.commit.interval.ms”). Could

Re: Kafka KeyedStream source

2017-01-09 Thread Tzu-Li (Gordon) Tai
Hi Niels, Thank you for bringing this up. I recall there was some previous discussion related to this before: [1]. I don’t think this is possible at the moment, mainly because of how the API is designed. On the other hand, a KeyedStream in Flink is basically just a DataStream with a hash

Re: Joining two kafka streams

2017-01-08 Thread Tzu-Li (Gordon) Tai
Hi Igor! What you can actually do is let a single FlinkKafkaConsumer consume from both topics, producing a single DataStream which you can keyBy afterwards. All versions of the FlinkKafkaConsumer support consuming multiple Kafka topics simultaneously. This is logically the same as union and

Re: Serializers and Schemas

2016-12-07 Thread Tzu-Li (Gordon) Tai
Hi Matt, 1. There’s some in-progress work on wrapper util classes for Kafka de/serializers here [1] that allows Kafka de/serializers to be used with the Flink Kafka Consumers/Producers with minimal user overhead. The PR also has some proposed adds to the documentations for the wrappers. 2. I

Re: Partitioning operator state

2016-12-07 Thread Tzu-Li (Gordon) Tai
Hi Dominik, Do you mean how Flink redistributes an operator’s state when the parallelism of the operator is changed? If so, you can take a look at [1] and [2]. Cheers, Gordon [1] https://issues.apache.org/jira/browse/FLINK-3755 [2] 

Re: Flink Kafka producer with a topic per message

2016-12-07 Thread Tzu-Li (Gordon) Tai
Hi, The FlinkKafkaProducers currently support per record topic through the user-provided serialization schema which has a "getTargetTopic(T element)" method called for per record, and also decides the partition the record will be sent to through a custom KafkaPartitioner, which is also provided 

Re: Why use Kafka after all?

2016-11-22 Thread Tzu-Li (Gordon) Tai
Hi Matt, Just to be clear, what I'm looking for is a way to serialize a POJO class for Kafka but also for Flink, I'm not sure the interface of both frameworks are compatible but it seems they aren't. For Kafka (producer) I need a Serializer and a Deserializer class, and for Flink (consumer) a 

Re: flink-dist shading

2016-11-18 Thread Tzu-Li (Gordon) Tai
Hi Craig, I think the email wasn't sent to the ‘dev’ list, somehow. Have you tried this: mvn clean install -DskipTests # In Maven 3.3 the shading of flink-dist doesn't work properly in one run, so we need to run mvn for flink-dist again. cd flink-dist mvn clean install -DskipTests I agree that

Re: Subtask keeps on discovering new Kinesis shard when using Kinesalite

2016-11-16 Thread Tzu-Li (Gordon) Tai
Hi Phillip, Thanks for testing it. From your log and my own tests, I can confirm the problem is with Kinesalite not correctly mocking the official Kinesis behaviour for the `describeStream` API. There’s a PR for the fix here: https://github.com/apache/flink/pull/2822. With this change, shard

Re: Why use Kafka after all?

2016-11-15 Thread Tzu-Li (Gordon) Tai
Hi Matt, Here’s an example of writing a DeserializationSchema for your POJOs: [1]. As for simply writing messages from WebSocket to Kafka using a Flink job, while it is absolutely viable, I would not recommend it, mainly because you’d never know if you might need to temporarily shut down Flink

Re: Subtask keeps on discovering new Kinesis shard when using Kinesalite

2016-11-15 Thread Tzu-Li (Gordon) Tai
Hi Philipp, When used against Kinesalite, can you tell if the connector is already reading data from the test shard before any of the shard discovery messages? If you have any spare time to test this, you can set a larger value for the `ConsumerConfigConstants.SHARD_DISCOVERY_INTERVAL_MILLIS`

Re: ExpiredIteratorException when reading from a Kinesis stream

2016-11-03 Thread Tzu-Li (Gordon) Tai
the job from my last savepoint, it's consuming the stream fine again with no problems. Do you have any idea what might be causing this, or anything I should do to investigate further? Cheers, Josh On Wed, Oct 5, 2016 at 4:55 AM, Tzu-Li (Gordon) Tai <tzuli...@apache.org> wrote: Hi S

Re: A question regarding to the checkpoint mechanism

2016-10-16 Thread Tzu-Li (Gordon) Tai
processing? Thanks, Li On Oct 17, 2016, at 11:10 AM, Tzu-Li (Gordon) Tai <tzuli...@apache.org> wrote: Hi! No, the operator does not need to pause processing input records while the checkpointing of its state is in progress. The checkpointing of operator state is asynchronous. The op

Re: Presented Flink use case in Japan

2016-10-04 Thread Tzu-Li (Gordon) Tai
Really great to hear this! Cheers, Gordon On October 4, 2016 at 8:13:27 PM, Till Rohrmann (trohrm...@apache.org) wrote: It's always great to hear Flink success stories :-) Thanks for sharing it with the community.  I hope Flink helps you to solve even more problems. And don't hesitate to

Re: Flink Checkpoint runs slow for low load stream

2016-10-04 Thread Tzu-Li (Gordon) Tai
Hi, Helping out here: this is the PR for async Kafka offset committing - https://github.com/apache/flink/pull/2574. It has already been merged into the master and release-1.1 branches, so you can try out the changes now if you’d like. The change should also be included in the 1.1.3 release,

Re: Using Flink

2016-10-04 Thread Tzu-Li (Gordon) Tai
rators? If I need the state stored on previous operators, how can I fetch it? I don’t think this is possible. Fine, thanks. Thanks. On Mon, Oct 3, 2016 at 12:23 AM, Tzu-Li (Gordon) Tai <tzuli...@apache.org> wrote: Hi! - I'm currently running my flink program on 1.2 SNAPSHOT with kafka s

Re: Using Flink

2016-10-03 Thread Tzu-Li (Gordon) Tai
Hi! - I'm currently running my flink program on 1.2 SNAPSHOT with kafka source and I have checkpoint enabled. When I look at the consumer offsets in kafka it appears to be stagnant and there is a huge lag. But I can see my flink program is in pace with kafka source in JMX metrics and outputs.

Re: FlinkKafkaConsumer and Kafka topic/partition change

2016-09-27 Thread Tzu-Li (Gordon) Tai
Hi! This is definitely a planned feature for the Kafka connectors, there’s a JIRA exactly for this [1]. We’re currently going through some blocking tasks to make this happen, I also hope to speed up things over there :) Your observation is correct that the Kaka consumer uses “assign()” instead

Re: setting the name of a subtask ?

2016-09-12 Thread Tzu-Li (Gordon) Tai
Hi! Yes, you can set custom operator names by calling `.name(…)` on DataStreams after a transformation. For example, `.addSource(…).map(...).name(…)`. This name will be used for visualization on the dashboard, and also for logging. Regards, Gordon On September 12, 2016 at 3:44:58 PM, Bart

Re: Kafka SimpleStringConsumer NPE

2016-09-04 Thread Tzu-Li (Gordon) Tai
Hi David, Is it possible that your Kafka installation is an older version than 0.9? Or you may have used a different Kafka client major version in your job jar's dependency? This seems like an odd incompatible protocol with the Kafka broker to me, as the client in the Kafka consumer is reading

Re: Kinesis connector - Iterator expired exception

2016-08-26 Thread Tzu-Li (Gordon) Tai
57 PM, Tzu-Li (Gordon) Tai <tzuli...@apache.org> wrote: Hi Josh, Thank you for reporting this, I’m looking into it. There was some major changes to the Kinesis connector after mid June, but the changes don’t seem to be related to the iterator timeout, so it may be a bug that had always

Re: Kinesis connector - Iterator expired exception

2016-08-26 Thread Tzu-Li (Gordon) Tai
Hi Josh, Thank you for reporting this, I’m looking into it. There was some major changes to the Kinesis connector after mid June, but the changes don’t seem to be related to the iterator timeout, so it may be a bug that had always been there. I’m not sure yet if it may be related, but may I

Re: Threading Model for Kinesis

2016-08-23 Thread Tzu-Li (Gordon) Tai
:) Regards, Gordon On August 23, 2016 at 9:31:58 PM, Tzu-Li (Gordon) Tai (tzuli...@gmail.com) wrote: Slight misunderstanding here. The one thread per Kafka broker happens *after* the assignment of Kafka partitions to the source instances. So, with a total of 10 partitions and 10 source instances

Re: Threading Model for Kinesis

2016-08-23 Thread Tzu-Li (Gordon) Tai
stamps receiving and sending data. If there is only one source instance per broker how does that happen? Thanks, Sameer On Tue, Aug 23, 2016 at 7:17 AM, Tzu-Li (Gordon) Tai <tzuli...@gmail.com> wrote: > Hi! > > Kinesis shards should be ideally evenly assigned to the source insta

Re: Default timestamps for Event Time when no Watermark Assigner used?

2016-08-23 Thread Tzu-Li (Gordon) Tai
on the source stream- assignTimestampsAndWatermarks(new IngestionTimeExtractor<>()) Sameer On Tue, Aug 23, 2016 at 7:29 AM, Tzu-Li (Gordon) Tai <tzuli...@gmail.com> wrote: > Hi, > > For the Kinesis consumer, when you use Event Time but do not explicitly > assign timestamps,

Re: Default timestamps for Event Time when no Watermark Assigner used?

2016-08-23 Thread Tzu-Li (Gordon) Tai
Hi, For the Kinesis consumer, when you use Event Time but do not explicitly assign timestamps, the Kinesis server-side timestamp (the time which Kinesis received the record) is attached to the record as default, not Flink’s ingestion time. Does this answer your question? Regards, Gordon On

Re: Threading Model for Kinesis

2016-08-23 Thread Tzu-Li (Gordon) Tai
Hi! Kinesis shards should be ideally evenly assigned to the source instances. So, with your example of source parallelism of 10 and 20 shards, each source instance will have 2 shards and will have 2 threads consuming them (therefore, not in round robin). For the Kafka consumer, in the source

Re: How to get latest offsets with FlinkKafkaConsumer

2016-08-05 Thread Tzu-Li (Gordon) Tai
Hi, Please also note that the “auto.offset.reset” property is only respected when there is no offsets under the same consumer group in ZK. So, currently, in order to make sure you read from the latest / earliest offsets every time you restart your Flink application, you’d have to use an unique

Re: API request to submit job takes over 1hr

2016-06-13 Thread Tzu-Li (Gordon) Tai
Hi Shannon, Thanks for your investigation on the issue and the JIRA. There's actually a previous JIRA on this problem already: https://issues.apache.org/jira/browse/FLINK-4023. Would you be ok with tracking this issue on FLINK-4023, and close FLINK-4069 as a duplicate issue? As you can see, I've

Re: S3 as streaming source

2016-06-03 Thread Tzu-Li (Gordon) Tai
Hi Soumya, No, currently there is no Flink standard supported S3 streaming source. As far as I know, there isn't one out in the public yet either. The community is open to submissions for new connectors, so if you happen to be working on one for S3, you can file up a JIRA to let us know. Also,

Re: Local Cluster have problem with connect to elasticsearch

2016-05-11 Thread Tzu-Li (Gordon) Tai
Hi Rafal, >From your description, it seems like Flink is complaining because it cannot access the Elasticsearch API related dependencies as well. You'd also have to include the following into your Maven build, under : org.elasticsearch elasticsearch 2.3.2 jar false

Re: Checkpoint for exact-once proccessing

2016-01-13 Thread Tzu-Li (Gordon) Tai
Hi Francis, A part of every complete snapshot is the record positions associated with the barrier that triggered the checkpointing of this snapshot. The snapshot is completed only when all the records within the checkpoint reaches the sink. When a topology fails, all the operators' state will

Re: Flink DataStream and KeyBy

2016-01-13 Thread Tzu-Li (Gordon) Tai
Hi Saiph, In Flink, the key for keyBy() can be provided in different ways: https://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#specifying-keys (the doc is for DataSet API, but specifying keys is basically the same for DataStream and DataSet). As described in the

Re: Flink with Yarn

2016-01-11 Thread Tzu-Li (Gordon) Tai
Hi Sourav, A little help with more clarification on your last comment. In sense of "where" the driver program is executed, then yes the Flink driver program runs in a mode similar to Spark's YARN-client. However, the "role" of the driver program and the work that it is responsible of is quite

Re: Kinesis Connector

2016-01-08 Thread Tzu-Li (Gordon) Tai
Hi Giancarlo, Since it has been a while since the last post and there hasn't been a JIRA ticket opened for Kinesis connector yet, I'm wondering how you are doing on the Kinesis connector and hope to help out with this feature :) I've opened a JIRA

<    1   2   3   4   5   6