Re: Cannot cancel job with savepoint due to timeout

2017-02-01 Thread Bruno Aranda
code I conclude that "akka.client.timeout" setting is what > affects this. It defaults to 60 seconds. > > I'm not sure why this setting is not documented though as well as many > other "akka.*" settings - maybe there are some good reasons behind. > > Regards, > Yury > &g

Re: Can't run flink on yarn on version 1.2.0

2017-02-17 Thread Bruno Aranda
Hi Howard, We run Flink 1.2 in Yarn without issues. Sorry I don't have any specific solution, but are you sure you don't have some sort of Flink mix? In your logs I can see: The configuration directory ('/home/software/flink-1.1.4/conf') contains both LOG4J and Logback configuration files.

Re: Rolling sink parquet/Avro output

2017-01-18 Thread Bruno Aranda
Sorry, something went wrong with the code for the Writer. Here it is again: import org.apache.avro.Schema import org.apache.flink.streaming.connectors.fs.Writer import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.parquet.avro.AvroParquetWriter import

Re: Telling if a job has caught up with Kafka

2017-03-20 Thread Bruno Aranda
Hi, Thanks! The proposal sounds very good to us too. Bruno On Sun, 19 Mar 2017 at 10:57 Florian König wrote: > Thanks Gordon for the detailed explanation! That makes sense and explains > the expected behaviour. > > The JIRA for the new metric also sounds very good.

Re: Telling if a job has caught up with Kafka

2017-03-17 Thread Bruno Aranda
Hi, We are interested on this too. So far we flag the records with timestamps in different points of the pipeline and use metrics gauges to measure latency between the different components, but would be good to know if there is something more specific to Kafka that we can do out of the box in

Any good ideas for online/offline detection of devices that send events?

2017-03-03 Thread Bruno Aranda
Hi all, We are trying to write an online/offline detector for devices that keep streaming data through Flink. We know how often roughly to expect events from those devices and we want to be able to detect when any of them stops (goes offline) or starts again (comes back online) sending events

Re: AWS exception serialization problem

2017-03-08 Thread Bruno Aranda
Hi, We have seen something similar in Flink 1.2. We have an operation that parses some JSON, and when it fails to parse it, we can see the ClassNotFoundException for the relevant exception (in our case JsResultException from the play-json library). The library is indeed in the shaded JAR,

Re: AWS exception serialization problem

2017-03-08 Thread Bruno Aranda
Hi Stephan, we are running Flink 1.2.0 on Yarn (AWS EMR cluster) On Wed, 8 Mar 2017, 21:41 Stephan Ewen, <se...@apache.org> wrote: > @Bruno: How are you running Flink? On yarn, standalone, mesos, docker? > > On Wed, Mar 8, 2017 at 2:13 PM, Bruno Aranda <brunoara...@gmail.co

Re: Any good ideas for online/offline detection of devices that send events?

2017-03-07 Thread Bruno Aranda
; The CEP parts in the slides in 2. also provides some good examples of > timeout detection using CEP. > > Hope this helps! > > Cheers, > Gordon > > On March 4, 2017 at 1:27:51 AM, Bruno Aranda (bara...@apache.org) wrote: > > Hi all, > > We are trying to write an o

Flink Graphire Reporter stops reporting via TCP if network issue

2017-05-05 Thread Bruno Aranda
Hi, We are using the Graphite reporter from Flink 1.2.0 to send the metrics via TCP. Due to our network configuration we cannot use UDP at the moment. We have observed that if there is any problem with graphite our the network, basically, the TCP connection times out or something, the metrics

Re: Flink Graphire Reporter stops reporting via TCP if network issue

2017-05-05 Thread Bruno Aranda
eption or not. > > Could you check the log for a warning statements from the MetricRegistry? > > Regards, > Chesnay > > On 05.05.2017 13:26, Bruno Aranda wrote: > > Hi, > > > > We are using the Graphite reporter from Flink 1.2.0 to send the > > metrics vi

Job Manager killed by Kubernetes during recovery

2018-08-18 Thread Bruno Aranda
Hi, I am experiencing an issue when a job manager is trying to recover using a HA setup. When the job manager starts again and tries to resume from the last checkpoints, it gets killed by Kubernetes (I guess), since I can see the following in the logs while the jobs are deployed: INFO

Re: Job Manager killed by Kubernetes during recovery

2018-08-22 Thread Bruno Aranda
nd your K8s deployment > specification would be helpful. If you have some memory limits specified > these would also be interesting to know. > > Cheers, > Till > > On Sun, Aug 19, 2018 at 2:43 PM vino yang wrote: > >> Hi Bruno, >> >> Ping Till for you, he may g

Re: IndexOutOfBoundsException on deserialization after updating to 1.6.1

2018-10-17 Thread Bruno Aranda
a Option[Boolean], and the failure seems not to happen anymore. We may continue with the Boolean for now, I guess though this was not a problem in an earlier Flink version, possible Kryo change? Cheers, Bruno On Wed, 17 Oct 2018 at 15:40 aitozi wrote: > Hi,Bruno Aranda > >

IndexOutOfBoundsException on deserialization after updating to 1.6.1

2018-10-17 Thread Bruno Aranda
Hi, We are trying to update from 1.3.2 to 1.6.1, but one of our jobs keeps throwing an exception during deserialization: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:657) at java.util.ArrayList.get(ArrayList.java:433) at

Rich variant for Async IO in Scala

2018-11-08 Thread Bruno Aranda
Hi, I see that the AsyncFunction for Scala does not seem to have a rich variant like the Java one. Is there a particular reason for this? Is there any workaround? Thanks! Bruno

Re: Rich variant for Async IO in Scala

2018-11-13 Thread Bruno Aranda
hFilterFunction` are also shared between both APIs. > > Is there anything that blocks you from using it? > > Regards, > Timo > > Am 09.11.18 um 01:38 schrieb Bruno Aranda: > > Hi, > > > > I see that the AsyncFunction for Scala does not seem to have a rich >

Can't get the FlinkKinesisProducer to work against Kinesalite for tests

2018-09-26 Thread Bruno Aranda
Hi, We have started to use Kinesis with Flink and we need to be able to test when a Flink jobs writes to Kinesis. For that, we use a docker image with Kinesalite. To configure the producer, we do like it is explained in the docs [1]. However, if we use this code, the job submission is going to

Re: Can't get the FlinkKinesisProducer to work against Kinesalite for tests

2018-09-27 Thread Bruno Aranda
image. If you want the Amazon KPL to work fine, it will need to be one of the Debian images running in Docker. Hope this saves someone all the days we have spent looking at it :) Cheers, Bruno On Wed, 26 Sep 2018 at 14:59 Bruno Aranda wrote: > Hi, > > We have started to use Kinesis

Re: Subtask much slower than the others when creating checkpoints

2019-01-15 Thread Bruno Aranda
ate - assuming that the load is kind of balanced > between partitions. > > Best, > Stefan > > On 15. Jan 2019, at 11:42, Bruno Aranda wrote: > > Hi, > > Just an update from our side. We couldn't find anything specific in the > logs and the problem is not easy reproducibl

Re: Subtask much slower than the others when creating checkpoints

2019-01-15 Thread Bruno Aranda
es the same key strategy as the Kafka >> partitions, I've tried to use murmur2 for hashing but it didn't help either. >> The subtask that seems causing problems seems to be a CoProcessFunction. >> I am going to debug Flink but since I'm relatively new to it, it might >> take a whi

Subtask much slower than the others when creating checkpoints

2019-01-08 Thread Bruno Aranda
Hi, We are using Flink 1.6.1 at the moment and we have a streaming job configured to create a checkpoint every 10 seconds. Looking at the checkpointing times in the UI, we can see that one subtask is much slower creating the endpoint, at least in its "End to End Duration", and seems caused by a

Re: StreamingFileSink seems to be overwriting existing part files

2019-03-29 Thread Bruno Aranda
ally the reason why, contrary to the BucketingSink, the > StreamingFileSink relies on Flink's own state to determine the "next" part > counter. > > Cheers, > Kostas > > On Fri, Mar 29, 2019 at 4:22 PM Bruno Aranda wrote: > >> Hi, >> >> One of the mai

StreamingFileSink seems to be overwriting existing part files

2019-03-29 Thread Bruno Aranda
Hi, One of the main reasons we moved to version 1.7 (and 1.7.2 in particular) was because of the possibility of using a StreamingFileSink with S3. We've configured a StreamingFileSink to use a DateTimeBucketAssigner to bucket by day. It's got a parallelism of 1 and is writing to S3 from an EMR

1.7.2 requires several attempts to start in AWS EMR's Yarn

2019-03-26 Thread Bruno Aranda
Hi, I did write recently about our problems with 1.7.2 for which we still haven't found a solution and the cluster is very unstable. I am trying to point now to a different problem that maybe it is related somehow and we don't understand. When we restart a Flink Session in Yarn, we see it takes

Re: Flink 1.7.2 extremely unstable and losing jobs in prod

2019-03-21 Thread Bruno Aranda
. Thanks, Bruno On Tue, 19 Mar 2019 at 17:09, Andrey Zagrebin wrote: > Hi Bruno, > > could you also share the job master logs? > > Thanks, > Andrey > > On Tue, Mar 19, 2019 at 12:03 PM Bruno Aranda wrote: > >> Hi, >> >> This is causing seriou

Re: Flink 1.7.2 extremely unstable and losing jobs in prod

2019-04-08 Thread Bruno Aranda
for not taking faster a look at your problem and the > inconveniences with the upload. > > Cheers, > Till > > On Thu, Mar 21, 2019 at 4:30 PM Bruno Aranda wrote: > >> Ok, here it goes: >> >> https://transfer.sh/12qMre/jobmanager-debug.log >> >>

Re: StreamingFileSink on EMR

2019-02-26 Thread Bruno Aranda
Hi, That Jar must exist for all the 1.7 versions, but I was replacing the libs for the Flink provided by the AWS EMR (1.7.0) by the more recent ones. But you could download the 1.7.0 distribution and copy the flink-s3-fs-hadoop-1.7.0.jar from there into the /usr/lib/flink/lib folder. But knowing

Re: StreamingFileSink on EMR

2019-02-26 Thread Bruno Aranda
Hi, I am having the same issue, but it is related to what Kostas is pointing out. I was trying to stream to the "s3" scheme and not "hdfs", and then getting that exception. I have realised that somehow I need to reach the S3RecoverableWriter, and found out it is in a difference library

Re: StreamingFileSink on EMR

2019-02-26 Thread Bruno Aranda
Hey, Got it working, basically you need to add the flink-s3-fs-hadoop-1.7.2.jar libraries from the /opt folder of the flink distribution into the /usr/lib/flink/lib. That has done the trick for me. Cheers, Bruno On Tue, 26 Feb 2019 at 16:28, kb wrote: > Hi Bruno, > > Thanks for verifying. We

Flink 1.7.2 extremely unstable and losing jobs in prod

2019-03-19 Thread Bruno Aranda
Hi, This is causing serious instability and data loss in our production environment. Any help figuring out what's going on here would be really appreciated. We recently updated our two EMR clusters from flink 1.6.1 to flink 1.7.2 (running on AWS EMR). The road to the upgrade was fairly rocky,

Re: Flink 1.7.2 extremely unstable and losing jobs in prod

2019-03-21 Thread Bruno Aranda
iour. > > The community intends to add support for ranges of how many TMs must be > active at any given time [1]. > > [1] https://issues.apache.org/jira/browse/FLINK-11078 > > Cheers, > Till > > On Thu, Mar 21, 2019 at 1:50 PM Bruno Aranda wrote: > >&

Re: Flink and S3 AWS keys rotation

2019-02-07 Thread Bruno Aranda
Hi, You can give specific IAM instance roles to the instances running Flink. This way you never expose access keys anywhere. As the docs say, that is the recommended way (and not just for Flink, but for any service you want to use, never set it up with AWS credentials in config). IAM will

Re: Adding metadata to the jar

2019-04-08 Thread Bruno Aranda
Hi Avi, Don't know if there are better ways, but we store the version of the job running and other metadata as part of the "User configuration" of the job, so it shows in the UI when you go to the job Configuration tab inside the job. To do so, when we create the job: val buildInfo = new

Re: Flink 1.7.2 extremely unstable and losing jobs in prod

2019-04-09 Thread Bruno Aranda
just died and, hence, cannot > be connected to anymore? > > [1] > https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#zookeeper-based-ha-mode > > Cheers, > Till > > On Mon, Apr 8, 2019 at 12:33 PM Bruno Aranda wrote: > >> Hi Till, >> >>