Flink Forward SF 2018 Videos

2018-05-10 Thread Rafi Aroch
Hi, Are there any plans to upload the videos to the Flink Forward YouTube channel? There are so many interesting talks I would like to watch and filling out a form for each and every video makes it more difficult... Thanks, Rafi

FlinkKinesisProducer weird behaviour

2018-05-24 Thread Rafi Aroch
Hi, We're using Kinesis as our input & output of a job and experiencing parsing exception while reading from the output stream. All streams contain 1 shard only. While investigating the issue I noticed a weird behaviour where records get a PartitionKey I did not assign and the record Data is

Re: FlinkKinesisProducer weird behaviour

2018-05-24 Thread Rafi Aroch
gt; > Piotrek > > > On 24 May 2018, at 09:40, Rafi Aroch <rafi.ar...@gmail.com> wrote: > > Hi, > > We're using Kinesis as our input & output of a job and experiencing > parsing exception while reading from the output stream. All streams contain > 1 shard only. &

Re: Passing records between two jobs

2018-06-20 Thread Rafi Aroch
Hi Avihai, The problem is that every message queuing sink only provides at-least-once > guarantee > >From what I see, possible messaging queue which guarantees exactly-once is Kafka 0.11, while using the Kafka transactional messaging

Job Records In/Out metrics clarification

2018-10-15 Thread Rafi Aroch
Hi, Below is the In/Out metrics as they appear in the Flink UI. I was wondering what are the possible reasons that the "Records sent" of one operator is different than the "Records received" of the next one. I would expect to see the same number... [image: image.png] * We're using Flink 1.5.0

Re: How do I initialize the window state on first run?

2018-10-17 Thread Rafi Aroch
Hi Jiayi, This topic has been discussed by others, take a look here for some options by Lyft: https://youtu.be/WdMcyN5QZZQ Rafi On Fri, Oct 12, 2018, 16:51 bupt_ljy wrote: > Yes…that’s an option, but it’ll be very complicated because of our storage > and business. > > Now I’m trying to write

Re: Job Records In/Out metrics clarification

2018-10-15 Thread Rafi Aroch
Awesome, thanks! On Mon, Oct 15, 2018, 17:36 Chesnay Schepler wrote: > There is a known issue in 1.5.0 with the numRecordsIn/Out metrics for > chained tasks: https://issues.apache.org/jira/browse/FLINK-9530 > > This has been fixed in 1.5.1. > > On 15.10.2018 14:37, Rafi A

Re: Flink Kafka to BucketingSink to S3 - S3Exception

2018-10-28 Thread Rafi Aroch
Hi, I'm also experiencing this with Flink 1.5.2. This is probably related to BucketingSink not working properly with S3 as filesystem because of the eventual-consistency of S3. I see that https://issues.apache.org/jira/browse/FLINK-9752 will be part of 1.6.2 release. It might help, if you use

BucketingSink capabilities for DataSet API

2018-10-25 Thread Rafi Aroch
Hi, I'm writing a Batch job which reads Parquet, does some aggregations and writes back as Parquet files. I would like the output to be partitioned by year, month, day by event time. Similarly to the functionality of the BucketingSink. I was able to achieve the reading/writing to/from Parquet by

Re: FLINK Kinesis Connector Error - ProvisionedThroughputExceededException

2018-11-12 Thread Rafi Aroch
Hi Steve, We've encountered this also. We have way more than enough shards, but were still getting exceptions. We think we know what is the reason, we would love for someone to approve/reject. What we suspect is happening is as follows: The KPL's RateLimit parameter is tracking the amount of

Re: Stream in loop and not getting to sink (Parquet writer )

2018-11-28 Thread Rafi Aroch
Hi Avi, I can't see the part where you use assignTimestampsAndWatermarks. If this part in not set properly, it's possible that watermarks are not sent and nothing will be written to your Sink. See here for more details:

Re: Streaming Protobuf into Parquet file not working with StreamingFileSink

2019-03-27 Thread Rafi Aroch
t it will do the job. > > Cheers, > Kostas > > > > On Mon, Mar 25, 2019 at 9:49 AM Rafi Aroch wrote: > >> Hi Kostas, >> >> Thank you. >> I'm currently testing my job against a small file, so it's finishing >> before the checkpointing starts.

Support for Parquet schema evolution (a.k.a mergeSchema)

2019-03-27 Thread Rafi Aroch
Hi, In my job I want to read Parquet files from buckets by a date range. For that i'm using the Hadoop Compatibility features to use *ProtoParquetInputFormat*. If in the processed date range the Parquet schema underwent changes (even valid ones). Job fails with

Re: Streaming Protobuf into Parquet file not working with StreamingFileSink

2019-03-25 Thread Rafi Aroch
et, for example, will not write the footer that is needed to be able > to properly read the file. > > Kostas > > > On Thu, Mar 21, 2019 at 8:03 PM Rafi Aroch wrote: > >> Hi Kostas, >> >> Yes I have. >> >> Rafi >> >> On Thu, Mar 21, 2019, 20:

Streaming Protobuf into Parquet file not working with StreamingFileSink

2019-03-20 Thread Rafi Aroch
Hi, I'm trying to stream events in Prorobuf format into a parquet file. I looked into both streaming-file options: BucketingSink & StreamingFileSink. I first tried using the newer *StreamingFileSink* with the *forBulkFormat *API. I noticed there's currently support only for the Avro format with

Re: Streaming Protobuf into Parquet file not working with StreamingFileSink

2019-03-21 Thread Rafi Aroch
t shouldn’t you be just reading committed files and >> ignore in-progress? Maybe Kostas could add more insight to this topic. >> >> Piotr Nowojski >> >> On 20 Mar 2019, at 12:23, Rafi Aroch wrote: >> >> Hi, >> >> I'm trying to stream events in Pror

Re: Streaming Protobuf into Parquet file not working with StreamingFileSink

2019-03-21 Thread Rafi Aroch
Hi Kostas, Yes I have. Rafi On Thu, Mar 21, 2019, 20:47 Kostas Kloudas wrote: > Hi Rafi, > > Have you enabled checkpointing for you job? > > Cheers, > Kostas > > On Thu, Mar 21, 2019 at 5:18 PM Rafi Aroch wrote: > >> Hi Piotr and Kostas, >> >> Tha

Replacing AWS credentials provider for S3 filesystem

2019-05-26 Thread Rafi Aroch
Hi, I'm seeing an issue when trying to set a different credentials provider for AWS S3. I'm setting in flink-conf.json the *fs.s3a.aws.credentials.provider *to a different value. I'm using the *flink-s3-fs-hadoop* dependency and I get an exception when job starts: *RuntimeException: Failed to

Re: Calculate a 4 hour time window for sum,count with sum and count output every 5 mins

2019-06-17 Thread Rafi Aroch
Hi Vijay, When using windows, you may use the 'trigger' to set a Custom Trigger which would trigger your *ProcessWindowFunction* accordingly. In your case, you would probably use: > *.trigger(ContinuousProcessingTimeTrigger.of(Time.minutes(5)))* > Thanks, Rafi On Mon, Jun 17, 2019 at 9:01 PM

Re: AvroSerializer

2019-05-14 Thread Rafi Aroch
Hi Debasish, It would be a bit tedious, but in order to override the default AvroSerializer you could specify a TypeInformation object where needed. You would need to implement your own MyAvroTypeInfo instead of the provided AvroTypeInfo. For example: env.addSource(kafkaConsumer)

Re: DateTimeBucketAssigner using Element Timestamp

2019-05-06 Thread Rafi Aroch
Hi Peter, I also encountered this issue. As far as I know, it is not currently possible to stream from files (or any bounded stream) into a *StreamingFileSink*. This is because files are rolled over only on checkpoints and NOT when the stream closes. This is due to the fact that at the function

Re: Organize env using files

2019-04-22 Thread Rafi Aroch
Hi, If it helps, we're using Lightbend's Config for that: * https://github.com/lightbend/config * https://www.stubbornjava.com/posts/environment-aware-configuration-with-typesafe-config Thanks, Rafi On Wed, Apr 17, 2019 at 7:07 AM Andy Hoang wrote: > I have 3 different files for env: test,

Re: Using S3 as a sink (StreamingFileSink)

2019-08-17 Thread Rafi Aroch
Hi, S3 would delete files only if you have 'lifecycle rules' [1] defined on the bucket. Could that be the case? If so, make sure to disable / extend the object expiration period. [1] https://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html Thanks, Rafi On Sat, Aug 17, 2019

Re: Checkpointing is not performing well

2019-09-07 Thread Rafi Aroch
Hi Ravi, Consider moving to RocksDB state backend, where you can enable incremental checkpointing. This will make you checkpoints size stay pretty much constant even when your state becomes larger.

Re: serialization issue in streaming job run with scala Future

2019-09-18 Thread Rafi Aroch
Hi Debasish, Have you taken a look at the AsyncIO API for running async operations? I think this is the preferred way of doing it. [1] So it would look something like this: class AsyncDatabaseRequest extends AsyncFunction[String, (String, String)] { /** The database specific client that can

JM fails connecting to TM Metrics service on AWS ECS

2019-09-22 Thread Rafi Aroch
Hi, I have a Flink 1.9.0 cluster deployed on AWS ECS. Cluster is running, but metrics are not showing in the UI. For other services (RPC / Data) it works because the connection is initiated from the TM to the JM through a load-balancer. But it does not work for metrics where JM tries to initiate

Re: How to reprocess certain events in Flink?

2019-12-17 Thread Rafi Aroch
Hi Pooja, Here's an implementation from Jamie Grier for de-duplication using in-memory cache with some expiration time: https://github.com/jgrier/FilteringExample/blob/master/src/main/java/com/dataartisans/DedupeFilteringJob.java If for your use-case you can limit the time period where

Re: Emit intermediate accumulator state of a session window

2019-12-08 Thread Rafi Aroch
Hi Chandu, Maybe you can use a custom trigger: * .trigger(**ContinuousEventTimeTrigger.of(Time.minutes(15)))* This would continuously trigger your aggregate every period of time. Thanks, Rafi On Thu, Dec 5, 2019 at 1:09 PM Andrey Zagrebin wrote: > Hi Chandu, > > I am not sure whether

Re: Operator latency metric not working in 1.9.1

2020-03-01 Thread Rafi Aroch
Hi Ori, Make sure that latency metrics is enabled. It's disabled by default. See also that the scope is set properly. https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#latency-tracking

Re: StreamingFileSink Not Flushing All Data

2020-03-02 Thread Rafi Aroch
Hi, This happens because StreamingFileSink does not support a finite input stream. In the docs it's mentioned under "Important Considerations": [image: image.png] This behaviour often surprises users... There's a FLIP

Re: Emit message at start and end of event time session window

2020-02-20 Thread Rafi Aroch
I think one "edge" case which is not handled would be that the first event (by event-time) arrives late, then a wrong "started-window" would be reported. Rafi On Thu, Feb 20, 2020 at 12:36 PM Manas Kale wrote: > Is the reason ValueState cannot be use because session windows are always >

Re: BucketingSink capabilities for DataSet API

2020-02-19 Thread Rafi Aroch
e > data in partitioned. > > Thanks, > Anuj > > > On Thu, Oct 25, 2018 at 5:35 PM Andrey Zagrebin > wrote: > >> Hi Rafi, >> >> At the moment I do not see any support of Parquet in DataSet API >> except HadoopOutputFormat, mentioned in stack overflow

Re: Apache Flink - Flink Metrics collection using Prometheus on EMR from streaming mode

2019-12-25 Thread Rafi Aroch
Hi, Take a look here: https://github.com/eastcirclek/flink-service-discovery I used it successfully quite a while ago, so things might have changed since. Thanks, Rafi On Wed, Dec 25, 2019, 05:54 vino yang wrote: > Hi Mans, > > IMO, the mechanism of metrics reporter does not depend on any

Re: Setting app Flink logger

2020-03-10 Thread Rafi Aroch
Hi Eyal, Sounds trivial, but can you verify that the file actually exists in /opt/flink/conf/log4j-console.properties? Also, verify that the user running the process has read permissions to that file. You said you use Flink in YARN mode, but the the example above you run inside a docker image so

Re: Does Flink use EMRFS?

2020-05-24 Thread Rafi Aroch
Hi Peter, I've dealt with the cross-account delegation issues in the past (with no relation to Flink) and got into the same ownership problems (accounts can't access data, account A 'loses' access to it's own data). My 2-cents are that: - The account that produces the data (A) should be the

Re: Savepoint fails due to RocksDB 2GiB limit

2020-07-11 Thread Rafi Aroch
Hi Ori, In your code, are you using the process() API? .process(new MyProcessWindowFunction()); if you do, the ProcessWindowFunction is getting as argument an Iterable with ALL elements collected along the session. This will make the state per key potentially huge (like you're experiencing).

Re: EventTimeSessionWindow firing too soon

2020-06-16 Thread Rafi Aroch
Hi Ori, I guess you consume from Kafka from the earliest offset, so you consume historical data and Flink is catching-up. Regarding: *My event-time timestamps also do not have big gaps* Just to verify, if you do keyBy sessionId, do you check the gaps of events from the same session? Rafi On

Re: State leak

2020-06-24 Thread Rafi Aroch
Hi Ori, Once a session ends, it's state should get purged. You should take care that a session does end. For example, if you wait for a 'session-end' event, limit it with some time threshold. If it's defined with inactivity gap and your client sends infinite events, you could limit the session

Re: Flink Stream job to parquet sink

2020-06-25 Thread Rafi Aroch
Hi Arvid, Would it be possible to implement a BucketAssigner that for example loads the configuration periodically from an external source and according to the event type decides on a different sub-folder? Thanks, Rafi On Wed, Jun 24, 2020 at 10:53 PM Arvid Heise wrote: > Hi Anuj, > > There

Re: Dynamic StreamingFileSink

2020-12-26 Thread Rafi Aroch
Hi Sidney, Have a look at implementing a BucketAssigner for StreamingFileSink: https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/streamfile_sink.html#bucket-assignment Rafi On Sat, Dec 26, 2020 at 11:48 PM Sidney Feiner wrote: > Hey, > I would like to create a dynamic

Re: Flink handle both kafka source and db source

2021-10-26 Thread Rafi Aroch
Hi, Take a look at the new 1.14 feature called Hybrid Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/hybridsource/ Rafi On Tue, Oct 26, 2021 at 7:46 PM Qihua Yang wrote: > Hi, > > My flink app has two data sources. One is from a Kafka topic, one