Re: KinesisIO checkpointing

2020-06-24 Thread Lars Almgren Schwartz
We had the exact same problem, but have not spent any time trying to solve it, we just skipped checkpointing for now. When we saw this problem we ran Spark 2.4.5 (local mode) and Kinesis 2.18 and 2.19. On Wed, Jun 24, 2020 at 6:22 PM Sunny, Mani Kolbe wrote: > We are on spark 2.4 and Beam 2.22.0

Unable to commit offset using KafkaIO

2020-06-24 Thread Praveen K Viswanathan
Hello Everyone, I am having issues in committing offsets using KafkaIO. My underlying streaming is OSS (Oracle Streaming Service) which is a Kafka-Compatible one. When I try to commit offset using "ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, true" I am getting below error. I am having both my Kakfa

Re: Designing an existing pipeline in Beam

2020-06-24 Thread Praveen K Viswanathan
Thanks Luke, I would like to try the latter approach. Would be able to share any pseudo-code or point to any example on how to call a common method inside a DoFn's, let's say, ProcessElement method? On Tue, Jun 23, 2020 at 6:35 PM Luke Cwik wrote: > You can apply the same DoFn / Transform instan

Re: KafkaIO Exactly once vs At least Once

2020-06-24 Thread Eleanore Jin
Hi Alex, Thanks a lot for the info. Eleanore On Wed, Jun 24, 2020 at 9:26 AM Alexey Romanenko wrote: > Well, I think, in general, it will be a question of trade-off between > latency and performance in case of EOS sink (since EOS can’t be "for > free"). > > I can’t recommend specific numbers f

Re: How to create a marker file when each window completes?

2020-06-24 Thread Luke Cwik
You can use the @OnWindowExpiration with a stateful DoFn that consumes the PCollection (you'll need to convert the PCollection to be a keyed PCollection which you can do that with WithKeys.of(null)) . The window expiration will only be invoked once all upstream processing for that window has been c

Re: KafkaIO Exactly once vs At least Once

2020-06-24 Thread Alexey Romanenko
Well, I think, in general, it will be a question of trade-off between latency and performance in case of EOS sink (since EOS can’t be "for free"). I can’t recommend specific numbers for Flink (maybe Maximilian Michels or others with more Flink knowledge can do), but I’d just try different numbe

RE: KinesisIO checkpointing

2020-06-24 Thread Sunny, Mani Kolbe
We are on spark 2.4 and Beam 2.22.0 From: Alexey Romanenko Sent: Wednesday, June 24, 2020 5:15 PM To: user@beam.apache.org Subject: Re: KinesisIO checkpointing CAUTION: This email originated from outside of D&B. Please do not click links or open attachments unless you recognize the sender and k

RE: How to create a marker file when each window completes?

2020-06-24 Thread Sunny, Mani Kolbe
Lets say I set FixedWindows.of(Duration.standardMinutes(10)) Since my event time is determined by WithTimestamps.of(Instant.now()), I can safely assume window is closed once that 10 min period is passed. But how do I ensure that all records belonged to that window are already flushed to disk. T

Re: KinesisIO checkpointing

2020-06-24 Thread Alexey Romanenko
Yes, KinesisIO supports restart from checkpoints and it’s based on runner checkpoints support [1]. Could you specify which version of Spark and Beam you use? [1] https://stackoverflow.com/questions/62259364/how-apache-beam-manage-kinesis-checkpointing/62349838#62349838

Re: How to create a marker file when each window completes?

2020-06-24 Thread Luke Cwik
Knowing when a window is "closed" is based upon having the watermark advance which is based upon even time. On Wed, Jun 24, 2020 at 9:05 AM Sunny, Mani Kolbe wrote: > Hi Luke, > > > > Sorry forgot to mention, we override the event timestamp to current using > WithTimestamps.of(Instant.now()) as

RE: How to create a marker file when each window completes?

2020-06-24 Thread Sunny, Mani Kolbe
Hi Luke, Sorry forgot to mention, we override the event timestamp to current using WithTimestamps.of(Instant.now()) as we don’t really care actual event time. So FixedWindow closes when current time passes window.end time. It is a standard practice in oozie world to trigger downstream jobs base

Re: [Announce] Flink Forward Call for Proposals Extended

2020-06-24 Thread Seth Wiesman
As a reminder, the CfP for Flink Forward is open until this Sunday, June 28th. If you've never spoken at a conference before and are thinking about submitting, out amazing event manager Laura just wrote an article on dev.to about why virtual conferences are the best way to get started. [1] https:

Re: Flink/Portable Runner error on AWS EMR

2020-06-24 Thread Jesse Lord
Hi Max, Thanks for the help. I will certainly look into shading the library, but my setup is very straightforward if someone is interested in reproducing this issue. I started and AWS EMR cluster version 5.30 with flink 1.10.0 installed. I created the beam wordcount example project using maven

Re: How to create a marker file when each window completes?

2020-06-24 Thread Luke Cwik
What do you consider complete? (I ask this since you are using element count and processing time triggers) Generally the idea is that you can feed the output PCollection to a stateful DoFn with an @OnWindowExpiration setup but this works only if you completeness is controlled by watermark advancem

KinesisIO checkpointing

2020-06-24 Thread Sunny, Mani Kolbe
Hello, We are developing a beam pipeline which runs on SparkRunner on streaming mode. This pipeline read from Kinesis, do some translations, filtering and finally output to S3 using AvroIO writer. We are using Fixed windows with triggers based on element count and processing time intervals. Out

How to avoid data loss during streaming stops

2020-06-24 Thread Sunny, Mani Kolbe
Hello, We are developing a beam pipeline which runs on SparkRunner on streaming mode. This pipeline read from Kinesis, do some translations, filtering and finally output to S3 using AvroIO writer. We are using Fixed windows with triggers based on element count and processing time intervals. Out

How to create a marker file when each window completes?

2020-06-24 Thread Sunny, Mani Kolbe
Hello, We are developing a beam pipeline which runs on SparkRunner on streaming mode. This pipeline read from Kinesis, do some translations, filtering and finally output to S3 using AvroIO writer. We are using Fixed windows with triggers based on element count and processing time intervals. Out

Re: Flink/Portable Runner error on AWS EMR

2020-06-24 Thread Maximilian Michels
Hi Jesse, This is hard to debug without knowing more about the setup. You have conflicting versions of the Jackson library. One is present in Beam, one may be loaded by the AWS setup. You mentioned in the issue that using parent-child first classloading did not resolve the issue. I'd suggest to r