[ANNOUNCE] Beam 2.23.0 Released

2020-07-29 Thread Valentyn Tymofieiev
The Apache Beam team is pleased to announce the release of version 2.23.0.

Apache Beam is an open source unified programming model to define and
execute data processing pipelines, including ETL, batch and stream
(continuous) processing. See: https://beam.apache.org

You can download the release here:

https://beam.apache.org/get-started/downloads/

This release includes bug fixes, features, and improvements detailed on
the Beam blog: https://beam.apache.org/blog/beam-2.23.0/

Thanks to everyone who contributed to this release, and we hope you enjoy
using Beam 2.23.0.

-- Valentyn Tymofieiev, on behalf of The Apache Beam team


Re: Program and registration for Beam Digital Summit

2020-07-29 Thread Maximilian Michels
Thanks Pedro! Great to see the program! This is going to be an exciting 
event.


Forwarding to the dev mailing list, in case people didn't see this here.

-Max

On 29.07.20 20:25, Pedro Galvan wrote:

Hello!

Just a quick message to let everybody know that we have published the 
program for the Beam Digital Summit. It is available at 
https://2020.beamsummit.org/program


With more than 30 talks and workshops covering all the scope from 
introductory sessions to advanced scenarios and use cases, we hopeĀ  that 
everybody will find useful content at Beam Digital Summit.


Beam Digital Summit will broadcast through the Crowdcast platform. It is 
a free event but you need to register. Please visit 
https://www.crowdcast.io/e/beamsummit 
 to register.



--
*Pedro Galvan*
Beam Digital Summit Team


Program and registration for Beam Digital Summit

2020-07-29 Thread Pedro Galvan

Hello!

Just a quick message to let everybody know that we have published the 
program for the Beam Digital Summit. It is available at 
https://2020.beamsummit.org/program


With more than 30 talks and workshops covering all the scope from 
introductory sessions to advanced scenarios and use cases, we hopeĀ  that 
everybody will find useful content at Beam Digital Summit.


Beam Digital Summit will broadcast through the Crowdcast platform. It is 
a free event but you need to register. Please visit 
https://www.crowdcast.io/e/beamsummit 
 to register.



--
*Pedro Galvan*
Beam Digital Summit Team


Re: Exceptions: Attempt to deliver a timer to a DoFn, but timers are not supported in Dataflow.

2020-07-29 Thread Mohil Khare
Hi Kenneth,
I am on beam java sdk 2.19 With enableStreamingEngine set to true and using
default machine type (n1-standard-2).

Thanks and regards
Mohil



On Wed, Jul 29, 2020 at 10:36 AM Kenneth Knowles  wrote:

> Hi Mohil,
>
> It helps also to tell us what version of Beam you are using and some more
> details. This looks related to
> https://issues.apache.org/jira/browse/BEAM-6855 which claims to be
> resolved in 2.17.0
>
> Kenn
>
> On Mon, Jul 27, 2020 at 11:47 PM Mohil Khare  wrote:
>
>> Hello all,
>>
>> I think I found the reason for the issue.  Since the exception was thrown
>> by StreamingSideInputDoFnRunner.java, I realize that I recently added side
>> input to one of my ParDo that does stateful transformations.
>> It looks like there is some issue when you add side input (My side input
>> was coming via Global window to ParDo in a Fixed Window) to stateful DoFn.
>>
>> As a work around, instead of adding side input to stateful ParDo, I
>> introduced another ParDo  that enriches streaming data with side input
>> before flowing into stateful DoFn. That seems to have fixed the problem.
>>
>>
>> Thanks and regards
>> Mohil
>>
>>
>>
>> On Mon, Jul 27, 2020 at 10:50 AM Mohil Khare  wrote:
>>
>>> Hello All,
>>>
>>> Any idea how to debug this and find out which stage, which DoFn or which
>>> side input is causing the problem?
>>> Do I need to override OnTimer with every DoFn to avoid this problem?
>>> I thought that some uncaught exceptions were causing this and added
>>> various checks and exception handling in all DoFn and still seeing this
>>> issue.
>>> It has been driving me nuts. And now forget DRAIN, it happens during
>>> normal functioning as well. Any help would be appreciated.
>>>
>>> java.lang.UnsupportedOperationException: Attempt to deliver a timer to a
>>> DoFn, but timers are not supported in Dataflow.
>>>
>>>1.
>>>   1. at org.apache.beam.runners.dataflow.worker.
>>>   StreamingSideInputDoFnRunner.onTimer (
>>>   StreamingSideInputDoFnRunner.java:86
>>>   
>>> 
>>>   )
>>>   2. at org.apache.beam.runners.dataflow.worker.
>>>   SimpleParDoFn.processUserTimer (SimpleParDoFn.java:360
>>>   
>>> 
>>>   )
>>>   3. at org.apache.beam.runners.dataflow.worker.
>>>   SimpleParDoFn.access$600 (SimpleParDoFn.java:73
>>>   
>>> 
>>>   )
>>>   4. at org.apache.beam.runners.dataflow.worker.
>>>   SimpleParDoFn$TimerType$1.processTimer (SimpleParDoFn.java:444
>>>   
>>> 
>>>   )
>>>   5. at org.apache.beam.runners.dataflow.worker.
>>>   SimpleParDoFn.processTimers (SimpleParDoFn.java:473
>>>   
>>> 
>>>   )
>>>   6. at org.apache.beam.runners.dataflow.worker.
>>>   SimpleParDoFn.processTimers (SimpleParDoFn.java:353
>>>   
>>> 
>>>   )
>>>   7. at org.apache.beam.runners.dataflow.worker.util.common.worker.
>>>   ParDoOperation.finish (ParDoOperation.java:52
>>>   
>>> 
>>>   )
>>>   8. at org.apache.beam.runners.dataflow.worker.util.common.worker.
>>>   MapTaskExecutor.execute (MapTaskExecutor.java:85
>>>   
>>> 
>>>   )
>>>   9. at org.apache.beam.runners.dataflow.worker.
>>>   StreamingDataflowWorker.process (StreamingDataflowWorker.java:1350
>>>   
>>> 

Re: Exceptions: Attempt to deliver a timer to a DoFn, but timers are not supported in Dataflow.

2020-07-29 Thread Kenneth Knowles
Hi Mohil,

It helps also to tell us what version of Beam you are using and some more
details. This looks related to
https://issues.apache.org/jira/browse/BEAM-6855 which claims to be resolved
in 2.17.0

Kenn

On Mon, Jul 27, 2020 at 11:47 PM Mohil Khare  wrote:

> Hello all,
>
> I think I found the reason for the issue.  Since the exception was thrown
> by StreamingSideInputDoFnRunner.java, I realize that I recently added side
> input to one of my ParDo that does stateful transformations.
> It looks like there is some issue when you add side input (My side input
> was coming via Global window to ParDo in a Fixed Window) to stateful DoFn.
>
> As a work around, instead of adding side input to stateful ParDo, I
> introduced another ParDo  that enriches streaming data with side input
> before flowing into stateful DoFn. That seems to have fixed the problem.
>
>
> Thanks and regards
> Mohil
>
>
>
> On Mon, Jul 27, 2020 at 10:50 AM Mohil Khare  wrote:
>
>> Hello All,
>>
>> Any idea how to debug this and find out which stage, which DoFn or which
>> side input is causing the problem?
>> Do I need to override OnTimer with every DoFn to avoid this problem?
>> I thought that some uncaught exceptions were causing this and added
>> various checks and exception handling in all DoFn and still seeing this
>> issue.
>> It has been driving me nuts. And now forget DRAIN, it happens during
>> normal functioning as well. Any help would be appreciated.
>>
>> java.lang.UnsupportedOperationException: Attempt to deliver a timer to a
>> DoFn, but timers are not supported in Dataflow.
>>
>>1.
>>   1. at org.apache.beam.runners.dataflow.worker.
>>   StreamingSideInputDoFnRunner.onTimer (
>>   StreamingSideInputDoFnRunner.java:86
>>   
>> 
>>   )
>>   2. at org.apache.beam.runners.dataflow.worker.
>>   SimpleParDoFn.processUserTimer (SimpleParDoFn.java:360
>>   
>> 
>>   )
>>   3. at org.apache.beam.runners.dataflow.worker.
>>   SimpleParDoFn.access$600 (SimpleParDoFn.java:73
>>   
>> 
>>   )
>>   4. at org.apache.beam.runners.dataflow.worker.
>>   SimpleParDoFn$TimerType$1.processTimer (SimpleParDoFn.java:444
>>   
>> 
>>   )
>>   5. at org.apache.beam.runners.dataflow.worker.
>>   SimpleParDoFn.processTimers (SimpleParDoFn.java:473
>>   
>> 
>>   )
>>   6. at org.apache.beam.runners.dataflow.worker.
>>   SimpleParDoFn.processTimers (SimpleParDoFn.java:353
>>   
>> 
>>   )
>>   7. at org.apache.beam.runners.dataflow.worker.util.common.worker.
>>   ParDoOperation.finish (ParDoOperation.java:52
>>   
>> 
>>   )
>>   8. at org.apache.beam.runners.dataflow.worker.util.common.worker.
>>   MapTaskExecutor.execute (MapTaskExecutor.java:85
>>   
>> 
>>   )
>>   9. at org.apache.beam.runners.dataflow.worker.
>>   StreamingDataflowWorker.process (StreamingDataflowWorker.java:1350
>>   
>> 
>>   )
>>   10. at org.apache.beam.runners.dataflow.worker.
>>   StreamingDataflowWorker.access$1100 (
>>   StreamingDataflowWorker.java:152
>>   
>> 

Re: KafkaUnboundedReader

2020-07-29 Thread Maximilian Michels

Hi Dinh,

The check only de-duplicates in case the consumer processes the same 
offset multiple times. It ensures the offset is always increasing.


If this has been fixed in Kafka, which the comment assumes, the 
condition will never be true.


Which Kafka version are you using?

-Max

On 29.07.20 09:16, wang Wu wrote:

Hi,
I am curious about this comment:

if (offset < expected) { // -- (a)
	// this can happen when compression is enabled in Kafka (seems to be 
fixed in 0.10)

// should we check if the offset is way off from consumedOffset (say > 
1M)?
LOG.warn(
"{}: ignoring already consumed offset {} for {}",
this,
offset,
pState.topicPartition);
continue;
}


https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaUnboundedReader.java#L167

Does it mean that Beam KafkaIO may skip processing some Kafka messages 
if the lag in consuming Kafka messages > 1 M?

Why Kafka compression may result in this bug?
Is there anyway to prevent loss messages and enable at-least-once delivery?

Context: We enable at-least-once delivery semantics on our Beam code by 
this code:


input
 .getPipeline()
 .apply(
 "ReadFromKafka",
 KafkaIO.readBytes()
 
.withBootstrapServers(getSource().getKafkaSourceConfig().getBootstrapServers())
 .withTopics(getTopics())
 .withConsumerConfigUpdates(
 ImmutableMap.of(
 ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG,false,
 ConsumerConfig.GROUP_ID_CONFIG,groupId
))
 .withReadCommitted()
 .commitOffsetsInFinalize())

However, we notice that if we send > 1 millions Kafka message and the 
batch processing can not keep up, it seems that Beam process less number 
of messages than we sent.


Regards
Dinh


KafkaUnboundedReader

2020-07-29 Thread wang Wu
Hi,
I am curious about this comment:

if (offset < expected) { // -- (a)
  // this can happen when compression is enabled in Kafka (seems to be 
fixed in 0.10)
  // should we check if the offset is way off from consumedOffset (say 
> 1M)?
  LOG.warn(
  "{}: ignoring already consumed offset {} for {}",
  this,
  offset,
  pState.topicPartition);
  continue;
}

https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaUnboundedReader.java#L167
 


Does it mean that Beam KafkaIO may skip processing some Kafka messages if the 
lag in consuming Kafka messages > 1 M?
Why Kafka compression may result in this bug?
Is there anyway to prevent loss messages and enable at-least-once delivery?

Context: We enable at-least-once delivery semantics on our Beam code by this 
code:

input
.getPipeline()
.apply(
"ReadFromKafka",
KafkaIO.readBytes()

.withBootstrapServers(getSource().getKafkaSourceConfig().getBootstrapServers())
.withTopics(getTopics())
.withConsumerConfigUpdates(
ImmutableMap.of(
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false,
ConsumerConfig.GROUP_ID_CONFIG, groupId
))
.withReadCommitted()
.commitOffsetsInFinalize())
However, we notice that if we send > 1 millions Kafka message and the batch 
processing can not keep up, it seems that Beam process less number of messages 
than we sent.

Regards
Dinh