Re: KafkaIO reading from latest offset when pipeline fails on FlinkRunner

2018-02-11 Thread Sushil Ks
Thanks, Raghu.

On Tue, Feb 6, 2018 at 6:41 AM, Raghu Angadi <rang...@google.com> wrote:

> Hi Sushil,
>
> That is expected behavior. If you don't have any saved checkpoint, the
> pipeline would start from scratch. It does not have any connection to
> previous run.
>
> On Thu, Feb 1, 2018 at 1:29 AM, Sushil Ks <sushil...@gmail.com> wrote:
>
>> Hi,
>>Apologies for delay in my reply,
>>
>> @Raghu Angadi
>> This checkpoints 20 mins, as you mentioned before any
>> checkpoint is created and if the pipeline restarts, it's reading from the
>> latest offset.
>>
>> @Mingmin
>> Thanks a lot for sharing your learnings, However in case of any
>> *UserCodeException* while processing the element as part of ParDo after
>> materializing the window, the pipeline drops the unprocessed elements and
>> restarts. Is this expected from Beam?
>>
>>
>> On Wed, Jan 17, 2018 at 2:13 AM, Kenneth Knowles <k...@google.com> wrote:
>>
>>> Is there a JIRA filed for this? I think this discussion should live in a
>>> ticket.
>>>
>>> Kenn
>>>
>>> On Wed, Jan 10, 2018 at 11:00 AM, Mingmin Xu <mingm...@gmail.com> wrote:
>>>
>>>> @Sushil, I have several jobs running on KafkaIO+FlinkRunner, hope my
>>>> experience can help you a bit.
>>>>
>>>> For short, `ENABLE_AUTO_COMMIT_CONFIG` doesn't meet your requirement,
>>>> you need to leverage exactly-once checkpoint/savepoint in Flink. The reason
>>>> is,  with `ENABLE_AUTO_COMMIT_CONFIG` KafkaIO commits offset after data is
>>>> read, and once job is restarted KafkaIO reads from last_committed_offset.
>>>>
>>>> In my jobs, I enable external(external should be optional I think?)
>>>> checkpoint on exactly-once mode in Flink cluster. When the job auto-restart
>>>> on failures it doesn't lost data. In case of manually redeploy the job, I
>>>> use savepoint to cancel and launch the job.
>>>>
>>>> Mingmin
>>>>
>>>> On Wed, Jan 10, 2018 at 10:34 AM, Raghu Angadi <rang...@google.com>
>>>> wrote:
>>>>
>>>>> How often does your pipeline checkpoint/snapshot? If the failure
>>>>> happens before the first checkpoint, the pipeline could restart without 
>>>>> any
>>>>> state, in which case KafkaIO would read from latest offset. There is
>>>>> probably some way to verify if pipeline is restarting from a checkpoint.
>>>>>
>>>>> On Sun, Jan 7, 2018 at 10:57 PM, Sushil Ks <sushil...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> HI Aljoscha,
>>>>>>The issue is let's say I consumed 100 elements in
>>>>>> 5 mins Fixed Window with *GroupByKey* and later I applied *ParDO* for
>>>>>> all those elements. If there is an issue while processing element 70 in
>>>>>> *ParDo *and the pipeline restarts with *UserCodeException *it's
>>>>>> skipping the rest 30 elements. Wanted to know if this is expected? In 
>>>>>> case
>>>>>> if you still having doubt let me know will share a code snippet.
>>>>>>
>>>>>> Regards,
>>>>>> Sushil Ks
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> 
>>>> Mingmin
>>>>
>>>
>>>
>>
>


Re: KafkaIO reading from latest offset when pipeline fails on FlinkRunner

2018-01-07 Thread Sushil Ks
HI Aljoscha,
   The issue is let's say I consumed 100 elements in 5 mins
Fixed Window with *GroupByKey* and later I applied *ParDO* for all those
elements. If there is an issue while processing element 70 in *ParDo *and
the pipeline restarts with *UserCodeException *it's skipping the rest 30
elements. Wanted to know if this is expected? In case if you still having
doubt let me know will share a code snippet.

Regards,
Sushil Ks


Re: KafkaIO reading from latest offset when pipeline fails on FlinkRunner

2018-01-04 Thread Sushil Ks
*bump*

On Dec 15, 2017 11:22 PM, "Lukasz Cwik" <lc...@google.com> wrote:

> +d...@beam.apache.org
>
> On Thu, Dec 14, 2017 at 11:27 PM, Sushil Ks <sushil...@gmail.com> wrote:
>
>> Hi Likasz,
>>I am not sure whether I can reproduce in the DirectRunner, as
>> am taking retry and checkpoint mechanism of Flink into consideration. In
>> other words, the issue am facing is, any exception in the operation post
>> GroupByKey and the pipeline restarts, those particular elements are not
>> being processed in the next run.
>>
>> On Wed, Dec 13, 2017 at 4:01 AM, Lukasz Cwik <lc...@google.com> wrote:
>>
>>> That seems incorrect. Please file a JIRA and provide an example + data
>>> that shows the error using the DirectRunner.
>>>
>>> On Tue, Dec 12, 2017 at 2:51 AM, Sushil Ks <sushil...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>  I am running a fixed window with GroupByKey on FlinkRunner and
>>>> have noticed that any exception and restart before the GroupByKey operation
>>>> the Kafka consumer is replaying the data from the particular offset,
>>>> however, post that any exception occurs and the pipeline restart the Kafka
>>>> is consuming from the latest offset. Is this expected?
>>>>
>>>> Regards,
>>>> Sushil Ks
>>>>
>>>
>>>
>>
>


Re: KafkaIO reading from latest offset when pipeline fails on FlinkRunner

2017-12-14 Thread Sushil Ks
Hi Likasz,
   I am not sure whether I can reproduce in the DirectRunner, as am
taking retry and checkpoint mechanism of Flink into consideration. In other
words, the issue am facing is, any exception in the operation post
GroupByKey and the pipeline restarts, those particular elements are not
being processed in the next run.

On Wed, Dec 13, 2017 at 4:01 AM, Lukasz Cwik <lc...@google.com> wrote:

> That seems incorrect. Please file a JIRA and provide an example + data
> that shows the error using the DirectRunner.
>
> On Tue, Dec 12, 2017 at 2:51 AM, Sushil Ks <sushil...@gmail.com> wrote:
>
>> Hi,
>>  I am running a fixed window with GroupByKey on FlinkRunner and
>> have noticed that any exception and restart before the GroupByKey operation
>> the Kafka consumer is replaying the data from the particular offset,
>> however, post that any exception occurs and the pipeline restart the Kafka
>> is consuming from the latest offset. Is this expected?
>>
>> Regards,
>> Sushil Ks
>>
>
>


Re: GroupByKey not happening on SparkRunner for multiple triggers.

2017-11-17 Thread Sushil Ks
Oh okay! Thanks Jean.

On Nov 17, 2017 6:02 PM, "Jean-Baptiste Onofré" <j...@nanthrax.net> wrote:

> Hi,
>
> I think it's related to the identified issue:
>
> https://issues.apache.org/jira/browse/BEAM-3193
>
> I'm working on a fix.
>
> To avoid to mix different change in the runner, I'm holding the fix a bit
> due to the Spark 2 update.
>
> Regards
> JB
>
> On 11/17/2017 11:50 AM, Sushil Ks wrote:
>
>> Hi,
>>   I have configured multiple triggers on an
>> *UnboundedSource* for a Beam Pipeline with *GroupByKey* and works as
>> expected on DirectRunner. However, when I deploy on the SparkCluster its
>> seems to not apply the *GroupByKey* transformation at all. Is this
>> expected? if so any help to get unblocked would be appreciated.
>>
>> Regards,
>> Sushil Ks
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: GroupByKey not happening on SparkRunner for multiple triggers.

2017-11-17 Thread Sushil Ks
Thanks, Nishu.

On Fri, Nov 17, 2017 at 4:47 PM, Nishu <nishuta...@gmail.com> wrote:

> Hi Sushil,
>
> I also faced the similar issue last week and reported to the team. There
> is an issue with Spark runner when you run the pipeline in streaming mode.
> Issue has been logged into JIRA : https://issues.apache.org/
> jira/browse/BEAM-3193.
>
> Regards,
> Nishu
>
> On Fri, Nov 17, 2017 at 11:50 AM, Sushil Ks <sushil...@gmail.com> wrote:
>
>> Hi,
>>  I have configured multiple triggers on an *UnboundedSource* for
>> a Beam Pipeline with *GroupByKey* and works as expected on DirectRunner.
>> However, when I deploy on the SparkCluster its seems to not apply the
>> *GroupByKey* transformation at all. Is this expected? if so any help to
>> get unblocked would be appreciated.
>>
>> Regards,
>> Sushil Ks
>>
>
>
>
> --
> Thanks & Regards,
> Nishu Tayal
>


GroupByKey not happening on SparkRunner for multiple triggers.

2017-11-17 Thread Sushil Ks
Hi,
 I have configured multiple triggers on an *UnboundedSource* for
a Beam Pipeline with *GroupByKey* and works as expected on DirectRunner.
However, when I deploy on the SparkCluster its seems to not apply the
*GroupByKey* transformation at all. Is this expected? if so any help to get
unblocked would be appreciated.

Regards,
Sushil Ks


Re: Regarding Beam Slack Channel

2017-10-14 Thread Sushil Ks
+1
Kindly add me to beam slack channel.

On Oct 14, 2017 5:02 AM, "Lukasz Cwik"  wrote:

> Invite sent, welcome.
>
> On Fri, Oct 13, 2017 at 3:07 PM, NerdyNick  wrote:
>
>> Hello
>>
>> Can someone please add me to the Beam slack channel?
>>
>> Thanks.
>>
>
>