unsubscribe

2023-03-03 Thread Xiangyu Su via user



Re: Flink 1.14 stuck on INITIALIZING state after job restarting

2021-11-11 Thread Xiangyu Su
Thanks Jake!
I use one pod/taskmanager on one instance, on the
same  taskmanager/instance some slots are able to switch from INITIALIZING
to RUNNING  like normal, but some slots on the same instance are not. so
due to that the connection of the new task manager should be ok.
available slots are correct.

and btw using same cluster setting and the same test case if I run on flink
1.13.2 there are no such issues

Best


On Thu, 11 Nov 2021 at 11:16, Jake  wrote:

> Checkout new task manager instances connection status, if connect to job
> manager is normal, check available slot, check create connection at
> function initializes.
>
>
>
> On Nov 11, 2021, at 18:01, Xiangyu Su  wrote:
>
> Thank you Jake!
> Enable debug level logging have to ask system engineer first ;)
> do you know how to resolve this issue?
>
> Best
>
> On Thu, 11 Nov 2021 at 10:57, Jake  wrote:
>
>>
>> Set log root level is DEBUG and check Job manager logs,  you will get it.
>>
>>
>>
>> On Nov 11, 2021, at 17:02, Xiangyu Su  wrote:
>>
>> Hello Everyone,
>>
>> We are facing an issue on Flink 1.14.0.
>> Every time if the job gets restarted, some tasks/slots get stuck in
>> INITIALIZING state, and will never switch to RUNNING.
>>
>> Any idea/suggestion about a solution to this issue?
>>
>> btw, our flink cluster runs on EKS, and using AWS spot instance for Task
>> manager, once spot instances get lost and then get new instances this issue
>> will always happen.
>>
>> Best,
>> --
>> Xiangyu Su
>> Senior Data Engineer
>> xian...@smaato.com
>>
>> Smaato Inc.
>> San Francisco - New York - Hamburg - Singapore
>> www.smaato.com
>>
>> Germany:
>> Barcastraße 5
>> 22087 Hamburg
>> Germany
>> M 0049(176)43330282
>>
>> The information contained in this communication may be CONFIDENTIAL and
>> is intended only for the use of the recipient(s) named above. If you are
>> not the intended recipient, you are hereby notified that any dissemination,
>> distribution, or copying of this communication, or any of its contents, is
>> strictly prohibited. If you have received this communication in error,
>> please notify the sender and delete/destroy the original message and any
>> copy of it from your computer or paper files.
>>
>>
>>
>
> --
> Xiangyu Su
> Senior Data Engineer
> xian...@smaato.com
>
> Smaato Inc.
> San Francisco - New York - Hamburg - Singapore
> www.smaato.com
>
> Germany:
> Barcastraße 5
> 22087 Hamburg
> Germany
> M 0049(176)43330282
>
> The information contained in this communication may be CONFIDENTIAL and is
> intended only for the use of the recipient(s) named above. If you are not
> the intended recipient, you are hereby notified that any dissemination,
> distribution, or copying of this communication, or any of its contents, is
> strictly prohibited. If you have received this communication in error,
> please notify the sender and delete/destroy the original message and any
> copy of it from your computer or paper files.
>
>
>

-- 
Xiangyu Su
Senior Data Engineer
xian...@smaato.com

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:

Barcastraße 5

22087 Hamburg

Germany
M 0049(176)43330282

The information contained in this communication may be CONFIDENTIAL and is
intended only for the use of the recipient(s) named above. If you are not
the intended recipient, you are hereby notified that any dissemination,
distribution, or copying of this communication, or any of its contents, is
strictly prohibited. If you have received this communication in error,
please notify the sender and delete/destroy the original message and any
copy of it from your computer or paper files.


Re: Flink 1.14 stuck on INITIALIZING state after job restarting

2021-11-11 Thread Xiangyu Su
Thank you Jake!
Enable debug level logging have to ask system engineer first ;)
do you know how to resolve this issue?

Best

On Thu, 11 Nov 2021 at 10:57, Jake  wrote:

>
> Set log root level is DEBUG and check Job manager logs,  you will get it.
>
>
>
> On Nov 11, 2021, at 17:02, Xiangyu Su  wrote:
>
> Hello Everyone,
>
> We are facing an issue on Flink 1.14.0.
> Every time if the job gets restarted, some tasks/slots get stuck in
> INITIALIZING state, and will never switch to RUNNING.
>
> Any idea/suggestion about a solution to this issue?
>
> btw, our flink cluster runs on EKS, and using AWS spot instance for Task
> manager, once spot instances get lost and then get new instances this issue
> will always happen.
>
> Best,
> --
> Xiangyu Su
> Senior Data Engineer
> xian...@smaato.com
>
> Smaato Inc.
> San Francisco - New York - Hamburg - Singapore
> www.smaato.com
>
> Germany:
> Barcastraße 5
> 22087 Hamburg
> Germany
> M 0049(176)43330282
>
> The information contained in this communication may be CONFIDENTIAL and is
> intended only for the use of the recipient(s) named above. If you are not
> the intended recipient, you are hereby notified that any dissemination,
> distribution, or copying of this communication, or any of its contents, is
> strictly prohibited. If you have received this communication in error,
> please notify the sender and delete/destroy the original message and any
> copy of it from your computer or paper files.
>
>
>

-- 
Xiangyu Su
Senior Data Engineer
xian...@smaato.com

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:

Barcastraße 5

22087 Hamburg

Germany
M 0049(176)43330282

The information contained in this communication may be CONFIDENTIAL and is
intended only for the use of the recipient(s) named above. If you are not
the intended recipient, you are hereby notified that any dissemination,
distribution, or copying of this communication, or any of its contents, is
strictly prohibited. If you have received this communication in error,
please notify the sender and delete/destroy the original message and any
copy of it from your computer or paper files.


Flink 1.14 stuck on INITIALIZING state after job restarting

2021-11-11 Thread Xiangyu Su
Hello Everyone,

We are facing an issue on Flink 1.14.0.
Every time if the job gets restarted, some tasks/slots get stuck in
INITIALIZING state, and will never switch to RUNNING.

Any idea/suggestion about a solution to this issue?

btw, our flink cluster runs on EKS, and using AWS spot instance for Task
manager, once spot instances get lost and then get new instances this issue
will always happen.

Best,
-- 
Xiangyu Su
Senior Data Engineer
xian...@smaato.com

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:

Barcastraße 5

22087 Hamburg

Germany
M 0049(176)43330282

The information contained in this communication may be CONFIDENTIAL and is
intended only for the use of the recipient(s) named above. If you are not
the intended recipient, you are hereby notified that any dissemination,
distribution, or copying of this communication, or any of its contents, is
strictly prohibited. If you have received this communication in error,
please notify the sender and delete/destroy the original message and any
copy of it from your computer or paper files.


Re: Potential bug when assuming roles from AWS EKS when using S3 as RocksDb checkpoint backend?

2021-09-25 Thread Xiangyu Su
Hi Thomas,
did you try to login to EKS node and run some aws command like : aws s3 ls
?
It sounds like EKS issue, but not 100% sure.
Best


On Sat, 25 Sept 2021 at 22:12, Ingo Bürk  wrote:

> Hi Thomas,
>
> I think you might be looking for this:
> https://github.com/apache/flink/pull/16717
>
>
> Best
> Ingo
>
> On Sat, Sep 25, 2021, 20:46 Thomas Wang  wrote:
>
>> Hi,
>>
>> I'm using the official docker image:
>> apache/flink:1.12.1-scala_2.11-java11
>>
>> I'm trying to run a Flink job on an EKS cluster. The job is running under
>> a k8s service account that is tied to an IAM role. If I'm not using s3 as
>> RocksDB checkpoint backend, everything works just fine. However, when I
>> enabled s3 as RocksDB checkpoint backend, I got permission denied.
>>
>> The IAM role tied to the service account has the appropriate permissions
>> to s3. However the underlying role tied to the EKS node doesn't. After
>> debugging with AWS support, it looks like the request to s3 was made under
>> the EKS node role, not the role tied to the service account. Thus the
>> permission denial.
>>
>> With the same Flink application, I'm also making requests to AWS Secrets
>> Manager to get some sensitive information and those requests were made
>> explicitly with AWS Java SDK 2.x bundled in the same application Jar file.
>> Those requests were made correctly with the IAM role tied to the service
>> account.
>>
>> Based on the info above, I suspect Flink may be using an older version of
>> the AWS SDK that doesn't support assuming an IAM role via an IODC web
>> identity token file. Please see AWS doc here:
>> https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-minimum-sdk.html
>>
>> Could someone help me confirm this bug and maybe have it fixed some time?
>> Thanks.
>>
>> Thomas
>>
>

-- 
Xiangyu Su
Java Developer
xian...@smaato.com

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:

Barcastraße 5

22087 Hamburg

Germany
M 0049(176)43330282

The information contained in this communication may be CONFIDENTIAL and is
intended only for the use of the recipient(s) named above. If you are not
the intended recipient, you are hereby notified that any dissemination,
distribution, or copying of this communication, or any of its contents, is
strictly prohibited. If you have received this communication in error,
please notify the sender and delete/destroy the original message and any
copy of it from your computer or paper files.


Re: FLINK-14316 happens on version 1.13.2

2021-09-14 Thread Xiangyu Su
Hi Guys,
sorry for the late reply.
we found out the issue is not related to flink, there is a connection issue
with zookeeper. we deploy our whole infra on k8s and using aws spot ec2,
once the pod
 get restarted or lost spot instances we lost the log files... so sorry for
not being able to share the log files.

Sharing some of our experiences:
Job leader lost leadership issues can be caused due to different reasons,
most properly due to zookeeper and this failure does not cause job failure
as far as we have seen.
And the Checkpoint timeout issue can also be due to zookeeper issue,
because the last successful CK meta info is stored in ZK, if ZK has issue
flink will not be able to restore from last CK..

On Tue, 7 Sept 2021 at 10:00, Matthias Pohl  wrote:

> Hi Xiangyu,
> thanks for reaching out to the community. Could you share the entire
> TaskManager and JobManager logs with us? That might help investigating
> what's going on.
>
> Best,
> Matthias
>
> On Fri, Sep 3, 2021 at 10:07 AM Xiangyu Su  wrote:
>
>> Hi Yun,
>> Thanks alot.
>> I am running a test, and facing the "Job Leader lost leadership..."
>> issue, and also the checkpointing timeout at the same time,, not sure
>> whether those 2 things related to each other.
>> regarding your question:
>> 1. GC looks ok.
>> 2. seems like once the "Job Leader lost leadership..." happens flink job
>> can not successfully get restarted.
>> and e.g here is some logs from one job failure:
>> ---
>> 2021-09-02 20:41:11,345 WARN  org.apache.flink.runtime.taskmanager.Task
>>  [] - KeyedProcess -> Sink: StatsdMetricsSink (40/48)#18
>> (9ab62cc148569e449fdb31b521ec976c) switched from RUNNING to FAILED with
>> failure cause: org.apache.flink.util.FlinkException: Disconnect from
>> JobManager responsible for ec6fd88643747aafac06ee906e421a96.
>> at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1660)
>> at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1500(TaskExecutor.java:181)
>> at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:2189)
>> at java.util.Optional.ifPresent(Optional.java:159)
>> at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3(TaskExecutor.java:2187)
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>> at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>> at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>> at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>> at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> at
>> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>> at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> at
>> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> Caused by: java.lang.Exception: Job leader for job id
>> ec6fd88643747aafac06ee906e421a96 lost leadership.
>> ... 24 more
>>
>> ---
>> 2021-09-02 20:47:22,388 ERROR
>> org.apache.flink.shaded.curator4.org.apache.curator.ConnectionState [] -
>> Authentication failed
>> 2021-09-02 20:47:22,388 INFO
>>  org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
>> Opening socket connection to server dpl-zookeeper-0.dpl-zookeeper/
>> 10.168.175.10:2181
>> 2021-09-02 20:47:22,388 WARN
>>  org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
>> SASL configuration failed: javax.security.auth.login.LoginException: 

Re: FLINK-14316 happens on version 1.13.2

2021-09-03 Thread Xiangyu Su
_2.11-1.13.2.jar:1.13.2]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:212)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:305)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_302]
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[?:1.8.0_302]
at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source) ~[?:?]
at
org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:441)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
org.apache.flink.runtime.jobmaster.ExecutionGraphException: The execution
attempt deb6e9dd535069eb66e2139fde5b77cd was not found.
2021-09-02 20:47:21,870 INFO
 org.apache.flink.runtime.taskexecutor.TaskExecutor   [] - Cannot
find task to fail for execution deb6e9dd535069eb66e2139fde5b77cd with
exception:
at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:212)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:305)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_302]
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[?:1.8.0_302]
at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source) ~[?:?]
at
org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:441)
~[flink-dist_2.11-1.13.2.jar:1.13.2]

Thanks for your support.
Best Regards,

On Thu, 2 Sept 2021 at 16:43, Yun Gao  wrote:

> Hi Xiangyu,
>
> There might be different reasons for the "Job Leader... lost leadership"
> problem. Do you see the erros
> in the TM log ? If so, the root cause might be that the connection between
> the TM and ZK is lost or
> timeout. Have you checked the GC status of the TM side ? If the GC is ok,
> could you provide more detailed
> exception stack ?
>
> Best,
> Yun
>
>
> --Original Mail --
> *Sender:*Xiangyu Su 
> *Send Date:*Wed Sep 1 15:31:03 2021
> *Recipients:*use

Checkpointing failure, subtasks get stuck

2021-09-02 Thread Xiangyu Su
Hello Everyone,
Hello Till,
We were facing checkpointing failure issue since version 1.9, currently we
are using  version 1.13.2

We are using filesystem(s3) as statebackend, 10 mins checkpoint timeout,
usually the checkpoint takes 10-30 seconds.
But sometimes I have seen Job failed and restarted due to checkpoint
timeout without huge increasing of incoming data... and also seen the
checkpointing progress of some subtasks get stuck by e.g 7% for 10 mins.
My guess would be somehow the thread for doing checkpointing get blocked...

Any suggestions? idea will be helpful, thanks


Best Regards,

-- 
Xiangyu Su
Java Developer
xian...@smaato.com

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:

Barcastraße 5

22087 Hamburg

Germany
M 0049(176)43330282

The information contained in this communication may be CONFIDENTIAL and is
intended only for the use of the recipient(s) named above. If you are not
the intended recipient, you are hereby notified that any dissemination,
distribution, or copying of this communication, or any of its contents, is
strictly prohibited. If you have received this communication in error,
please notify the sender and delete/destroy the original message and any
copy of it from your computer or paper files.


Job leader ... lost leadership with version 1.13.2

2021-09-02 Thread Xiangyu Su
Hello Everyone,
Hello Till,
We upgrade flink to 1.13.2, and we were facing randomly the "Job leader ...
lost leadership" error, the job keep restarting and failing...
It behaviours like this ticket
https://issues.apache.org/jira/browse/FLINK-14316

Did anybody had same issue or any suggestions?

Best Regards,

-- 
Xiangyu Su
Java Developer
xian...@smaato.com

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:

Barcastraße 5

22087 Hamburg

Germany
M 0049(176)43330282

The information contained in this communication may be CONFIDENTIAL and is
intended only for the use of the recipient(s) named above. If you are not
the intended recipient, you are hereby notified that any dissemination,
distribution, or copying of this communication, or any of its contents, is
strictly prohibited. If you have received this communication in error,
please notify the sender and delete/destroy the original message and any
copy of it from your computer or paper files.


Checkpointing failure, subtasks get stuck

2021-09-01 Thread Xiangyu Su
Hello Everyone,
We were facing checkpointing failure issue since version 1.9, currently we
are using  version 1.13.2

We are using filesystem(s3) as statebackend, 10 mins checkpoint timeout,
usually the checkpoint takes 10-30 seconds.
But sometimes I have seen Job failed and restarted due to checkpoint
timeout without huge increasing of incoming data... and also seen the
checkpointing progress of some subtasks get stuck by e.g 7% for 10 mins.
My guess would be somehow the thread for doing checkpointing get blocked...

Any suggestions? idea will be helpful, thanks


Best Regards,
-- 
Xiangyu Su
Java Developer
xian...@smaato.com

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:

Barcastraße 5

22087 Hamburg

Germany
M 0049(176)43330282

The information contained in this communication may be CONFIDENTIAL and is
intended only for the use of the recipient(s) named above. If you are not
the intended recipient, you are hereby notified that any dissemination,
distribution, or copying of this communication, or any of its contents, is
strictly prohibited. If you have received this communication in error,
please notify the sender and delete/destroy the original message and any
copy of it from your computer or paper files.


FLINK-14316 happens on version 1.13.2

2021-09-01 Thread Xiangyu Su
Hello Everyone,
We upgrade flink to 1.13.2, and we were facing randomly the "Job leader ...
lost leadership" error, the job keep restarting and failing...
It behaviours like this ticket
https://issues.apache.org/jira/browse/FLINK-14316

Did anybody had same issue or any suggestions?

Best Regards,


-- 
Xiangyu Su
Java Developer
xian...@smaato.com

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:

Barcastraße 5

22087 Hamburg

Germany
M 0049(176)43330282

The information contained in this communication may be CONFIDENTIAL and is
intended only for the use of the recipient(s) named above. If you are not
the intended recipient, you are hereby notified that any dissemination,
distribution, or copying of this communication, or any of its contents, is
strictly prohibited. If you have received this communication in error,
please notify the sender and delete/destroy the original message and any
copy of it from your computer or paper files.


flink: keyed process function, why are timestamp of register event timer different as "on timer" function timestamp

2019-09-19 Thread Xiangyu Su
Hi User,

We are using keyed process function with Event time for flink streaming
application.
We register event time on "processElement" function, and mentioned that
"onTimer" function had different "timestamp" as registered on
"processElement" function.

If we understand correctly, "onTimer" should be triggered by using same
time(see flink api: InternalTimerServiceImpl.advanceWatermark )

thanks for any suggestion

Best regards
Xiangyu


Re: apache flink: Why checkpoint coordinator takes long time to get completion

2019-07-23 Thread Xiangyu Su
Hi Zili,

here is the release notes for 1.8.1
https://flink.apache.org/news/2019/07/02/release-1.8.1.html
But I could not find any ticket related to the "unexpected time-consuming",
I have just tested our application with both versions, this issue is be
able to reproduce every time with version 1.8.0, and it does not happen
with version 1.8.1 until now.

Best regards
Xiangyu

On Tue, 23 Jul 2019 at 08:49, Zili Chen  wrote:

> Hi Xiangyu,
>
> Could you share the corresponding JIRA that fixed this issue?
>
> Best,
> tison.
>
>
> Xiangyu Su  于2019年7月19日周五 下午8:47写道:
>
>> btw. it seems like this issue has been fixed in 1.8.1
>>
>> On Fri, 19 Jul 2019 at 12:21, Xiangyu Su  wrote:
>>
>>> Ok, thanks.
>>>
>>> and this time-consuming until now always happens after 3rd
>>> checkpointing, and this unexpected  time-consuming was always consistent (~
>>> 4 min by under 4G/min incoming traffic).
>>>
>>> On Fri, 19 Jul 2019 at 11:06, Biao Liu  wrote:
>>>
>>>> Hi Xiangyu,
>>>>
>>>> Just took a glance at the relevant codes. There is a gap between
>>>> calculating the duration and logging it out. I guess the checkpoint 4 is
>>>> finished in 1 minute, but there is an unexpected time-consuming operation
>>>> during that time. But I can't tell which part it is.
>>>>
>>>>
>>>> Xiangyu Su  于2019年7月19日周五 下午4:14写道:
>>>>
>>>>> Dear flink community,
>>>>>
>>>>> We are POC flink(1.8) to process data in real time, and using global
>>>>> checkpointing(S3) and local checkpointing(EBS), deploy cluster on EKS. Our
>>>>> application is consuming data from Kinesis.
>>>>>
>>>>> For my test e.g I am using checkpointing interval 5min. and minimum
>>>>> pause 2min.
>>>>>
>>>>> The issue what we saw is: It seems like flink checkpointing process
>>>>> would be idle for 3-4 min, before job manager get complete notification.
>>>>>
>>>>> here is some logging from job manager:
>>>>>
>>>>> 2019-07-10 11:59:03,893 INFO  
>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - 
>>>>> Triggering checkpoint 4 @ 1562759941082 for job 
>>>>> e7a97014f5799458f1c656135712813d.
>>>>> 2019-07-10 12:05:01,836 INFO  
>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
>>>>> checkpoint 4 for job e7a97014f5799458f1c656135712813d (22387207650 bytes 
>>>>> in 58645 ms).
>>>>>
>>>>> As my understanding the logging above, the 
>>>>> completedCheckpoint(CheckpointCoordinator)
>>>>> object has been completed in 58645 ms, but the whole checkpointing process
>>>>> took ~ 6min.
>>>>>
>>>>> This logging is for 4th checkpointing, But the first 3 checkpointing
>>>>> were finished on time.
>>>>> Could you please tell me, why flink checkpointing in my test was
>>>>> starting "idle" for few minutes after 3 checkpointing?
>>>>>
>>>>> Best Regards
>>>>> --
>>>>> Xiangyu Su
>>>>> Java Developer
>>>>> xian...@smaato.com
>>>>>
>>>>> Smaato Inc.
>>>>> San Francisco - New York - Hamburg - Singapore
>>>>> www.smaato.com
>>>>>
>>>>> Germany:
>>>>> Valentinskamp 70, Emporio, 19th Floor
>>>>> 20355 Hamburg
>>>>> M 0049(176)22943076
>>>>>
>>>>> The information contained in this communication may be CONFIDENTIAL
>>>>> and is intended only for the use of the recipient(s) named above. If you
>>>>> are not the intended recipient, you are hereby notified that any
>>>>> dissemination, distribution, or copying of this communication, or any of
>>>>> its contents, is strictly prohibited. If you have received this
>>>>> communication in error, please notify the sender and delete/destroy the
>>>>> original message and any copy of it from your computer or paper files.
>>>>>
>>>>
>>>
>>> --
>>> Xiangyu Su
>>> Java Developer
>>> xian...@smaato.com
>>>
>>> Smaato Inc.
>>> San Francisco - New York - Hamburg - Singapore
>>> www.smaato.com
>>>
>>> Germany:
>>>

Re: apache flink: Why checkpoint coordinator takes long time to get completion

2019-07-19 Thread Xiangyu Su
btw. it seems like this issue has been fixed in 1.8.1

On Fri, 19 Jul 2019 at 12:21, Xiangyu Su  wrote:

> Ok, thanks.
>
> and this time-consuming until now always happens after 3rd checkpointing,
> and this unexpected  time-consuming was always consistent (~ 4 min by under
> 4G/min incoming traffic).
>
> On Fri, 19 Jul 2019 at 11:06, Biao Liu  wrote:
>
>> Hi Xiangyu,
>>
>> Just took a glance at the relevant codes. There is a gap between
>> calculating the duration and logging it out. I guess the checkpoint 4 is
>> finished in 1 minute, but there is an unexpected time-consuming operation
>> during that time. But I can't tell which part it is.
>>
>>
>> Xiangyu Su  于2019年7月19日周五 下午4:14写道:
>>
>>> Dear flink community,
>>>
>>> We are POC flink(1.8) to process data in real time, and using global
>>> checkpointing(S3) and local checkpointing(EBS), deploy cluster on EKS. Our
>>> application is consuming data from Kinesis.
>>>
>>> For my test e.g I am using checkpointing interval 5min. and minimum
>>> pause 2min.
>>>
>>> The issue what we saw is: It seems like flink checkpointing process
>>> would be idle for 3-4 min, before job manager get complete notification.
>>>
>>> here is some logging from job manager:
>>>
>>> 2019-07-10 11:59:03,893 INFO  
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
>>> checkpoint 4 @ 1562759941082 for job e7a97014f5799458f1c656135712813d.
>>> 2019-07-10 12:05:01,836 INFO  
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
>>> checkpoint 4 for job e7a97014f5799458f1c656135712813d (22387207650 bytes in 
>>> 58645 ms).
>>>
>>> As my understanding the logging above, the 
>>> completedCheckpoint(CheckpointCoordinator)
>>> object has been completed in 58645 ms, but the whole checkpointing process
>>> took ~ 6min.
>>>
>>> This logging is for 4th checkpointing, But the first 3 checkpointing
>>> were finished on time.
>>> Could you please tell me, why flink checkpointing in my test was
>>> starting "idle" for few minutes after 3 checkpointing?
>>>
>>> Best Regards
>>> --
>>> Xiangyu Su
>>> Java Developer
>>> xian...@smaato.com
>>>
>>> Smaato Inc.
>>> San Francisco - New York - Hamburg - Singapore
>>> www.smaato.com
>>>
>>> Germany:
>>> Valentinskamp 70, Emporio, 19th Floor
>>> 20355 Hamburg
>>> M 0049(176)22943076
>>>
>>> The information contained in this communication may be CONFIDENTIAL and
>>> is intended only for the use of the recipient(s) named above. If you are
>>> not the intended recipient, you are hereby notified that any dissemination,
>>> distribution, or copying of this communication, or any of its contents, is
>>> strictly prohibited. If you have received this communication in error,
>>> please notify the sender and delete/destroy the original message and any
>>> copy of it from your computer or paper files.
>>>
>>
>
> --
> Xiangyu Su
> Java Developer
> xian...@smaato.com
>
> Smaato Inc.
> San Francisco - New York - Hamburg - Singapore
> www.smaato.com
>
> Germany:
> Valentinskamp 70, Emporio, 19th Floor
> 20355 Hamburg
> M 0049(176)22943076
>
> The information contained in this communication may be CONFIDENTIAL and is
> intended only for the use of the recipient(s) named above. If you are not
> the intended recipient, you are hereby notified that any dissemination,
> distribution, or copying of this communication, or any of its contents, is
> strictly prohibited. If you have received this communication in error,
> please notify the sender and delete/destroy the original message and any
> copy of it from your computer or paper files.
>


-- 
Xiangyu Su
Java Developer
xian...@smaato.com

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:
Valentinskamp 70, Emporio, 19th Floor
20355 Hamburg
M 0049(176)22943076

The information contained in this communication may be CONFIDENTIAL and is
intended only for the use of the recipient(s) named above. If you are not
the intended recipient, you are hereby notified that any dissemination,
distribution, or copying of this communication, or any of its contents, is
strictly prohibited. If you have received this communication in error,
please notify the sender and delete/destroy the original message and any
copy of it from your computer or paper files.


Re: apache flink: Why checkpoint coordinator takes long time to get completion

2019-07-19 Thread Xiangyu Su
Ok, thanks.

and this time-consuming until now always happens after 3rd checkpointing,
and this unexpected  time-consuming was always consistent (~ 4 min by under
4G/min incoming traffic).

On Fri, 19 Jul 2019 at 11:06, Biao Liu  wrote:

> Hi Xiangyu,
>
> Just took a glance at the relevant codes. There is a gap between
> calculating the duration and logging it out. I guess the checkpoint 4 is
> finished in 1 minute, but there is an unexpected time-consuming operation
> during that time. But I can't tell which part it is.
>
>
> Xiangyu Su  于2019年7月19日周五 下午4:14写道:
>
>> Dear flink community,
>>
>> We are POC flink(1.8) to process data in real time, and using global
>> checkpointing(S3) and local checkpointing(EBS), deploy cluster on EKS. Our
>> application is consuming data from Kinesis.
>>
>> For my test e.g I am using checkpointing interval 5min. and minimum pause
>> 2min.
>>
>> The issue what we saw is: It seems like flink checkpointing process would
>> be idle for 3-4 min, before job manager get complete notification.
>>
>> here is some logging from job manager:
>>
>> 2019-07-10 11:59:03,893 INFO  
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
>> checkpoint 4 @ 1562759941082 for job e7a97014f5799458f1c656135712813d.
>> 2019-07-10 12:05:01,836 INFO  
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
>> checkpoint 4 for job e7a97014f5799458f1c656135712813d (22387207650 bytes in 
>> 58645 ms).
>>
>> As my understanding the logging above, the 
>> completedCheckpoint(CheckpointCoordinator)
>> object has been completed in 58645 ms, but the whole checkpointing process
>> took ~ 6min.
>>
>> This logging is for 4th checkpointing, But the first 3 checkpointing were
>> finished on time.
>> Could you please tell me, why flink checkpointing in my test was starting
>> "idle" for few minutes after 3 checkpointing?
>>
>> Best Regards
>> --
>> Xiangyu Su
>> Java Developer
>> xian...@smaato.com
>>
>> Smaato Inc.
>> San Francisco - New York - Hamburg - Singapore
>> www.smaato.com
>>
>> Germany:
>> Valentinskamp 70, Emporio, 19th Floor
>> 20355 Hamburg
>> M 0049(176)22943076
>>
>> The information contained in this communication may be CONFIDENTIAL and
>> is intended only for the use of the recipient(s) named above. If you are
>> not the intended recipient, you are hereby notified that any dissemination,
>> distribution, or copying of this communication, or any of its contents, is
>> strictly prohibited. If you have received this communication in error,
>> please notify the sender and delete/destroy the original message and any
>> copy of it from your computer or paper files.
>>
>

-- 
Xiangyu Su
Java Developer
xian...@smaato.com

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:
Valentinskamp 70, Emporio, 19th Floor
20355 Hamburg
M 0049(176)22943076

The information contained in this communication may be CONFIDENTIAL and is
intended only for the use of the recipient(s) named above. If you are not
the intended recipient, you are hereby notified that any dissemination,
distribution, or copying of this communication, or any of its contents, is
strictly prohibited. If you have received this communication in error,
please notify the sender and delete/destroy the original message and any
copy of it from your computer or paper files.


apache flink: Why checkpoint coordinator takes long time to get completion

2019-07-19 Thread Xiangyu Su
Dear flink community,

We are POC flink(1.8) to process data in real time, and using global
checkpointing(S3) and local checkpointing(EBS), deploy cluster on EKS. Our
application is consuming data from Kinesis.

For my test e.g I am using checkpointing interval 5min. and minimum pause
2min.

The issue what we saw is: It seems like flink checkpointing process would
be idle for 3-4 min, before job manager get complete notification.

here is some logging from job manager:

2019-07-10 11:59:03,893 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator -
Triggering checkpoint 4 @ 1562759941082 for job
e7a97014f5799458f1c656135712813d.
2019-07-10 12:05:01,836 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator -
Completed checkpoint 4 for job e7a97014f5799458f1c656135712813d
(22387207650 bytes in 58645 ms).

As my understanding the logging above, the
completedCheckpoint(CheckpointCoordinator)
object has been completed in 58645 ms, but the whole checkpointing process
took ~ 6min.

This logging is for 4th checkpointing, But the first 3 checkpointing were
finished on time.
Could you please tell me, why flink checkpointing in my test was starting
"idle" for few minutes after 3 checkpointing?

Best Regards
-- 
Xiangyu Su
Java Developer
xian...@smaato.com

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:
Valentinskamp 70, Emporio, 19th Floor
20355 Hamburg
M 0049(176)22943076

The information contained in this communication may be CONFIDENTIAL and is
intended only for the use of the recipient(s) named above. If you are not
the intended recipient, you are hereby notified that any dissemination,
distribution, or copying of this communication, or any of its contents, is
strictly prohibited. If you have received this communication in error,
please notify the sender and delete/destroy the original message and any
copy of it from your computer or paper files.