Re: Profiling on flink jobs

2023-12-01 Thread Matthias Pohl via user
I missed the Reply All button in my previous message. Here's my previous
email for the sake of transparency sent to the user ML once more:

Hi Oscar,
sorry for the late reply. I didn't see that you posted the question at the
beginning of the month already.

I used jmap [1] in the past to get some statistics out and generate *.hprof
files. I haven't looked into creating dump files as documented in [2].

env.java.opts.all will be passed to each java process that's triggered
within Apache Flink.  "dumponexit" (which is used in the documented code
parameter list) suggests that the dump file would be created when the JVM
process exits. Without any more detailed investigation on how the Java
Flight Recorder works, I'd assume that a *.hprof file should be created
when killing the JobManager/TaskManager process rather than cancelling an
individual job. Cancelling the job should only trigger this file creation
if you're using Flink in Application Mode because terminating the job would
trigger the shutdown of the Flink cluster entirely in that case.

Best,
Matthias

[1]
https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr014.html
[2]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/application_profiling/

On Thu, Nov 9, 2023 at 9:39 AM Oscar Perez via user 
wrote:

> hi [image: :wave:]  I am trying to do profiling on one of our flink jobs
> according to these docs:
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/application_profiling/We
> are using OpenJDK 8.0. I am adding this line to the flink properties file
> in docker-compose:
>
> env.java.opts.all: "-XX:+UnlockCommercialFeatures 
> -XX:+UnlockDiagnosticVMOptions -XX:+FlightRecorder -XX:+DebugNonSafepoints 
> -XX:FlightRecorderOptions=defaultrecording=true,dumponexit=true,dumponexitpath=/tmp/dump.jfr"
>
> I would expect to see the dump.jfr file created once I cancel the job but
> unfortunately I dont see anything created. How can I manage to get a valid
> profile file? Thanks!
> Regards,
> Oscar
>


Re: Doubts about state and table API

2023-11-29 Thread Matthias Pohl via user
Hi Oscar,
could you provide the Java code to illustrate what you were doing?
The difference between version A and B might be especially helpful. I
assume you already looked into the FAQ about operator IDs [1]?

Adding the JM and TM logs might help as well to investigate the issue, as
Yu Chen mentioned.

Best,
Matthias

On Sun, Nov 26, 2023 at 2:18 PM Yu Chen  wrote:

> Hi Oscar,
>
> The Operator ID of the SQL job was generated by
> `StreamingJobGraphGenerator`, it was releated with the topology of the
> stream graph.
> If you would like to confirm that the problem was caused by the changes of
> opearator id or not, please remove --allowNonRestoredState, and you will
> get the exception of the failed restore operator id.
>
> However, the lost of the operator state would only produce some erroneous
> results and would not result in `not able to return any row`. It would be
> better to provide logs after restoring to locate a more specific problem.
>
> Best,
> Yu Chen
> --
> *发件人:* Oscar Perez via user 
> *发送时间:* 2023年11月25日 0:08
> *收件人:* Oscar Perez via user 
> *主题:* Doubts about state and table API
>
> Hi,
>
> We are having a job in production where we use table API to join multiple
> topics. The query looks like this:
>
>
> SELECT *
> FROM topic1 AS t1
> JOIN topic2 AS t2 ON t1.userId = t2.userId
> JOIN topic3 AS t3 ON t1.userId = t3.accountUserId
>
>
> This works and produces an EnrichedActivity any time any of the topics
> receives a new event, which is what we expect. This SQL query is linked to
> a processor function and the processElement gets triggered whenever a new
> EnrichedActivity occurs
>
> We have experienced an issue a couple of times in production where we have
> deployed a new version from savepoint and then suddenly we
> stopped receiving EnrichedActivities in the process function.
>
> Our assumption is that this is related to the table API state and that
> some operators are lost from going from one savepoint to new deployment.
>
> Let me illustrate with one example:
>
> version A of the job is deployed
> version B of the job is deployed
>
> version B UID for some table api operators changes and this operator is
> removed when deploying version B as it is unable to be mapped (we have the
> --allowNonRestoredState enabled)
>
> The state for the table api stores bot the committed offset and the
> contents of the topic but just the contents are lost and the committed
> offset is still in the offset
>
> Therefore, when doing the join of the query, it is not able to return any
> row as it is unable to get data from topic2 or topic 3.
>
> Can this be the case?
> We are having a hard time trying to understand how the table api and state
> works internally so any help in this regard would be truly helpful!
>
> Thanks,
> Oscar
>
>
>


Re: Java 17 as default

2023-11-29 Thread Matthias Pohl via user
The 1.18 Docker images were pushed on Oct 31. This also included Java 17
images [1].

[1] https://hub.docker.com/_/flink/tags?page=1=java17

On Wed, Nov 15, 2023 at 7:56 AM Tauseef Janvekar 
wrote:

> Dear Team,
>
> I saw the documentation for 1.18 and Java 17 is not supported and the
> image is created from Java11. I guess there is separate docker image for
> java_17.
> When do we plan to release main image with Java 17.
>
> Thanks,
> Tauseef
>


Re: [DISCUSS][FLINK-33240] Document deprecated options as well

2023-10-30 Thread Matthias Pohl via user
Thanks for your proposal, Zhanghao Chen. I think it adds more transparency
to the configuration documentation.

+1 from my side on the proposal

On Wed, Oct 11, 2023 at 2:09 PM Zhanghao Chen 
wrote:

> Hi Flink users and developers,
>
> Currently, Flink won't generate doc for the deprecated options. This might
> confuse users when upgrading from an older version of Flink: they have to
> either carefully read the release notes or check the source code for
> upgrade guidance on deprecated options.
>
> I propose to document deprecated options as well, with a "(deprecated)"
> tag placed at the beginning of the option description to highlight the
> deprecation status [1].
>
> Looking forward to your feedbacks on it.
>
> [1] https://issues.apache.org/jira/browse/FLINK-33240
>
> Best,
> Zhanghao Chen
>


Re: [ANNOUNCE] Flink Table Store Joins Apache Incubator as Apache Paimon(incubating)

2023-03-27 Thread Matthias Pohl
Congratulations and good luck with pushing the project forward.

On Mon, Mar 27, 2023 at 2:35 PM Jing Ge via user 
wrote:

> Congrats!
>
> Best regards,
> Jing
>
> On Mon, Mar 27, 2023 at 2:32 PM Leonard Xu  wrote:
>
>> Congratulations!
>>
>>
>> Best,
>> Leonard
>>
>> On Mar 27, 2023, at 5:23 PM, Yu Li  wrote:
>>
>> Dear Flinkers,
>>
>>
>>
>> As you may have noticed, we are pleased to announce that Flink Table Store 
>> has joined the Apache Incubator as a separate project called Apache 
>> Paimon(incubating) [1] [2] [3]. The new project still aims at building a 
>> streaming data lake platform for high-speed data ingestion, change data 
>> tracking and efficient real-time analytics, with the vision of supporting a 
>> larger ecosystem and establishing a vibrant and neutral open source 
>> community.
>>
>>
>>
>> We would like to thank everyone for their great support and efforts for the 
>> Flink Table Store project, and warmly welcome everyone to join the 
>> development and activities of the new project. Apache Flink will continue to 
>> be one of the first-class citizens supported by Paimon, and we believe that 
>> the Flink and Paimon communities will maintain close cooperation.
>>
>>
>> 亲爱的Flinkers,
>>
>>
>> 正如您可能已经注意到的,我们很高兴地宣布,Flink Table Store 已经正式加入 Apache
>> 孵化器独立孵化 [1] [2] [3]。新项目的名字是
>> Apache 
>> Paimon(incubating),仍致力于打造一个支持高速数据摄入、流式数据订阅和高效实时分析的新一代流式湖仓平台。此外,新项目将支持更加丰富的生态,并建立一个充满活力和中立的开源社区。
>>
>>
>> 在这里我们要感谢大家对 Flink Table Store
>> 项目的大力支持和投入,并热烈欢迎大家加入新项目的开发和社区活动。Apache Flink 将继续作为 Paimon 支持的主力计算引擎之一,我们也相信
>> Flink 和 Paimon 社区将继续保持密切合作。
>>
>>
>> Best Regards,
>> Yu (on behalf of the Apache Flink PMC and Apache Paimon PPMC)
>>
>> 致礼,
>> 李钰(谨代表 Apache Flink PMC 和 Apache Paimon PPMC)
>>
>> [1] https://paimon.apache.org/
>> [2] https://github.com/apache/incubator-paimon
>> [3] https://cwiki.apache.org/confluence/display/INCUBATOR/PaimonProposal
>>
>>
>>


Re: [ANNOUNCE] Flink Table Store Joins Apache Incubator as Apache Paimon(incubating)

2023-03-27 Thread Matthias Pohl via user
Congratulations and good luck with pushing the project forward.

On Mon, Mar 27, 2023 at 2:35 PM Jing Ge via user 
wrote:

> Congrats!
>
> Best regards,
> Jing
>
> On Mon, Mar 27, 2023 at 2:32 PM Leonard Xu  wrote:
>
>> Congratulations!
>>
>>
>> Best,
>> Leonard
>>
>> On Mar 27, 2023, at 5:23 PM, Yu Li  wrote:
>>
>> Dear Flinkers,
>>
>>
>>
>> As you may have noticed, we are pleased to announce that Flink Table Store 
>> has joined the Apache Incubator as a separate project called Apache 
>> Paimon(incubating) [1] [2] [3]. The new project still aims at building a 
>> streaming data lake platform for high-speed data ingestion, change data 
>> tracking and efficient real-time analytics, with the vision of supporting a 
>> larger ecosystem and establishing a vibrant and neutral open source 
>> community.
>>
>>
>>
>> We would like to thank everyone for their great support and efforts for the 
>> Flink Table Store project, and warmly welcome everyone to join the 
>> development and activities of the new project. Apache Flink will continue to 
>> be one of the first-class citizens supported by Paimon, and we believe that 
>> the Flink and Paimon communities will maintain close cooperation.
>>
>>
>> 亲爱的Flinkers,
>>
>>
>> 正如您可能已经注意到的,我们很高兴地宣布,Flink Table Store 已经正式加入 Apache
>> 孵化器独立孵化 [1] [2] [3]。新项目的名字是
>> Apache 
>> Paimon(incubating),仍致力于打造一个支持高速数据摄入、流式数据订阅和高效实时分析的新一代流式湖仓平台。此外,新项目将支持更加丰富的生态,并建立一个充满活力和中立的开源社区。
>>
>>
>> 在这里我们要感谢大家对 Flink Table Store
>> 项目的大力支持和投入,并热烈欢迎大家加入新项目的开发和社区活动。Apache Flink 将继续作为 Paimon 支持的主力计算引擎之一,我们也相信
>> Flink 和 Paimon 社区将继续保持密切合作。
>>
>>
>> Best Regards,
>> Yu (on behalf of the Apache Flink PMC and Apache Paimon PPMC)
>>
>> 致礼,
>> 李钰(谨代表 Apache Flink PMC 和 Apache Paimon PPMC)
>>
>> [1] https://paimon.apache.org/
>> [2] https://github.com/apache/incubator-paimon
>> [3] https://cwiki.apache.org/confluence/display/INCUBATOR/PaimonProposal
>>
>>
>>


Re: Issue with the flink version 1.10.1

2023-03-27 Thread Matthias Pohl via user
Hi Kiran,
it's really hard to come up with an answer based on your description.
Usually, it helps to share some logs with the exact error that's appearing
and a clear description on what you're observing and what you're expecting.
A plain "no jobs are running" is too general to come up with a conclusion.
Sorry.

Additionally, let me state that Flink 1.10 and 1.9 are quite old versions.
The community doesn't support those versions anymore. It might be the case
that you're running into issues that are already fixed in newer versions.
Investigating code from years ago can be quite tedious.

Best,
Matthias

On Mon, Mar 27, 2023 at 2:29 PM Kiran Kumar Kathe <
kirankumarkathe...@gmail.com> wrote:

> When I submit a job using flink version 1.10.1 ,  it is not upadating the
> jobs that are running and completed successfully in the Web UI of YARN
> resource manager . But When I use flink version 1.9.3 it is working fine
> and I am able to see the jobs that are running and completed in
> YARN resource manager Web UI . And to find why this is happening I just
> tried with replacing the application folders and in lib folder when I use
> the flink_dist jar of version 1.9.3 in place of flink_dist of version
> 1.10.1 it is running fine and I am able to see the jobs running and
> completed. Is it the right way , if not will I face any compatible issues
> in future with this change of flink_dist jar in lib folder.
>


Re: [ANNOUNCE] Apache Flink 1.17.0 released

2023-03-27 Thread Matthias Pohl via user
Here are a few things I noticed from the 1.17 release retrospectively which
I want to share (other release managers might have a different view or
might disagree):

- Google Meet might not be the best choice for the release sync. We need to
be able to invite attendees even if the creator of the meeting isn't
available (maybe try Zoom or jitsi instead?)

- Release sync every 2 weeks and a switch to weekly after feature freeze
felt reasonable

- Slack worked well as a collaboration tool to document the monitoring
tasks (#builds [1], #flink-dev-benchmarks [2]) in a team with multiple
release managers

- The Slack Azure Pipeline bot seems to be buggy. It swallows some build
failures. It's not a severe issue, though. We created #builds-debug [3] to
monitor whether it's happening consistently. The issue is covered in
FLINK-30733 [4].

- Having dedicated people for monitoring the build failures helps getting a
more consistent picture of test instabilities

- We experienced occasional issues in the manual steps of the release
creation in the past (e.g. japicmp config was not properly pushed).
Creating Jira issues for the release helped to make the release creation
more transparent and made the steps more reviewable [5][6][7][8].
Additionally, it helped to distribute subtasks to different people with
Jira being the tool for documentation and synchronization. That's
especially helpful when there is more than one person in charge of creating
the release.

- We had backports/merges without PRs happening by committers occasionally
during the 1.17 release which broke master/release branches (probably,
changes were done locally before merging which were not part of the PR to
have a faster backport experience). It might make sense to remind everyone
that this should be avoided. Not sure whether we want/can restrict that.

- We observed a good response on fixing test instabilities by the end of
the release cycle but had some long running issues earlier in the cycle
which caused extra efforts on the release managers due to reoccurring test
failures.

- Release testing picked up “slowly”: Initially, we planned 2 weeks for
release testing. But there was not really any progress (tickets being
created and worked on) in the first week. In the end, we had to extend the
phase by another week resulting in 3 instead of 2 weeks of release testing.
I guess we could encourage the community to create release testing tasks
earlier and label them properly to be able to monitor the effort. That
would even enable us to do release testing for a certain feature after the
feature is done and not necessarily only at the end of the release cycle.

- Manual test data generation is tedious (FLINK-31593 [9]). But this should
be fixed in 1.18 with FLINK-27518 [10] being almost done.

- We started creating documentation for release management [11]. The goal
is to collect what tasks are there to help support a Flink release to
encourage newcomers to pick up the task.

I'm going to add these to the Flink 1.17 release documentation [12] as
feedback as well.

Best,
Matthias

[1] https://apache-flink.slack.com/archives/C03MR1HQHK2
[2] https://apache-flink.slack.com/archives/C0471S0DFJ9
[3] https://apache-flink.slack.com/archives/C04LZM3EE9E
[4] https://issues.apache.org/jira/browse/FLINK-30733
[5] https://issues.apache.org/jira/browse/FLINK-31146
[6] https://issues.apache.org/jira/browse/FLINK-31154
[7] https://issues.apache.org/jira/browse/FLINK-31562
[8] https://issues.apache.org/jira/browse/FLINK-31567
[9] https://issues.apache.org/jira/browse/FLINK-31593
[10] https://issues.apache.org/jira/browse/FLINK-27518
[11]
https://cwiki.apache.org/confluence/display/FLINK/Flink+Release+Management
[12] https://cwiki.apache.org/confluence/display/FLINK/1.17+Release

On Sat, Mar 25, 2023 at 8:29 AM Hang Ruan  wrote:

> Thanks for the great work ! Congrats all!
>
> Best,
> Hang
>
> Panagiotis Garefalakis  于2023年3月25日周六 03:22写道:
>
>> Congrats all! Well done!
>>
>> Cheers,
>> Panagiotis
>>
>> On Fri, Mar 24, 2023 at 2:46 AM Qingsheng Ren  wrote:
>>
>> > I'd like to say thank you to all contributors of Flink 1.17. Your
>> support
>> > and great work together make this giant step forward!
>> >
>> > Also like Matthias mentioned, feel free to leave us any suggestions and
>> > let's improve the releasing procedure together.
>> >
>> > Cheers,
>> > Qingsheng
>> >
>> > On Fri, Mar 24, 2023 at 5:00 PM Etienne Chauchot 
>> > wrote:
>> >
>> >> Congrats to all the people involved!
>> >>
>> >> Best
>> >>
>> >> Etienne
>> >>
>> >> Le 23/03/2023 à 10:19, Leonard Xu a écrit :
>> >> > The Apache Flink community is very happy to announce the release of
>> >> Apache Flink 1.17.0, which is the first release for the Apache Flink
>> 1.17
>> >> series.
>> >> >
>> >> > Apache Flink® is an open-source unified stream and batch data
>> >> processing framework for distributed, high-performing,
>> always-available,
>> >> and accurate data applications.
>> >> >
>> >> > The release is available 

Re: [ANNOUNCE] Apache Flink 1.17.0 released

2023-03-27 Thread Matthias Pohl
Here are a few things I noticed from the 1.17 release retrospectively which
I want to share (other release managers might have a different view or
might disagree):

- Google Meet might not be the best choice for the release sync. We need to
be able to invite attendees even if the creator of the meeting isn't
available (maybe try Zoom or jitsi instead?)

- Release sync every 2 weeks and a switch to weekly after feature freeze
felt reasonable

- Slack worked well as a collaboration tool to document the monitoring
tasks (#builds [1], #flink-dev-benchmarks [2]) in a team with multiple
release managers

- The Slack Azure Pipeline bot seems to be buggy. It swallows some build
failures. It's not a severe issue, though. We created #builds-debug [3] to
monitor whether it's happening consistently. The issue is covered in
FLINK-30733 [4].

- Having dedicated people for monitoring the build failures helps getting a
more consistent picture of test instabilities

- We experienced occasional issues in the manual steps of the release
creation in the past (e.g. japicmp config was not properly pushed).
Creating Jira issues for the release helped to make the release creation
more transparent and made the steps more reviewable [5][6][7][8].
Additionally, it helped to distribute subtasks to different people with
Jira being the tool for documentation and synchronization. That's
especially helpful when there is more than one person in charge of creating
the release.

- We had backports/merges without PRs happening by committers occasionally
during the 1.17 release which broke master/release branches (probably,
changes were done locally before merging which were not part of the PR to
have a faster backport experience). It might make sense to remind everyone
that this should be avoided. Not sure whether we want/can restrict that.

- We observed a good response on fixing test instabilities by the end of
the release cycle but had some long running issues earlier in the cycle
which caused extra efforts on the release managers due to reoccurring test
failures.

- Release testing picked up “slowly”: Initially, we planned 2 weeks for
release testing. But there was not really any progress (tickets being
created and worked on) in the first week. In the end, we had to extend the
phase by another week resulting in 3 instead of 2 weeks of release testing.
I guess we could encourage the community to create release testing tasks
earlier and label them properly to be able to monitor the effort. That
would even enable us to do release testing for a certain feature after the
feature is done and not necessarily only at the end of the release cycle.

- Manual test data generation is tedious (FLINK-31593 [9]). But this should
be fixed in 1.18 with FLINK-27518 [10] being almost done.

- We started creating documentation for release management [11]. The goal
is to collect what tasks are there to help support a Flink release to
encourage newcomers to pick up the task.

I'm going to add these to the Flink 1.17 release documentation [12] as
feedback as well.

Best,
Matthias

[1] https://apache-flink.slack.com/archives/C03MR1HQHK2
[2] https://apache-flink.slack.com/archives/C0471S0DFJ9
[3] https://apache-flink.slack.com/archives/C04LZM3EE9E
[4] https://issues.apache.org/jira/browse/FLINK-30733
[5] https://issues.apache.org/jira/browse/FLINK-31146
[6] https://issues.apache.org/jira/browse/FLINK-31154
[7] https://issues.apache.org/jira/browse/FLINK-31562
[8] https://issues.apache.org/jira/browse/FLINK-31567
[9] https://issues.apache.org/jira/browse/FLINK-31593
[10] https://issues.apache.org/jira/browse/FLINK-27518
[11]
https://cwiki.apache.org/confluence/display/FLINK/Flink+Release+Management
[12] https://cwiki.apache.org/confluence/display/FLINK/1.17+Release

On Sat, Mar 25, 2023 at 8:29 AM Hang Ruan  wrote:

> Thanks for the great work ! Congrats all!
>
> Best,
> Hang
>
> Panagiotis Garefalakis  于2023年3月25日周六 03:22写道:
>
>> Congrats all! Well done!
>>
>> Cheers,
>> Panagiotis
>>
>> On Fri, Mar 24, 2023 at 2:46 AM Qingsheng Ren  wrote:
>>
>> > I'd like to say thank you to all contributors of Flink 1.17. Your
>> support
>> > and great work together make this giant step forward!
>> >
>> > Also like Matthias mentioned, feel free to leave us any suggestions and
>> > let's improve the releasing procedure together.
>> >
>> > Cheers,
>> > Qingsheng
>> >
>> > On Fri, Mar 24, 2023 at 5:00 PM Etienne Chauchot 
>> > wrote:
>> >
>> >> Congrats to all the people involved!
>> >>
>> >> Best
>> >>
>> >> Etienne
>> >>
>> >> Le 23/03/2023 à 10:19, Leonard Xu a écrit :
>> >> > The Apache Flink community is very happy to announce the release of
>> >> Apache Flink 1.17.0, which is the first release for the Apache Flink
>> 1.17
>> >> series.
>> >> >
>> >> > Apache Flink® is an open-source unified stream and batch data
>> >> processing framework for distributed, high-performing,
>> always-available,
>> >> and accurate data applications.
>> >> >
>> >> > The release is available 

Re: [ANNOUNCE] Apache Flink 1.17.0 released

2023-03-23 Thread Matthias Pohl
Thanks for making this release getting over the finish line.

One additional thing:
Feel free to reach out to the release managers (or respond to this thread)
with feedback on the release process. Our goal is to constantly improve the
release process. Feedback on what could be improved or things that didn't
go so well during the 1.17.0 release cycle are much appreciated.

Best,
Matthias

On Thu, Mar 23, 2023 at 11:02 AM Jing Ge via user 
wrote:

> Excellent work! Congratulations! Appreciate the hard work and
> contributions of everyone in the Apache Flink community who helped make
> this release possible. Looking forward to those new features. Cheers!
>
> Best regards,
> Jing
>
> On Thu, Mar 23, 2023 at 10:24 AM Leonard Xu  wrote:
>
>> The Apache Flink community is very happy to announce the release of Apache 
>> Flink
>> 1.17.0, which is the first release for the Apache Flink 1.17 series.
>>
>> Apache Flink® is an open-source unified stream and batch data processing 
>> framework
>> for distributed, high-performing, always-available, and accurate data
>> applications.
>>
>> The release is available for download at:
>>
>> *https://flink.apache.org/downloads.html
>> *
>> Please check out the release blog post for an overview of the improvements
>> for this release:
>>
>> *https://flink.apache.org/2023/03/23/announcing-the-release-of-apache-flink-1.17/
>> *
>> The full release notes are available in Jira:
>>
>> *https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12351585
>> 
>> *
>> We would like to thank all contributors of the Apache Flink community who
>> made this release possible!
>>
>> Best regards,
>> Qingsheng, Martijn, Matthias and Leonard
>>
>


Re: [ANNOUNCE] Apache Flink 1.17.0 released

2023-03-23 Thread Matthias Pohl via user
Thanks for making this release getting over the finish line.

One additional thing:
Feel free to reach out to the release managers (or respond to this thread)
with feedback on the release process. Our goal is to constantly improve the
release process. Feedback on what could be improved or things that didn't
go so well during the 1.17.0 release cycle are much appreciated.

Best,
Matthias

On Thu, Mar 23, 2023 at 11:02 AM Jing Ge via user 
wrote:

> Excellent work! Congratulations! Appreciate the hard work and
> contributions of everyone in the Apache Flink community who helped make
> this release possible. Looking forward to those new features. Cheers!
>
> Best regards,
> Jing
>
> On Thu, Mar 23, 2023 at 10:24 AM Leonard Xu  wrote:
>
>> The Apache Flink community is very happy to announce the release of Apache 
>> Flink
>> 1.17.0, which is the first release for the Apache Flink 1.17 series.
>>
>> Apache Flink® is an open-source unified stream and batch data processing 
>> framework
>> for distributed, high-performing, always-available, and accurate data
>> applications.
>>
>> The release is available for download at:
>>
>> *https://flink.apache.org/downloads.html
>> *
>> Please check out the release blog post for an overview of the improvements
>> for this release:
>>
>> *https://flink.apache.org/2023/03/23/announcing-the-release-of-apache-flink-1.17/
>> *
>> The full release notes are available in Jira:
>>
>> *https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12351585
>> 
>> *
>> We would like to thank all contributors of the Apache Flink community who
>> made this release possible!
>>
>> Best regards,
>> Qingsheng, Martijn, Matthias and Leonard
>>
>


Re: Job Cancellation Failing

2023-02-21 Thread Matthias Pohl via user
I noticed a test instability that sounds quite similar to what you're
experiencing. I created FLINK-31168 [1] to follow-up on this one.

[1] https://issues.apache.org/jira/browse/FLINK-31168

On Mon, Feb 20, 2023 at 4:50 PM Matthias Pohl 
wrote:

> What do you mean by "earlier it used to fail due to ExecutionGraphStore
> not existing in /tmp" folder? Did you get the error message "Could not
> create executionGraphStorage directory in /tmp." and creating this folder
> fixed the issue?
>
> It also looks like the stacktrace doesn't match any of the 1.15 versions
> in terms of line numbers. Or I might miss something here. Could you provide
> the exact Flink version you're using?
>
> I might also help to share the JobManager logs to understand the context
> in which the cancel operation was triggered.
>
> Matthias
>
> On Mon, Feb 20, 2023 at 1:53 AM Puneet Duggal 
> wrote:
>
>> Flink Cluster Context:
>>
>>
>>- Flink Version - 1.15
>>- Deployment Mode - Session
>>- Number of Job Managers - 3 (HA)
>>- Number of Task Managers - 1
>>
>>
>> Cancellation of Job fails due to following
>>
>> org.apache.flink.runtime.rest.NotFoundException: Job
>> 1cb2185d4d72c8c6f0a3a549d7de4ef0 not found
>> at
>> org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.lambda$handleRequest$1(AbstractExecutionGraphHandler.java:99)
>> at
>> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884)
>> at
>> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866)
>> at
>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>> at
>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>> at
>> org.apache.flink.runtime.rest.handler.legacy.DefaultExecutionGraphCache.lambda$getExecutionGraphInternal$0(DefaultExecutionGraphCache.java:109)
>> at
>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>> at
>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>> at
>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>> at
>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$1(AkkaInvocationHandler.java:252)
>> at
>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>> at
>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>> at
>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>> at
>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>> at
>> org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1387)
>> at
>> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$null$1(ClassLoadingUtils.java:93)
>> at
>> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
>> at
>> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$guardCompletionWithContextClassLoader$2(ClassLoadingUtils.java:92)
>> at
>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>> at
>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>> at
>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>> at
>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>> at
>> org.apache.flink.runtime.concurrent.akka.AkkaFutureUtils$1.onComplete(AkkaFutureUtils.java:45)
>> at akka.dispatch.OnComplete.internal(Future.scala:299)
>> at akka.dispatch.OnComplete.internal(Future.scala:297)
>> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:224)
>> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:221)
>> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>> at
>> org.apache.flink.runtime.concurrent.akka.AkkaFutureUtils$DirectExecutionContext.execute(AkkaFutureUtils.java:65)
>> at
>> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68)
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284)
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284)
>> at
>> scala.co

Re: Job Cancellation Failing

2023-02-20 Thread Matthias Pohl via user
What do you mean by "earlier it used to fail due to ExecutionGraphStore not
existing in /tmp" folder? Did you get the error message "Could not create
executionGraphStorage directory in /tmp." and creating this folder fixed
the issue?

It also looks like the stacktrace doesn't match any of the 1.15 versions in
terms of line numbers. Or I might miss something here. Could you provide
the exact Flink version you're using?

I might also help to share the JobManager logs to understand the context in
which the cancel operation was triggered.

Matthias

On Mon, Feb 20, 2023 at 1:53 AM Puneet Duggal 
wrote:

> Flink Cluster Context:
>
>
>- Flink Version - 1.15
>- Deployment Mode - Session
>- Number of Job Managers - 3 (HA)
>- Number of Task Managers - 1
>
>
> Cancellation of Job fails due to following
>
> org.apache.flink.runtime.rest.NotFoundException: Job
> 1cb2185d4d72c8c6f0a3a549d7de4ef0 not found
> at
> org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.lambda$handleRequest$1(AbstractExecutionGraphHandler.java:99)
> at
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884)
> at
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> at
> org.apache.flink.runtime.rest.handler.legacy.DefaultExecutionGraphCache.lambda$getExecutionGraphInternal$0(DefaultExecutionGraphCache.java:109)
> at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> at
> org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$1(AkkaInvocationHandler.java:252)
> at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> at
> org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1387)
> at
> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$null$1(ClassLoadingUtils.java:93)
> at
> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
> at
> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$guardCompletionWithContextClassLoader$2(ClassLoadingUtils.java:92)
> at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> at
> org.apache.flink.runtime.concurrent.akka.AkkaFutureUtils$1.onComplete(AkkaFutureUtils.java:45)
> at akka.dispatch.OnComplete.internal(Future.scala:299)
> at akka.dispatch.OnComplete.internal(Future.scala:297)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:224)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:221)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
> at
> org.apache.flink.runtime.concurrent.akka.AkkaFutureUtils$DirectExecutionContext.execute(AkkaFutureUtils.java:65)
> at
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68)
> at
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284)
> at
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284)
> at
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284)
> at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:621)
> at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:118)
> at
> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:1144)
> at akka.actor.Actor.aroundReceive(Actor.scala:537)
> at akka.actor.Actor.aroundReceive$(Actor.scala:535)
> at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:540)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
> at akka.actor.ActorCell.invoke(ActorCell.scala:548)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
> at akka.dispatch.Mailbox.run(Mailbox.scala:231)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
> at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
> at
> java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
> at 

Re: Blob server connection problem

2023-01-24 Thread Matthias Pohl via user
We had issues like that in the past (e.g. FLINK-24923 [1], FLINK-10683
[2]). The error you're observing is caused by an unexpected byte being read
from the socket. The BlobServer protocol expects either 0 (for put
messages) or 1 (for get messages) being retrieved as a header for new
message blocks [3].
Reading different values might mean that there is some other process
sending data to the port the BlobServer is listening on. May you check your
network traffic?

Matthias

[1] https://issues.apache.org/jira/browse/FLINK-24923
[2] https://issues.apache.org/jira/browse/FLINK-10683
[3]
https://github.com/apache/flink/blob/ab264e4ab5a3bc6961a5128b1c7e19752508a7ca/flink-runtime/src/main/java/org/apache/flink/runtime/blob/BlobServerConnection.java#L115

On Fri, Jan 20, 2023 at 11:26 PM Yang Liu  wrote:

> Hello,
>
> Is anyone familiar with the "blob server connection"? We have constantly
> been seeing the "Error while executing Blob connection" error, which
> sometimes causes a job stuck in the middle of a run if there are too many
> connection errors and eventually causes a failure, though most of the time
> the streaming run mode can recover from that failure in the subsequent
> iterations of runs, but that slows down the entire process. We tried
> adjusting the blob.fetch.num-concurrent and some other blob parameters, but
> it was not very helpful, so we want to know what might be the root cause of
> the issue. Are there any Flink metrics or tools to help us monitor the blob
> server connections?
>
> We use:
>
>- Flink Kubernetes Operator
>- Flink 1.15.3 and 1.16.0
>- Kafka, filesystem(S3)
>- Hudi 0.11.1
>
> Full error message:
>
> java.io.IOException: Unknown operation 71
>   at 
> org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:116)
>  [flink-dist-1.15.3.jar:1.15.3]
> 2023-01-19 16:44:37,448 ERROR 
> org.apache.flink.runtime.blob.BlobServerConnection   [] - Error while 
> executing BLOB connection.
>
>
> Best regards,
> Yang
>


Re: How does Flink plugin system work?

2023-01-02 Thread Matthias Pohl via user
Yes, Ruibin confirmed in a private message that using the factory class
works. But thanks for digging into it once more Yanfei. I missed to
consider in my previous message that the plugin classes are loaded using
their own class loaders which, indeed, can result in a
ClassNotFoundException being thrown.

Best,
Matthias

On Tue, Jan 3, 2023 at 4:45 AM Yanfei Lei  wrote:

> Hi Ruibin,
>
> "metrics.reporter.prom.class" is deprecated in 1.16, maybe "
> metrics.reporter.prom.factory.class"[1] can solve your problem.
> After reading the related code[2], I think the root cause is that  "
> metrics.reporter.prom.class" would load the code via flink's classpath
> instead of MetricReporterFactory, due to "Plugins cannot access classes
> from other plugins or from Flink that have not been specifically
> whitelisted"[3], so ClassNotFoundException is thrown.
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/metric_reporters/#prometheus
> [2]
> https://github.com/apache/flink/blob/release-1.16/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/ReporterSetup.java#L457
> [3]
> https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/filesystems/plugins/
>
> Matthias Pohl via user  于2023年1月2日周一 20:27写道:
>
>> Hi Ruibin,
>> could you switch to using the currently supported way for instantiating
>> reporters using the factory configuration parameter [1][2]?
>>
>> Based on the ClassNotFoundException, your suspicion might be right that
>> the plugin didn't make it onto the classpath. Could you share the
>> startup logs of the JM and TMs. That might help getting a bit more context
>> on what's going on. Your approach on integrating the reporter through the
>> plugin system [3] sounds about right as far as I can see.
>>
>> Matthias
>>
>> [1]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/metric_reporters/#factory-class
>> [2]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/metric_reporters/#prometheus
>> [3]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/filesystems/plugins/
>>
>> On Fri, Dec 30, 2022 at 11:42 AM Ruibin Xing  wrote:
>>
>>> Hi community,
>>>
>>> I am having difficulty understanding the Flink plugin system. I am
>>> attempting to enable the Prometheus exporter with the official Flink image
>>> 1.16.0, but I am experiencing issues with library dependencies. According
>>> to the plugin documentation (
>>> https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/filesystems/plugins/),
>>> as long as the library is located in the /opt/flink/plugins/
>>> directory, Flink should automatically load it, similar to how it loads
>>> libraries in the /opt/flink/lib directory. However, Flink does not seem to
>>> detect the plugin.
>>>
>>> Here is the directory structure for /opt/flink:
>>> > tree /opt/flink
>>> .
>>> 
>>> ├── plugins
>>> │   ├── metrics-prometheus
>>> │   │   └── flink-metrics-prometheus-1.16.0.jar
>>> ...
>>>
>>> And here is the related Flink configuration:
>>> > metrics.reporter.prom.class:
>>> org.apache.flink.metrics.prometheus.PrometheusReporter
>>>
>>> The error logs in the task manager show the following:
>>> 2022-12-30 10:03:55,840 WARN
>>>  org.apache.flink.runtime.metrics.ReporterSetup   [] - The
>>> reporter configuration of 'prom' configures the reporter class, which is a
>>> deprecated approach to configure reporters. Please configure a factory
>>> class instead: 'metrics.reporter.prom.factory.class: ' to
>>> ensure that the configuration continues to work with future versions.
>>> 2022-12-30 10:03:55,841 ERROR
>>> org.apache.flink.runtime.metrics.ReporterSetup   [] - Could not
>>> instantiate metrics reporter prom. Metrics might not be exposed/reported.
>>> java.lang.ClassNotFoundException:
>>> org.apache.flink.metrics.prometheus.PrometheusReporter
>>> at jdk.internal.loader.BuiltinClassLoader.loadClass(Unknown Source)
>>> ~[?:?]
>>> at jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(Unknown
>>> Source) ~[?:?]
>>> at java.lang.ClassLoader.loadClass(Unknown Source) ~[?:?]
>>> at java.lang.Class.forName0(Native Method) ~[?:?]
>>> at java.lang.Class.forName(Unknown Source) ~[?:?]
>>> at
>>> org.apache.flink.runtime.metrics.ReporterSetup.loadViaReflection(ReporterSetup.java:456)
>>

Re: The use of zookeeper in flink

2023-01-02 Thread Matthias Pohl via user
And I screwed up the reply again. -.- Here's my previous response for the
ML thread and not only spoon_lz:

Hi spoon_lz,
Thanks for reaching out to the community and sharing your use case. You're
right about the fact that Flink's HA feature relies on the leader election.
The HA backend not being responsive for too long might cause problems. I'm
not sure I understand fully what you mean by the standby JobManagers
struggling with the ZK outage shouldn't affect the running jobs. If ZK is
not responding for the standby JMs, the actual JM leader should be affected
as well which, as a consequence, would affect the job execution. But I
might misunderstand your post. Logs would be helpful to get a better
understanding of your post's context.

Best,
Matthias

FYI: There is also (a kind of stalled) discussion in the dev ML [1] about
recovery of too many jobs affecting Flink's performance.

[1] https://lists.apache.org/thread/r3fnw13j5h04z87lb34l42nvob4pq2xj

On Thu, Dec 29, 2022 at 8:55 AM spoon_lz  wrote:

> Hi All,
> We use zookeeper to achieve high availability of jobs. Recently, a failure
> occurred in our flink cluster. It was due to the abnormal downtime of the
> zookeeper service that all the flink jobs using this zookeeper all occurred
> failover. The failover startup of a large number of jobs in a short period
> of time caused the cluster The pressure is too high, which in turn causes
> the cluster to crash.
> Afterwards, I checked the HA function of zk:
> 1. Leader election
> 2. Service discovery
> 3.State persistence:
>
> The unavailability of the zookeeper service leads to failover of the flink
> job. It seems that because of the first point, JM cannot confirm whether it
> is Active or Standby, and the other two points should not affect it. But we
> didn't use the Standby JobManager.
> So in my opinion, if the JobManager of Standby is not used, whether the zk
> service is available should not affect the jobs that are running
> normally(of course, it is understandable that the task cannot be recovered
> correctly if an exception occurs), and I don’t know if there is a way to
> achieve a similar purpose
>


Re: How does Flink plugin system work?

2023-01-02 Thread Matthias Pohl via user
Hi Ruibin,
could you switch to using the currently supported way for instantiating
reporters using the factory configuration parameter [1][2]?

Based on the ClassNotFoundException, your suspicion might be right that the
plugin didn't make it onto the classpath. Could you share the startup logs
of the JM and TMs. That might help getting a bit more context on what's
going on. Your approach on integrating the reporter through the plugin
system [3] sounds about right as far as I can see.

Matthias

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/metric_reporters/#factory-class
[2]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/metric_reporters/#prometheus
[3]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/filesystems/plugins/

On Fri, Dec 30, 2022 at 11:42 AM Ruibin Xing  wrote:

> Hi community,
>
> I am having difficulty understanding the Flink plugin system. I am
> attempting to enable the Prometheus exporter with the official Flink image
> 1.16.0, but I am experiencing issues with library dependencies. According
> to the plugin documentation (
> https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/filesystems/plugins/),
> as long as the library is located in the /opt/flink/plugins/
> directory, Flink should automatically load it, similar to how it loads
> libraries in the /opt/flink/lib directory. However, Flink does not seem to
> detect the plugin.
>
> Here is the directory structure for /opt/flink:
> > tree /opt/flink
> .
> 
> ├── plugins
> │   ├── metrics-prometheus
> │   │   └── flink-metrics-prometheus-1.16.0.jar
> ...
>
> And here is the related Flink configuration:
> > metrics.reporter.prom.class:
> org.apache.flink.metrics.prometheus.PrometheusReporter
>
> The error logs in the task manager show the following:
> 2022-12-30 10:03:55,840 WARN
>  org.apache.flink.runtime.metrics.ReporterSetup   [] - The
> reporter configuration of 'prom' configures the reporter class, which is a
> deprecated approach to configure reporters. Please configure a factory
> class instead: 'metrics.reporter.prom.factory.class: ' to
> ensure that the configuration continues to work with future versions.
> 2022-12-30 10:03:55,841 ERROR
> org.apache.flink.runtime.metrics.ReporterSetup   [] - Could not
> instantiate metrics reporter prom. Metrics might not be exposed/reported.
> java.lang.ClassNotFoundException:
> org.apache.flink.metrics.prometheus.PrometheusReporter
> at jdk.internal.loader.BuiltinClassLoader.loadClass(Unknown Source) ~[?:?]
> at jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(Unknown
> Source) ~[?:?]
> at java.lang.ClassLoader.loadClass(Unknown Source) ~[?:?]
> at java.lang.Class.forName0(Native Method) ~[?:?]
> at java.lang.Class.forName(Unknown Source) ~[?:?]
> at
> org.apache.flink.runtime.metrics.ReporterSetup.loadViaReflection(ReporterSetup.java:456)
> ~[flink-dist-1.16.0.jar:1.16.0]
> at
> org.apache.flink.runtime.metrics.ReporterSetup.loadReporter(ReporterSetup.java:409)
> ~[flink-dist-1.16.0.jar:1.16.0]
> at
> org.apache.flink.runtime.metrics.ReporterSetup.setupReporters(ReporterSetup.java:328)
> ~[flink-dist-1.16.0.jar:1.16.0]
> at
> org.apache.flink.runtime.metrics.ReporterSetup.fromConfiguration(ReporterSetup.java:209)
> ~[flink-dist-1.16.0.jar:1.16.0]
>
> The Java commands for Flink process:
> flink  1  3.0  4.6 2168308 765936 ?  Ssl  10:03   1:08
> /opt/java/openjdk/bin/java -XX:+UseG1GC -Xmx697932173 -Xms697932173
> -XX:MaxDirectMemorySize=300647712 -XX:MaxMetaspaceSize=268435456
> -Dlog.file=/opt/flink/log/flink--kubernetes-taskmanager-0-checkpoint-ha-example-taskmanager-1-1.log
> -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
> -Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties
> -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
> -classpath
> /opt/flink/lib/flink-cep-1.16.0.jar:/opt/flink/lib/flink-connector-files-1.16.0.jar:/opt/flink/lib/flink-csv-1.16.0.jar:/opt/flink/lib/flink-json-1.16.0.jar:/opt/flink/lib/flink-scala_2.12-1.16.0.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.4.1-10.0.jar:/opt/flink/lib/flink-shaded-zookeeper-3.5.9.jar:/opt/flink/lib/flink-table-api-java-uber-1.16.0.jar:/opt/flink/lib/flink-table-planner-loader-1.16.0.jar:/opt/flink/lib/flink-table-runtime-1.16.0.jar:/opt/flink/lib/log4j-1.2-api-2.17.1.jar:/opt/flink/lib/log4j-api-2.17.1.jar:/opt/flink/lib/log4j-core-2.17.1.jar:/opt/flink/lib/log4j-slf4j-impl-2.17.1.jar:/opt/flink/lib/flink-dist-1.16.0.jar
> org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner
> --configDir /opt/flink/conf -Djobmanager.rpc.address=172.17.0.7
> -Dpipeline.classpaths= -Djobmanager.memory.off-heap.size=134217728b
> -Dweb.tmpdir=/tmp/flink-web-57b9e638-f313-4389-a75b-988509697edb
> -Djobmanager.rpc.port=6123
> -D.pipeline.job-id=a6f1c9fb
> -Drest.address=172.17.0.7 

Re: Cleanup for high-availability.storageDir

2022-12-08 Thread Matthias Pohl via user
Yes, the wrong button was pushed when replying last time. -.-

Looking into the code once again [1], you're right. It looks like for
"last-state", no job is cancelled but the cluster deployment is just
deleted. I was assuming that the artifacts the documentation about the
JobResultStore resource leak [2] is referring to are the
JobResultStoreEntry files rather than other artifacts (e.g. jobgraphs). But
yeah, if we only delete the deployment, no Flink-internal cleanup is done.

I'm wondering what the reasoning behind that is.

[1]
https://github.com/apache/flink-kubernetes-operator/blob/ea01e294cf1b68d597244d0a11b3c81822a163e7/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L336
[2]
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/concepts/overview/#jobresultstore-resource-leak

On Thu, Dec 8, 2022 at 11:04 AM Alexis Sarda-Espinosa <
sarda.espin...@gmail.com> wrote:

> Hi Matthias,
>
> I think you didn't include the mailing list in your response.
>
> According to my experiments, using last-state means the operator simply
> deletes the Flink pods, and I believe that doesn't count as Cancelled, so
> the artifacts for blobs and submitted job graphs are not cleaned up. I
> imagine the same logic Gyula mentioned before applies, namely keep the
> latest one and clean the older ones.
>
> Regards,
> Alexis.
>
> Am Do., 8. Dez. 2022 um 10:37 Uhr schrieb Matthias Pohl <
> matthias.p...@aiven.io>:
>
>> I see, I confused the Flink-internal recovery with what the Flink
>> Kubernetes Operator does for redeploying the Flink job. AFAIU, when you do
>> an upgrade of your job, the operator will cancel the Flink job (I'm
>> assuming now that you use Flink's Application mode rather than Session
>> mode). The operator cancelled your job and shuts down the cluster.
>> Checkpoints are retained and, therefore, can be used as the so-called "last
>> state" when redeploying your job using the same Job ID. In that case, the
>> corresponding jobGraph and other BLOBs should be cleaned up by Flink
>> itself. The checkpoint files are retained, i.e. survive the Flink cluster
>> shutdown.
>>
>> When redeploying the Flink cluster with the (updated) job, a new JobGraph
>> file is created by Flink internally. BLOBs are recreated as well. New
>> checkpoints are going to be created and old ones (that are not needed
>> anymore) are cleaned up.
>>
>> Just to recap what I said before (to make it more explicit to
>> differentiate what the k8s operator does and what Flink does internally):
>> Removing the artifacts you were talking about in your previous post would
>> harm Flink's internal recovery mechanism. That's probably not what you want.
>>
>> @Gyula: Please correct me if I misunderstood something here.
>>
>> I hope that helped.
>> Matthias
>>
>> On Wed, Dec 7, 2022 at 4:19 PM Alexis Sarda-Espinosa <
>> sarda.espin...@gmail.com> wrote:
>>
>>> I see, thanks for the details.
>>>
>>> I do mean replacing the job without stopping it terminally.
>>> Specifically, I mean updating the container image with one that contains
>>> an updated job jar. Naturally, the new version must not break state
>>> compatibility, but as long as that is fulfilled, the job should be able to
>>> use the last checkpoint as starting point. It's my understanding that this
>>> is how the Kubernetes operator's "last-state" upgrade mode works [1].
>>>
>>> [1]
>>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.2/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades
>>>
>>> Regards,
>>> Alexis.
>>>
>>> Am Mi., 7. Dez. 2022 um 15:54 Uhr schrieb Matthias Pohl <
>>> matthias.p...@aiven.io>:
>>>
>>>> > - job_name/submittedJobGraphX
>>>> submittedJobGraph* is the persisted JobGraph that would be picked up in
>>>> case of a failover. Deleting this file would result in Flink's failure
>>>> recovery not working properly anymore if the job is still executed but
>>>> needs to be restarted because the actual job definition is gone.
>>>>
>>>> > completedCheckpointXYZ
>>>> This is the persisted CompletedCheckpoint with a reference to the
>>>> actual Checkpoint directory. Deleting this file would be problematic if the
>>>> state recovery relies in some way on this specific checkpoint. The HA data
>>>> relies on this file to be present. Failover only fails if there's no newer
>>

Re: How's JobManager bring up TaskManager in Application Mode or Session Mode?

2022-11-28 Thread Matthias Pohl via user
Hi Mark,
the JobManager is not necessarily in charge of spinning up TaskManager
instances. It depends on the resource provider configuration you choose.
Flink differentiates between active and passive Resource Management (see
the two available implementations of ResourceManager [1]).

Active Resource Management actually takes care of spinning up new
TaskManager instances if needed (i.e. Flink runs out of free task slots).
This is handled by the corresponding AbstractResourceManageDriver
implementations [2].

In contrast, passive Resource Management (i.e. through the standalone
resource provider configurations [3]) doesn't do anything like that. Here,
Flink works with the TaskManagers that were instantiated by an external
process. Each TaskManager instance registers itself to the JobManager that
is specified in the Flink configuration which is provided to the
corresponding TaskManager instance.

I hope that helps. For future posts, please solely use the user mailing
list for questions around understanding Flink or troubleshooting. The dev
mailing list is reserved for development-related questions [4].

Matthias

[1]
https://github.com/apache/flink/blob/55a8d1a76067204e00839f1b6a2c09965434eaa4/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L117
[2]
https://github.com/apache/flink/blob/9815caad271a561640ffe0df7193c04270d53a25/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/active/AbstractResourceManagerDriver.java#L33
[3]
https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/resource-providers/standalone/overview/
[4] https://flink.apache.org/community.html#mailing-lists

On Tue, Nov 29, 2022 at 5:23 AM 李  wrote:

> Hi,
>
>How's JobManager bring up TaskManager in Application Mode or Session
> Mode? I can’t get it even after reading source code of flink operator?
>
> Any help will be appreciate, Thank you.
>
>  Mark
>
>
>


Re: [Security] - Critical OpenSSL Vulnerability

2022-11-01 Thread Matthias Pohl via user
The Docker image for Flink 1.12.7 uses an older base image which comes with
openssl 1.1.1k. There was a previous post in the OpenSSL mailing list
reporting a low vulnerability being fixed with 3.0.6 and 1.1.1r (both
versions being explicitly mentioned) [1]. Therefore, I understand the post
in a way that only 3.0.x would be affected and, as a consequence, Docker
images below 1.13- would be fine.

I verified Mason's finding that only 1.14+ Docker images are affected. No
entire release is necessary as far as I understand. Theoretically, we would
only have to push newer Docker images to the registry. I'm not sure what
the right approach is when it comes to versioning. I'm curious about
Chesnay's opinion on that one (CC'd).

[1]
https://mta.openssl.org/pipermail/openssl-announce/2022-October/000233.html

On Tue, Nov 1, 2022 at 7:06 AM Prasanna kumar 
wrote:

> Could we also get an emergency patch to 1.12 version as well , because
> upgrading flink to a newer version on production in a short time would be
> high in effort and longer in duration as well .
>
> Thanks,
> Prasanna
>
> On Tue, Nov 1, 2022 at 11:30 AM Prasanna kumar <
> prasannakumarram...@gmail.com> wrote:
>
>> If flink version 1.12 also affected ?
>>
>> Thanks,
>> Prasanna.
>>
>> On Tue, Nov 1, 2022 at 10:40 AM Mason Chen 
>> wrote:
>>
>>> Hi Tamir and Martjin,
>>>
>>> We have also noticed this internally. So far, we have found that the
>>> *latest* Flink Java 11/Scala 2.12 docker images *1.14, 1.15, and 1.16*
>>> are affected, which all have the *openssl 3.0.2 *dependency. It would
>>> be good to discuss an emergency release when this patch comes out
>>> tomorrow, as it is the highest priority level from their severity rating.
>>>
>>> Best,
>>> Mason
>>>
>>> On Mon, Oct 31, 2022 at 1:10 PM Martijn Visser 
>>> wrote:
>>>
 Hi Tamir,

 That depends on a) if Flink is vulnerable and b) if yes, how vulnerable
 that would be.

 Best regards,

 Martijn

 Op ma 31 okt. 2022 om 19:22 schreef Tamir Sagi <
 tamir.s...@niceactimize.com>

> Hey all,
>
> Following that link
> https://eu01.z.antigena.com/l/CjXA7qEmnn79gc24BA2Hb6K2OVR-yGlLfMyp4smo5aXj5Z6WC0dSiHCRPqjSz972DkRNssUoTbxKmp5Pi3IaaVB983yfLJ9MUZY9LYtnBMEKJP5DcQqmhR3SktltkbVG8b7nSRa84kWSnwNJFuXFLA2GrMLTVG7mXdy59-ykolsAWAVAJSDgRdWCv6xN0iczvQ
>
>
> due to critical vulnerability , there will be an important release of
> OpenSSl v3.0.7 tomorrow November 1st.
>
> Is there any plan to update Flink with the newest version?
>
> Thanks.
> Tamir
>
>
> Confidentiality: This communication and any attachments are intended
> for the above-named persons only and may be confidential and/or legally
> privileged. Any opinions expressed in this communication are not
> necessarily those of NICE Actimize. If this communication has come to you
> in error you must take no action based on it, nor must you copy or show it
> to anyone; please delete/destroy and inform the sender by e-mail
> immediately.
> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
> Viruses: Although we have taken steps toward ensuring that this e-mail
> and attachments are free from any virus, we advise that in keeping with
> good computing practice the recipient should ensure they are actually 
> virus
> free.
>
 --
 Martijn
 https://twitter.com/MartijnVisser82
 https://github.com/MartijnVisser

>>>


Re: Watermark generating mechanism in Flink SQL

2022-10-17 Thread Matthias Pohl via user
Hi Hunk,
there is documentation about watermarking in FlinkSQL [1]. There is also a
FlinkSQL cookbook entry about watermarking [2]. Essentially, you define the
watermark strategy in your CREATE TABLE statement and specify the lateness
for a given event (not the period in which watermarks are automatically
generated!). You have to apply the `WATERMARK FOR` phrase on a column that
is declared as a time attribute [3]. Watermarks are based on event time,
i.e. based on an event being processed that provides the event time. Your
idea of generating them "every 5 seconds" does not work out of the box
because a watermark wouldn't be generated if the source idles for more than
5 seconds (in case of your specific example). Sending periodic dummy events
extrapolating the current event time would be a way to work around this
issue. Keep in mind that mixing processing time (what you would do if you
create a watermark based on the system's current time rather than relying
on events) and event time is usually not advised. I hope that helps.

Best,
Matthias

[1]
https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/table/sql/create/#watermark
[2]
https://github.com/ververica/flink-sql-cookbook/blob/main/aggregations-and-analytics/02_watermarks/02_watermarks.md
[3]
https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/table/concepts/time_attributes/#event-time

On Tue, Oct 18, 2022 at 5:32 AM wang <24248...@163.com> wrote:

> Hi dear engineers,
>
> I have one question about watermark generating mechanism in Flink SQL.
> There are two mechanisms called *Periodic Watermarks* and *Punctuated
> Watermarks, *I want to use* Periodic Watermarks* with interval 5 seconds
> (meaning watermarks will be generated every 5 seconds), how should I set in
> Flink sql? thanks in advance!
>
> Regards,
> Hunk
>


Re: Watermark generating mechanism in Flink SQL

2022-10-17 Thread Matthias Pohl
Hi Hunk,
there is documentation about watermarking in FlinkSQL [1]. There is also a
FlinkSQL cookbook entry about watermarking [2]. Essentially, you define the
watermark strategy in your CREATE TABLE statement and specify the lateness
for a given event (not the period in which watermarks are automatically
generated!). You have to apply the `WATERMARK FOR` phrase on a column that
is declared as a time attribute [3]. Watermarks are based on event time,
i.e. based on an event being processed that provides the event time. Your
idea of generating them "every 5 seconds" does not work out of the box
because a watermark wouldn't be generated if the source idles for more than
5 seconds (in case of your specific example). Sending periodic dummy events
extrapolating the current event time would be a way to work around this
issue. Keep in mind that mixing processing time (what you would do if you
create a watermark based on the system's current time rather than relying
on events) and event time is usually not advised. I hope that helps.

Best,
Matthias

[1]
https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/table/sql/create/#watermark
[2]
https://github.com/ververica/flink-sql-cookbook/blob/main/aggregations-and-analytics/02_watermarks/02_watermarks.md
[3]
https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/table/concepts/time_attributes/#event-time

On Tue, Oct 18, 2022 at 5:32 AM wang <24248...@163.com> wrote:

> Hi dear engineers,
>
> I have one question about watermark generating mechanism in Flink SQL.
> There are two mechanisms called *Periodic Watermarks* and *Punctuated
> Watermarks, *I want to use* Periodic Watermarks* with interval 5 seconds
> (meaning watermarks will be generated every 5 seconds), how should I set in
> Flink sql? thanks in advance!
>
> Regards,
> Hunk
>


Re: jobmaster's fatal error will kill the session cluster

2022-10-17 Thread Matthias Pohl via user
; ~[?:?]
> at
> org.apache.hudi.sink.StreamWriteOperatorCoordinator.start(StreamWriteOperatorCoordinator.java:179)
> ~[?:?]
> at
> org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:194)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at
> org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:164)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at
> org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:82)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at
> org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:624)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at
> org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:1010)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at
> org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:927)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at
> org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:388)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.lambda$start$0(AkkaRpcActor.java:612)
> ~[flink-rpc-akka_db70a2fa-991e-4392-9447-5d060aeb156e.jar:1.15.0]
> at
> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
> ~[flink-rpc-akka_db70a2fa-991e-4392-9447-5d060aeb156e.jar:1.15.0]
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:611)
> ~[flink-rpc-akka_db70a2fa-991e-4392-9447-5d060aeb156e.jar:1.15.0]
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:185)
> ~[flink-rpc-akka_db70a2fa-991e-4392-9447-5d060aeb156e.jar:1.15.0]
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) ~[?:?]
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) ~[?:?]
> at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
> ~[flink-scala_2.12-1.15.0.jar:1.15.0]
> at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
> ~[flink-scala_2.12-1.15.0.jar:1.15.0]
> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
> ~[?:?]
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> ~[flink-scala_2.12-1.15.0.jar:1.15.0]
>
> I’m not sure whether it’s proper to kill the cluster just because of using
> a wrong job configuration (set a relative path).
>
>
> 2022年10月14日 19:53,Matthias Pohl via user  写道:
>
> Hi Jie Han,
> welcome to the community. Just a little side note: These kinds of
> questions are more suitable to be asked in the user mailing list. The dev
> mailing list is rather used for discussing feature development or
> project-related topics. See [1] for further details.
>
> About your question: The stacktrace you're providing indicates that
> something went wrong while initiating the job execution. Unfortunately, the
> actual reason is not clear because that's not included in your stacktrace
> (it should be listed as a cause for the JobMasterException in your logs).
> You're right in assuming that Flink is able to handle certain kinds of user
> code and infrastructure-related errors by restarting the job. But there
> might be other Flink cluster internal errors that could cause a Flink
> cluster shutdown. It's hard to tell from the logs you provided. Usually,
> it's a good habit to share a reasonable amount of logs to make
> investigating the issue easier right away.
>
> Let's move the discussion into the user mailing list in case you have
> further questions.
>
> Best,
> Matthias
>
> [1] https://flink.apache.org/community.html#mailing-lists
>
> On Fri, Oct 14, 2022 at 10:13 AM Jie Han  wrote:
>
>> Hi, guys, I’m new to apache flink. It’s exciting to join the community!
>>
>> When I experienced flink 1.15.0, I met some problems confusing, here is
>> the streamlined log:
>>
>> org.apache.flink.runtime.rpc.akka.exceptions.AkkaRpcException: Could not
>> start RpcEndpoint jobmanager_2.
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:617)
>> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:185)
>> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
>> at akka.japi.pf
>> <https://eu01.z.antigena.com/l/pyN08xgk8WWQxSNlxnRpGknLWSdgjvnWGXMhKkXGI~fvj~FYsrddbTM9iVPOiscdS

Re: Sometimes checkpoints to s3 fail

2022-10-14 Thread Matthias Pohl via user
Hi Evgeniy,
is it Ceph which you're using as a S3 server? All the Google search entries
point to Ceph when looking for the error message. Could it be that there's
a problem with the version of the underlying system? The stacktrace you
provided looks like Flink struggles to close the File and, therefore, fails
to create the checkpoint.

Best,
Matthias

On Thu, Oct 6, 2022 at 11:25 AM Evgeniy Lyutikov 
wrote:

> Hello all.
> I can’t understand the floating problem, sometimes checkpoints stop
> passing, sometimes they start to complete every other time.
> Flink 1.14.4 in kubernetes application mode.
>
>
> 2022-10-06 09:08:04,731 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] -
> Triggering checkpoint 18314 (type=CHECKPOINT) @ 1665047284716 for job
> .
> 2022-10-06 09:11:29,130 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Decline
> checkpoint 18314 by task 048169f0e3c2efd473d3cef9c9d2cd70 of job
>  at job-name-taskmanager-3-1 @ 10.109.0.168
> (dataPort=43795).
> org.apache.flink.util.SerializedThrowable: Asynchronous task checkpoint
> failed.
> at
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:301)
> ~[flink-dist_2.12-1.14.4.jar:1.14.4]
> at
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:155)
> ~[flink-dist_2.12-1.14.4.jar:1.14.4]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> [?:?]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> [?:?]
> at java.lang.Thread.run(Thread.java:829) [?:?]
> Caused by: org.apache.flink.util.SerializedThrowable: Could not
> materialize checkpoint 18314 for operator Process rec last clicks -> Cast
> rec last clicks type (30/44)#0.
> at
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:279)
> ~[flink-dist_2.12-1.14.4.jar:1.14.4]
> ... 4 more
> Caused by: org.apache.flink.util.SerializedThrowable: java.io.IOException:
> Could not flush to file and close the file system output stream to
> s3p://flink-checkpoints/k8s-checkpoint-job-name//shared/7c09fcf1-49b9-4b72-b756-81cd7778e396
> in order to obtain the stream state handle
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> ~[?:?]
> at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
> at
> org.apache.flink.util.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:645)
> ~[flink-dist_2.12-1.14.4.jar:1.14.4]
> at
> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.(OperatorSnapshotFinalizer.java:54)
> ~[flink-dist_2.12-1.14.4.jar:1.14.4]
> at
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:177)
> ~[flink-dist_2.12-1.14.4.jar:1.14.4]
> at
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)
> ~[flink-dist_2.12-1.14.4.jar:1.14.4]
> ... 3 more
> Caused by: org.apache.flink.util.SerializedThrowable: Could not flush to
> file and close the file system output stream to
> s3p://flink-checkpoints/k8s-checkpoint-job-name//shared/7c09fcf1-49b9-4b72-b756-81cd7778e396
> in order to obtain the stream state handle
> at
> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:373)
> ~[flink-dist_2.12-1.14.4.jar:1.14.4]
> at
> org.apache.flink.contrib.streaming.state.RocksDBStateUploader.uploadLocalFileToCheckpointFs(RocksDBStateUploader.java:143)
> ~[flink-dist_2.12-1.14.4.jar:1.14.4]
> at
> org.apache.flink.contrib.streaming.state.RocksDBStateUploader.lambda$createUploadFutures$0(RocksDBStateUploader.java:101)
> ~[flink-dist_2.12-1.14.4.jar:1.14.4]
> at
> org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:32)
> ~[flink-dist_2.12-1.14.4.jar:1.14.4]
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
> ~[?:?]
> ... 3 more
> Caused by: org.apache.flink.util.SerializedThrowable:
> com.amazonaws.services.s3.model.AmazonS3Exception: This multipart
> completion is already in progress (Service: Amazon S3; Status Code: 500;
> Error Code: InternalError; Request ID:
> tx000ced9f8-00633e9bc1-18489a52-default; S3 Extended Request
> ID: 18489a52-default-default; Proxy: null), S3 Extended Request ID:
> 18489a52-default-default
> at
> com.facebook.presto.hive.s3.PrestoS3FileSystem$PrestoS3OutputStream.uploadObject(PrestoS3FileSystem.java:1278)
> ~[?:?]
> at
> 

Re: jobmaster's fatal error will kill the session cluster

2022-10-14 Thread Matthias Pohl via user
Hi Jie Han,
welcome to the community. Just a little side note: These kinds of questions
are more suitable to be asked in the user mailing list. The dev mailing
list is rather used for discussing feature development or project-related
topics. See [1] for further details.

About your question: The stacktrace you're providing indicates that
something went wrong while initiating the job execution. Unfortunately, the
actual reason is not clear because that's not included in your stacktrace
(it should be listed as a cause for the JobMasterException in your logs).
You're right in assuming that Flink is able to handle certain kinds of user
code and infrastructure-related errors by restarting the job. But there
might be other Flink cluster internal errors that could cause a Flink
cluster shutdown. It's hard to tell from the logs you provided. Usually,
it's a good habit to share a reasonable amount of logs to make
investigating the issue easier right away.

Let's move the discussion into the user mailing list in case you have
further questions.

Best,
Matthias

[1] https://flink.apache.org/community.html#mailing-lists

On Fri, Oct 14, 2022 at 10:13 AM Jie Han  wrote:

> Hi, guys, I’m new to apache flink. It’s exciting to join the community!
>
> When I experienced flink 1.15.0, I met some problems confusing, here is
> the streamlined log:
>
> org.apache.flink.runtime.rpc.akka.exceptions.AkkaRpcException: Could not
> start RpcEndpoint jobmanager_2.
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:617)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:185)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at akka.actor.Actor.aroundReceive(Actor.scala:537)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at akka.actor.Actor.aroundReceive$(Actor.scala:535)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at akka.actor.ActorCell.invoke(ActorCell.scala:548)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
> [flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at akka.dispatch.Mailbox.run(Mailbox.scala:231)
> [flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
> [flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
> [?:1.8.0_301]
> at
> java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1067)
> [?:1.8.0_301]
> at
> java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1703)
> [?:1.8.0_301]
> at
> java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:172)
> [?:1.8.0_301]
> Caused by: org.apache.flink.runtime.jobmaster.JobMasterException: Could
> not start the JobMaster.
> at
> org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:390)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.lambda$start$0(AkkaRpcActor.java:612)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at
> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
> 

Re: Cancel a job in status INITIALIZING

2022-09-26 Thread Matthias Pohl via user
Can you provide the JobManager logs for this case. It sounds odd that the
job was stuck in the INITIALIZING phase.

Matthias

On Wed, Sep 21, 2022 at 11:50 AM Christian Lorenz via user <
user@flink.apache.org> wrote:

> Hi,
>
>
>
> we’re running a Flink Cluster in standalone/session mode. During a restart
> of a jobmanager one job was stuck in status INITIALIZING.
>
> When trying to cancel the job via CLI the command failed with a
> java.util.concurrent.TimeoutException.
>
> The only way to get rid of this job for us was to stop the jobmanagers and
> delete the zookeeper root node.
>
> Is there a better way of handling this issue as this seems to be very
> unclean to me.
>
>
>
> Kind regards,
>
> Christian
>
> Mapp Digital Germany GmbH with registered offices at Sandstr. 3, 80335
> München.
> Registered with the District Court München HRB 226181
> Managing Directors: Frasier, Christopher & Warren, Steve
>
> This e-mail is from Mapp Digital and its international legal entities and
> may contain information that is confidential or proprietary.
> If you are not the intended recipient, do not read, copy or distribute the
> e-mail or any attachments. Instead, please notify the sender and delete the
> e-mail and any attachments.
> Please consider the environment before printing. Thank you.
>


Re: Jobmanager fails to come up if the job has an issue

2022-09-26 Thread Matthias Pohl via user
Yes, the JobManager will failover in HA mode and all jobs would be
recovered.

On Mon, Sep 26, 2022 at 2:06 PM ramkrishna vasudevan <
ramvasu.fl...@gmail.com> wrote:

> Thanks @Matthias Pohl  . This is informative.  So
> generally in a session cluster if I have more than one job and only one of
> them has this issue, still we will face the same problem?
>
> Regards
> Ram
>
> On Mon, Sep 26, 2022 at 4:32 PM Matthias Pohl 
> wrote:
>
>> I see. Thanks for sharing the logs. It's related to a FLINK-9097 [1]. In
>> order for the job to not be cleaned up entirely after a failure while
>> submitting the job, the JobManager is failed fatally resulting in a
>> failover. That's what you're experiencing.
>>
>> One solution is to fix the permission issue to make the job recover
>> without problems. If that's not what you want to do, you could delete the
>> entry with the key 'jobGraph-04ae99777ee2ed34c13fe8120e68436e' from the
>> JobGraphStore ConfigMap (based on your logs it should
>> be flink-972ac3d8028e45fcafa9b8b7b7f1dafb-custer-config-map). This will
>> prevent the JobManager from recovering this specific job. Keep in mind that
>> you have to clean up any job-related data by yourself in that case.
>>
>> I hope that helps.
>> Matthias
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-9097
>>
>> On Mon, Sep 26, 2022 at 12:26 PM ramkrishna vasudevan <
>> ramvasu.fl...@gmail.com> wrote:
>>
>>> I got some logs and stack traces from our backend storage. This is not
>>> the entire log though. Can this be useful?  With these set of logs messages
>>> the job manager kept restarting.
>>>
>>> Regards
>>> Ram
>>>
>>> On Mon, Sep 26, 2022 at 3:11 PM ramkrishna vasudevan <
>>> ramvasu.fl...@gmail.com> wrote:
>>>
>>>> Thank you very much for the reply. I have lost the k8s cluster in this
>>>> case before I could capture the logs. I will try to repro this and get back
>>>> to you.
>>>>
>>>> Regards
>>>> Ram
>>>>
>>>> On Mon, Sep 26, 2022 at 12:42 PM Matthias Pohl 
>>>> wrote:
>>>>
>>>>> Hi Ramkrishna,
>>>>> thanks for reaching out to the Flink community. Could you share the
>>>>> JobManager logs to get a better understanding of what's going on? I'm
>>>>> wondering why the JobManager is failing when the actual problem is that 
>>>>> the
>>>>> job is struggling to access a folder. It sounds like there are multiple
>>>>> problems here.
>>>>>
>>>>> Best,
>>>>> Matthias
>>>>>
>>>>> On Mon, Sep 26, 2022 at 6:25 AM ramkrishna vasudevan <
>>>>> ramvasu.fl...@gmail.com> wrote:
>>>>>
>>>>>> Hi all
>>>>>>
>>>>>> I have a simple job where we read for a given path in cloud storage
>>>>>> to watch for new files in a given fodler. While I setup my job there was
>>>>>> some permission issue on the folder. The job is STREAMING job.
>>>>>> The cluster is set in the session mode and is running on Kubernetes.
>>>>>> The job manager since then is failing to come back up and every time
>>>>>> it fails with the permission issue. But the point is how should i recover
>>>>>> my cluster in this case. Since JM is not there the UI is also not working
>>>>>> and how do I remove the bad job from the JM.
>>>>>>
>>>>>> Regards
>>>>>> Ram
>>>>>>
>>>>>


Re: JobManager restarts on job failure

2022-09-26 Thread Matthias Pohl via user
That's a good point. I forgot about these options. You're right. Cleanup
wouldn't be done in that case. So, upgrading would be a viable option as
you suggested.

Matthias

On Mon, Sep 26, 2022 at 12:53 PM Gyula Fóra  wrote:

> Maybe it is a stupid question but in Flink 1.15 with the following configs
> enabled:
>
> SHUTDOWN_ON_APPLICATION_FINISH = false
> SUBMIT_FAILED_JOB_ON_APPLICATION_ERROR = true
>
> I think jobmanager pod would not restart but simply go to a terminal
> failed state right?
>
> Gyula
>
> On Mon, Sep 26, 2022 at 12:31 PM Matthias Pohl 
> wrote:
>
>> Thanks Evgeniy for reaching out to the community and Gyula for picking it
>> up. I haven't looked into the k8s operator in much detail, yet. So, help me
>> out if I miss something here. But I'm afraid that this is not something
>> that would be fixed by upgrading to 1.15.
>> The issue here is that we're recovering from an external checkpoint using
>> the same job ID (the default one used for any Flink cluster in Application
>> Mode) and the same cluster ID, if I understand correctly. Now, the job is
>> failing during initialization. Currently, this causes a global cleanup [1].
>> All HA data including the checkpoints are going to be deleted. I created
>> FLINK-29415 [2] to cover this.
>>
>> I'm wondering whether we could work around this problem by specifying a
>> random job ID through PipelineOptionsInternal [3] in the Kubernetes
>> Operator. But I haven't looked into all the consequences around that. And
>> it feels wrong to make this configuration parameter publicly usable.
>>
>> Another option might be to use ExecutionMode.RECOVERY in case of an
>> initialization failure when recovering from an external Checkpoint in
>> Application Mode (like we do it for internal recovery already).
>>
>> I'm looking forward to your opinion.
>> Matthias
>>
>> [1]
>> https://github.com/apache/flink/blob/41ac1ba13679121f1ddf14b26a36f4f4a3cc73e4/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L663
>> [2] https://issues.apache.org/jira/browse/FLINK-29415
>> [3]
>> https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-core/src/main/java/org/apache/flink/configuration/PipelineOptionsInternal.java#L29
>>
>> On Tue, Sep 20, 2022 at 3:45 PM Gyula Fóra  wrote:
>>
>>> I see I think we have seen this issue with others before, in Flink 1.15
>>> it is solved by the newly introduced JobResultStore. The operator also
>>> configures that automatically for 1.15 to avoid this.
>>>
>>> Gyula
>>>
>>> On Tue, Sep 20, 2022 at 3:27 PM Evgeniy Lyutikov 
>>> wrote:
>>>
>>>> Thanks for the answer.
>>>> I think this is not about the operator issue, kubernetes deployment
>>>> just restarts the fallen pod, restarted jobmanager without HA metadata
>>>> starts the job itself from an empty state.
>>>>
>>>> I'm looking for a way to prevent it from exiting in case of an
>>>> job error (we use application mode cluster).
>>>>
>>>>
>>>>
>>>> --
>>>> *От:* Gyula Fóra 
>>>> *Отправлено:* 20 сентября 2022 г. 19:49:37
>>>> *Кому:* Evgeniy Lyutikov
>>>> *Копия:* user@flink.apache.org
>>>> *Тема:* Re: JobManager restarts on job failure
>>>>
>>>> The best thing for you to do would be to upgrade to Flink 1.15 and the
>>>> latest operator version.
>>>> In Flink 1.15 we have the option to interact with the Flink jobmanager
>>>> even after the job FAILED and the operator leverages this for a much more
>>>> robust behaviour.
>>>>
>>>> In any case the operator should not ever start the job from an empty
>>>> state (even if it FAILED), if you think that's happening could you please
>>>> open a JIRA ticket with the accompanying JM and Operator logs?
>>>>
>>>> Thanks
>>>> Gyula
>>>>
>>>> On Tue, Sep 20, 2022 at 1:00 PM Evgeniy Lyutikov 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>> We using flink 1.14.4 with flink kubernetes operator.
>>>>>
>>>>> Sometimes when updating a job, it fails on startup and flink removes
>>>>> all HA metadata and exits the jobmanager.
>>>>>
>>>>>
>>>>> 2022-09-14 14:54:44,534 INFO
>>>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - 
>>>>> Rest

Re: Jobmanager fails to come up if the job has an issue

2022-09-26 Thread Matthias Pohl via user
I see. Thanks for sharing the logs. It's related to a FLINK-9097 [1]. In
order for the job to not be cleaned up entirely after a failure while
submitting the job, the JobManager is failed fatally resulting in a
failover. That's what you're experiencing.

One solution is to fix the permission issue to make the job recover without
problems. If that's not what you want to do, you could delete the entry
with the key 'jobGraph-04ae99777ee2ed34c13fe8120e68436e' from the
JobGraphStore ConfigMap (based on your logs it should
be flink-972ac3d8028e45fcafa9b8b7b7f1dafb-custer-config-map). This will
prevent the JobManager from recovering this specific job. Keep in mind that
you have to clean up any job-related data by yourself in that case.

I hope that helps.
Matthias

[1] https://issues.apache.org/jira/browse/FLINK-9097

On Mon, Sep 26, 2022 at 12:26 PM ramkrishna vasudevan <
ramvasu.fl...@gmail.com> wrote:

> I got some logs and stack traces from our backend storage. This is not the
> entire log though. Can this be useful?  With these set of logs messages the
> job manager kept restarting.
>
> Regards
> Ram
>
> On Mon, Sep 26, 2022 at 3:11 PM ramkrishna vasudevan <
> ramvasu.fl...@gmail.com> wrote:
>
>> Thank you very much for the reply. I have lost the k8s cluster in this
>> case before I could capture the logs. I will try to repro this and get back
>> to you.
>>
>> Regards
>> Ram
>>
>> On Mon, Sep 26, 2022 at 12:42 PM Matthias Pohl 
>> wrote:
>>
>>> Hi Ramkrishna,
>>> thanks for reaching out to the Flink community. Could you share the
>>> JobManager logs to get a better understanding of what's going on? I'm
>>> wondering why the JobManager is failing when the actual problem is that the
>>> job is struggling to access a folder. It sounds like there are multiple
>>> problems here.
>>>
>>> Best,
>>> Matthias
>>>
>>> On Mon, Sep 26, 2022 at 6:25 AM ramkrishna vasudevan <
>>> ramvasu.fl...@gmail.com> wrote:
>>>
>>>> Hi all
>>>>
>>>> I have a simple job where we read for a given path in cloud storage to
>>>> watch for new files in a given fodler. While I setup my job there was some
>>>> permission issue on the folder. The job is STREAMING job.
>>>> The cluster is set in the session mode and is running on Kubernetes.
>>>> The job manager since then is failing to come back up and every time it
>>>> fails with the permission issue. But the point is how should i recover my
>>>> cluster in this case. Since JM is not there the UI is also not working and
>>>> how do I remove the bad job from the JM.
>>>>
>>>> Regards
>>>> Ram
>>>>
>>>


Re: JobManager restarts on job failure

2022-09-26 Thread Matthias Pohl via user
Thanks Evgeniy for reaching out to the community and Gyula for picking it
up. I haven't looked into the k8s operator in much detail, yet. So, help me
out if I miss something here. But I'm afraid that this is not something
that would be fixed by upgrading to 1.15.
The issue here is that we're recovering from an external checkpoint using
the same job ID (the default one used for any Flink cluster in Application
Mode) and the same cluster ID, if I understand correctly. Now, the job is
failing during initialization. Currently, this causes a global cleanup [1].
All HA data including the checkpoints are going to be deleted. I created
FLINK-29415 [2] to cover this.

I'm wondering whether we could work around this problem by specifying a
random job ID through PipelineOptionsInternal [3] in the Kubernetes
Operator. But I haven't looked into all the consequences around that. And
it feels wrong to make this configuration parameter publicly usable.

Another option might be to use ExecutionMode.RECOVERY in case of an
initialization failure when recovering from an external Checkpoint in
Application Mode (like we do it for internal recovery already).

I'm looking forward to your opinion.
Matthias

[1]
https://github.com/apache/flink/blob/41ac1ba13679121f1ddf14b26a36f4f4a3cc73e4/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L663
[2] https://issues.apache.org/jira/browse/FLINK-29415
[3]
https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-core/src/main/java/org/apache/flink/configuration/PipelineOptionsInternal.java#L29

On Tue, Sep 20, 2022 at 3:45 PM Gyula Fóra  wrote:

> I see I think we have seen this issue with others before, in Flink 1.15 it
> is solved by the newly introduced JobResultStore. The operator also
> configures that automatically for 1.15 to avoid this.
>
> Gyula
>
> On Tue, Sep 20, 2022 at 3:27 PM Evgeniy Lyutikov 
> wrote:
>
>> Thanks for the answer.
>> I think this is not about the operator issue, kubernetes deployment just
>> restarts the fallen pod, restarted jobmanager without HA metadata starts
>> the job itself from an empty state.
>>
>> I'm looking for a way to prevent it from exiting in case of an job error
>> (we use application mode cluster).
>>
>>
>>
>> --
>> *От:* Gyula Fóra 
>> *Отправлено:* 20 сентября 2022 г. 19:49:37
>> *Кому:* Evgeniy Lyutikov
>> *Копия:* user@flink.apache.org
>> *Тема:* Re: JobManager restarts on job failure
>>
>> The best thing for you to do would be to upgrade to Flink 1.15 and the
>> latest operator version.
>> In Flink 1.15 we have the option to interact with the Flink jobmanager
>> even after the job FAILED and the operator leverages this for a much more
>> robust behaviour.
>>
>> In any case the operator should not ever start the job from an empty
>> state (even if it FAILED), if you think that's happening could you please
>> open a JIRA ticket with the accompanying JM and Operator logs?
>>
>> Thanks
>> Gyula
>>
>> On Tue, Sep 20, 2022 at 1:00 PM Evgeniy Lyutikov 
>> wrote:
>>
>>> Hi,
>>> We using flink 1.14.4 with flink kubernetes operator.
>>>
>>> Sometimes when updating a job, it fails on startup and flink removes all
>>> HA metadata and exits the jobmanager.
>>>
>>>
>>> 2022-09-14 14:54:44,534 INFO
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Restoring
>>> job  from Checkpoint 30829 @ 1663167158684
>>> for  located at
>>> s3p://flink-checkpoints/k8s-checkpoint-job-name//chk-30829.
>>> 2022-09-14 14:54:44,638 INFO
>>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job
>>>  reached terminal state FAILED.
>>> org.apache.flink.runtime.client.JobInitializationException: Could not
>>> start the JobMaster.
>>> Caused by: java.util.concurrent.CompletionException:
>>> java.lang.IllegalStateException: There is no operator for the state
>>> 4e1d9dde287c33a35e7970cbe64a40fe
>>> 2022-09-14 14:54:44,930 ERROR
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Fatal
>>> error occurred in the cluster entrypoint.
>>> 2022-09-14 14:54:45,020 INFO
>>> org.apache.flink.kubernetes.highavailability.KubernetesHaServices [] -
>>> Clean up the high availability data for job
>>> .
>>> 2022-09-14 14:54:45,020 INFO
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Shutting
>>> KubernetesApplicationClusterEntrypoint down with application status
>>> UNKNOWN. Diagnostics Cluster entrypoint has been closed externally..
>>> 2022-09-14 14:54:45,026 INFO
>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Shutting
>>> down rest endpoint.
>>> 2022-09-14 14:54:46,122 INFO
>>> akka.remote.RemoteActorRefProvider$RemotingTerminator[] - Shutting
>>> down remote daemon.
>>> 2022-09-14 14:54:46,321 INFO
>>> 

Re: Jobmanager fails to come up if the job has an issue

2022-09-26 Thread Matthias Pohl via user
Hi Ramkrishna,
thanks for reaching out to the Flink community. Could you share the
JobManager logs to get a better understanding of what's going on? I'm
wondering why the JobManager is failing when the actual problem is that the
job is struggling to access a folder. It sounds like there are multiple
problems here.

Best,
Matthias

On Mon, Sep 26, 2022 at 6:25 AM ramkrishna vasudevan <
ramvasu.fl...@gmail.com> wrote:

> Hi all
>
> I have a simple job where we read for a given path in cloud storage to
> watch for new files in a given fodler. While I setup my job there was some
> permission issue on the folder. The job is STREAMING job.
> The cluster is set in the session mode and is running on Kubernetes.
> The job manager since then is failing to come back up and every time it
> fails with the permission issue. But the point is how should i recover my
> cluster in this case. Since JM is not there the UI is also not working and
> how do I remove the bad job from the JM.
>
> Regards
> Ram
>


Re: Classloading issues with Flink Operator / Kubernetes Native

2022-09-16 Thread Matthias Pohl via user
Are you deploying the job in session or application mode? Could you provide
the stacktrace. I'm wondering whether that would be helpful to pin a code
location for further investigation.
So far, I couldn't come up with a definite answer about placing the jar in
the lib directory. Initially, I would have thought that it's fine
considering that all dependencies are included and the job jar itself ends
up on the user classpath. I'm curious whether Chesnay (CC'd) has an answer
to that one.

On Tue, Sep 13, 2022 at 1:40 AM Yaroslav Tkachenko 
wrote:

> Hey everyone,
>
> I’m migrating a Flink Kubernetes standalone job to the Flink operator
> (with Kubernetes native mode).
>
> I have a lot of classloading issues when trying to run with the operator
> in native mode. For example, I have a Postgres driver as a dependency (I
> can confirm the files are included in the uber jar), but I still get
> "java.sql.SQLException: No suitable driver found for jdbc:postgresql:..."
> exception.
>
> In the Kubernetes standalone setup my uber jar is placed in the
> /opt/flink/lib folder, this is what I specify as "jarURI" in the operator
> config. Is this supported? Should I only be using /opt/flink/usrlib?
>
> Thanks for any suggestions.
>


Re: New licensing for Akka

2022-09-09 Thread Matthias Pohl via user
Looks like there will be a bit of a grace period till Sep 2023 for
vulnerability fixes in akka 2.6.x [1]

[1] https://discuss.lightbend.com/t/2-6-x-maintenance-proposal/9949

On Wed, Sep 7, 2022 at 4:30 PM Robin Cassan via user 
wrote:

> Thanks a lot for your answers, this is reassuring!
>
> Cheers
>
> Le mer. 7 sept. 2022 à 13:12, Chesnay Schepler  a
> écrit :
>
>> Just to squash concerns, we will make sure this license change will not
>> affect Flink users in any way.
>>
>> On 07/09/2022 11:14, Robin Cassan via user wrote:
>> > Hi all!
>> > It seems Akka have announced a licensing change
>> > https://www.lightbend.com/blog/why-we-are-changing-the-license-for-akka
>> > If I understand correctly, this could end-up increasing cost a lot for
>> > companies using Flink in production. Do you know if the Flink
>> > developers have any initial reaction as to how this could be handled
>> > (using a Fork? moving out of akka, even though it's probably
>> > incredibly complex?)? Are we right to assume that this license applies
>> > when using akka through Flink?
>> >
>> > Thanks a lot!
>> > Robin
>>
>>
>>


Re: New licensing for Akka

2022-09-07 Thread Matthias Pohl via user
There is some more discussion going on in the related PR [1]. Based on the
current state of the discussion, akka 2.6.20 will be the last version under
Apache 2.0 license. But, I guess, we'll have to see where this discussion
is heading considering that it's kind of fresh.

[1] https://github.com/akka/akka/pull/31561

On Wed, Sep 7, 2022 at 11:30 AM Chesnay Schepler  wrote:

> We'll have to look into it.
>
> The license would apply to usages of Flink.
> That said, I'm not sure if we'd even be allowed to use Akka under that
> license since it puts significant restrictions on the use of the software.
> If that is the case, then it's either use a fork created by another
> party or switch to a different library.
>
> On 07/09/2022 11:14, Robin Cassan via user wrote:
> > Hi all!
> > It seems Akka have announced a licensing change
> > https://www.lightbend.com/blog/why-we-are-changing-the-license-for-akka
> > If I understand correctly, this could end-up increasing cost a lot for
> > companies using Flink in production. Do you know if the Flink
> > developers have any initial reaction as to how this could be handled
> > (using a Fork? moving out of akka, even though it's probably
> > incredibly complex?)? Are we right to assume that this license applies
> > when using akka through Flink?
> >
> > Thanks a lot!
> > Robin
>
>
>


Re: Slow Tests in Flink 1.15

2022-09-06 Thread Matthias Pohl via user
Hi David,
I guess, you're referring to [1]. But as Chesnay already pointed out in the
previous thread: It would be helpful to get more insights into what exactly
your tests are executing (logs, code, ...). That would help identifying the
cause.
> Can you give us a more complete stacktrace so we can see what call in
> Flink is waiting for something?
>
> Does this happen to all of your tests?
> Can you provide us with an example that we can try ourselves? If not,
> can you describe the test structure (e.g., is it using a
> MiniClusterResource).

Matthias

[1] https://lists.apache.org/thread/yhhprwyf29kgypzzqdmjgft4qs25yyhk

On Mon, Sep 5, 2022 at 4:59 PM David Jost  wrote:

> Hi,
>
> we were going to upgrade our application from Flink 1.14.4 to Flink
> 1.15.2, when we noticed, that all our job tests, using a
> MiniClusterWithClientResource, are multiple times slower in 1.15 than
> before in 1.14. I, unfortunately, have not found mentions in that regard in
> the changelog or documentation. The slowdown is rather extreme I hope to
> find a solution to this. I saw it mentioned once in the mailing list, but
> there was no (public) outcome to it.
>
> I would appreciate any help on this. Thank you in advance.
>
> Best
>  David


Re: flink ci build run longer than the maximum time of 310 minutes.

2022-09-05 Thread Matthias Pohl via user
Usually, it would be more helpful to provide a link to the PR to get a
better picture of the problem. I'm not 100% sure whether I grasp what's
wrong.

It looks like your branch is based on apache/flink:release-1.15 [1].
Therefore, you should fetch the most recent version from upstream and then
do a git rebase upstream/release-1.15. This will put your 4 commits which
you've added to your local branch so far "on top" of everything that is
already part of upstream/release-1.15. This should resolve your branch
being 11 commits behind the 1.15 release branch. Force-pushing the changes
in your local branch to your remote repo (your fork) will update the PR.

Keep in mind that you have to specify the right base branch in your Github
PR (pointing to the 1.15 release branch in your case) as well to have the
right diff.

I hope that helps. Best,
Matthias

[1] https://github.com/apache/flink/tree/release-1.15

On Sat, Sep 3, 2022 at 10:18 AM hjw <1010445...@qq.com> wrote:

> Hi,Matthias
> The ci build Error in  e2e_1_ci job:
> Sep 0211:01:51 ##[group]Top 15 biggest directories in terms of used disk
> space
> Sep 02 11:01:52 Searching for .dump, .dumpstream and related files in
> '/home/vsts/work/1/s'
> dmesg: read kernel buffer failed: Operation not permitted
> Sep 02 11:01:53 No taskexecutor daemon to stop on host fv-az158-417.
> Sep 02 11:01:53 No standalonesession daemon to stop on host fv-az158-417.
> Sep 02 11:10:27 The command 'docker build --no-cache --network=host -t
> test_docker_embedded_job dev/test_docker_embedded_job-debian' (pid: 188432)
> did not finish after 600 seconds.
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common.sh: line
> 900: kill: (188432) - No such process
> Sep 02 11:11:06 The command 'docker build --no-cache --network=host -t
> test_docker_embedded_job dev/test_docker_embedded_job-debian' (pid: 188484)
> did not finish after 600 seconds.
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common.sh: line
> 900: kill: (188484) - No such process
>
> I think the issue  applies to  my case.
> However, I have submited some commit to my fork repo and create a pr.The
> pr  has not been merged in to flink repo. My fork repo status :This
> branch is 4 commits ahead
> <https://github.com/SwimSweet/flink/compare/apache:flink:release-1.15...release-1.15>
> , 11 commits behind
> <https://github.com/SwimSweet/flink/compare/release-1.15...apache:flink:release-1.15>
>  apache:release-1.15.
>
> When I rebase the branch from upstream and push to my fork repo, the 11
> commits
> <https://github.com/SwimSweet/flink/compare/release-1.15...apache:flink:release-1.15>
>  behind
> <https://github.com/SwimSweet/flink/compare/release-1.15...apache:flink:release-1.15>
>  apache:release-1.15
> also appear in my pr change files. How can I handle this situation? thx.
>
> --
> Best,
> Hjw
>
>
>
> -- 原始邮件 --
> *发件人:* "Matthias Pohl" ;
> *发送时间:* 2022年9月2日(星期五) 晚上7:29
> *收件人:* "Martijn Visser";
> *抄送:* "hjw"<1010445...@qq.com>;"user";
> *主题:* Re: flink ci build run longer than the maximum time of 310 minutes.
>
> Not sure whether that applies to your case, but there was a recent issue
> [1] where the e2e_1_ci job ran into a timeout. If that's what you were
> observing, rebasing your branch might help.
>
> Best,
> Matthias
>
> [1] https://issues.apache.org/jira/browse/FLINK-29161
>
> On Fri, Sep 2, 2022 at 10:51 AM Martijn Visser 
> wrote:
>
>> You can ask the Flinkbot to run again by typing as comment
>>
>> @flinkbot run azure
>>
>> Best regards,
>>
>> Martijn
>>
>> Op vr 2 sep. 2022 om 08:40 schreef hjw <1010445...@qq.com>:
>>
>>> I commit a pr to Flink Github .
>>> A error happened in building ci.
>>> [error]The job running on agent Azure Pipelines 6 ran longer than the
>>> maximum time of 310 minutes. For more information, see
>>> https://go.microsoft.com/fwlink/?linkid=2077134
>>>
>>> How to solve this problem?
>>> How to triigle the ci building again?
>>> thx.
>>>
>>


Re: flink ci build run longer than the maximum time of 310 minutes.

2022-09-02 Thread Matthias Pohl via user
Not sure whether that applies to your case, but there was a recent issue
[1] where the e2e_1_ci job ran into a timeout. If that's what you were
observing, rebasing your branch might help.

Best,
Matthias

[1] https://issues.apache.org/jira/browse/FLINK-29161

On Fri, Sep 2, 2022 at 10:51 AM Martijn Visser 
wrote:

> You can ask the Flinkbot to run again by typing as comment
>
> @flinkbot run azure
>
> Best regards,
>
> Martijn
>
> Op vr 2 sep. 2022 om 08:40 schreef hjw <1010445...@qq.com>:
>
>> I commit a pr to Flink Github .
>> A error happened in building ci.
>> [error]The job running on agent Azure Pipelines 6 ran longer than the
>> maximum time of 310 minutes. For more information, see
>> https://go.microsoft.com/fwlink/?linkid=2077134
>>
>> How to solve this problem?
>> How to triigle the ci building again?
>> thx.
>>
>


Re: Failing to maven compile install Flink 1.15

2022-08-22 Thread Matthias Pohl via user
Hi hjw,
it would be interesting to know the exact Maven commands you used for the
successful run (where you compiled the flink-client module individually)
and the failed run (where you tried to build everything at once) and
probably a more complete version of the Maven output.

The path
D:\learn\Code\Flink\FlinkSourceCode\Flink-1.15\flink\src\test\assembly\test-assembly.xml
appears
to be strange. The Flink sources have a test-assembly.xml file
configuration in two locations:
$ find . -name test-assembly.xml
./flink-clients/src/test/assembly/test-assembly.xml
./flink-formats/flink-avro/src/test/assembly/test-assembly.xml

There's also no src folder in Flink's root folder which indicates there's
something which (at least I) don't understand about your setup.

Best,
Matthias

On Fri, Aug 19, 2022 at 3:16 AM yuxia  wrote:

> which mvn version do you use? It's recommanded to use maven 3.2.5
>
> Best regards,
> Yuxia
>
> --
> *发件人: *"hjw" <1010445...@qq.com>
> *收件人: *"User" 
> *发送时间: *星期四, 2022年 8 月 18日 下午 10:48:57
> *主题: *Failing to maven compile install Flink 1.15
>
> I try to maven clean install Flink 1.15 parent,but fail.
> A Error happened in compiling flink-clients.
> Error Log:
> Failed to execute goal
> org.apache.maven.plugins:maven-assembly-plugin:2.4:single
> (create-test-dependency) on project flink-clients: Error reading
> assemblies: Error locating assembly descriptor:
> src/test/assembly/test-assembly.xml
>
> [1] [INFO] Searching for file location:
> D:\learn\Code\Flink\FlinkSourceCode\Flink-1.15\flink\flink-clients\target\src\test\assembly\test-assembly.xml
>
> [2] [INFO] File:
> D:\learn\Code\Flink\FlinkSourceCode\Flink-1.15\flink\flink-clients\target\src\test\assembly\test-assembly.xml
> does not exist.
>
> [3] [INFO] File:
> D:\learn\Code\Flink\FlinkSourceCode\Flink-1.15\flink\src\test\assembly\test-assembly.xml
> does not exist.
>
>
> However, mvn clean package Flink 1.15 parent  and  flink-client alone are
> successful.
>
>
>


Re: Jobmanager trying to be registered for Zombie Job

2022-04-26 Thread Matthias Pohl
Hi Peter,
based on our analysis the issue already existed before 1.15, yes. We
couldn't come up with any other reasoning. It was just never reported... or
noticing an older ticket.

Matthias

On Mon, Apr 25, 2022 at 6:21 PM Peter Schrott  wrote:

> Hi Matthias,
>
> You are welcome & thanks a lot for your help too!
>
> It's not quite clear to me, the bug was already there since 1.13.6 but not
> reported yet (FLINK-27354 is a new ticket)?
>
> Best, Peter
>
>
> On Mon, Apr 25, 2022 at 5:48 PM Matthias Pohl 
> wrote:
>
>> Thanks again, Peter for sharing your logs. I looked into the issue with
>> the help of Chesnay. Essentially, it's FLINK-27354 [1] that is causing this
>> issue. We couldn't come up with a reason why it should have popped up just
>> now with 1.15. The bug itself is already present in 1.14. You can find more
>> details on the investigation in FLINK-27354 [1] itself.
>>
>> Best,
>> Matthias
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-27354
>>
>> On Mon, Apr 25, 2022 at 2:00 PM Matthias Pohl 
>> wrote:
>>
>>> Thanks Peter, we're looking into it...
>>>
>>> On Mon, Apr 25, 2022 at 11:54 AM Peter Schrott 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> sorry for the late reply. It took me quite some time to get the logs
>>>> out of the system. I have attached them now.
>>>>
>>>> Its logs of 2 jobmanagers and 2 taskamangers. It can be seen on jm 1
>>>> that the job starts crashing and recovering a few times. This happens
>>>> until 2022-04-20 12:12:14,607. After that the above described behavior can
>>>> be seen.
>>>>
>>>> I hope this helps.
>>>>
>>>> Best, Peter
>>>>
>>>> On Fri, Apr 22, 2022 at 12:06 PM Matthias Pohl 
>>>> wrote:
>>>>
>>>>> FYI: I created FLINK-27354 [1] to cover the issue of retrying to
>>>>> connect to the RM while shutting down the JobMaster.
>>>>>
>>>>> This doesn't explain your issue though, Peter. It's still unclear why
>>>>> the JobMaster is still around as stated in my previous email.
>>>>>
>>>>> Matthias
>>>>>
>>>>> [1] https://issues.apache.org/jira/browse/FLINK-27354
>>>>>
>>>>> On Fri, Apr 22, 2022 at 11:54 AM Matthias Pohl 
>>>>> wrote:
>>>>>
>>>>>> Just by looking through the code, it appears that these logs could be
>>>>>> produced while stopping the job. The ResourceManager sends a confirmation
>>>>>> of the JobMaster being disconnected at the end back to the JobMaster. If
>>>>>> the JobMaster is still around to process the request, it would try to
>>>>>> reconnect (I'd consider that a bug because the JobMaster is in shutdown
>>>>>> mode already and wouldn't need to re-establish the connection). This 
>>>>>> method
>>>>>> would have been swallowed otherwise if the JobMaster was already 
>>>>>> terminated.
>>>>>>
>>>>>> The only explanation I can come up with right now (without having any
>>>>>> logs) is that stopping the JobMaster didn't finish for some reason. For
>>>>>> that it would be helpful to look at the logs to see whether there is some
>>>>>> other issue that causes the JobMaster to stop entirely.
>>>>>>
>>>>>> On Fri, Apr 22, 2022 at 10:14 AM Matthias Pohl <
>>>>>> matth...@ververica.com> wrote:
>>>>>>
>>>>>>> ...if possible it would be good to get debug rather than only info
>>>>>>> logs. Did you encounter anything odd in the TaskManager logs as well.
>>>>>>> Sharing those might be of value as well.
>>>>>>>
>>>>>>> On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl <
>>>>>>> matth...@ververica.com> wrote:
>>>>>>>
>>>>>>>> Hi Peter,
>>>>>>>> thanks for sharing. That doesn't sound right. May you provide the
>>>>>>>> entire jobmanager logs?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Matthias
>>>>>>>>
>>>>>>>> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott <
>>>>>>>> pe...@bluerootlabs.io> wrote:
>>>>>>>

Re: Jobmanager trying to be registered for Zombie Job

2022-04-25 Thread Matthias Pohl
Thanks again, Peter for sharing your logs. I looked into the issue with the
help of Chesnay. Essentially, it's FLINK-27354 [1] that is causing this
issue. We couldn't come up with a reason why it should have popped up just
now with 1.15. The bug itself is already present in 1.14. You can find more
details on the investigation in FLINK-27354 [1] itself.

Best,
Matthias

[1] https://issues.apache.org/jira/browse/FLINK-27354

On Mon, Apr 25, 2022 at 2:00 PM Matthias Pohl 
wrote:

> Thanks Peter, we're looking into it...
>
> On Mon, Apr 25, 2022 at 11:54 AM Peter Schrott 
> wrote:
>
>> Hi,
>>
>> sorry for the late reply. It took me quite some time to get the logs out
>> of the system. I have attached them now.
>>
>> Its logs of 2 jobmanagers and 2 taskamangers. It can be seen on jm 1 that
>> the job starts crashing and recovering a few times. This happens
>> until 2022-04-20 12:12:14,607. After that the above described behavior can
>> be seen.
>>
>> I hope this helps.
>>
>> Best, Peter
>>
>> On Fri, Apr 22, 2022 at 12:06 PM Matthias Pohl 
>> wrote:
>>
>>> FYI: I created FLINK-27354 [1] to cover the issue of retrying to connect
>>> to the RM while shutting down the JobMaster.
>>>
>>> This doesn't explain your issue though, Peter. It's still unclear why
>>> the JobMaster is still around as stated in my previous email.
>>>
>>> Matthias
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-27354
>>>
>>> On Fri, Apr 22, 2022 at 11:54 AM Matthias Pohl 
>>> wrote:
>>>
>>>> Just by looking through the code, it appears that these logs could be
>>>> produced while stopping the job. The ResourceManager sends a confirmation
>>>> of the JobMaster being disconnected at the end back to the JobMaster. If
>>>> the JobMaster is still around to process the request, it would try to
>>>> reconnect (I'd consider that a bug because the JobMaster is in shutdown
>>>> mode already and wouldn't need to re-establish the connection). This method
>>>> would have been swallowed otherwise if the JobMaster was already 
>>>> terminated.
>>>>
>>>> The only explanation I can come up with right now (without having any
>>>> logs) is that stopping the JobMaster didn't finish for some reason. For
>>>> that it would be helpful to look at the logs to see whether there is some
>>>> other issue that causes the JobMaster to stop entirely.
>>>>
>>>> On Fri, Apr 22, 2022 at 10:14 AM Matthias Pohl 
>>>> wrote:
>>>>
>>>>> ...if possible it would be good to get debug rather than only info
>>>>> logs. Did you encounter anything odd in the TaskManager logs as well.
>>>>> Sharing those might be of value as well.
>>>>>
>>>>> On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl 
>>>>> wrote:
>>>>>
>>>>>> Hi Peter,
>>>>>> thanks for sharing. That doesn't sound right. May you provide the
>>>>>> entire jobmanager logs?
>>>>>>
>>>>>> Best,
>>>>>> Matthias
>>>>>>
>>>>>> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Flink-Users,
>>>>>>>
>>>>>>> I am not sure if this does something to my cluster or not. But since
>>>>>>> updating to Flink 1.15 (atm rc4) I find the following logs:
>>>>>>>
>>>>>>> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>>>>> @akka.tcp://
>>>>>>> fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 for job
>>>>>>> 5566648d9b1aac6c1a1b78187fd56975.
>>>>>>>
>>>>>>> as many times as number of parallelisms (here 10 times). These logs
>>>>>>> are triggered every 5 minutes.
>>>>>>>
>>>>>>> Then they are followed by:
>>>>>>>
>>>>>>> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>>>>> @akka.tcp://
>>>>>>> fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 failed.
>>>>>>>
>>>>>>> also 10 log entries.
>>>>>>>
>>>>>>> I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975),
>>>>>>> it was a long-running sql streaming job, started on Apr 13th on a
>>>>>>> standalone cluster. After some recovery attempts it finally failed 
>>>>>>> (using
>>>>>>> the failover strategy) on the 20th Apr (yesterday) for good. Then those
>>>>>>> logs started to appear. Now there was no other job running on my cluster
>>>>>>> anymore but the logs appeared every 5 minutes until I restarted this
>>>>>>> jobmanager service.
>>>>>>>
>>>>>>> This job was just an example, it happens to other jobs too.
>>>>>>>
>>>>>>> It's just INFO logs but it does not look healthy either.
>>>>>>>
>>>>>>> Thanks & Best
>>>>>>> Peter
>>>>>>>
>>>>>>

-- 

Matthias Pohl | Engineer

Follow us @VervericaData Ververica <https://www.ververica.com/>

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton
Wehner


Re: Jobmanager trying to be registered for Zombie Job

2022-04-25 Thread Matthias Pohl
Thanks Peter, we're looking into it...

On Mon, Apr 25, 2022 at 11:54 AM Peter Schrott 
wrote:

> Hi,
>
> sorry for the late reply. It took me quite some time to get the logs out
> of the system. I have attached them now.
>
> Its logs of 2 jobmanagers and 2 taskamangers. It can be seen on jm 1 that
> the job starts crashing and recovering a few times. This happens
> until 2022-04-20 12:12:14,607. After that the above described behavior can
> be seen.
>
> I hope this helps.
>
> Best, Peter
>
> On Fri, Apr 22, 2022 at 12:06 PM Matthias Pohl 
> wrote:
>
>> FYI: I created FLINK-27354 [1] to cover the issue of retrying to connect
>> to the RM while shutting down the JobMaster.
>>
>> This doesn't explain your issue though, Peter. It's still unclear why the
>> JobMaster is still around as stated in my previous email.
>>
>> Matthias
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-27354
>>
>> On Fri, Apr 22, 2022 at 11:54 AM Matthias Pohl 
>> wrote:
>>
>>> Just by looking through the code, it appears that these logs could be
>>> produced while stopping the job. The ResourceManager sends a confirmation
>>> of the JobMaster being disconnected at the end back to the JobMaster. If
>>> the JobMaster is still around to process the request, it would try to
>>> reconnect (I'd consider that a bug because the JobMaster is in shutdown
>>> mode already and wouldn't need to re-establish the connection). This method
>>> would have been swallowed otherwise if the JobMaster was already terminated.
>>>
>>> The only explanation I can come up with right now (without having any
>>> logs) is that stopping the JobMaster didn't finish for some reason. For
>>> that it would be helpful to look at the logs to see whether there is some
>>> other issue that causes the JobMaster to stop entirely.
>>>
>>> On Fri, Apr 22, 2022 at 10:14 AM Matthias Pohl 
>>> wrote:
>>>
>>>> ...if possible it would be good to get debug rather than only info
>>>> logs. Did you encounter anything odd in the TaskManager logs as well.
>>>> Sharing those might be of value as well.
>>>>
>>>> On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl 
>>>> wrote:
>>>>
>>>>> Hi Peter,
>>>>> thanks for sharing. That doesn't sound right. May you provide the
>>>>> entire jobmanager logs?
>>>>>
>>>>> Best,
>>>>> Matthias
>>>>>
>>>>> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott 
>>>>> wrote:
>>>>>
>>>>>> Hi Flink-Users,
>>>>>>
>>>>>> I am not sure if this does something to my cluster or not. But since
>>>>>> updating to Flink 1.15 (atm rc4) I find the following logs:
>>>>>>
>>>>>> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>>>> @akka.tcp://
>>>>>> fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 for job
>>>>>> 5566648d9b1aac6c1a1b78187fd56975.
>>>>>>
>>>>>> as many times as number of parallelisms (here 10 times). These logs
>>>>>> are triggered every 5 minutes.
>>>>>>
>>>>>> Then they are followed by:
>>>>>>
>>>>>> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>>>> @akka.tcp://
>>>>>> fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 failed.
>>>>>>
>>>>>> also 10 log entries.
>>>>>>
>>>>>> I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975),
>>>>>> it was a long-running sql streaming job, started on Apr 13th on a
>>>>>> standalone cluster. After some recovery attempts it finally failed (using
>>>>>> the failover strategy) on the 20th Apr (yesterday) for good. Then those
>>>>>> logs started to appear. Now there was no other job running on my cluster
>>>>>> anymore but the logs appeared every 5 minutes until I restarted this
>>>>>> jobmanager service.
>>>>>>
>>>>>> This job was just an example, it happens to other jobs too.
>>>>>>
>>>>>> It's just INFO logs but it does not look healthy either.
>>>>>>
>>>>>> Thanks & Best
>>>>>> Peter
>>>>>>
>>>>>


Re: Jobmanager trying to be registered for Zombie Job

2022-04-22 Thread Matthias Pohl
FYI: I created FLINK-27354 [1] to cover the issue of retrying to connect to
the RM while shutting down the JobMaster.

This doesn't explain your issue though, Peter. It's still unclear why the
JobMaster is still around as stated in my previous email.

Matthias

[1] https://issues.apache.org/jira/browse/FLINK-27354

On Fri, Apr 22, 2022 at 11:54 AM Matthias Pohl 
wrote:

> Just by looking through the code, it appears that these logs could be
> produced while stopping the job. The ResourceManager sends a confirmation
> of the JobMaster being disconnected at the end back to the JobMaster. If
> the JobMaster is still around to process the request, it would try to
> reconnect (I'd consider that a bug because the JobMaster is in shutdown
> mode already and wouldn't need to re-establish the connection). This method
> would have been swallowed otherwise if the JobMaster was already terminated.
>
> The only explanation I can come up with right now (without having any
> logs) is that stopping the JobMaster didn't finish for some reason. For
> that it would be helpful to look at the logs to see whether there is some
> other issue that causes the JobMaster to stop entirely.
>
> On Fri, Apr 22, 2022 at 10:14 AM Matthias Pohl 
> wrote:
>
>> ...if possible it would be good to get debug rather than only info logs.
>> Did you encounter anything odd in the TaskManager logs as well. Sharing
>> those might be of value as well.
>>
>> On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl 
>> wrote:
>>
>>> Hi Peter,
>>> thanks for sharing. That doesn't sound right. May you provide the entire
>>> jobmanager logs?
>>>
>>> Best,
>>> Matthias
>>>
>>> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott 
>>> wrote:
>>>
>>>> Hi Flink-Users,
>>>>
>>>> I am not sure if this does something to my cluster or not. But since
>>>> updating to Flink 1.15 (atm rc4) I find the following logs:
>>>>
>>>> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>> @akka.tcp://fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
>>>> for job 5566648d9b1aac6c1a1b78187fd56975.
>>>>
>>>> as many times as number of parallelisms (here 10 times). These logs are
>>>> triggered every 5 minutes.
>>>>
>>>> Then they are followed by:
>>>>
>>>> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>> @akka.tcp://fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
>>>> failed.
>>>>
>>>> also 10 log entries.
>>>>
>>>> I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975),
>>>> it was a long-running sql streaming job, started on Apr 13th on a
>>>> standalone cluster. After some recovery attempts it finally failed (using
>>>> the failover strategy) on the 20th Apr (yesterday) for good. Then those
>>>> logs started to appear. Now there was no other job running on my cluster
>>>> anymore but the logs appeared every 5 minutes until I restarted this
>>>> jobmanager service.
>>>>
>>>> This job was just an example, it happens to other jobs too.
>>>>
>>>> It's just INFO logs but it does not look healthy either.
>>>>
>>>> Thanks & Best
>>>> Peter
>>>>
>>>


Re: Jobmanager trying to be registered for Zombie Job

2022-04-22 Thread Matthias Pohl
Just by looking through the code, it appears that these logs could be
produced while stopping the job. The ResourceManager sends a confirmation
of the JobMaster being disconnected at the end back to the JobMaster. If
the JobMaster is still around to process the request, it would try to
reconnect (I'd consider that a bug because the JobMaster is in shutdown
mode already and wouldn't need to re-establish the connection). This method
would have been swallowed otherwise if the JobMaster was already terminated.

The only explanation I can come up with right now (without having any logs)
is that stopping the JobMaster didn't finish for some reason. For that it
would be helpful to look at the logs to see whether there is some other
issue that causes the JobMaster to stop entirely.

On Fri, Apr 22, 2022 at 10:14 AM Matthias Pohl 
wrote:

> ...if possible it would be good to get debug rather than only info logs.
> Did you encounter anything odd in the TaskManager logs as well. Sharing
> those might be of value as well.
>
> On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl 
> wrote:
>
>> Hi Peter,
>> thanks for sharing. That doesn't sound right. May you provide the entire
>> jobmanager logs?
>>
>> Best,
>> Matthias
>>
>> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott 
>> wrote:
>>
>>> Hi Flink-Users,
>>>
>>> I am not sure if this does something to my cluster or not. But since
>>> updating to Flink 1.15 (atm rc4) I find the following logs:
>>>
>>> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>> @akka.tcp://fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
>>> for job 5566648d9b1aac6c1a1b78187fd56975.
>>>
>>> as many times as number of parallelisms (here 10 times). These logs are
>>> triggered every 5 minutes.
>>>
>>> Then they are followed by:
>>>
>>> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>> @akka.tcp://fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
>>> failed.
>>>
>>> also 10 log entries.
>>>
>>> I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975),
>>> it was a long-running sql streaming job, started on Apr 13th on a
>>> standalone cluster. After some recovery attempts it finally failed (using
>>> the failover strategy) on the 20th Apr (yesterday) for good. Then those
>>> logs started to appear. Now there was no other job running on my cluster
>>> anymore but the logs appeared every 5 minutes until I restarted this
>>> jobmanager service.
>>>
>>> This job was just an example, it happens to other jobs too.
>>>
>>> It's just INFO logs but it does not look healthy either.
>>>
>>> Thanks & Best
>>> Peter
>>>
>>


Re: Jobmanager trying to be registered for Zombie Job

2022-04-22 Thread Matthias Pohl
...if possible it would be good to get debug rather than only info logs.
Did you encounter anything odd in the TaskManager logs as well. Sharing
those might be of value as well.

On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl 
wrote:

> Hi Peter,
> thanks for sharing. That doesn't sound right. May you provide the entire
> jobmanager logs?
>
> Best,
> Matthias
>
> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott 
> wrote:
>
>> Hi Flink-Users,
>>
>> I am not sure if this does something to my cluster or not. But since
>> updating to Flink 1.15 (atm rc4) I find the following logs:
>>
>> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762
>> @akka.tcp://fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
>> for job 5566648d9b1aac6c1a1b78187fd56975.
>>
>> as many times as number of parallelisms (here 10 times). These logs are
>> triggered every 5 minutes.
>>
>> Then they are followed by:
>>
>> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
>> @akka.tcp://fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
>> failed.
>>
>> also 10 log entries.
>>
>> I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975),
>> it was a long-running sql streaming job, started on Apr 13th on a
>> standalone cluster. After some recovery attempts it finally failed (using
>> the failover strategy) on the 20th Apr (yesterday) for good. Then those
>> logs started to appear. Now there was no other job running on my cluster
>> anymore but the logs appeared every 5 minutes until I restarted this
>> jobmanager service.
>>
>> This job was just an example, it happens to other jobs too.
>>
>> It's just INFO logs but it does not look healthy either.
>>
>> Thanks & Best
>> Peter
>>
>


Re: Jobmanager trying to be registered for Zombie Job

2022-04-22 Thread Matthias Pohl
Hi Peter,
thanks for sharing. That doesn't sound right. May you provide the entire
jobmanager logs?

Best,
Matthias

On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott  wrote:

> Hi Flink-Users,
>
> I am not sure if this does something to my cluster or not. But since
> updating to Flink 1.15 (atm rc4) I find the following logs:
>
> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684...@akka.tcp://
> fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 for job
> 5566648d9b1aac6c1a1b78187fd56975.
>
> as many times as number of parallelisms (here 10 times). These logs are
> triggered every 5 minutes.
>
> Then they are followed by:
>
> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
> @akka.tcp://fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2
> failed.
>
> also 10 log entries.
>
> I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975), it
> was a long-running sql streaming job, started on Apr 13th on a standalone
> cluster. After some recovery attempts it finally failed (using the failover
> strategy) on the 20th Apr (yesterday) for good. Then those logs started to
> appear. Now there was no other job running on my cluster anymore but the
> logs appeared every 5 minutes until I restarted this jobmanager service.
>
> This job was just an example, it happens to other jobs too.
>
> It's just INFO logs but it does not look healthy either.
>
> Thanks & Best
> Peter
>


Re: Adjusted frame length exceeds 2147483647

2022-03-18 Thread Matthias Pohl
One other pointer: Martijn mentioned in FLINK-24923 [1] tools like Nessus
could generate traffic while scanning for ports. It's just the size of the
request that is suspicious.

[1] https://issues.apache.org/jira/browse/FLINK-24923

On Thu, Mar 17, 2022 at 5:29 PM Ori Popowski  wrote:

> This issue did not repeat, so it may be a network issue
>
> On Thu, Mar 17, 2022 at 6:12 PM Matthias Pohl  wrote:
>
>> Hi Ori,
>> that looks odd. The message seems to exceed the maximum size
>> of 2147483647 bytes (2GB). I couldn't find anything similar in the ML or in
>> Jira that supports a bug in Flink. Could it be that there was some network
>> issue?
>>
>> Matthias
>>
>> On Tue, Mar 15, 2022 at 6:52 AM Ori Popowski  wrote:
>>
>>> I am running a production job for at least 1 year, and I got to day this
>>> error:
>>>
>>>
>>> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
>>> Adjusted frame length exceeds 2147483647: 2969686273 - discarded
>>> (connection to
>>> 'flink-session-playback-prod-1641716499-sw-6q8p.c.data-prod-292614.internal/
>>> 10.208.65.38:40737')
>>>
>>> Nothing was changed in the code for a long time. What's causing this
>>> error and how to fix it? I am running Flink 1.10.3 on YARN.
>>>
>>> This is the full stack trace:
>>>
>>> 2022-03-15 03:22:13
>>> org.apache.flink.runtime.io.network.netty.exception.
>>> LocalTransportException: Adjusted frame length exceeds 2147483647:
>>> 2969686273 - discarded (connection to
>>> 'flink-session-playback-prod-1641716499-sw-6q8p.c.data-prod-292614.internal/
>>> 10.208.65.38:40737')
>>> at org.apache.flink.runtime.io.network.netty.
>>> CreditBasedPartitionRequestClientHandler.exceptionCaught(
>>> CreditBasedPartitionRequestClientHandler.java:165)
>>> at org.apache.flink.shaded.netty4.io.netty.channel.
>>> AbstractChannelHandlerContext.invokeExceptionCaught(
>>> AbstractChannelHandlerContext.java:297)
>>> at org.apache.flink.shaded.netty4.io.netty.channel.
>>> AbstractChannelHandlerContext.invokeExceptionCaught(
>>> AbstractChannelHandlerContext.java:276)
>>> at org.apache.flink.shaded.netty4.io.netty.channel.
>>> AbstractChannelHandlerContext.fireExceptionCaught(
>>> AbstractChannelHandlerContext.java:268)
>>> at org.apache.flink.shaded.netty4.io.netty.channel.
>>> ChannelInboundHandlerAdapter.exceptionCaught(
>>> ChannelInboundHandlerAdapter.java:143)
>>> at org.apache.flink.shaded.netty4.io.netty.channel.
>>> AbstractChannelHandlerContext.invokeExceptionCaught(
>>> AbstractChannelHandlerContext.java:297)
>>> at org.apache.flink.shaded.netty4.io.netty.channel.
>>> AbstractChannelHandlerContext.notifyHandlerException(
>>> AbstractChannelHandlerContext.java:831)
>>> at org.apache.flink.shaded.netty4.io.netty.channel.
>>> AbstractChannelHandlerContext.invokeChannelRead(
>>> AbstractChannelHandlerContext.java:376)
>>> at org.apache.flink.shaded.netty4.io.netty.channel.
>>> AbstractChannelHandlerContext.invokeChannelRead(
>>> AbstractChannelHandlerContext.java:360)
>>> at org.apache.flink.shaded.netty4.io.netty.channel.
>>> AbstractChannelHandlerContext.fireChannelRead(
>>> AbstractChannelHandlerContext.java:352)
>>> at org.apache.flink.shaded.netty4.io.netty.channel.
>>> DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline
>>> .java:1421)
>>> at org.apache.flink.shaded.netty4.io.netty.channel.
>>> AbstractChannelHandlerContext.invokeChannelRead(
>>> AbstractChannelHandlerContext.java:374)
>>> at org.apache.flink.shaded.netty4.io.netty.channel.
>>> AbstractChannelHandlerContext.invokeChannelRead(
>>> AbstractChannelHandlerContext.java:360)
>>> at org.apache.flink.shaded.netty4.io.netty.channel.
>>> DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930)
>>> at org.apache.flink.shaded.netty4.io.netty.channel.nio.
>>> AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:
>>> 163)
>>> at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop
>>> .processSelectedKey(NioEventLoop.java:697)
>>> at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop
>>> .processSelectedKeysOptimized(NioEventLoop.java:632)
>>> at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop
>>> .processSelectedKeys

Re: how to set kafka sink ssl properties

2022-03-17 Thread Matthias Pohl
Could you share more details on what's not working? Is the
ssl.trustore.location accessible from the Flink nodes?

Matthias

On Thu, Mar 17, 2022 at 4:00 PM HG  wrote:

> Hi all,
> I am probably not the smartest but I cannot find how to set ssl-properties
> for a Kafka Sink.
> My assumption was that it would be just like the Kafka Consumer
>
> KafkaSource source = KafkaSource.builder()
> .setProperties(kafkaProps)
> .setProperty("ssl.truststore.type", trustStoreType)
> .setProperty("ssl.truststore.password", trustStorePassword)
> .setProperty("ssl.truststore.location", trustStoreLocation)
> .setProperty("security.protocol", securityProtocol)
> .setProperty("partition.discovery.interval.ms", 
> partitionDiscoveryIntervalMs)
> .setProperty("commit.offsets.on.checkpoint", 
> commitOffsetsOnCheckpoint)
> .setGroupId(inputGroupId)
> .setClientIdPrefix(clientId)
> .setTopics(kafkaInputTopic)
> .setDeserializer(KafkaRecordDeserializationSchema.of(new 
> JSONKeyValueDeserializationSchema(fetchMetadata)))
> 
> .setStartingOffsets(OffsetsInitializer.committedOffsets(OffsetResetStrategy.EARLIEST))
> .build();
>
>
> But that seems not to be the case.
>
> Any quick pointers?
>
> Regards Hans-Peter
>


Re: Adjusted frame length exceeds 2147483647

2022-03-17 Thread Matthias Pohl
Hi Ori,
that looks odd. The message seems to exceed the maximum size of 2147483647
bytes (2GB). I couldn't find anything similar in the ML or in Jira that
supports a bug in Flink. Could it be that there was some network issue?

Matthias

On Tue, Mar 15, 2022 at 6:52 AM Ori Popowski  wrote:

> I am running a production job for at least 1 year, and I got to day this
> error:
>
>
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
> Adjusted frame length exceeds 2147483647: 2969686273 - discarded
> (connection to
> 'flink-session-playback-prod-1641716499-sw-6q8p.c.data-prod-292614.internal/
> 10.208.65.38:40737')
>
> Nothing was changed in the code for a long time. What's causing this error
> and how to fix it? I am running Flink 1.10.3 on YARN.
>
> This is the full stack trace:
>
> 2022-03-15 03:22:13
> org.apache.flink.runtime.io.network.netty.exception.
> LocalTransportException: Adjusted frame length exceeds 2147483647:
> 2969686273 - discarded (connection to
> 'flink-session-playback-prod-1641716499-sw-6q8p.c.data-prod-292614.internal/
> 10.208.65.38:40737')
> at org.apache.flink.runtime.io.network.netty.
> CreditBasedPartitionRequestClientHandler.exceptionCaught(
> CreditBasedPartitionRequestClientHandler.java:165)
> at org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.invokeExceptionCaught(
> AbstractChannelHandlerContext.java:297)
> at org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.invokeExceptionCaught(
> AbstractChannelHandlerContext.java:276)
> at org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.fireExceptionCaught(
> AbstractChannelHandlerContext.java:268)
> at org.apache.flink.shaded.netty4.io.netty.channel.
> ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter
> .java:143)
> at org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.invokeExceptionCaught(
> AbstractChannelHandlerContext.java:297)
> at org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.notifyHandlerException(
> AbstractChannelHandlerContext.java:831)
> at org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.invokeChannelRead(
> AbstractChannelHandlerContext.java:376)
> at org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.invokeChannelRead(
> AbstractChannelHandlerContext.java:360)
> at org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.fireChannelRead(
> AbstractChannelHandlerContext.java:352)
> at org.apache.flink.shaded.netty4.io.netty.channel.
> DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline
> .java:1421)
> at org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.invokeChannelRead(
> AbstractChannelHandlerContext.java:374)
> at org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.invokeChannelRead(
> AbstractChannelHandlerContext.java:360)
> at org.apache.flink.shaded.netty4.io.netty.channel.
> DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930)
> at org.apache.flink.shaded.netty4.io.netty.channel.nio.
> AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
> at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop
> .processSelectedKey(NioEventLoop.java:697)
> at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop
> .processSelectedKeysOptimized(NioEventLoop.java:632)
> at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop
> .processSelectedKeys(NioEventLoop.java:549)
> at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop
> .run(NioEventLoop.java:511)
> at org.apache.flink.shaded.netty4.io.netty.util.concurrent.
> SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
> at org.apache.flink.shaded.netty4.io.netty.util.internal.
> ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.shaded.netty4.io.netty.handler.codec.
> TooLongFrameException: Adjusted frame length exceeds 2147483647:
> 2969686273 - discarded
> at org.apache.flink.shaded.netty4.io.netty.handler.codec.
> LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:513)
> at org.apache.flink.shaded.netty4.io.netty.handler.codec.
> LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder
> .java:491)
> at org.apache.flink.shaded.netty4.io.netty.handler.codec.
> LengthFieldBasedFrameDecoder.exceededFrameLength(
> LengthFieldBasedFrameDecoder.java:378)
> at org.apache.flink.shaded.netty4.io.netty.handler.codec.
> LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:421)
> at org.apache.flink.runtime.io.network.netty.
> NettyMessage$NettyMessageDecoder.decode(NettyMessage.java:214)
> 

Re: Flink failure rate restart not work as expect

2022-03-01 Thread Matthias Pohl
The YARN node manager logs support my observation: The container exits with
a failure which, if I understand it correctly, should cause a container
restart on the YARN side. In HA mode, Flink expects the underlying resource
management to restart the Flink cluster in case of failure. This does not
seem to happen in your case. Is there a configuration issue in your YARN
cluster? Or does the container recovery usually work in failure cases for
you? I'm not that experienced with YARN deployments. I'm adding David to
this thread. He might have some additional insights.

Matthias

On Tue, Mar 1, 2022 at 12:19 PM 刘 家锹  wrote:

> Unfortunately we did't keep log properly , this happen too far away, yarn
> ResourceMnager log had clean,  and the broken machine had reinstall. We
> only found the yarn log of JobManager on Yarn NodeManager, it maybe
> useless. We will put the detail logs to this thread when it happen again,
> since it happen sometime, like between two weeks,  if one of our cluster
> machine go down.
> ------
> *发件人:* Matthias Pohl 
> *发送时间:* 2022年3月1日 17:57
> *收件人:* Alexander Preuß 
> *抄送:* 刘 家锹 ; user@flink.apache.org <
> user@flink.apache.org>
> *主题:* Re: Flink failure rate restart not work as expect
>
> Hi,
> I second Alex' observation - based on the logs it looks like the task
> restart functionality worked as expected: It tried to restart the tasks
> until it reached the limit of 4 attempts due to the missing TaskManager.
> The job-cluster shut down with an error code. At this point, YARN should
> pick it up and bring up a new JobManager based on the non-0 exit code of
> the Flink cluster. It would be interesting to see the YARN logs to figure
> out why the cluster failover didn't work.
>
> Best,
> Matthias
>
> On Tue, Mar 1, 2022 at 8:00 AM Alexander Preuß <
> alexanderpre...@ververica.com> wrote:
>
> Hi,
> from a first glance it looks like the exception was thrown very rapidly so
> it exceeded the maxFailuresPerInterval and the FailureRestartStrategy
> decided not to restart. Why do you think this is different from the
> expected behavior?
>
> Best,
> Alex
>
> On Tue, Mar 1, 2022 at 3:23 AM 刘 家锹  wrote:
>
> Hi, all
> We encounter some problem with FailureRateRestartStrategy, which confuse
> us and don't know how to solove it. Here's the situation:
>
> Flink version: 1.10.1
> Development env: on Yarn
>
> FailureRateRestartStrategy: 
> failuresIntervalMS=6,backoffTimeMS=15000,maxFailuresPerInterval=4
>
> One of our hadoop machine got stuck without response, which our job's
> taskmanager running on. At this moment, the jobmanager receive a heartbeat
> timeout exception, but after throwing 4 times exception in a very short
> time(about 10ms each), it hit the FailureRateRestartStrategy and all job
> quit, we got the message of 'org.apache.flink.runtime.JobException:
> Recovery is suppressed by FailureRateRestartBackoffTimeStrategy'.
> As I know from document, the behavior expected was jobmanager should try
> to restart the job which will bring up a new taskmanager on other machine,
> but it did not.
> We also do some test, start a new job and just kill the taskamanger, but
> it can restart as expect.
>
> So it confuse us most,  if anyone know what happen, that would be thanks.
>
> JobManager log and TaskManager log append below
>
>
>
> --
>
> Alexander Preuß | Junior Engineer - Data Intensive Systems
>
> alexanderpre...@ververica.com
>
> <https://www.ververica.com/>
>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
>
> Ververica GmbH
>
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
>
> Managing Directors: Karl Anton Wehner, Holger Temme, Yip Park Tung Jason,
> Jinwei (Kevin) Zhang
>
>


Re: Flink failure rate restart not work as expect

2022-03-01 Thread Matthias Pohl
Hi,
I second Alex' observation - based on the logs it looks like the task
restart functionality worked as expected: It tried to restart the tasks
until it reached the limit of 4 attempts due to the missing TaskManager.
The job-cluster shut down with an error code. At this point, YARN should
pick it up and bring up a new JobManager based on the non-0 exit code of
the Flink cluster. It would be interesting to see the YARN logs to figure
out why the cluster failover didn't work.

Best,
Matthias

On Tue, Mar 1, 2022 at 8:00 AM Alexander Preuß <
alexanderpre...@ververica.com> wrote:

> Hi,
> from a first glance it looks like the exception was thrown very rapidly so
> it exceeded the maxFailuresPerInterval and the FailureRestartStrategy
> decided not to restart. Why do you think this is different from the
> expected behavior?
>
> Best,
> Alex
>
> On Tue, Mar 1, 2022 at 3:23 AM 刘 家锹  wrote:
>
>> Hi, all
>> We encounter some problem with FailureRateRestartStrategy, which confuse
>> us and don't know how to solove it. Here's the situation:
>>
>> Flink version: 1.10.1
>> Development env: on Yarn
>>
>> FailureRateRestartStrategy: 
>> failuresIntervalMS=6,backoffTimeMS=15000,maxFailuresPerInterval=4
>>
>> One of our hadoop machine got stuck without response, which our job's
>> taskmanager running on. At this moment, the jobmanager receive a heartbeat
>> timeout exception, but after throwing 4 times exception in a very short
>> time(about 10ms each), it hit the FailureRateRestartStrategy and all job
>> quit, we got the message of 'org.apache.flink.runtime.JobException:
>> Recovery is suppressed by FailureRateRestartBackoffTimeStrategy'.
>> As I know from document, the behavior expected was jobmanager should try
>> to restart the job which will bring up a new taskmanager on other machine,
>> but it did not.
>> We also do some test, start a new job and just kill the taskamanger, but
>> it can restart as expect.
>>
>> So it confuse us most,  if anyone know what happen, that would be thanks.
>>
>> JobManager log and TaskManager log append below
>>
>
>
> --
>
> Alexander Preuß | Junior Engineer - Data Intensive Systems
>
> alexanderpre...@ververica.com
>
> 
>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward  - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
>
> Ververica GmbH
>
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
>
> Managing Directors: Karl Anton Wehner, Holger Temme, Yip Park Tung Jason,
> Jinwei (Kevin) Zhang
>
>


Re: [DISCUSS] Future of Per-Job Mode

2022-01-24 Thread Matthias Pohl
Hi all,
I agree with Xintong's comment: Reducing the number of deployment modes
would help users. There is a clearer distinction between session mode and
the two other deployment modes (i.e. application and job mode). The
difference between application and job mode is not that easy to grasp for
newcomers, I imagine. It would also help cleaning up some job-mode-specific
code segments in the source code.

It would be interesting to see whether there are other use-cases that are
missed in the Application mode (besides the ones already addressed by
Biao). I would second Xintong's proposal of deprecating the job-mode rather
soonish making users aware of the plans around that deployment mode. That
might help encourage users to speak up in case they are not able to find a
solution to work around deprecation warnings.

I also agree with Xintong's assessment that dropping it should only be done
after we're sure that all relevant use cases are met also by other
deployment modes considering that (based on the comments above) it is a
widely used deployment mode.

Matthias

On Mon, Jan 24, 2022 at 10:00 AM Xintong Song  wrote:

> Sorry for joining the discussion late.
>
> I'm leaning towards deprecating the per-job mode soonish, and eventually
> dropping it in the long-term.
> - One less deployment mode makes it easier for users (especially
> newcomers) to understand. Deprecating the per-job mode sends the signal
> that it is legacy, not recommended, and in most cases users do not need to
> care about it.
> - For most (if not all) user demands that are satisfied by the per-job
> mode but not by the application mode, AFAICS, they can be either workaround
> or eventually addressed by the application mode. E.g., make application
> mode support shipping local dependencies.
> - I'm not sure about dropping the per-job mode soonish, as many users are
> still working with it. We'd better not force these users to migrate to the
> application mode when upgrading the Flink version.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Fri, Jan 21, 2022 at 4:30 PM Konstantin Knauf 
> wrote:
>
>> Thanks Thomas & Biao for your feedback.
>>
>> Any additional opinions on how we should proceed with per job-mode? As
>> you might have guessed, I am leaning towards proposing to deprecate per-job
>> mode.
>>
>> On Thu, Jan 13, 2022 at 5:11 PM Thomas Weise  wrote:
>>
>>> Regarding session mode:
>>>
>>> ## Session Mode
>>> * main() method executed in client
>>>
>>> Session mode also supports execution of the main method on Jobmanager
>>> with submission through REST API. That's how Flinkk k8s operators like
>>> [1] work. It's actually an important capability because it allows for
>>> allocation of the cluster resources prior to taking down the previous
>>> job during upgrade when the goal is optimization for availability.
>>>
>>> Thanks,
>>> Thomas
>>>
>>> [1] https://github.com/lyft/flinkk8soperator
>>>
>>> On Thu, Jan 13, 2022 at 12:32 AM Konstantin Knauf 
>>> wrote:
>>> >
>>> > Hi everyone,
>>> >
>>> > I would like to discuss and understand if the benefits of having
>>> Per-Job
>>> > Mode in Apache Flink outweigh its drawbacks.
>>> >
>>> >
>>> > *# Background: Flink's Deployment Modes*
>>> > Flink currently has three deployment modes. They differ in the
>>> following
>>> > dimensions:
>>> > * main() method executed on Jobmanager or Client
>>> > * dependencies shipped by client or bundled with all nodes
>>> > * number of jobs per cluster & relationship between job and cluster
>>> > lifecycle* (supported resource providers)
>>> >
>>> > ## Application Mode
>>> > * main() method executed on Jobmanager
>>> > * dependencies already need to be available on all nodes
>>> > * dedicated cluster for all jobs executed from the same main()-method
>>> > (Note: applications with more than one job, currently still significant
>>> > limitations like missing high-availability). Technically, a session
>>> cluster
>>> > dedicated to all jobs submitted from the same main() method.
>>> > * supported by standalone, native kubernetes, YARN
>>> >
>>> > ## Session Mode
>>> > * main() method executed in client
>>> > * dependencies are distributed from and by the client to all nodes
>>> > * cluster is shared by multiple jobs submitted from different clients,
>>> > independent lifecycle
>>> > * supported by standalone, Native Kubernetes, YARN
>>> >
>>> > ## Per-Job Mode
>>> > * main() method executed in client
>>> > * dependencies are distributed from and by the client to all nodes
>>> > * dedicated cluster for a single job
>>> > * supported by YARN only
>>> >
>>> >
>>> > *# Reasons to Keep** There are use cases where you might need the
>>> > combination of a single job per cluster, but main() method execution
>>> in the
>>> > client. This combination is only supported by per-job mode.
>>> > * It currently exists. Existing users will need to migrate to either
>>> > session or application mode.
>>> >
>>> >
>>> > *# Reasons to Drop** With Per-Job Mode and Application Mode we 

Re: Flink Kinesis Producer con't connect with AWS credentials

2022-01-07 Thread Matthias Pohl
I'm adding Danny to this thread. He might be able to help on this topic.

Best,
Matthias

On Mon, Jan 3, 2022 at 4:57 PM Daniel Vol  wrote:

> I definitely do, and you can see in my initial post that this is the first
> thing I tried but I got warnings and it doesn't use credentials I supplied.
> Though you are right that I do find a solution - using credentialProvider
> object and injecting keys as a java env variables through:
> -yd "env.java.opts.taskmanager=-Daws.secretKey=xxx -Daws.accessKeyId=xxx"
> -yd "env.java.opts.jobmanager=-Daws.secretKey=xxx -Daws.accessKeyId=xxx"
>
> Though I do expect from producer to be able to get parameters as per
> documentation (exactly as consumer do) so probably it is a good idea to
> open a ticket for this behavior:
>
> val props = new Properties
>
> props.put(AWSConfigConstants.AWS_ACCESS_KEY_ID, kinesisConfig.accessKeyId.get)
>
> props.put(AWSConfigConstants.AWS_SECRET_ACCESS_KEY, 
> kinesisConfig.secretKey.get)
>
> [Window(EventTimeSessionWindows(180), EventTimeTrigger,
> ScalaProcessWindowFunctionWrapper) -> Sink: Unnamed (1/1)#0] WARN
> o.a.f.k.s.c.a.s.k.producer.KinesisProducerConfiguration - Property
> aws.credentials.provider.basic.secretkey ignored as there is no
> corresponding set method in KinesisProducerConfiguration
> [Window(EventTimeSessionWindows(180), EventTimeTrigger,
> ScalaProcessWindowFunctionWrapper) -> Sink: Unnamed (1/1)#0] WARN
> o.a.f.k.s.c.a.s.k.producer.KinesisProducerConfiguration - Property
> aws.region ignored as there is no corresponding set method in
> KinesisProducerConfiguration
> [Window(EventTimeSessionWindows(180), EventTimeTrigger,
> ScalaProcessWindowFunctionWrapper) -> Sink: Unnamed (1/1)#0] WARN
> o.a.f.k.s.c.a.s.k.producer.KinesisProducerConfiguration - Property
> aws.credentials.provider.basic.accesskeyid ignored as there is no
> corresponding set method in KinesisProducerConfiguration
>
> On Mon, Jan 3, 2022 at 5:34 PM Matthias Pohl 
> wrote:
>
>> Hi Daniel,
>> I'm assuming you already looked into the Flink documentation for this
>> topic [1]? I'm gonna add Fabian to this thread. Maybe, he's able to help
>> out here.
>>
>> Matthias
>>
>> [1]
>> https://nightlies.apache.org/flink/flink-docs-release-1.12/dev/connectors/kinesis.html#kinesis-producer
>>
>> On Fri, Dec 31, 2021 at 1:06 PM Daniel Vol  wrote:
>>
>>> Hi,
>>>
>>> I am trying to run a Flink on GCP with the current source and
>>> destination on Kinesis on AWS.
>>> I have configured the access key on AWS to be able to connect.
>>> I am running Flink 1.12.1
>>> In flink I use the following code (Scala 2.12.2)
>>>
>>> val props = new Properties
>>>
>>> props.put(AWSConfigConstants.AWS_ACCESS_KEY_ID, 
>>> kinesisConfig.accessKeyId.get)
>>>
>>> props.put(AWSConfigConstants.AWS_SECRET_ACCESS_KEY, 
>>> kinesisConfig.secretKey.get)
>>>
>>>
>>> It works just fine to get connection to consumer, but not to producer.
>>>
>>> In TaskManager stdout log I see the following:
>>>
>>> [Window(EventTimeSessionWindows(180), EventTimeTrigger, 
>>> ScalaProcessWindowFunctionWrapper) -> Sink: Unnamed (1/1)#0] WARN  
>>> o.a.f.k.s.c.a.s.k.producer.KinesisProducerConfiguration  - Property 
>>> aws.credentials.provider.basic.secretkey ignored as there is no 
>>> corresponding set method in KinesisProducerConfiguration
>>> [Window(EventTimeSessionWindows(180), EventTimeTrigger, 
>>> ScalaProcessWindowFunctionWrapper) -> Sink: Unnamed (1/1)#0] WARN  
>>> o.a.f.k.s.c.a.s.k.producer.KinesisProducerConfiguration  - Property 
>>> aws.region ignored as there is no corresponding set method in 
>>> KinesisProducerConfiguration
>>> [Window(EventTimeSessionWindows(180), EventTimeTrigger, 
>>> ScalaProcessWindowFunctionWrapper) -> Sink: Unnamed (1/1)#0] WARN  
>>> o.a.f.k.s.c.a.s.k.producer.KinesisProducerConfiguration  - Property 
>>> aws.credentials.provider.basic.accesskeyid ignored as there is no 
>>> corresponding set method in KinesisProducerConfiguration
>>>
>>> Then I have tried a different approach: to create AWSCredentialsProvider 
>>> object with key + secret and add it by:
>>>
>>> (as it have setCredentialsProvider method)
>>>
>>> class CredentialsProvider(config: KinesisConfig) extends 
>>> AWSCredentialsProvider with Serializable {
>>>   override def getCredentials: AWSCredentials =
>>> new BasicAWSCredentials(config.accessK

Re: Flink connection to remote server

2022-01-03 Thread Matthias Pohl
Hi Mariam,
a quick mailing list query and Jira query didn't reveal any pointers for
Flink with Milvus, unfortunately. But have you had a look at Flink's
AsyncIO API [1]? I haven't worked with it, yet. But it sounds like
something that might help you accessing an external system.

Matthias

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/

On Mon, Jan 3, 2022 at 2:48 PM Mariam Walid 
wrote:

> Dear All,
>
> I have a question regarding contacting a remote server and receiving
> responses in flink functions. What is the best approach to do so? If also
> other users have used flink with milvus server, I have trouble running the
> job on flink cluster although it is working locally.
> I would really appreciate your help.
>
> Best regards,
> Mariam
>


Re: Flink Kinesis Producer con't connect with AWS credentials

2022-01-03 Thread Matthias Pohl
Hi Daniel,
I'm assuming you already looked into the Flink documentation for this topic
[1]? I'm gonna add Fabian to this thread. Maybe, he's able to help out here.

Matthias

[1]
https://nightlies.apache.org/flink/flink-docs-release-1.12/dev/connectors/kinesis.html#kinesis-producer

On Fri, Dec 31, 2021 at 1:06 PM Daniel Vol  wrote:

> Hi,
>
> I am trying to run a Flink on GCP with the current source and
> destination on Kinesis on AWS.
> I have configured the access key on AWS to be able to connect.
> I am running Flink 1.12.1
> In flink I use the following code (Scala 2.12.2)
>
> val props = new Properties
>
> props.put(AWSConfigConstants.AWS_ACCESS_KEY_ID, kinesisConfig.accessKeyId.get)
>
> props.put(AWSConfigConstants.AWS_SECRET_ACCESS_KEY, 
> kinesisConfig.secretKey.get)
>
>
> It works just fine to get connection to consumer, but not to producer.
>
> In TaskManager stdout log I see the following:
>
> [Window(EventTimeSessionWindows(180), EventTimeTrigger, 
> ScalaProcessWindowFunctionWrapper) -> Sink: Unnamed (1/1)#0] WARN  
> o.a.f.k.s.c.a.s.k.producer.KinesisProducerConfiguration  - Property 
> aws.credentials.provider.basic.secretkey ignored as there is no corresponding 
> set method in KinesisProducerConfiguration
> [Window(EventTimeSessionWindows(180), EventTimeTrigger, 
> ScalaProcessWindowFunctionWrapper) -> Sink: Unnamed (1/1)#0] WARN  
> o.a.f.k.s.c.a.s.k.producer.KinesisProducerConfiguration  - Property 
> aws.region ignored as there is no corresponding set method in 
> KinesisProducerConfiguration
> [Window(EventTimeSessionWindows(180), EventTimeTrigger, 
> ScalaProcessWindowFunctionWrapper) -> Sink: Unnamed (1/1)#0] WARN  
> o.a.f.k.s.c.a.s.k.producer.KinesisProducerConfiguration  - Property 
> aws.credentials.provider.basic.accesskeyid ignored as there is no 
> corresponding set method in KinesisProducerConfiguration
>
> Then I have tried a different approach: to create AWSCredentialsProvider 
> object with key + secret and add it by:
>
> (as it have setCredentialsProvider method)
>
> class CredentialsProvider(config: KinesisConfig) extends 
> AWSCredentialsProvider with Serializable {
>   override def getCredentials: AWSCredentials =
> new BasicAWSCredentials(config.accessKeyId.get, config.secretKey.get)
>
>   override def refresh(): Unit = {}
> }
>
> val credentialsProvider = new CredentialsProvider(kinesisConfig)
>
> producerConfig.put("CredentialsProvider", credentialsProvider)
>
> But then I get different exceptions that the process can't find access_key 
> and secret key.
>
> [kpl-daemon-] ERROR 
> o.a.f.k.s.c.a.services.kinesis.producer.KinesisProducer  - Error in child 
> process
> java.lang.RuntimeException: Error running child process
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:533)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:513)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.access$200(Daemon.java:63)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon$1.run(Daemon.java:135)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.kinesis.shaded.com.amazonaws.SdkClientException: 
> Unable to load AWS credentials from any provider in the chain: 
> [EnvironmentVariableCredentialsProvider: Unable to load AWS credentials from 
> environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and 
> AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY)), 
> SystemPropertiesCredentialsProvider: Unable to load AWS credentials from Java 
> system properties (aws.accessKeyId and aws.secretKey), 
> WebIdentityTokenCredentialsProvider: You must specify a value for roleArn and 
> roleSessionName, 
> org.apache.flink.kinesis.shaded.com.amazonaws.auth.profile.ProfileCredentialsProvider@4ec6449f:
>  profile file cannot be null, 
> org.apache.flink.kinesis.shaded.com.amazonaws.auth.EC2ContainerCredentialsProviderWrapper@2a5774f5:
>  The requested metadata is not found at 
> http://169.254.169.254/latest/meta-data/iam/security-credentials/]
>
> It tries to get either from env or java env.
>
> So I tried to add those as following:
>
> AWS_ACCESS_KEY_ID=xx AWS_SECRET_ACCESS_KEY=xx flink run [options] app.jar 
> [options]
>
> I tried
>
> flink run [options] app.jar -DAWS_ACCESS_KEY_ID=xx -DAWS_SECRET_ACCESS_KEY=xx 
> [options]
>
> but neither way is not working.
>
> Any idea how I am going to solve it?
>
>


Re: JsonRowSerializationSchema unable to parse TIMESTAMP_LTZ fields

2022-01-03 Thread Matthias Pohl
For documentation purposes: Surendra started a discussion in FLINK-25411
[1].

[1] https://issues.apache.org/jira/browse/FLINK-25411

On Wed, Dec 22, 2021 at 9:51 AM Surendra Lalwani 
wrote:

>
> Hi Team,
>
> JsonRowSerializationSchema is unable to parse fields with type
> TIMESTAMP_LTZ, seems like this is not handled properly. While trying to
> fire a simple query Select current_timestamp from table_name , it gives
> error that Could not serialize row and asks to add shaded flink dependency
> for jsr-310. Seems like in the Serializer , the JavaTimeModule is not added.
>
>
> --
> IMPORTANT NOTICE: This e-mail, including any attachments, may contain
> confidential information and is intended only for the addressee(s) named
> above. If you are not the intended recipient(s), you should not
> disseminate, distribute, or copy this e-mail. Please notify the sender by
> reply e-mail immediately if you have received this e-mail in error and
> permanently delete all copies of the original message from your system.
> E-mail transmission cannot be guaranteed to be secure as it could be
> intercepted, corrupted, lost, destroyed, arrive late or incomplete, or
> contain viruses. Company accepts no liability for any damage or loss of
> confidential information caused by this email or due to any virus
> transmitted by this email or otherwise.


Re: log4j2 upgrade requirement

2022-01-03 Thread Matthias Pohl
Hi Puneet,
Flink logs things like the job name which can be specified by the user.
Hence, a user could (as far as I understand) add a job name containing
malicious content. This is where the Flink cluster's log4j version comes
into play. Therefore, it's not enough to provide only an updated log4j
dependency with your job uber jar.

Best,
Matthias

On Wed, Dec 22, 2021 at 12:57 PM Puneet Duggal 
wrote:

> Hi,
>
> Context: - I am using flink 1.12.1 version for real time event processing.
> This flink uses log4j 2.12.1 version. But jar that i am uploading uses
> 2.17.0.
>
> Now my assumption is that flink being generic in nature, does not log
> event specific data , logging it is responsibility of user specific code
> which is uploaded via jar.
>
> Since log4j vulnerability is caused by attacker sending malicious string
> which performs lookup to attacker server… Hence getting attacked by this
> string can only be possible (in my case) if malicious string is set as
> value to a key which is then logged by my code. But my uber jar uses log4j
> 2.17.0 version.
>
> So my doubt is whether there is any situation that i am missing because of
> which i should upgrade log4j version of cluster as well or just upgrading
> log4j version of my jar should suffice.
>
> Thanks,
> Puneet Duggal
>


Re: Scala Case Class Serialization

2021-12-07 Thread Matthias Pohl
Hi Lars,
not sure about the out-of-the-box support for case classes with primitive
member types (could you refer to the section which made you conclude
this?). I haven't used Scala with Flink, yet. So maybe, others can give
more context.
But have you looked into using the TypeInfoFactory to define the schema [1]?

Best,
Matthias

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/serialization/types_serialization/#defining-type-information-using-a-factory

On Tue, Dec 7, 2021 at 10:03 AM Lars Skjærven  wrote:

> Hello,
> We're running Flink 1.14 with scala, and we're suspecting that performance
> is suffering due to serialization of some scala case classes. Specifically
> we're seeing that our Case Class "cannot be used as a POJO type because not
> all fields are valid POJO fields, and must be processed as GenericType",
> and that the case class "does not contain a setter for field X". I'm
> interpreting these log messages as performance warnings.
>
> A simple case class example we're writing to state that triggers the
> mentioned 'warnings':
> case class Progress(position: Int, eventTime: Int, alive: Boolean)
>
> I'm understanding the docs that case classes with primitive types should
> be supported "out of the box".
>
> Any tips on how to proceed ?
>
> Kind regards,
> Lars
>


Re: Query regarding exceptions API(/jobs/:jobid/exceptions)

2021-11-30 Thread Matthias Pohl
Thanks for sharing this information. I verified that it's a bug in Flink.
The issue is that the Exceptions you're observing are happening while the
job is initialized. We're not setting the exception history properly in
that case.

Matthias

On Mon, Nov 29, 2021 at 2:08 PM Mahima Agarwal 
wrote:

> Hi Matthias,
>
> We have created a JIRA ticket for this issue. Please find the jira id below
>
> https://issues.apache.org/jira/browse/FLINK-25096
>
> Thanks
> Mahima
>
> On Mon, Nov 29, 2021 at 2:24 PM Matthias Pohl 
> wrote:
>
>> Thanks Mahima,
>> could you create a Jira ticket and, if possible, add the Flink logs? That
>> would make it easier to investigate the problem.
>>
>> Best,
>> Matthias
>>
>> On Sun, Nov 28, 2021 at 7:29 AM Mahima Agarwal 
>> wrote:
>>
>>> Thanks Matthias
>>>
>>> But we have observed the below 2 exceptions are coming in
>>> root-exceptions but not in exceptionHistory:
>>>
>>> caused by: java.util.concurrent.CompletionException:
>>> java.lang.RuntimeException: java.io.FileNotFoundException: Cannot find
>>> checkpoint or savepoint file/directory
>>> 'C:\Users\abc\Documents\checkpoints\a737088e21206281db87f6492bcba074' on
>>> file system 'file'.
>>>
>>> Caused by: java.lang.IllegalStateException: Failed to rollback to
>>> checkpoint/savepoint
>>> file:/mnt/c/Users/abc/Documents/checkpoints/a737088e21206281db87f6492bcba074/chk-144.
>>> Thanks and Regards
>>> Mahima Agarwal
>>>
>>>
>>> On Fri, Nov 26, 2021, 13:19 Matthias Pohl 
>>> wrote:
>>>
>>>> Just to add a bit of context: The first-level members all-exceptions,
>>>> root-exceptions, truncated and timestamp have been around for a longer
>>>> time. The exceptionHistory was added in Flink 1.13. As part of this change,
>>>> the aforementioned members were deprecated (see [1]). We kept them for
>>>> backwards-compatibility reasons.
>>>>
>>>> That said, root-exception and all-exceptions are also represented in
>>>> the exceptionHistory.
>>>>
>>>> Matthias
>>>>
>>>> [1]
>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs-jobid-exceptions
>>>>
>>>> On Thu, Nov 25, 2021 at 12:14 PM Chesnay Schepler 
>>>> wrote:
>>>>
>>>>> root-exception: The last exception that caused a job to fail.
>>>>> all-exceptions: All exceptions that occurred the last time a job
>>>>> failed. This is primarily useful for completed jobs.
>>>>> exception-history: Exceptions that previously caused a job to fail.
>>>>>
>>>>> On 25/11/2021 11:52, Mahima Agarwal wrote:
>>>>>
>>>>> Hi Team,
>>>>>
>>>>> Please find the query below regarding exceptions
>>>>> API(/jobs/:jobid/exceptions)
>>>>>
>>>>>
>>>>> In response of above rest api:
>>>>>
>>>>>
>>>>> Users are getting 3 types of exceptions:
>>>>> 1. exceptionHistory
>>>>> 2. all-exceptions
>>>>> 3. root-exception
>>>>>
>>>>>
>>>>> What is the purpose of the above 3 exceptions?
>>>>>
>>>>>
>>>>> Any leads would be appreciated.
>>>>>
>>>>> Thanks
>>>>> Mahima
>>>>>
>>>>>


Re: Query regarding exceptions API(/jobs/:jobid/exceptions)

2021-11-29 Thread Matthias Pohl
Thanks Mahima,
could you create a Jira ticket and, if possible, add the Flink logs? That
would make it easier to investigate the problem.

Best,
Matthias

On Sun, Nov 28, 2021 at 7:29 AM Mahima Agarwal 
wrote:

> Thanks Matthias
>
> But we have observed the below 2 exceptions are coming in root-exceptions
> but not in exceptionHistory:
>
> caused by: java.util.concurrent.CompletionException:
> java.lang.RuntimeException: java.io.FileNotFoundException: Cannot find
> checkpoint or savepoint file/directory
> 'C:\Users\abc\Documents\checkpoints\a737088e21206281db87f6492bcba074' on
> file system 'file'.
>
> Caused by: java.lang.IllegalStateException: Failed to rollback to
> checkpoint/savepoint
> file:/mnt/c/Users/abc/Documents/checkpoints/a737088e21206281db87f6492bcba074/chk-144.
> Thanks and Regards
> Mahima Agarwal
>
>
> On Fri, Nov 26, 2021, 13:19 Matthias Pohl  wrote:
>
>> Just to add a bit of context: The first-level members all-exceptions,
>> root-exceptions, truncated and timestamp have been around for a longer
>> time. The exceptionHistory was added in Flink 1.13. As part of this change,
>> the aforementioned members were deprecated (see [1]). We kept them for
>> backwards-compatibility reasons.
>>
>> That said, root-exception and all-exceptions are also represented in the
>> exceptionHistory.
>>
>> Matthias
>>
>> [1]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs-jobid-exceptions
>>
>> On Thu, Nov 25, 2021 at 12:14 PM Chesnay Schepler 
>> wrote:
>>
>>> root-exception: The last exception that caused a job to fail.
>>> all-exceptions: All exceptions that occurred the last time a job failed.
>>> This is primarily useful for completed jobs.
>>> exception-history: Exceptions that previously caused a job to fail.
>>>
>>> On 25/11/2021 11:52, Mahima Agarwal wrote:
>>>
>>> Hi Team,
>>>
>>> Please find the query below regarding exceptions
>>> API(/jobs/:jobid/exceptions)
>>>
>>>
>>> In response of above rest api:
>>>
>>>
>>> Users are getting 3 types of exceptions:
>>> 1. exceptionHistory
>>> 2. all-exceptions
>>> 3. root-exception
>>>
>>>
>>> What is the purpose of the above 3 exceptions?
>>>
>>>
>>> Any leads would be appreciated.
>>>
>>> Thanks
>>> Mahima
>>>
>>>


Re: Query regarding exceptions API(/jobs/:jobid/exceptions)

2021-11-25 Thread Matthias Pohl
Just to add a bit of context: The first-level members all-exceptions,
root-exceptions, truncated and timestamp have been around for a longer
time. The exceptionHistory was added in Flink 1.13. As part of this change,
the aforementioned members were deprecated (see [1]). We kept them for
backwards-compatibility reasons.

That said, root-exception and all-exceptions are also represented in the
exceptionHistory.

Matthias

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs-jobid-exceptions

On Thu, Nov 25, 2021 at 12:14 PM Chesnay Schepler 
wrote:

> root-exception: The last exception that caused a job to fail.
> all-exceptions: All exceptions that occurred the last time a job failed.
> This is primarily useful for completed jobs.
> exception-history: Exceptions that previously caused a job to fail.
>
> On 25/11/2021 11:52, Mahima Agarwal wrote:
>
> Hi Team,
>
> Please find the query below regarding exceptions
> API(/jobs/:jobid/exceptions)
>
>
> In response of above rest api:
>
>
> Users are getting 3 types of exceptions:
> 1. exceptionHistory
> 2. all-exceptions
> 3. root-exception
>
>
> What is the purpose of the above 3 exceptions?
>
>
> Any leads would be appreciated.
>
> Thanks
> Mahima
>
>
>


Re: Recommended metaspace memory config for 16GB hosts.

2021-11-23 Thread Matthias Pohl
Hi John,
the memory configuration depends entirely on your use-case. It's hard to
judge from the distance here. You should monitor your memory usage and act
accordingly. Increasing memory usage indicates some memory leak, though.
You will run into issues the longer the job and the cluster runs. This
should be investigated in the best case.

Matthias

On Wed, Nov 24, 2021 at 4:41 AM John Smith  wrote:

> Well the hosts have 16GB.
>
> If there is a "bug" with classloading... Then for now I can only hope to
> increase the metaspace size so...
>
> If the host has 16GB
>
> Can I set the Java heap to say 12GB and the Metaspace to 2GB and leave 2GB
> for the OS?
> Or maybe 10GB for heap and 2GB for Meta which leaves 4GB for everything
> else including the OS?
>
> This is from my live taskmanager
>
> taskmanager.memory.flink.size: 10240m
> taskmanager.memory.jvm-metaspace.size: 1024m
> taskmanager.numberOfTaskSlots: 12
>
> Physical Memory:15.7 GB
> JVM Heap Size:4.88 GB
> Flink Managed Memory:4.00 GB
>
> JVM (Heap/Non-Heap)
> Type
> Committed
> Used
> Maximum
> Heap 4.88 GB 2.16 GB 4.88 GB
> Non-Heap 416 MB 404 MB 2.23 GB
> Total 5.28 GB 2.55 GB 7.10 GB
> Outside JVM
> Type
> Count
> Used
> Capacity
> Direct 32,836 1.01 GB 1.01 GB
> Mapped 0 0 B 0 B
>
>
>
> On Tue, 23 Nov 2021 at 02:23, Matthias Pohl 
> wrote:
>
>> In general, running out of memory in the Metaspace pool indicates some
>> bug related to the classloaders. Have you considered upgrading to new
>> versions of Flink and other parts of your pipeline? Otherwise, you might
>> want to create a heap dump and analyze that one [1]. This analysis might
>> reveal some pointers to what is causing the problem.
>>
>> Matthias
>>
>> [1]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/application_profiling/#analyzing-out-of-memory-problems
>>
>> On Mon, Nov 22, 2021 at 8:34 PM John Smith 
>> wrote:
>>
>>> Hi thanks. I know, I already mentioned that I put 1024, see config
>>> above. But my question is how much? I still get the message once a while.
>>> It also seems that if a job restarts a few times it happens... My jobs
>>> aren't complicated. They use Kafka, some of them JDBC and the JDBC driver
>>> to push to DB. Right now I use flink for ETL
>>>
>>> Kafka -> JSon Validation (Jackson) -> filter -> JDBC to database.
>>>
>>> On Mon, 22 Nov 2021 at 10:24, Matthias Pohl 
>>> wrote:
>>>
>>>> Hi John,
>>>> have you had a look at the memory model for Flink 1.10? [1] Based on
>>>> the documentation, you could try increasing the Metaspace size
>>>> independently of the Flink memory usage (i.e. flink.size). The heap Size is
>>>> a part of the overall Flink memory. I hope that helps.
>>>>
>>>> Best,
>>>> Matthias
>>>>
>>>> [1]
>>>> https://nightlies.apache.org/flink/flink-docs-release-1.10/ops/memory/mem_detail.html
>>>>
>>>> On Mon, Nov 22, 2021 at 3:58 PM John Smith 
>>>> wrote:
>>>>
>>>>> Hi, has anyone seen this?
>>>>>
>>>>> On Tue, 16 Nov 2021 at 14:14, John Smith 
>>>>> wrote:
>>>>>
>>>>>> Hi running Flink 1.10
>>>>>>
>>>>>> I have
>>>>>> - 3 job nodes 8GB memory total
>>>>>> - jobmanager.heap.size: 6144m
>>>>>>
>>>>>> - 3 task nodes 16GB memory total
>>>>>> - taskmanager.memory.flink.size: 10240m
>>>>>> - taskmanager.memory.jvm-metaspace.size: 1024m <--- This still
>>>>>> cause metaspace errors once a while, can I go higher do I need to lower 
>>>>>> the
>>>>>> 10GB above?
>>>>>>
>>>>>> The task nodes on the UI are reporting:
>>>>>> - Physical Memory:15.7 GBJVM
>>>>>> - Heap Size:4.88 GB <--- I'm guess this current used heap size
>>>>>> and not the mak of 10GB set above?
>>>>>> - Flink Managed Memory:4.00 GB
>>>>>>
>>>>>


Re: Recommended metaspace memory config for 16GB hosts.

2021-11-22 Thread Matthias Pohl
In general, running out of memory in the Metaspace pool indicates some bug
related to the classloaders. Have you considered upgrading to new versions
of Flink and other parts of your pipeline? Otherwise, you might want to
create a heap dump and analyze that one [1]. This analysis might reveal
some pointers to what is causing the problem.

Matthias

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/application_profiling/#analyzing-out-of-memory-problems

On Mon, Nov 22, 2021 at 8:34 PM John Smith  wrote:

> Hi thanks. I know, I already mentioned that I put 1024, see config above.
> But my question is how much? I still get the message once a while. It also
> seems that if a job restarts a few times it happens... My jobs aren't
> complicated. They use Kafka, some of them JDBC and the JDBC driver to push
> to DB. Right now I use flink for ETL
>
> Kafka -> JSon Validation (Jackson) -> filter -> JDBC to database.
>
> On Mon, 22 Nov 2021 at 10:24, Matthias Pohl 
> wrote:
>
>> Hi John,
>> have you had a look at the memory model for Flink 1.10? [1] Based on the
>> documentation, you could try increasing the Metaspace size independently of
>> the Flink memory usage (i.e. flink.size). The heap Size is a part of the
>> overall Flink memory. I hope that helps.
>>
>> Best,
>> Matthias
>>
>> [1]
>> https://nightlies.apache.org/flink/flink-docs-release-1.10/ops/memory/mem_detail.html
>>
>> On Mon, Nov 22, 2021 at 3:58 PM John Smith 
>> wrote:
>>
>>> Hi, has anyone seen this?
>>>
>>> On Tue, 16 Nov 2021 at 14:14, John Smith  wrote:
>>>
>>>> Hi running Flink 1.10
>>>>
>>>> I have
>>>> - 3 job nodes 8GB memory total
>>>> - jobmanager.heap.size: 6144m
>>>>
>>>> - 3 task nodes 16GB memory total
>>>> - taskmanager.memory.flink.size: 10240m
>>>> - taskmanager.memory.jvm-metaspace.size: 1024m <--- This still
>>>> cause metaspace errors once a while, can I go higher do I need to lower the
>>>> 10GB above?
>>>>
>>>> The task nodes on the UI are reporting:
>>>> - Physical Memory:15.7 GBJVM
>>>> - Heap Size:4.88 GB <--- I'm guess this current used heap size and
>>>> not the mak of 10GB set above?
>>>> - Flink Managed Memory:4.00 GB
>>>>
>>>


Re: Flink on Native Kubernetes S3 checkpointing error

2021-11-22 Thread Matthias Pohl
Cool, thanks for the update.

Matthias

On Mon, Nov 22, 2021 at 6:42 PM bat man  wrote:

> Hi Matthias,
>
> Looks like the service account token volume projection was not working
> fine with the EKS version I was running. Upgraded the version and with the
> same configs now the s3 checkpointing is working fine.
> So, in short, on AWS use EKS v1.20+ for IAM Pod Identity Webhook.
>
> Thanks,
> Hemant
>
> On Mon, Nov 22, 2021 at 7:26 PM Matthias Pohl 
> wrote:
>
>> Hi bat man,
>> this feature seems to be tied to a certain AWS SDK version [1] which you
>> already considered. But I checked the version used in Flink 1.13.1 for the
>> s3 filesystem. It seems like the version that's used (1.11.788) is good
>> enough to provide this feature (which was added in 1.11.704):
>> ```
>> $ git checkout release-1.13.1
>> $ cd flink-filesystems/flink-s3-fs-base; mvn dependency:tree | grep
>> com.amazonaws:aws-java-sdk-s3
>> [INFO] +- com.amazonaws:aws-java-sdk-s3:jar:1.11.788:compile
>> ```
>>
>> Matthias
>>
>> [1]
>> https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-minimum-sdk.html
>>
>> On Mon, Nov 22, 2021 at 8:04 AM bat man  wrote:
>>
>>> Hi,
>>>
>>> I am using flink 1.13.1 to use checkpointing(RocksDB) on s3 with native
>>> kubernetes.
>>> Passing in this parameter to job -
>>>
>>>
>>> *-Dfs.s3a.aws.credentials.provider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider*
>>> I am getting this error in job-manager logs -
>>>
>>> *Caused by: com.amazonaws.AmazonClientException: No AWS Credentials
>>> provided by WebIdentityTokenCredentialsProvider :
>>> com.amazonaws.SdkClientException: Unable to locate specified web identity
>>> token file: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
>>> <http://eks.amazonaws.com/serviceaccount/token> at
>>> org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:139)
>>> ~[?:?]*
>>>
>>> Describing the pod shows that that volume is mounted to the jobmanager
>>> pod.
>>> Is there anything specific that needs to be done as on the same EKS
>>> cluster for testing I ran a sample pod with aws cli image and it's able to
>>> do *ls* on the s3 buckets.
>>> Is this related to aws sdk used in Flink 1.13.1, shall I try with recent
>>> flink versions.
>>>
>>> Any help would be appreciated.
>>>
>>> Thanks.
>>>
>>


Re: Flink CLI - pass command line arguments to a pyflink job

2021-11-22 Thread Matthias Pohl
Hi Kamil,
afaik, the parameter passing should work as normal by just appending them
to the Flink job submission similar to the Java job submission:
```
$ ./flink run --help
Action "run" compiles and runs a program.
  Syntax: run [OPTIONS]  
[...]
```

Matthias

On Mon, Nov 22, 2021 at 3:58 PM Kamil ty  wrote:

> Hey,
>
> Looking at the examples at Command-Line Interface | Apache Flink
> 
>  I
> don't see an example of passing command line arguments to a pyflink job
> when deploying the job to a remote cluster with flink cli. Is this
> supported?
>
> Best Regards
> Kamil
>


Re: Recommended metaspace memory config for 16GB hosts.

2021-11-22 Thread Matthias Pohl
Hi John,
have you had a look at the memory model for Flink 1.10? [1] Based on the
documentation, you could try increasing the Metaspace size independently of
the Flink memory usage (i.e. flink.size). The heap Size is a part of the
overall Flink memory. I hope that helps.

Best,
Matthias

[1]
https://nightlies.apache.org/flink/flink-docs-release-1.10/ops/memory/mem_detail.html

On Mon, Nov 22, 2021 at 3:58 PM John Smith  wrote:

> Hi, has anyone seen this?
>
> On Tue, 16 Nov 2021 at 14:14, John Smith  wrote:
>
>> Hi running Flink 1.10
>>
>> I have
>> - 3 job nodes 8GB memory total
>> - jobmanager.heap.size: 6144m
>>
>> - 3 task nodes 16GB memory total
>> - taskmanager.memory.flink.size: 10240m
>> - taskmanager.memory.jvm-metaspace.size: 1024m <--- This still cause
>> metaspace errors once a while, can I go higher do I need to lower the 10GB
>> above?
>>
>> The task nodes on the UI are reporting:
>> - Physical Memory:15.7 GBJVM
>> - Heap Size:4.88 GB <--- I'm guess this current used heap size and
>> not the mak of 10GB set above?
>> - Flink Managed Memory:4.00 GB
>>
>


Re: Table API Filesystem connector - disable interval rolling policy

2021-11-22 Thread Matthias Pohl
Hi Kamil,
by looking at the code I'd say that the only option you have is to increase
the parameter you already mentioned to a very high number. But I'm not sure
about the side effects. I'm gonna add Francesco to this thread. Maybe he
has better ideas on how to answer your question.

Best,
Matthias

On Mon, Nov 22, 2021 at 10:32 AM Kamil ty  wrote:

> Hey all,
>
> I wanted to know if there is a way to disable the interval rolling policy
> in the Table API filesystem connector.
> From flink docs: FileSystem | Apache Flink
> 
> The key to change the interval: sink.rolling-policy.rollover-interval
> Is it possible to fully disable this rolling policy or the only solution
> is to set a very big duration?
>
> Best Regards
> Kamil
>


Re: Flink on Native Kubernetes S3 checkpointing error

2021-11-22 Thread Matthias Pohl
Hi bat man,
this feature seems to be tied to a certain AWS SDK version [1] which you
already considered. But I checked the version used in Flink 1.13.1 for the
s3 filesystem. It seems like the version that's used (1.11.788) is good
enough to provide this feature (which was added in 1.11.704):
```
$ git checkout release-1.13.1
$ cd flink-filesystems/flink-s3-fs-base; mvn dependency:tree | grep
com.amazonaws:aws-java-sdk-s3
[INFO] +- com.amazonaws:aws-java-sdk-s3:jar:1.11.788:compile
```

Matthias

[1]
https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-minimum-sdk.html

On Mon, Nov 22, 2021 at 8:04 AM bat man  wrote:

> Hi,
>
> I am using flink 1.13.1 to use checkpointing(RocksDB) on s3 with native
> kubernetes.
> Passing in this parameter to job -
>
>
> *-Dfs.s3a.aws.credentials.provider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider*
> I am getting this error in job-manager logs -
>
> *Caused by: com.amazonaws.AmazonClientException: No AWS Credentials
> provided by WebIdentityTokenCredentialsProvider :
> com.amazonaws.SdkClientException: Unable to locate specified web identity
> token file: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
>  at
> org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:139)
> ~[?:?]*
>
> Describing the pod shows that that volume is mounted to the jobmanager pod.
> Is there anything specific that needs to be done as on the same EKS
> cluster for testing I ran a sample pod with aws cli image and it's able to
> do *ls* on the s3 buckets.
> Is this related to aws sdk used in Flink 1.13.1, shall I try with recent
> flink versions.
>
> Any help would be appreciated.
>
> Thanks.
>


Re: Kubernetes HA: New jobs stuck in Initializing for a long time after a certain number of existing jobs are running

2021-11-22 Thread Matthias Pohl
Hi Joey,
that looks like a cluster configuration issue. The 192.168.100.79:6123 is
not accessible from the JobManager pod (see line 1224f in the provided JM
logs):
   2021-11-19 04:06:45,049 WARN  akka.remote.transport.netty.NettyTransport
  [] - Remote connection to [null] failed with
java.net.NoRouteToHostException: No route to host
   2021-11-19 04:06:45,067 WARN  akka.remote.ReliableDeliverySupervisor
  [] - Association with remote system [akka.tcp://
flink@192.168.100.79:6123] has failed, address is now gated for [50] ms.
Reason: [Association failed with [akka.tcp://flink@192.168.100.79:6123]]
Caused by: [java.net.NoRouteToHostException: No route to host]

The TaskManagers are able to communicate with the JobManager pod and are
properly registered. The JobMaster, instead, tries to connect to the
ResourceManager (both running on the JobManager pod) but fails.
SlotRequests are triggered but never actually fulfilled. They are put in
the queue for pending SlotRequests. The timeout kicks in after trying to
reach the ResourceManager for some time. That's
the NoResourcesAvailableException you are experiencing.

Matthias

On Fri, Nov 19, 2021 at 7:02 AM Joey L  wrote:

> Hi,
>
> I've set up a Flink 1.12.5 session cluster running on K8s with HA, and
> came across an issue with creating new jobs once the cluster has reached 20
> existing jobs. The first 20 jobs always gets initialized and start running
> within 5 - 10 seconds.
>
> Any new job submission is stuck in Initializing state for a long time (10
> - 30 mins), and eventually it goes to Running but the tasks are stuck in
> Scheduled state despite there being free task slots available. The
> Scheduled jobs will eventually start running, but the delay could be up to
> an hour. Interestingly, this issue doesn't occur once I remove the HA
> config.
>
> Each task manager is configured to have 4 task slots, and I can see via
> the Flink UI that the task managers are registered correctly. (Refer to
> attached screenshot).
>
> [image: Screen Shot 2021-11-19 at 3.08.11 pm.png]
>
> In the logs, I can see that jobs stuck in Scheduled throw this exception
> after 5 minutes (eventhough there are slots available):
>
> ```
> java.util.concurrent.CompletionException:
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Slot request bulk is not fulfillable! Could not allocate the required slot
> within slot request timeout
> ```
>
> I've also attached the full job manager logs below.
>
> Any help/guidance would be appreciated.
>
> Thanks,
> Joey
>


Re: FlinkJobNotFoundException

2021-10-14 Thread Matthias Pohl
Hi Doug,
sorry for being not responsive the last two weeks. Other stuff kept me
busy. A few things to note on your issue: It looks like the job result is
requested while doing the job execution in a synchronous way. Flink will
try to access the ArchivedExecutionGraphStore to get the job's result after
it finishes. By default, Flink's session mode relies on the
FileArchivedExecutionGraphStore which relies on a temporary folder. If this
temporary folder gets cleaned up in any way, Flink would not be able to
retrieve the job result if the number of jobs being stored in this
ArchivedExecutionGraphStore exceeds the size of the internally used cache.
Same might be true if the underlying file is corrupted in any way.

This brings up the following questions/action items:
- Could you provide the debug logs for this?
- Could you provide the Flink configuration you use to check the cache
configuration?
- Why do you commit two jobs with the same job ID? This might cause the
job's execution graph to be overwritten in the temporary folder and,
therefore, might lead to a corrupted file while writing the second job's
ExecutionGraph to disk. As a consequence another REST request that is sent
later on to the Flink cluster's REST endpoint with the same job ID should
succeed again. That would be something to try out.
- Could you monitor Flink's temporary folder and see whether there is some
cleanup happening by some other process?

Best,
Matthias

On Wed, Oct 13, 2021 at 3:12 PM Gusick, Doug S  wrote:

> Hi Matthias,
>
>
>
> Do you have any update here?
>
>
>
> Thank you,
>
> Doug
>
>
>
> *From:* Gusick, Doug S [Engineering]
> *Sent:* Thursday, October 7, 2021 9:03 AM
> *To:* Hailu, Andreas [Engineering] ;
> Matthias Pohl 
> *Cc:* user@flink.apache.org; Erai, Rahul [Engineering] <
> rahul.e...@ny.email.gs.com>
> *Subject:* RE: FlinkJobNotFoundException
>
>
>
> Hi Matthias,
>
>
>
> I just wanted to follow up here. Were you able to access the jobmanager
> log? If so, were you able to find anything around the issues we have been
> facing?
>
>
>
> Best,
>
> Doug
>
>
>
> *From:* Hailu, Andreas [Engineering] 
> *Sent:* Thursday, September 30, 2021 8:56 AM
> *To:* Matthias Pohl ; Gusick, Doug S
> [Engineering] 
> *Cc:* user@flink.apache.org; Erai, Rahul [Engineering] <
> rahul.e...@ny.email.gs.com>
> *Subject:* RE: FlinkJobNotFoundException
>
>
>
> Hi Matthias, the log file is quite large (21MB) so mailing it over in its
> entirety may have been a challenge. The file is available here [1], and
> we’re of course happy to share any relevant parts of it with the mailing
> list.
>
>
>
> I think since we’ve shared logs with you before in the past, you weren’t
> sent over an additional welcome email J
>
>
>
> [1]
> https://lockbox.gs.com/lockbox/folders/dc2ccacc-f2d2-4d66-a098-461b43e8b65f/
>
>
>
> *// *ah
>
>
>
> *From:* Matthias Pohl 
> *Sent:* Thursday, September 30, 2021 2:57 AM
> *To:* Gusick, Doug S [Engineering] 
> *Cc:* user@flink.apache.org; Erai, Rahul [Engineering] <
> rahul.e...@ny.email.gs.com>
> *Subject:* Re: FlinkJobNotFoundException
>
>
>
> I didn't receive any email. But we rather not do individual support.
> Please share the logs on the mailing list. This way, anyone is able to
> participate in the discussion.
>
>
>
> Best,
> Matthias
>
>
>
> On Wed, Sep 29, 2021 at 8:12 PM Gusick, Doug S  wrote:
>
> Hi Matthias,
>
>
>
> Thank you for getting back. We have been looking into upgrading to a newer
> version, but have not completed full testing just yet.
>
>
>
> I was unable to find a previous error in the JM logs. You should have
> received an email with details to a “lockbox”. I have uploaded the job
> manager logs there. Please let me know if you need any more information.
>
>
>
> Thank you,
>
> Doug
>
>
>
> *From:* Matthias Pohl 
> *Sent:* Wednesday, September 29, 2021 12:00 PM
> *To:* Gusick, Doug S [Engineering] 
> *Cc:* user@flink.apache.org; Erai, Rahul [Engineering] <
> rahul.e...@ny.email.gs.com>
> *Subject:* Re: FlinkJobNotFoundException
>
>
>
> Hi Doug,
>
> thanks for reaching out to the community. First of all, 1.9.2 is quite an
> old Flink version. You might want to consider upgrading to a newer version.
> The community only offers support for the two most-recent Flink versions.
> Newer version might include fixes for your issue.
>
>
>
> But back to your actual problem: The logs you're providing only show that
> some job switched into FINISHED state. Is there some error showing up
> earlier in the logs which you might have missed? It would be helpful if you
> could share the complet

Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-30 Thread Matthias Pohl
Thanks for sharing. I was wondering why you don't use $PORT0 in your
command. And: Are the ports properly configured in the Marathon network
configuration [1]? But the error seems to be unrelated to that setting.
Other than that, I cannot see any other issue with the configuration. It
could be that the HOST IP is blocked?

[1] https://mesosphere.github.io/marathon/docs/ports.html#specifying-ports

On Wed, Sep 29, 2021 at 7:07 PM Javier Vegas  wrote:

>
> Full appmaster log in debug mode is attached.
> My startup command was
> /opt/flink/bin/mesos-appmaster.sh \
>   -Drest.bind-port=8081 \
>   -Drest.port=8081 \
>   -Djobmanager.rpc.address=$HOST \
>   -Djobmanager.rpc.port=$PORT1 \
>   -Dmesos.resourcemanager.framework.user=flink \
>   -Dmesos.resourcemanager.framework.name=timeline-flink-populator \
>   -Dmesos.master=10.0.18.246:5050 \
>   -Dmesos.resourcemanager.tasks.cpus=4 \
>   -Dmesos.resourcemanager.tasks.container.type=docker \
>   -Dmesos.resourcemanager.tasks.container.image.name=
> docker.strava.com/strava/timeline-populator2:jv-mesos \
>   -Dtaskmanager.numberOfTaskSlots=4 ;
>
> where $PORT1 refers to my second host open port, mapped to 6123 on the
> Docker container (first port is mapped to 8081).
> I can see in the log that $HOST and $PORT1 resolve to the correct values, 
> 10.0.20.25
> and 31608
>
> On Wed, Sep 29, 2021 at 9:41 AM Matthias Pohl 
> wrote:
>
>> ...and if possible, it would be helpful to provide debug logs as well.
>>
>> On Wed, Sep 29, 2021 at 6:33 PM Matthias Pohl 
>> wrote:
>>
>>> May you provide the entire JobManager logs so that we can see what's
>>> going on?
>>>
>>> On Wed, Sep 29, 2021 at 12:42 PM Javier Vegas  wrote:
>>>
>>>> Thanks again, Matthias!
>>>>
>>>> Putting  -Djobmanager.rpc.address=$HOST and
>>>> -Djobmanager.rpc.port=$PORT0 as params for appmaster.sh
>>>> I see in tog they seem to transform in the correct values
>>>>
>>>> -Djobmanager.rpc.address=10.0.23.35 -Djobmanager.rpc.port=31009
>>>>
>>>> but a bit later the appmaster dies with this new error. it is unclear
>>>> what address it is trying to bind, I added explicit params
>>>> -Drest.bind-port=8081 and
>>>>   -Drest.port=8081 in case jobmanager.rpc.port was somehow
>>>> interfering, but that didn't help.
>>>>
>>>> 2021-09-29 10:29:59.845 [main] INFO  
>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting 
>>>> MesosSessionClusterEntrypoint down with application status FAILED. 
>>>> Diagnostics java.net.BindException: Cannot assign requested address
>>>>at java.base/sun.nio.ch.Net.bind0(Native Method)
>>>>at java.base/sun.nio.ch.Net.bind(Unknown Source)
>>>>at java.base/sun.nio.ch.Net.bind(Unknown Source)
>>>>at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source)
>>>>at 
>>>> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
>>>>at 
>>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550)
>>>>at 
>>>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
>>>>at 
>>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
>>>>at 
>>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
>>>>at 
>>>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
>>>>at 
>>>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248)
>>>>at 
>>>> org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
>>>>at 
>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>>>>at 
>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>>>>at 
>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
>>>>at 
>>>> org.apache.flin

Re: Start Flink cluster, k8s pod behavior

2021-09-30 Thread Matthias Pohl
Hi Qihua,
I guess, looking into kubectl describe and the JobManager logs would help
in understanding what's going on.

Best,
Matthias

On Wed, Sep 29, 2021 at 8:37 PM Qihua Yang  wrote:

> Hi,
> I deployed flink in session mode. I didn't run any jobs. I saw below logs.
> That is normal, same as Flink menual shows.
>
> + /opt/flink/bin/run-job-manager.sh
> Starting HA cluster with 1 masters.
> Starting standalonesession daemon on host job-manager-776dcf6dd-xzs8g.
> Starting taskexecutor daemon on host job-manager-776dcf6dd-xzs8g.
>
> But when I check kubectl, it shows status is Completed. After a while,
> status changed to CrashLoopBackOff, and pod restart.
> NAME  READY
> STATUS RESTARTS   AGE
> job-manager-776dcf6dd-xzs8g   0/1 Completed  5
>  5m27s
>
> NAME  READY
> STATUS RESTARTS   AGE
> job-manager-776dcf6dd-xzs8g   0/1 CrashLoopBackOff   5
>  7m35s
>
> Anyone can help me understand why?
> Why do kubernetes regard this pod as completed and restart? Should I
> config something? either Flink side or Kubernetes side? From the Flink
> manual, after the cluster is started, I can upload a jar to run the
> application.
>
> Thanks,
> Qihua
>


Re: FlinkJobNotFoundException

2021-09-30 Thread Matthias Pohl
I didn't receive any email. But we rather not do individual support. Please
share the logs on the mailing list. This way, anyone is able to participate
in the discussion.

Best,
Matthias

On Wed, Sep 29, 2021 at 8:12 PM Gusick, Doug S  wrote:

> Hi Matthias,
>
>
>
> Thank you for getting back. We have been looking into upgrading to a newer
> version, but have not completed full testing just yet.
>
>
>
> I was unable to find a previous error in the JM logs. You should have
> received an email with details to a “lockbox”. I have uploaded the job
> manager logs there. Please let me know if you need any more information.
>
>
>
> Thank you,
>
> Doug
>
>
>
> *From:* Matthias Pohl 
> *Sent:* Wednesday, September 29, 2021 12:00 PM
> *To:* Gusick, Doug S [Engineering] 
> *Cc:* user@flink.apache.org; Erai, Rahul [Engineering] <
> rahul.e...@ny.email.gs.com>
> *Subject:* Re: FlinkJobNotFoundException
>
>
>
> Hi Doug,
>
> thanks for reaching out to the community. First of all, 1.9.2 is quite an
> old Flink version. You might want to consider upgrading to a newer version.
> The community only offers support for the two most-recent Flink versions.
> Newer version might include fixes for your issue.
>
>
>
> But back to your actual problem: The logs you're providing only show that
> some job switched into FINISHED state. Is there some error showing up
> earlier in the logs which you might have missed? It would be helpful if you
> could share the complete JobManager logs to get a better understanding of
> what's going on.
>
>
>
> Best,
> Matthias
>
>
>
> On Wed, Sep 29, 2021 at 3:47 PM Gusick, Doug S  wrote:
>
> Hello,
>
>
>
> We are facing an issue with some of our applications that are submitting a
> high volume of jobs to Flink (we are using v1.9.2). We are observing that
> numerous jobs (in this case 44 out of 350+) fail with the same
> FlinkJobNotFoundException within a 45 second timeframe.
>
>
>
> From our client logs, this is the exception we can see:
>
>
>
> Calc Engine: Caused by: 
> org.apache.flink.runtime.rest.util.RestClientException: 
> [org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (d0991f0ae712a9df710aa03311a32c8c)]
>
> Calc Engine:   at 
> org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389)
>
> Calc Engine:   at 
> org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373)
>
> Calc Engine:   at 
> java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)
>
> Calc Engine:   at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
>
> Calc Engine:   at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>
> Calc Engine:   ... 3 more
>
>
>
>
>
> This is the first job to fail with the above exception. From the
> JobManager logs, we can see that the job goes to FINISHED State, and then
> we see the following exception:
>
>
>
> 2021-09-28 04:54:16,936 INFO  [flink-akka.actor.default-dispatcher-28]
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Job Flink
> Java Job at Tue Sep 28 04:48:21 EDT 2021 (d0991f0ae712a9df710aa03311a32c8c)
> switched from state RUNNING to FINISHED.
>
> 2021-09-28 04:54:16,937 INFO  [flink-akka.actor.default-dispatcher-28]
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job
> d0991f0ae712a9df710aa03311a32c8c reached globally terminal state FINISHED.
>
> 2021-09-28 04:54:16,939 INFO  [flink-akka.actor.default-dispatcher-28]
> org.apache.flink.runtime.jobmaster.JobMaster  - Stopping
> the JobMaster for job Flink Java Job at Tue Sep 28 04:48:21 EDT
> 2021(d0991f0ae712a9df710aa03311a32c8c).
>
> 2021-09-28 04:54:16,940 INFO  [flink-akka.actor.default-dispatcher-39]
> org.apache.flink.yarn.YarnResourceManager - Disconnect
> job manager
> 0...@akka.tcp://fl...@d43723-714.dc.gs.com:44887/user/jobmanager_392
> for job d0991f0ae712a9df710aa03311a32c8c from the resource manager.
>
> 2021-09-28 04:54:18,256 ERROR [flink-akka.actor.default-dispatcher-91]
> org.apache.flink.runtime.rest.handler.job.JobExecutionResultHandler  -
> Exception occurred in REST handler:
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find
> Flink job (d0991f0ae712a9df710aa03311a32c8c)
>
>
>
> Here are the relevant logs from the TaskManager. We can see that the 
> JobLeaderService tries to reconnect to the job. Any ideas as to why it is 
> trying to reconnect?:
>
>
> 2021-09-28 04:54:13,382 INFO  [flink-akka.actor.def

Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-29 Thread Matthias Pohl
...and if possible, it would be helpful to provide debug logs as well.

On Wed, Sep 29, 2021 at 6:33 PM Matthias Pohl 
wrote:

> May you provide the entire JobManager logs so that we can see what's going
> on?
>
> On Wed, Sep 29, 2021 at 12:42 PM Javier Vegas  wrote:
>
>> Thanks again, Matthias!
>>
>> Putting  -Djobmanager.rpc.address=$HOST and  -Djobmanager.rpc.port=$PORT0
>> as params for appmaster.sh
>> I see in tog they seem to transform in the correct values
>>
>> -Djobmanager.rpc.address=10.0.23.35 -Djobmanager.rpc.port=31009
>>
>> but a bit later the appmaster dies with this new error. it is unclear
>> what address it is trying to bind, I added explicit params
>> -Drest.bind-port=8081 and
>>   -Drest.port=8081 in case jobmanager.rpc.port was somehow
>> interfering, but that didn't help.
>>
>> 2021-09-29 10:29:59.845 [main] INFO  
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting 
>> MesosSessionClusterEntrypoint down with application status FAILED. 
>> Diagnostics java.net.BindException: Cannot assign requested address
>>  at java.base/sun.nio.ch.Net.bind0(Native Method)
>>  at java.base/sun.nio.ch.Net.bind(Unknown Source)
>>  at java.base/sun.nio.ch.Net.bind(Unknown Source)
>>  at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>>  at java.base/java.lang.Thread.run(Unknown Source)
>>
>> On Wed, Sep 29, 2021 at 2:36 AM Matthias Pohl 
>> wrote:
>>
>>> The port has its separate configuration parameter jobmanager.rpc.port [1]
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1
>>>
>>> On Wed, Sep 29, 2021 at 10:11 AM Javier Vegas  wrote:
>>>
>>>> Matthias, thanks for the suggestion! I changed my
>>>> jobmanager.rpc.address param from $HOSTNAME to $HOST:$PORT0 which in the
>>>> log I see resolves properly to the host IP and port mapped to 8081
>>>>
>>>> 2021-09-29 07:58:05.452 [main] INFO
>>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  -
>>>> -Djobmanager.rpc.address=10.0.22.114:31894
>>>>
>>>> which is very promising. But sadly a little bit later appmaster dies
>>>> with this errror:
>>>>
>>>> 2021-09-29 07:58:05.648 [main] INFO
>>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Initializing
>>>> cluster services.
>>>> 2021-09-29 07:58:05.674 [main] INFO
>>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting
>>>> MesosSessionClusterEntrypoint down with application status FAILED.
>>>> Diagnostics org.apache.flink.configurati
>>>> on

Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-29 Thread Matthias Pohl
May you provide the entire JobManager logs so that we can see what's going
on?

On Wed, Sep 29, 2021 at 12:42 PM Javier Vegas  wrote:

> Thanks again, Matthias!
>
> Putting  -Djobmanager.rpc.address=$HOST and  -Djobmanager.rpc.port=$PORT0
> as params for appmaster.sh
> I see in tog they seem to transform in the correct values
>
> -Djobmanager.rpc.address=10.0.23.35 -Djobmanager.rpc.port=31009
>
> but a bit later the appmaster dies with this new error. it is unclear what
> address it is trying to bind, I added explicit params
> -Drest.bind-port=8081 and
>   -Drest.port=8081 in case jobmanager.rpc.port was somehow
> interfering, but that didn't help.
>
> 2021-09-29 10:29:59.845 [main] INFO  
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting 
> MesosSessionClusterEntrypoint down with application status FAILED. 
> Diagnostics java.net.BindException: Cannot assign requested address
>   at java.base/sun.nio.ch.Net.bind0(Native Method)
>   at java.base/sun.nio.ch.Net.bind(Unknown Source)
>   at java.base/sun.nio.ch.Net.bind(Unknown Source)
>   at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source)
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550)
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248)
>   at 
> org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
>   at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>   at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>   at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
>   at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>   at 
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>   at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.base/java.lang.Thread.run(Unknown Source)
>
> On Wed, Sep 29, 2021 at 2:36 AM Matthias Pohl 
> wrote:
>
>> The port has its separate configuration parameter jobmanager.rpc.port [1]
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1
>>
>> On Wed, Sep 29, 2021 at 10:11 AM Javier Vegas  wrote:
>>
>>> Matthias, thanks for the suggestion! I changed my jobmanager.rpc.address
>>> param from $HOSTNAME to $HOST:$PORT0 which in the log I see resolves
>>> properly to the host IP and port mapped to 8081
>>>
>>> 2021-09-29 07:58:05.452 [main] INFO
>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  -
>>> -Djobmanager.rpc.address=10.0.22.114:31894
>>>
>>> which is very promising. But sadly a little bit later appmaster dies
>>> with this errror:
>>>
>>> 2021-09-29 07:58:05.648 [main] INFO
>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Initializing
>>> cluster services.
>>> 2021-09-29 07:58:05.674 [main] INFO
>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting
>>> MesosSessionClusterEntrypoint down with application status FAILED.
>>> Diagnostics org.apache.flink.configurati
>>> on.IllegalConfigurationException: The configured hostname is not valid
>>> at
>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
>>> at
>>> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
>>> at
>>> org.apache.flink.runtime.clusterframework.Bootstrap

Re: FlinkJobNotFoundException

2021-09-29 Thread Matthias Pohl
Hi Doug,
thanks for reaching out to the community. First of all, 1.9.2 is quite an
old Flink version. You might want to consider upgrading to a newer version.
The community only offers support for the two most-recent Flink versions.
Newer version might include fixes for your issue.

But back to your actual problem: The logs you're providing only show that
some job switched into FINISHED state. Is there some error showing up
earlier in the logs which you might have missed? It would be helpful if you
could share the complete JobManager logs to get a better understanding of
what's going on.

Best,
Matthias

On Wed, Sep 29, 2021 at 3:47 PM Gusick, Doug S  wrote:

> Hello,
>
>
>
> We are facing an issue with some of our applications that are submitting a
> high volume of jobs to Flink (we are using v1.9.2). We are observing that
> numerous jobs (in this case 44 out of 350+) fail with the same
> FlinkJobNotFoundException within a 45 second timeframe.
>
>
>
> From our client logs, this is the exception we can see:
>
>
>
> Calc Engine: Caused by: 
> org.apache.flink.runtime.rest.util.RestClientException: 
> [org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (d0991f0ae712a9df710aa03311a32c8c)]
>
> Calc Engine:   at 
> org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389)
>
> Calc Engine:   at 
> org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373)
>
> Calc Engine:   at 
> java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)
>
> Calc Engine:   at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
>
> Calc Engine:   at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>
> Calc Engine:   ... 3 more
>
>
>
>
>
> This is the first job to fail with the above exception. From the
> JobManager logs, we can see that the job goes to FINISHED State, and then
> we see the following exception:
>
>
>
> 2021-09-28 04:54:16,936 INFO  [flink-akka.actor.default-dispatcher-28]
> org.apache.flink.runtime.executiongraph.ExecutionGraph- Job Flink
> Java Job at Tue Sep 28 04:48:21 EDT 2021 (d0991f0ae712a9df710aa03311a32c8c)
> switched from state RUNNING to FINISHED.
>
> 2021-09-28 04:54:16,937 INFO  [flink-akka.actor.default-dispatcher-28]
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job
> d0991f0ae712a9df710aa03311a32c8c reached globally terminal state FINISHED.
>
> 2021-09-28 04:54:16,939 INFO  [flink-akka.actor.default-dispatcher-28]
> org.apache.flink.runtime.jobmaster.JobMaster  - Stopping
> the JobMaster for job Flink Java Job at Tue Sep 28 04:48:21 EDT
> 2021(d0991f0ae712a9df710aa03311a32c8c).
>
> 2021-09-28 04:54:16,940 INFO  [flink-akka.actor.default-dispatcher-39]
> org.apache.flink.yarn.YarnResourceManager - Disconnect
> job manager
> 0...@akka.tcp://fl...@d43723-714.dc.gs.com:44887/user/jobmanager_392
> for job d0991f0ae712a9df710aa03311a32c8c from the resource manager.
>
> 2021-09-28 04:54:18,256 ERROR [flink-akka.actor.default-dispatcher-91]
> org.apache.flink.runtime.rest.handler.job.JobExecutionResultHandler  -
> Exception occurred in REST handler:
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find
> Flink job (d0991f0ae712a9df710aa03311a32c8c)
>
>
>
> Here are the relevant logs from the TaskManager. We can see that the 
> JobLeaderService tries to reconnect to the job. Any ideas as to why it is 
> trying to reconnect?:
>
>
> 2021-09-28 04:54:13,382 INFO  [flink-akka.actor.default-dispatcher-16] 
> org.apache.flink.runtime.taskexecutor.TaskExecutor- Receive slot 
> request b26c04706fd5aad03dfdca8691f1bf1c for job 
> d0991f0ae712a9df710aa03311a32c8c from resource manager with leader id 
> .
>
> 2021-09-28 04:54:13,383 INFO  [flink-akka.actor.default-dispatcher-16] 
> org.apache.flink.runtime.taskexecutor.JobLeaderService- Add job 
> d0991f0ae712a9df710aa03311a32c8c for job leader monitoring.
>
> 2021-09-28 04:54:13,397 INFO  [flink-akka.actor.default-dispatcher-16] 
> org.apache.flink.runtime.taskexecutor.JobLeaderService- Successful 
> registration at job manager 
> akka.tcp://fl...@d43723-714.dc.gs.com:44887/user/jobmanager_392 for job 
> d0991f0ae712a9df710aa03311a32c8c.
>
> 2021-09-28 04:54:13,397 INFO  [flink-akka.actor.default-dispatcher-16] 
> org.apache.flink.runtime.taskexecutor.TaskExecutor- Establish 
> JobManager connection for job d0991f0ae712a9df710aa03311a32c8c.
>
> 2021-09-28 04:54:13,397 INFO  [flink-akka.actor.default-dispatcher-16] 
> org.apache.flink.runtime.taskexecutor.TaskExecutor- Offer 
> reserved slots to the leader of job d0991f0ae712a9df710aa03311a32c8c.
>
> 2021-09-28 04:54:13,405 INFO  [CHAIN DataSource (settl_delivery_type_code | 
> DistCp | Sourcing Files) -> FlatMap (settl_delivery_type_code | DistCp | 

Re: flink rest endpoint creation failure

2021-09-29 Thread Matthias Pohl
Hi Curt,
could you elaborate a bit more on your setup? Maybe, provide commands you
used to deploy the jobs and the Flink/YARN logs. What's puzzling me is your
statement about "two JobManagers spinning up" and "everything's working
fine if two TaskManagers are running on different instances".
- When talking about Flink applications, you're talking about application
mode?
- I have the feeling you're mixing up JobManager and TaskManager in your
initial description. Could you clarify this?
- Actually, each of the Flink components (JobManager and TaskManager)
should run in its own YARN container. The way you describe it it sounds
like Flink runs within one container?

Best,
Matthias



On Thu, Sep 23, 2021 at 5:14 PM Curt Buechter  wrote:

> Thanks Robert,
> But, no, the rest.bind-port is not set to 35485 in the configuration.
> Other jobs use different ports, so it is getting set dynamically.
>
>
> #==
> # Rest & web frontend
>
> #==
>
> # The port to which the REST client connects to. If rest.bind-port has
> # not been specified, then the server will bind to this port as well.
> #
> #rest.port: 8081
>
> # The address to which the REST client will connect to
> #
> #rest.address: 0.0.0.0
>
> # Port range for the REST and web server to bind to.
> #
> #rest.bind-port: 8080-8090
>
> # The address that the REST & web server binds to
> #
> #rest.bind-address: 0.0.0.0
>
> # Flag to specify whether job submission is enabled from the web-based
> # runtime monitor. Uncomment to disable.
>
> #web.submit.enable: false
>
>
>
> On Wed, Sep 22, 2021 at 11:46 AM Curt Buechter 
> wrote:
>
>> Hi,
>> I'm getting an error that happens randomly when starting a flink
>> application.
>>
>> For context, this is running in YARN on AWS. This application is one that
>> converts from the Table API to the Stream API, so two flink
>> applications/jobmanagers are trying to start up. I think what happens is
>> that the rest api port is chosen, and is the same for both of the flink
>> apps. If YARN chooses two different instances for the two task managers,
>> they each work fine and start their rest api on the same port on their own
>> respective machine. But, if YARN chooses the same instance for both job
>> managers, they both try to start up the rest api on the same port on the
>> same machine, and I get the error.
>>
>> Here is the error:
>>
>> 2021-09-22 15:47:27,724 ERROR 
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Could not 
>> start cluster entrypoint YarnJobClusterEntrypoint.
>> org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to 
>> initialize the cluster entrypoint YarnJobClusterEntrypoint.
>>  at 
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:212)
>>  ~[flink-dist_2.12-1.13.2.jar:1.13.2]
>>  at 
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600)
>>  [flink-dist_2.12-1.13.2.jar:1.13.2]
>>  at 
>> org.apache.flink.yarn.entrypoint.YarnJobClusterEntrypoint.main(YarnJobClusterEntrypoint.java:99)
>>  [flink-dist_2.12-1.13.2.jar:1.13.2]
>> Caused by: org.apache.flink.util.FlinkException: Could not create the 
>> DispatcherResourceManagerComponent.
>>  at 
>> org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:275)
>>  ~[flink-dist_2.12-1.13.2.jar:1.13.2]
>>  at 
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:250)
>>  ~[flink-dist_2.12-1.13.2.jar:1.13.2]
>>  at 
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189)
>>  ~[flink-dist_2.12-1.13.2.jar:1.13.2]
>>  at java.security.AccessController.doPrivileged(Native Method) 
>> ~[?:1.8.0_282]
>>  at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_282]
>>  at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>>  ~[hadoop-common-3.2.1-amzn-3.jar:?]
>>  at 
>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>>  ~[flink-dist_2.12-1.13.2.jar:1.13.2]
>>  at 
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186)
>>  ~[flink-dist_2.12-1.13.2.jar:1.13.2]
>>  ... 2 more
>> Caused by: java.net.BindException: Could not start rest endpoint on any port 
>> in port range 35485
>>  at 
>> org.apache.flink.runtime.rest.RestServerEndpoint.start(RestServerEndpoint.java:234)
>>  ~[flink-dist_2.12-1.13.2.jar:1.13.2]
>>  at 
>> org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:172)
>>  

Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-29 Thread Matthias Pohl
aRpcServiceUtils.java:92)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
> at
> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189)
> at java.base/java.security.AccessController.doPrivileged(Native Method)
> at java.base/javax.security.auth.Subject.doAs(Unknown Source)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at
> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186)
> ... 2 common frames omitted
> Caused by: java.lang.IllegalArgumentException: null
> at
> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122)
> at
> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177)
> ... 17 common frames omitted
>
>
>
> On Wed, Sep 29, 2021 at 12:16 AM Matthias Pohl 
> wrote:
>
>> One thing that was puzzling me yesterday when reading your post: Have you
>> tried $HOST instead of $HOSTNAME in the Marathon configuration? When I
>> played around with Mesos, I remember using HOST to resolve the host's IP
>> address instead of the host's name. It could be that the hostname itself
>> cannot be resolved to the right IP address. But I struggled to find proper
>> documentation to back that up. Only in the recipes section of the Marathon
>> docs [1], HOST was used as well.
>>
>> Matthias
>>
>> [1]
>> https://mesosphere.github.io/marathon/docs/recipes.html#command-executor-health-checks
>>
>> On Wed, Sep 29, 2021 at 3:37 AM Javier Vegas  wrote:
>>
>>> Another update:  Looking more carefully in my appmaster log, I see the
>>> following
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>>> Registering as new framework.
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>>> -
>>>
>>> ---
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Mesos
>>> Info:
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Master
>>> URL: 10.0.18.246:5050
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Framework
>>> Info:
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - ID:
>>> (none)
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Name:
>>> flink-test
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Failover
>>> Timeout (secs): 604800.0
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Role:
>>> *
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - 
>>> Capabilities:
>>> (none)
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - 
>>> Principal:
>>> (none)
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Host:
>>> 311dcf7fd77c
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -

Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-29 Thread Matthias Pohl
ult-dispatcher-3] DEBUG
> org.apache.flink.mesos.scheduler.ConnectionMonitor  - State change
> (StoppedState -> ConnectingState) with data ()
>
> 2021-09-29 01:15:39.826 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Mesos
> resource manager started.
>
> 2021-09-29 01:15:39.831 [flink-akka.actor.default-dispatcher-4] DEBUG
> org.apache.flink.mesos.scheduler.LaunchCoordinator  - State change
> (Suspended -> Suspended) with data GatherData(List(),List())
>
> 2021-09-29 01:15:44.843 [flink-akka.actor.default-dispatcher-4] WARN
> org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to connect
> to Mesos; still trying...
>
> 2021-09-29 01:15:49.843 [flink-akka.actor.default-dispatcher-3] DEBUG
> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
> heartbeat request.
>
> 2021-09-29 01:15:49.844 [flink-akka.actor.default-dispatcher-3] DEBUG
> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
> heartbeat request.
>
>
>
>
> So why the appmaster was able to connect to Mesos master to create the
> framework but failed to connect later to do whatever it does later?
>
>
> One possible issue I see is that the framework is set with web UI in h
> ttp://311dcf7fd77c:8081 which can not be resolved from the Mesos master. 
> 311dcf7fd77c
> is the result of doing hostname on the Docker container, and the Mesos
> master can not resolve that name. I could try to replace the Docker
> container hostname with the Docker host hostname, but the host port that
> gets mapped to 8081 on the container is a random port that I can not know
> beforehand. Does Mesos master try to reach Flink using that Web UI setting?
> Could this be the issue causing my connection problem, or is this a red
> herring and the problem is a different one?
>
>
> Thanks,
>
>
> Javier Vegas
>
>
>
>
>
>
>
>
> On Tue, Sep 28, 2021 at 10:23 AM Javier Vegas  wrote:
>
>> Thanks, Matthias!
>>
>> There are lots of apps deployed to the Mesos cluster, the task manager
>> itself is deployed to Mesos via Marathon.  In the Mesos log I can see the
>> Job manager agent starting, but no error messages related to it. As you
>> say, TaskManagers don't even have the chance to get confused about
>> variables, since the Job Manager can not connect to the Mesos master to
>> tell it to start the Task Managers.
>>
>> Thanks,
>>
>> Javier
>>
>> On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl 
>> wrote:
>>
>>> Hi Javier,
>>> I don't see anything that's configured in the wrong way based on the
>>> jobmanager logs you've provided. Have you been able to deploy other
>>> applications to this Mesos cluster? Do the Mesos master logs reveal
>>> anything? The variable resolution on the TaskManager side is a valid
>>> concern shared by Roman since it's easy to run into such an issue. But the
>>> JobManager logs indicate that the JobManager is not able to contact the
>>> Mesos master. Hence, I'd assume that it's not related to the TaskManagers
>>> not coming up.
>>>
>>> Best,
>>> Matthias
>>>
>>> On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> No additional ports need to be open as far as I know.
>>>>
>>>> Probably, $HOSTNAME is substituted for something not resolvable on TMs?
>>>>
>>>> Please also make sure that the following gets executed before
>>>> mesos-appmaster.sh:
>>>> export HADOOP_CLASSPATH=$(hadoop classpath)
>>>> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
>>>> (as per the documentation you linked)
>>>>
>>>> Regards,
>>>> Roman
>>>>
>>>> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas  wrote:
>>>> >
>>>> > I am trying to start Flink 1.13.2 on Mesos following the instrucions
>>>> in
>>>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
>>>> and using Marathon to deploy a Docker image with both the Flink and my
>>>> binaries.
>>>> >
>>>> > My entrypoint for the Docker image is:
>>>> >
>>>> >
>>>> > /opt/flink/bin/mesos-appmaster.sh \
>>>> >
>>>> >   -Djobmanager.rpc.address=$HOSTNAME \
>>>> >
>>>> >   -Dmesos.resourcemanager.framework.user=flink \
>>>> >

Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-28 Thread Matthias Pohl
Hi Javier,
I don't see anything that's configured in the wrong way based on the
jobmanager logs you've provided. Have you been able to deploy other
applications to this Mesos cluster? Do the Mesos master logs reveal
anything? The variable resolution on the TaskManager side is a valid
concern shared by Roman since it's easy to run into such an issue. But the
JobManager logs indicate that the JobManager is not able to contact the
Mesos master. Hence, I'd assume that it's not related to the TaskManagers
not coming up.

Best,
Matthias

On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan  wrote:

> Hi,
>
> No additional ports need to be open as far as I know.
>
> Probably, $HOSTNAME is substituted for something not resolvable on TMs?
>
> Please also make sure that the following gets executed before
> mesos-appmaster.sh:
> export HADOOP_CLASSPATH=$(hadoop classpath)
> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
> (as per the documentation you linked)
>
> Regards,
> Roman
>
> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas  wrote:
> >
> > I am trying to start Flink 1.13.2 on Mesos following the instrucions in
> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
> and using Marathon to deploy a Docker image with both the Flink and my
> binaries.
> >
> > My entrypoint for the Docker image is:
> >
> >
> > /opt/flink/bin/mesos-appmaster.sh \
> >
> >   -Djobmanager.rpc.address=$HOSTNAME \
> >
> >   -Dmesos.resourcemanager.framework.user=flink \
> >
> >   -Dmesos.master=10.0.18.246:5050 \
> >
> >   -Dmesos.resourcemanager.tasks.cpus=6
> >
> >
> >
> > When mesos-appmaster.sh starts, in the stderr I see this:
> >
> >
> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
> >
> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on agent
> f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
> >
> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker
> executor on 10.0.20.177
> >
> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
> >
> > WARNING: Your kernel does not support swap limit capabilities or the
> cgroup is not mounted. Memory limited without swap.
> >
> > WARNING: An illegal reflective access operation has occurred
> >
> > WARNING: Illegal reflective access by
> org.apache.hadoop.security.authentication.util.KerberosUtil
> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
> sun.security.krb5.Config.getInstance()
> >
> > WARNING: Please consider reporting this to the maintainers of
> org.apache.hadoop.security.authentication.util.KerberosUtil
> >
> > WARNING: Use --illegal-access=warn to enable warnings of further illegal
> reflective access operations
> >
> > WARNING: All illegal access operations will be denied in a future release
> >
> > I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
> >
> > I0927 16:50:43.624439   328 sched.cpp:336] New master detected at
> master@10.0.18.246:5050
> >
> > I0927 16:50:43.624779   328 sched.cpp:356] No credentials provided.
> Attempting to register without authentication
> >
> >
> > where the "New master detected" line is promising.
> >
> > However, on the Flink UI I see only the jobmanager started, and there
> are no task managers.  Getting into the Docker container, I see this in the
> log:
> >
> > WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to
> connect to Mesos; still trying...
> >
> >
> > I have verified that from the container I can access the Mesos container
> 10.0.18.246:5050
> >
> >
> > Does any other port besides the web UI port 5050 need to be open for
> mesos-appmaster to connect with the Mesos master?
> >
> >
> > In the appmaster log (attached) I see one exception that I don't know if
> they are related to the Mesos connection problem, one is
> >
> >
> > java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
> >
> > at
> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
> >
> > at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
> >
> > at org.apache.hadoop.util.Shell.(Shell.java:496)
> >
> > at
> org.apache.hadoop.util.StringUtils.(StringUtils.java:79)
> >
> > at
> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)
> >
> > at
> org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497)
> >
> > at
> org.apache.hadoop.security.SecurityUtil.(SecurityUtil.java:90)
> >
> > at
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289)
> >
> > at
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)
> >
> > at
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)
> >
> > at
> 

Re: hdfs lease issues on flink retry

2021-09-20 Thread Matthias Pohl
I don't know of any side effects of your approach. But another workaround I
saw was replacing the _0 suffix by something like "_" +
System.currentMillis()

On Fri, Sep 17, 2021 at 8:38 PM Shah, Siddharth 
wrote:

> Hi Matthias,
>
>
>
> Thanks for looking into the issue and creating a ticket. I am thinking of
> having a workaround until the issue is fixed.
>
>
>
> What if I create the attempt directories with a random int by patching
> *HadoopOutputFormatBase*’s open() method?
>
>
>
> Original:
>
>
>
> TaskAttemptID taskAttemptID = TaskAttemptID.*forName*(
> *"attempt___r_"   *+ String.*format*(*"%" *+ (6 - Integer.
> *toString*(taskNumber + 1).length()) + *"s"*, *" "*).replace(*" "*, *"0"*)
>   + Integer.*toString*(taskNumber + 1)
>   + *"_0"*);
>
>
>
>
>
> Patched:
>
>
>
> *int *attemptRandomPrefix = *new *Random().nextInt(999);
>
> TaskAttemptID taskAttemptID = TaskAttemptID.*forName*(
> *"attempt__"   *+ String.*format*(*"%" *+ (4 - 
> Integer.*toString*(attemptRandomPrefix).length())
> + *"s"*, *" "*).replace(*" "*, *"0"*)
>   + Integer.*toString*(attemptRandomPrefix) +
> *"_r_"   *+ String.*format*(*"%" *+ (6 - Integer.*toString*(taskNumber
> + 1).length()) + *"s"*, *" "*).replace(*" "*, *"0"*)
>   + Integer.*toString*(taskNumber + 1)
>   + *"_0"*);
>
>
>
>
>
> So basically I am creating a directory named *attempt__0123_r_0001_0 *instead
> of *attempt___r_0001_0*. I have tested on a handful of our jobs and
> seems to  be working fine. Just wanted to check any downside of this
> changes that I may not be aware of?
>
>
>
> Thanks,
>
> Siddharth
>
>
>
>
>
>
>
> *From:* Matthias Pohl 
> *Sent:* Tuesday, September 07, 2021 5:06 AM
> *To:* Shah, Siddharth [Engineering] 
> *Cc:* user@flink.apache.org; Hailu, Andreas [Engineering] <
> andreas.ha...@ny.email.gs.com>
> *Subject:* Re: hdfs lease issues on flink retry
>
>
>
> Just for documentation purposes: I created FLINK-24147 [1] to cover this
> issue.
>
>
>
> [1] https://issues.apache.org/jira/browse/FLINK-24147
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D24147=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=eLqB-T65EFJVVpR6QlSfRHIga7DPK3o8yJvw_OhnMvk=jTswrQDq0l9TalcFRAb297cz4EfsU-LznMJyB2uXDl0=LgUitz7kzpyweO3xqm7f19qxwbHh_LbQ-_M1zOxutpM=>
>
>
>
> On Thu, Aug 26, 2021 at 6:14 PM Matthias Pohl 
> wrote:
>
> I see - I should have checked my mailbox before answering. I received the
> email and was able to login.
>
>
>
> On Thu, Aug 26, 2021 at 6:12 PM Matthias Pohl 
> wrote:
>
> The link doesn't work, i.e. I'm redirected to a login page. It would be
> also good to include the Flink logs and make them accessible for everyone.
> This way others could share their perspective as well...
>
>
>
> On Thu, Aug 26, 2021 at 5:40 PM Shah, Siddharth [Engineering] <
> siddharth.x.s...@gs.com> wrote:
>
> Hi Matthias,
>
>
>
> Thank you for responding and taking time to look at the issue.
>
>
>
> Uploaded the yarn lags here:
> https://lockbox.gs.com/lockbox/folders/963b0f29-85ad-4580-b420-8c66d9c07a84/
> and have also requested read permissions for you. Please let us know if
> you’re not able to see the files.
>
>
>
>
>
> *From:* Matthias Pohl 
> *Sent:* Thursday, August 26, 2021 9:47 AM
> *To:* Shah, Siddharth [Engineering] 
> *Cc:* user@flink.apache.org; Hailu, Andreas [Engineering] <
> andreas.ha...@ny.email.gs.com>
> *Subject:* Re: hdfs lease issues on flink retry
>
>
>
> Hi Siddharth,
>
> thanks for reaching out to the community. This might be a bug. Could you
> share your Flink and YARN logs? This way we could get a better
> understanding of what's going on.
>
>
>
> Best,
> Matthias
>
>
>
> On Tue, Aug 24, 2021 at 10:19 PM Shah, Siddharth [Engineering] <
> siddharth.x.s...@gs.com> wrote:
>
> Hi  Team,
>
>
>
> We are seeing transient failures in the jobs mostly requiring higher
> resources and using flink RestartStrategies
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.13_docs_dev_execution_task-5Ffailure-5Frecovery_-23fixed-2Ddelay-2Drestart-2Dstrategy=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=eLqB-T65EFJVVpR6QlSfRHIga7DPK3o8yJvw_OhnMvk=qIClgDVq00Jp0qluJfWV-aGM7Sg7tAnr_I2yy4TtNaM=wL6-8B4mnGofyRWetXrTSw9FBSV-XTDnoHsPtz

Re: Fast serialization for Kotlin data classes

2021-09-16 Thread Matthias Pohl
True, that's a valid concern you raised here, Alexis. Thanks for pointing
that out.

On Thu, Sep 16, 2021 at 1:58 PM Alexis Sarda-Espinosa <
alexis.sarda-espin...@microfocus.com> wrote:

> Someone please correct me if I’m wrong but, until FLINK-16686 [1] is
> fixed, a class must be a POJO to be used in managed state with RocksDB,
> right? That’s not to say that the approach with TypeInfoFactory won’t work,
> just that even then it will mean none of the data classes can be used for
> managed state.
>
>
>
> [1] https://issues.apache.org/jira/browse/FLINK-16686
>
>
>
> Regards,
>
> Alexis.
>
>
>
> *From:* Matthias Pohl 
> *Sent:* Donnerstag, 16. September 2021 13:12
> *To:* Alex Cruise 
> *Cc:* Flink ML 
> *Subject:* Re: Fast serialization for Kotlin data classes
>
>
>
> Hi Alex,
>
> have you had a look at TypeInfoFactory? That might be the best way to come
> up with a custom serialization mechanism. See the docs [1] for further
> details.
>
>
>
> Best,
> Matthias
>
>
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/serialization/types_serialization/#defining-type-information-using-a-factory
>
>
>
> On Tue, Sep 14, 2021 at 8:33 PM Alex Cruise  wrote:
>
> Hi there,
>
>
>
> I appreciate the fact that Flink has built-in support for making POJO and
> Scala `case class` serialization faster, but in my project we use immutable
> Kotlin `data class`es (analogous to Scala `case class`es) extensively, and
> we'd really prefer not to make them POJOs, mostly for style/taste reasons
> (e.g. need a default constructor and setters, both are anathema!)
>
>
>
> Does anyone know of a good way for us to keep using idiomatic, immutable
> Kotlin data classes, but to get much faster serialization performance in
> Flink?
>
>
>
> Thanks!
>
>
>
> -0xe1a
>
>


Re: Fast serialization for Kotlin data classes

2021-09-16 Thread Matthias Pohl
Hi Alex,
have you had a look at TypeInfoFactory? That might be the best way to come
up with a custom serialization mechanism. See the docs [1] for further
details.

Best,
Matthias

[1]
https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/serialization/types_serialization/#defining-type-information-using-a-factory

On Tue, Sep 14, 2021 at 8:33 PM Alex Cruise  wrote:

> Hi there,
>
> I appreciate the fact that Flink has built-in support for making POJO and
> Scala `case class` serialization faster, but in my project we use immutable
> Kotlin `data class`es (analogous to Scala `case class`es) extensively, and
> we'd really prefer not to make them POJOs, mostly for style/taste reasons
> (e.g. need a default constructor and setters, both are anathema!)
>
> Does anyone know of a good way for us to keep using idiomatic, immutable
> Kotlin data classes, but to get much faster serialization performance in
> Flink?
>
> Thanks!
>
> -0xe1a
>


Re: [ANNOUNCE] Flink mailing lists archive service has migrated to Apache Archive service

2021-09-15 Thread Matthias Pohl
Thanks Leonard for the announcement. I guess that is helpful.

@Robert is there any way we can change the default setting to something
else (e.g. greater than 0 days)? Only having the last month available as a
default is kind of annoying considering that the time setting is quite
hidden.

Matthias

PS: As a workaround, one could use the gte=0d parameter which is encoded in
the URL (e.g. if you use managed search engines in Chrome or Firefox's
bookmark keywords:
https://lists.apache.org/x/list.html?u...@flink.apache.org:gte=0d:%s). That
will make all posts available right-away.

On Mon, Sep 6, 2021 at 3:16 PM JING ZHANG  wrote:

> Thanks Leonard for driving this.
> The information is helpful.
>
> Best,
> JING ZHANG
>
> Jark Wu  于2021年9月6日周一 下午4:59写道:
>
>> Thanks Leonard,
>>
>> I have seen many users complaining that the Flink mailing list doesn't
>> work (they were using Nabble).
>> I think this information would be very helpful.
>>
>> Best,
>> Jark
>>
>> On Mon, 6 Sept 2021 at 16:39, Leonard Xu  wrote:
>>
>>> Hi, all
>>>
>>> The mailing list archive service Nabble Archive was broken at the end of
>>> June, the Flink community has migrated the mailing lists archives[1] to
>>> Apache Archive service by commit[2], you can refer [3] to know more mailing
>>> lists archives of Flink.
>>>
>>> Apache Archive service is maintained by ASF thus the stability is
>>> guaranteed, it’s a web-based mail archive service which allows you to
>>> browse, search, interact, subscribe, unsubscribe, etc. with mailing lists.
>>>
>>> Apache Archive service shows mails of the last month by default, you can
>>> specify the date range to browse, search the history mails.
>>>
>>>
>>> Hope it would be helpful.
>>>
>>> Best,
>>> Leonard
>>>
>>> [1] The Flink mailing lists in Apache archive service
>>> dev mailing list archives:
>>> https://lists.apache.org/list.html?d...@flink.apache.org
>>> user mailing list archives :
>>> https://lists.apache.org/list.html?u...@flink.apache.org
>>> user-zh mailing list archives :
>>> https://lists.apache.org/list.html?user-zh@flink.apache.org
>>> [2]
>>> https://github.com/apache/flink-web/commit/9194dda862da00d93f627fd315056471657655d1
>>> [3] https://flink.apache.org/community.html#mailing-lists
>>
>>


Re: [ANNOUNCE] Flink mailing lists archive service has migrated to Apache Archive service

2021-09-15 Thread Matthias Pohl
Thanks Leonard for the announcement. I guess that is helpful.

@Robert is there any way we can change the default setting to something
else (e.g. greater than 0 days)? Only having the last month available as a
default is kind of annoying considering that the time setting is quite
hidden.

Matthias

PS: As a workaround, one could use the gte=0d parameter which is encoded in
the URL (e.g. if you use managed search engines in Chrome or Firefox's
bookmark keywords:
https://lists.apache.org/x/list.html?user@flink.apache.org:gte=0d:%s). That
will make all posts available right-away.

On Mon, Sep 6, 2021 at 3:16 PM JING ZHANG  wrote:

> Thanks Leonard for driving this.
> The information is helpful.
>
> Best,
> JING ZHANG
>
> Jark Wu  于2021年9月6日周一 下午4:59写道:
>
>> Thanks Leonard,
>>
>> I have seen many users complaining that the Flink mailing list doesn't
>> work (they were using Nabble).
>> I think this information would be very helpful.
>>
>> Best,
>> Jark
>>
>> On Mon, 6 Sept 2021 at 16:39, Leonard Xu  wrote:
>>
>>> Hi, all
>>>
>>> The mailing list archive service Nabble Archive was broken at the end of
>>> June, the Flink community has migrated the mailing lists archives[1] to
>>> Apache Archive service by commit[2], you can refer [3] to know more mailing
>>> lists archives of Flink.
>>>
>>> Apache Archive service is maintained by ASF thus the stability is
>>> guaranteed, it’s a web-based mail archive service which allows you to
>>> browse, search, interact, subscribe, unsubscribe, etc. with mailing lists.
>>>
>>> Apache Archive service shows mails of the last month by default, you can
>>> specify the date range to browse, search the history mails.
>>>
>>>
>>> Hope it would be helpful.
>>>
>>> Best,
>>> Leonard
>>>
>>> [1] The Flink mailing lists in Apache archive service
>>> dev mailing list archives:
>>> https://lists.apache.org/list.html?d...@flink.apache.org
>>> user mailing list archives :
>>> https://lists.apache.org/list.html?user@flink.apache.org
>>> user-zh mailing list archives :
>>> https://lists.apache.org/list.html?user...@flink.apache.org
>>> [2]
>>> https://github.com/apache/flink-web/commit/9194dda862da00d93f627fd315056471657655d1
>>> [3] https://flink.apache.org/community.html#mailing-lists
>>
>>


Re: KafkaSource builder and checkpointing with parallelism > kafka partitions

2021-09-15 Thread Matthias Pohl
Hi Lars,
I guess you are looking
for execution.checkpointing.checkpoints-after-tasks-finish.enabled [1].
This configuration parameter is going to be introduced in the upcoming
Flink 1.14 release.

Best,
Matthias

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#execution-checkpointing-checkpoints-after-tasks-finish-enabled

On Wed, Sep 15, 2021 at 11:26 AM Lars Skjærven  wrote:

> Using KafkaSource builder with a job parallelism larger than the number of
> kafka partitions, the job is unable to checkpoint.
>
> With a job parallelism of 4, 3 of the tasks are marked as FINISHED for the
> kafka topic with one partition. For this reason checkpointing seems to be
> disabled.
>
> When using FlinkKafkaConsumer (instead of KafkaSource builder) we don't
> see this behavior, and all 4 tasks have status RUNNING.
>
> Is there any way of using KafkaSource builder ang get the same behavior as
> FlinkKafkaConsumer for the number of tasks being used ?
>
> Code with KafkaSource.builder:
>
> val metadataSource = KafkaSource.builder[Metadata]()
>   .setBootstrapServers("kafka-server")
>   .setGroupId("my-group")
>   .setTopics("my-topic")
>   .setDeserializer(new MetadataDeserializationSchema)
>   .setStartingOffsets(OffsetsInitializer.earliest())
>   .build()
>
> Code with FlinkKafkaConsumer:
> val metadataSource = new FlinkKafkaConsumer[Metadata](
>   "my-topic",
>   new MetadataDeserializationSchema,
>   "my-server)
>   .setStartFromEarliest()
>
> Thanks in advance,
> Lars
>


Re: hdfs lease issues on flink retry

2021-09-07 Thread Matthias Pohl
Just for documentation purposes: I created FLINK-24147 [1] to cover this
issue.

[1] https://issues.apache.org/jira/browse/FLINK-24147

On Thu, Aug 26, 2021 at 6:14 PM Matthias Pohl 
wrote:

> I see - I should have checked my mailbox before answering. I received the
> email and was able to login.
>
> On Thu, Aug 26, 2021 at 6:12 PM Matthias Pohl 
> wrote:
>
>> The link doesn't work, i.e. I'm redirected to a login page. It would be
>> also good to include the Flink logs and make them accessible for everyone.
>> This way others could share their perspective as well...
>>
>> On Thu, Aug 26, 2021 at 5:40 PM Shah, Siddharth [Engineering] <
>> siddharth.x.s...@gs.com> wrote:
>>
>>> Hi Matthias,
>>>
>>>
>>>
>>> Thank you for responding and taking time to look at the issue.
>>>
>>>
>>>
>>> Uploaded the yarn lags here:
>>> https://lockbox.gs.com/lockbox/folders/963b0f29-85ad-4580-b420-8c66d9c07a84/
>>> and have also requested read permissions for you. Please let us know if
>>> you’re not able to see the files.
>>>
>>>
>>>
>>>
>>>
>>> *From:* Matthias Pohl 
>>> *Sent:* Thursday, August 26, 2021 9:47 AM
>>> *To:* Shah, Siddharth [Engineering] 
>>> *Cc:* user@flink.apache.org; Hailu, Andreas [Engineering] <
>>> andreas.ha...@ny.email.gs.com>
>>> *Subject:* Re: hdfs lease issues on flink retry
>>>
>>>
>>>
>>> Hi Siddharth,
>>>
>>> thanks for reaching out to the community. This might be a bug. Could you
>>> share your Flink and YARN logs? This way we could get a better
>>> understanding of what's going on.
>>>
>>>
>>>
>>> Best,
>>> Matthias
>>>
>>>
>>>
>>> On Tue, Aug 24, 2021 at 10:19 PM Shah, Siddharth [Engineering] <
>>> siddharth.x.s...@gs.com> wrote:
>>>
>>> Hi  Team,
>>>
>>>
>>>
>>> We are seeing transient failures in the jobs mostly requiring higher
>>> resources and using flink RestartStrategies
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.13_docs_dev_execution_task-5Ffailure-5Frecovery_-23fixed-2Ddelay-2Drestart-2Dstrategy=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=eLqB-T65EFJVVpR6QlSfRHIga7DPK3o8yJvw_OhnMvk=qIClgDVq00Jp0qluJfWV-aGM7Sg7tAnr_I2yy4TtNaM=wL6-8B4mnGofyRWetXrTSw9FBSV-XTDnoHsPtzU7h7c=>
>>> [1]. Upon checking the yarn logs we have observed hdfs lease issues when
>>> flink retry happens. The job originally fails for the first try with 
>>> PartitionNotFoundException
>>> or NoResourceAvailableException., but on retry it seems form the yarn logs
>>> is that the lease for the temp sink directory is not yet released by the
>>> node from previous try.
>>>
>>>
>>>
>>> Initial Failure log message:
>>>
>>>
>>>
>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>>> Could not allocate enough slots to run the job. Please make sure that the
>>> cluster has enough resources.
>>>
>>> at
>>> org.apache.flink.runtime.executiongraph.Execution.lambda$scheduleForExecution$0(Execution.java:461)
>>>
>>> at
>>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>>>
>>> at
>>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>>>
>>> at
>>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>>>
>>> at
>>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>>>
>>> at
>>> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:190)
>>>
>>> at
>>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>>>
>>> at
>>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>>>
>>> at
>>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>>>
>>> at
>>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>>>
>>>
>>>
>>>
>>>
>>> Retry fail

Re: FLINK-14316 happens on version 1.13.2

2021-09-07 Thread Matthias Pohl
Hi Xiangyu,
thanks for reaching out to the community. Could you share the entire
TaskManager and JobManager logs with us? That might help investigating
what's going on.

Best,
Matthias

On Fri, Sep 3, 2021 at 10:07 AM Xiangyu Su  wrote:

> Hi Yun,
> Thanks alot.
> I am running a test, and facing the "Job Leader lost leadership..." issue,
> and also the checkpointing timeout at the same time,, not sure whether
> those 2 things related to each other.
> regarding your question:
> 1. GC looks ok.
> 2. seems like once the "Job Leader lost leadership..." happens flink job
> can not successfully get restarted.
> and e.g here is some logs from one job failure:
> ---
> 2021-09-02 20:41:11,345 WARN  org.apache.flink.runtime.taskmanager.Task
>  [] - KeyedProcess -> Sink: StatsdMetricsSink (40/48)#18
> (9ab62cc148569e449fdb31b521ec976c) switched from RUNNING to FAILED with
> failure cause: org.apache.flink.util.FlinkException: Disconnect from
> JobManager responsible for ec6fd88643747aafac06ee906e421a96.
> at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1660)
> at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1500(TaskExecutor.java:181)
> at
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:2189)
> at java.util.Optional.ifPresent(Optional.java:159)
> at
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3(TaskExecutor.java:2187)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: java.lang.Exception: Job leader for job id
> ec6fd88643747aafac06ee906e421a96 lost leadership.
> ... 24 more
>
> ---
> 2021-09-02 20:47:22,388 ERROR
> org.apache.flink.shaded.curator4.org.apache.curator.ConnectionState [] -
> Authentication failed
> 2021-09-02 20:47:22,388 INFO
>  org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
> Opening socket connection to server dpl-zookeeper-0.dpl-zookeeper/
> 10.168.175.10:2181
> 2021-09-02 20:47:22,388 WARN
>  org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
> SASL configuration failed: javax.security.auth.login.LoginException: No
> JAAS configuration section named 'Client' was found in specified JAAS
> configuration file: '/tmp/jaas-4480663428736118963.conf'. Will continue
> connection to Zookeeper server without SASL authentication, if Zookeeper
> server allows it.
> at
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> [flink-dist_2.11-1.13.2.jar:1.13.2]
> at 

Re: Savepoint failure along with JobManager crash

2021-08-31 Thread Matthias Pohl
Hi Prasanna,
thanks for reaching out to the community. What you're experiencing is that
the savepoint was created but the job itself ended up in an inconsistent
state with Executions being cancelled instead of being finished. This
should have triggered a global failover resulting in a job restart. The
savepoint itself should be available, though.
It would be interesting to investigate in the logs how you ended up there.
Would you be able to share the entire JobManager logs and the TaskManager
logs?

Best,
Matthias


On Tue, Aug 31, 2021 at 10:26 AM Prasanna kumar <
prasannakumarram...@gmail.com> wrote:

> Hi ,
>
> We have a Publisher job which reads from Kafka Source(parallelism 4) and
> writes it to SNS through asyncIO operator(parallelism 10). Flink Version
> 1.12.2
>
> During deployment, I stopped this job using savepoint. Immediately , I saw
> that except 2 slots all of them got released. Following are the logs as
> soon as the job was stopped.  refer **
>
> Even after half an hour those 2 slots were not released. So I manually
> cancelled the job. Then we received the following error. refer *
> Attached ScreenShot too*
>
> After 5-10 min of me cancelling, the leader job manager crashed.
>
> Also attached Screenshots of Metrics of JM JVM Class loaded, JM JVM Heap
> Used (Max available is 15gb) and JM JVM GC count(using young generation
> GC)
>
> Questions:
> 1) There was no state involved in the job. Does it take more than half an
> hour generally to generate a savepoint ?
> 2) Is it because of JobManger going to some kind of precrash state , the
> savepoint took this much time ?
> 3) What precaution should be taken before taking a savepoint and
> restarting a job ?
> 4) How do we find if a JobManager is in a good state ? All the metrics
> were green till it crashed .
> 5) I have also attached last 30 day JM JVM Class load metric , we are
> running just 2 jobs. throughout , but i see the count had increased to
> 80k+. If a job is killed , would not the loaded classed be cleaned up ? Or
> is it just a metric to show the number loaded so far historically but at
> the back end it's cleaned up ?
>
> *LOG 1*
>
> 2021-08-25 14:48:45.539 
> 2021-08-25 14:48:44,532 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] -
> Releasing idle slot [d1de73724ef75e9b5b45d29b0dd70f5e].
> 2021-08-25 14:48:45.539 
> 2021-08-25 14:48:44,532 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] -
> Releasing idle slot [e1987aed51d29ec929d87fd483bd9771].
> 2021-08-25 14:48:45.539 
> 2021-08-25 14:48:44,532 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] -
> Releasing idle slot [8afef84967f04061721b2402d6031f03].
> 2021-08-25 14:48:45.539 
> 2021-08-25 14:48:44,532 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] -
> Releasing idle slot [f22183076bdf7fcba2497b99f5300c15].
> 2021-08-25 14:48:45.539 
> 2021-08-25 14:48:44,531 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] -
> Releasing idle slot [f8b0d455cde18d84b08b8e2779311bb8].
> 2021-08-25 14:48:45.539 
> 2021-08-25 14:48:44,532 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] -
> Releasing idle slot [f34615c7c002d3a0cf6679643982635d].
> 2021-08-25 14:48:44.531 
> 2021-08-25 14:48:44,531 INFO
>  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [] -
> Releasing idle slot [233cbae4c99dd6bf939ca8f405836df4].
> 2021-08-25 14:47:34.537 
> 2021-08-25 14:47:33,520 INFO  org.apache.flink.runtime.taskmanager.Task
>  [] - Freeing task resources for Map -> async wait operator
> -> SNS_SINK (6/10)#0 (5a5338ea27890cccfb73c0f7c23aa94e).
> 2021-08-25 14:47:34.537 
> 2021-08-25 14:47:33,520 INFO  org.apache.flink.runtime.taskmanager.Task
>  [] - Map -> async wait operator -> SNS_SINK (6/10)#0
> (5a5338ea27890cccfb73c0f7c23aa94e) switched from RUNNING to FINISHED.
> 2021-08-25 14:47:34.537 
> 2021-08-25 14:47:33,504 INFO
>  org.apache.flink.runtime.taskexecutor.TaskExecutor   [] -
> Un-registering task and sending final execution state FINISHED to
> JobManager for task Source: KAFKA-SOURCE (4/4)#0
> 49fbfeefbec6d6c1bf66df64d2b395f3.
> 2021-08-25 14:47:34.537 
> 2021-08-25 14:47:33,520 INFO
>  org.apache.flink.runtime.taskexecutor.TaskExecutor   [] -
> Un-registering task and sending final execution state FINISHED to
> JobManager for task Map -> async wait operator -> SNS_SINK (6/10)#0
> 5a5338ea27890cccfb73c0f7c23aa94e.
> 2021-08-25 14:47:33.524 
> 2021-08-25 14:47:33,524 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Map ->
> async wait operator -> SNS_SINK (9/10) (d88fe1bf433b9db5d5b5fa6a628da558)
> switched from RUNNING to FINISHED.
> 2021-08-25 14:47:33.524 
> 2021-08-25 14:47:33,524 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Map ->
> async wait operator -> SNS_SINK (1/10) (3db9c21c8932c334588ebc0b5e2b877c)
> switched from RUNNING to 

Re: Table API demo problem

2021-08-31 Thread Matthias Pohl
I missed the point that it's the purpose of the walkthrough to have the
functionality being implemented by the user. So, FLINK-24076 is actually
not valid. I initially thought of it as some kind of demo implementation.
Sorry for the confusion.

On Tue, Aug 31, 2021 at 11:15 AM Matthias Pohl 
wrote:

> Hi Manraj,
> the error messages about libjemalloc.so are caused by Flink 1.13.1 that
> has been published with the wrong architecture accidentally. I created
> FLINK-24075 [1] to cover this issue. As a workaround, you could upgrade the
> base image to Flink 1.13.2 until the Flink 1.13.1 images are republished.
> But: The flink-playground Table API walkthrough is not ready to be used.
> The job submission itself fails. I created FLINK-24076 [2]. Thanks for
> reporting the issues.
>
> Best,
> Matthias
>
> [1] https://issues.apache.org/jira/browse/FLINK-24075
> [2] https://issues.apache.org/jira/browse/FLINK-24076
>
> On Mon, Aug 30, 2021 at 6:39 PM Tatla, Manraj  wrote:
>
>> Hello everyone,
>>
>>
>>
>> I am learning Flink because at work we need stateful real time
>> computations in a bot detection system.  This weekend, I have had much
>> difficulty in getting the real time reporting API tutorial working.
>> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/try-flink/table_api/
>>
>> In particular, every time I run docker-compose up -d, it does not work.
>> Diagnosing this, I found the jobmanager service failing due to out of
>> resource exceptions.  I have inspected my system resources, and found no
>> cpu, memory, or disk issues.  I see a ton of error messages saying
>> libjemalloc.so cannot be preloaded.  I am running on mac, and that seems to
>> be a linux file.
>>
>>
>>
>> Does anyone know the problem?
>>
>>
>>
>> I apologize If this is a trivial issue and for the second email.
>>
>>
>>
>> -Manraj
>>
>


Re: Table API demo problem

2021-08-31 Thread Matthias Pohl
Hi Manraj,
the error messages about libjemalloc.so are caused by Flink 1.13.1 that has
been published with the wrong architecture accidentally. I created
FLINK-24075 [1] to cover this issue. As a workaround, you could upgrade the
base image to Flink 1.13.2 until the Flink 1.13.1 images are republished.
But: The flink-playground Table API walkthrough is not ready to be used.
The job submission itself fails. I created FLINK-24076 [2]. Thanks for
reporting the issues.

Best,
Matthias

[1] https://issues.apache.org/jira/browse/FLINK-24075
[2] https://issues.apache.org/jira/browse/FLINK-24076

On Mon, Aug 30, 2021 at 6:39 PM Tatla, Manraj  wrote:

> Hello everyone,
>
>
>
> I am learning Flink because at work we need stateful real time
> computations in a bot detection system.  This weekend, I have had much
> difficulty in getting the real time reporting API tutorial working.
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/try-flink/table_api/
>
> In particular, every time I run docker-compose up -d, it does not work.
> Diagnosing this, I found the jobmanager service failing due to out of
> resource exceptions.  I have inspected my system resources, and found no
> cpu, memory, or disk issues.  I see a ton of error messages saying
> libjemalloc.so cannot be preloaded.  I am running on mac, and that seems to
> be a linux file.
>
>
>
> Does anyone know the problem?
>
>
>
> I apologize If this is a trivial issue and for the second email.
>
>
>
> -Manraj
>


Re: Queries regarding Flink upgrade strategies

2021-08-27 Thread Matthias Pohl
Thanks for clarifying that, Amit. Rolling updates with JobManagers and
TaskManagers coming from different Flink versions in the same Flink cluster
is not supported.

@Yang Wang  Do you have any recommendations you
could share in this regard?

Best,
Matthias

On Fri, Aug 27, 2021 at 2:44 PM Amit Bhatia 
wrote:

> Hi Matthias,
>
> What you mention is a little tricky. When we create a new cluster it will
> have its own volume (PVC)  so sending savepoint/checkpoint data from volume
> (PVC) of the older cluster to the newer cluster is a manual task. Also not
> sure if savepoint/checkpoint data needs to be copied to the newer flink
> cluster before flink starts. This approach is more like a blue/green
> upgrade strategy.
>
> I wanted to understand if Flink supports rollingUpdate where we update
> Taskmanager and Jobmanager pods one by one and its impact when during
> upgrade Jobmanagers & Taskmanger pods are on different  versions. Also the
> impact of recreate strategy in the same context.
>
> Regards,
> Amit
>
> On Fri, Aug 27, 2021 at 5:32 PM Matthias Pohl 
> wrote:
>
>> The upgrade approach mentioned in my previous answer should also work in
>> the context of k8s and pods: Creating a Flink cluster having the newer
>> version should be done before migrating the job using a savepoint. But
>> maybe, I misunderstand your question. Do you have something in mind where
>> you upgrade each pod individually, i.e. operating TaskManagers and
>> JobManagers with different Flink versions in the same Flink cluster?
>>
>> Best,
>> Matthias
>>
>> On Fri, Aug 27, 2021 at 11:05 AM Amit Bhatia 
>> wrote:
>>
>>> Hi Matthias,
>>>
>>> Thanks for the information but this upgrade is looking like on native
>>> (physical/virtual) deployment.
>>> I want to understand the upgrade strategies on kubernetes deployments
>>> where Flink is running in pods. If you could help in that area it would be
>>> great.
>>>
>>> Regards,
>>> Amit Bhatia
>>>
>>> On Thu, Aug 26, 2021 at 5:25 PM Matthias Pohl 
>>> wrote:
>>>
>>>> Hi Amit,
>>>> upgrading Flink versions means that you should stop your jobs with a
>>>> savepoint first. A new cluster with the new Flink version can be deployed
>>>> next. Then, this cluster can be used to start the jobs from the previously
>>>> created savepoints. Each job should pick up the work from where it stopped.
>>>> See [1] for further details on how to upgrade Flink.
>>>> I'm not sure about any Helm-specifics here. But I'm gonna pull Austin
>>>> into the thread. He might have more insights to share.
>>>>
>>>> Best,
>>>> Matthias
>>>>
>>>> [1]
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/#upgrading-the-flink-framework-version
>>>>
>>>> On Thu, Aug 26, 2021 at 9:10 AM Amit Bhatia 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We are using Flink 1.13.2 with Kubernetes HA solution provided by
>>>>> flink. We have created a deployment for JobManager and TaskManager with
>>>>> option to deploy multiple replicas and the same is bundled in a single 
>>>>> helm
>>>>> chart.
>>>>> So we have below queries regarding Flink upgrade strategies, kindly
>>>>> help us to answer below queries:
>>>>>
>>>>> 1) What upgrade strategies are supported by Flink
>>>>> (RollingUpdate/Recreate) and which one is recommended for production use?
>>>>>
>>>>> 2) During Flink upgrade from version A to version B, if we are using
>>>>> rollingUpdate then at some point of time multiple versions of Flink JMs &
>>>>> TMs might be running so does that can cause any corruption/failure for
>>>>> running Jobs ?
>>>>>
>>>>> 3) During Flink upgrade from version A to version B, If we use
>>>>> recreate then at some point of time if all JMs gets updated to a new
>>>>> version and TMs are still updating which means TMs are running with
>>>>> different versions then will this cause any corruption/failure for running
>>>>> Jobs?
>>>>>
>>>>> Regards,
>>>>> Amit Bhatia
>>>>>
>>>>


Re: Queries regarding Flink upgrade strategies

2021-08-27 Thread Matthias Pohl
The upgrade approach mentioned in my previous answer should also work in
the context of k8s and pods: Creating a Flink cluster having the newer
version should be done before migrating the job using a savepoint. But
maybe, I misunderstand your question. Do you have something in mind where
you upgrade each pod individually, i.e. operating TaskManagers and
JobManagers with different Flink versions in the same Flink cluster?

Best,
Matthias

On Fri, Aug 27, 2021 at 11:05 AM Amit Bhatia 
wrote:

> Hi Matthias,
>
> Thanks for the information but this upgrade is looking like on native
> (physical/virtual) deployment.
> I want to understand the upgrade strategies on kubernetes deployments
> where Flink is running in pods. If you could help in that area it would be
> great.
>
> Regards,
> Amit Bhatia
>
> On Thu, Aug 26, 2021 at 5:25 PM Matthias Pohl 
> wrote:
>
>> Hi Amit,
>> upgrading Flink versions means that you should stop your jobs with a
>> savepoint first. A new cluster with the new Flink version can be deployed
>> next. Then, this cluster can be used to start the jobs from the previously
>> created savepoints. Each job should pick up the work from where it stopped.
>> See [1] for further details on how to upgrade Flink.
>> I'm not sure about any Helm-specifics here. But I'm gonna pull Austin
>> into the thread. He might have more insights to share.
>>
>> Best,
>> Matthias
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/#upgrading-the-flink-framework-version
>>
>> On Thu, Aug 26, 2021 at 9:10 AM Amit Bhatia 
>> wrote:
>>
>>> Hi,
>>>
>>> We are using Flink 1.13.2 with Kubernetes HA solution provided by flink.
>>> We have created a deployment for JobManager and TaskManager with option to
>>> deploy multiple replicas and the same is bundled in a single helm chart.
>>> So we have below queries regarding Flink upgrade strategies, kindly help
>>> us to answer below queries:
>>>
>>> 1) What upgrade strategies are supported by Flink
>>> (RollingUpdate/Recreate) and which one is recommended for production use?
>>>
>>> 2) During Flink upgrade from version A to version B, if we are using
>>> rollingUpdate then at some point of time multiple versions of Flink JMs &
>>> TMs might be running so does that can cause any corruption/failure for
>>> running Jobs ?
>>>
>>> 3) During Flink upgrade from version A to version B, If we use recreate
>>> then at some point of time if all JMs gets updated to a new version and TMs
>>> are still updating which means TMs are running with different versions then
>>> will this cause any corruption/failure for running Jobs?
>>>
>>> Regards,
>>> Amit Bhatia
>>>
>>


Re: hdfs lease issues on flink retry

2021-08-26 Thread Matthias Pohl
I see - I should have checked my mailbox before answering. I received the
email and was able to login.

On Thu, Aug 26, 2021 at 6:12 PM Matthias Pohl 
wrote:

> The link doesn't work, i.e. I'm redirected to a login page. It would be
> also good to include the Flink logs and make them accessible for everyone.
> This way others could share their perspective as well...
>
> On Thu, Aug 26, 2021 at 5:40 PM Shah, Siddharth [Engineering] <
> siddharth.x.s...@gs.com> wrote:
>
>> Hi Matthias,
>>
>>
>>
>> Thank you for responding and taking time to look at the issue.
>>
>>
>>
>> Uploaded the yarn lags here:
>> https://lockbox.gs.com/lockbox/folders/963b0f29-85ad-4580-b420-8c66d9c07a84/
>> and have also requested read permissions for you. Please let us know if
>> you’re not able to see the files.
>>
>>
>>
>>
>>
>> *From:* Matthias Pohl 
>> *Sent:* Thursday, August 26, 2021 9:47 AM
>> *To:* Shah, Siddharth [Engineering] 
>> *Cc:* user@flink.apache.org; Hailu, Andreas [Engineering] <
>> andreas.ha...@ny.email.gs.com>
>> *Subject:* Re: hdfs lease issues on flink retry
>>
>>
>>
>> Hi Siddharth,
>>
>> thanks for reaching out to the community. This might be a bug. Could you
>> share your Flink and YARN logs? This way we could get a better
>> understanding of what's going on.
>>
>>
>>
>> Best,
>> Matthias
>>
>>
>>
>> On Tue, Aug 24, 2021 at 10:19 PM Shah, Siddharth [Engineering] <
>> siddharth.x.s...@gs.com> wrote:
>>
>> Hi  Team,
>>
>>
>>
>> We are seeing transient failures in the jobs mostly requiring higher
>> resources and using flink RestartStrategies
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.13_docs_dev_execution_task-5Ffailure-5Frecovery_-23fixed-2Ddelay-2Drestart-2Dstrategy=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=eLqB-T65EFJVVpR6QlSfRHIga7DPK3o8yJvw_OhnMvk=qIClgDVq00Jp0qluJfWV-aGM7Sg7tAnr_I2yy4TtNaM=wL6-8B4mnGofyRWetXrTSw9FBSV-XTDnoHsPtzU7h7c=>
>> [1]. Upon checking the yarn logs we have observed hdfs lease issues when
>> flink retry happens. The job originally fails for the first try with 
>> PartitionNotFoundException
>> or NoResourceAvailableException., but on retry it seems form the yarn logs
>> is that the lease for the temp sink directory is not yet released by the
>> node from previous try.
>>
>>
>>
>> Initial Failure log message:
>>
>>
>>
>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>> Could not allocate enough slots to run the job. Please make sure that the
>> cluster has enough resources.
>>
>> at
>> org.apache.flink.runtime.executiongraph.Execution.lambda$scheduleForExecution$0(Execution.java:461)
>>
>> at
>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>>
>> at
>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>>
>> at
>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>>
>> at
>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>>
>> at
>> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:190)
>>
>> at
>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>>
>> at
>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>>
>> at
>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>>
>> at
>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>>
>>
>>
>>
>>
>> Retry failure log message:
>>
>>
>>
>> Caused by: 
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.FileAlreadyExistsException):
>>  
>> /user/p2epda/lake/delp_prod/PROD/APPROVED/data/TECHRISK_SENTINEL/INFORMATION_REPORT/4377/temp/data/_temporary/0/_temporary/attempt___r_03_0/partMapper-r-3.snappy.parquet
>>  for client 10.51.63.226 already exists
>>
>> at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2815)
>>
>> at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:270

Re: hdfs lease issues on flink retry

2021-08-26 Thread Matthias Pohl
The link doesn't work, i.e. I'm redirected to a login page. It would be
also good to include the Flink logs and make them accessible for everyone.
This way others could share their perspective as well...

On Thu, Aug 26, 2021 at 5:40 PM Shah, Siddharth [Engineering] <
siddharth.x.s...@gs.com> wrote:

> Hi Matthias,
>
>
>
> Thank you for responding and taking time to look at the issue.
>
>
>
> Uploaded the yarn lags here:
> https://lockbox.gs.com/lockbox/folders/963b0f29-85ad-4580-b420-8c66d9c07a84/
> and have also requested read permissions for you. Please let us know if
> you’re not able to see the files.
>
>
>
>
>
> *From:* Matthias Pohl 
> *Sent:* Thursday, August 26, 2021 9:47 AM
> *To:* Shah, Siddharth [Engineering] 
> *Cc:* user@flink.apache.org; Hailu, Andreas [Engineering] <
> andreas.ha...@ny.email.gs.com>
> *Subject:* Re: hdfs lease issues on flink retry
>
>
>
> Hi Siddharth,
>
> thanks for reaching out to the community. This might be a bug. Could you
> share your Flink and YARN logs? This way we could get a better
> understanding of what's going on.
>
>
>
> Best,
> Matthias
>
>
>
> On Tue, Aug 24, 2021 at 10:19 PM Shah, Siddharth [Engineering] <
> siddharth.x.s...@gs.com> wrote:
>
> Hi  Team,
>
>
>
> We are seeing transient failures in the jobs mostly requiring higher
> resources and using flink RestartStrategies
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.13_docs_dev_execution_task-5Ffailure-5Frecovery_-23fixed-2Ddelay-2Drestart-2Dstrategy=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=eLqB-T65EFJVVpR6QlSfRHIga7DPK3o8yJvw_OhnMvk=qIClgDVq00Jp0qluJfWV-aGM7Sg7tAnr_I2yy4TtNaM=wL6-8B4mnGofyRWetXrTSw9FBSV-XTDnoHsPtzU7h7c=>
> [1]. Upon checking the yarn logs we have observed hdfs lease issues when
> flink retry happens. The job originally fails for the first try with 
> PartitionNotFoundException
> or NoResourceAvailableException., but on retry it seems form the yarn logs
> is that the lease for the temp sink directory is not yet released by the
> node from previous try.
>
>
>
> Initial Failure log message:
>
>
>
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate enough slots to run the job. Please make sure that the
> cluster has enough resources.
>
> at
> org.apache.flink.runtime.executiongraph.Execution.lambda$scheduleForExecution$0(Execution.java:461)
>
> at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>
> at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>
> at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>
> at
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:190)
>
> at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>
> at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>
> at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>
>
>
>
>
> Retry failure log message:
>
>
>
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.FileAlreadyExistsException):
>  
> /user/p2epda/lake/delp_prod/PROD/APPROVED/data/TECHRISK_SENTINEL/INFORMATION_REPORT/4377/temp/data/_temporary/0/_temporary/attempt___r_03_0/partMapper-r-3.snappy.parquet
>  for client 10.51.63.226 already exists
>
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2815)
>
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2702)
>
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2586)
>
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:736)
>
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:409)
>
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>
> at 
> org.apache.hado

Re: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden)

2021-08-26 Thread Matthias Pohl
Hi Jonas,
have you included the s3 credentials in the Flink config file like it's
described in [1]? I'm not sure about this hive.s3.use-instance-credentials
being a valid configuration parameter.

Best,
Matthias

[1]
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/filesystems/s3/#configure-access-credentials

On Thu, Aug 26, 2021 at 3:43 PM jonas eyob  wrote:

> Hey,
>
> I am setting up HA on a standalone Kubernetes Flink application job
> cluster.
> Flink (1.12.5) is used and I am using S3 as the storage backend
>
> * The JobManager shortly fails after starts with the following errors
> (apologies in advance for the length), and I can't understand what's going
> on.
> * First I thought it may be due to missing Delete privileges of the IAM
> role and updated that, but the problem persists.
> * The S3 bucket configured s3:///recovery is empty.
>
> configmap.yaml
> flink-conf.yaml: |+
> jobmanager.rpc.address: {{ $fullName }}-jobmanager
> jobmanager.rpc.port: 6123
> jobmanager.memory.process.size: 1600m
> taskmanager.numberOfTaskSlots: 2
> taskmanager.rpc.port: 6122
> taskmanager.memory.process.size: 1728m
> blob.server.port: 6124
> queryable-state.proxy.ports: 6125
> parallelism.default: 2
> scheduler-mode: reactive
> execution.checkpointing.interval: 10s
> restart-strategy: fixed-delay
> restart-strategy.fixed-delay.attempts: 10
> high-availability:
> org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
> kubernetes.cluster-id: {{ $fullName }}
> high-availability.storageDir: s3://-flink-{{ .Values.environment
> }}/recovery
> hive.s3.use-instance-credentials: true
> kubernetes.namespace: {{ $fullName }} # The namespace that will be used
> for running the jobmanager and taskmanager pods
>
> role.yaml
> kind: Role
> apiVersion: rbac.authorization.k8s.io/v1
> metadata:
> name: {{ $fullName }}
> namespace: {{ $fullName }}
> labels:
> app: {{ $appName }}
> chart: {{ template "thoros.chart" . }}
> release: {{ .Release.Name }}
> heritage: {{ .Release.Service }}
>
> rules:
> - apiGroups: [""]
> resources: ["configmaps"]
> verbs: ["create", "edit", "delete", "watch", "get", "list", "update"]
>
> aws IAM policy
> {
> "Version": "2012-10-17",
> "Statement": [
> {
> "Action": [
> "s3:ListBucket",
> "s3:Get*",
> "s3:Put*",
> "s3:Delete*"
> ],
> "Resource": [
> "arn:aws:s3:::-flink-dev/*"
> ],
> "Effect": "Allow"
> }
> ]
> }
>
> *Error-log:*
> 2021-08-26 13:08:43,439 INFO  org.apache.beam.runners.flink.FlinkRunner
>  [] - Executing pipeline using FlinkRunner.
> 2021-08-26 13:08:43,444 WARN  org.apache.beam.runners.flink.FlinkRunner
>  [] - For maximum performance you should set the
> 'fasterCopy' option. See more at
> https://issues.apache.org/jira/browse/BEAM-11146
> 2021-08-26 13:08:43,451 INFO  org.apache.beam.runners.flink.FlinkRunner
>  [] - Translating pipeline to Flink program.
> 2021-08-26 13:08:43,456 INFO
>  org.apache.beam.runners.flink.FlinkPipelineExecutionEnvironment [] - Found
> unbounded PCollection. Switching to streaming execution.
> 2021-08-26 13:08:43,461 INFO
>  org.apache.beam.runners.flink.FlinkExecutionEnvironments [] - Creating
> a Streaming Environment.
> 2021-08-26 13:08:43,462 INFO
>  org.apache.flink.configuration.GlobalConfiguration   [] - Loading
> configuration property: jobmanager.rpc.address, thoros-jobmanager
> 2021-08-26 13:08:43,462 INFO
>  org.apache.flink.configuration.GlobalConfiguration   [] - Loading
> configuration property: jobmanager.rpc.port, 6123
> 2021-08-26 13:08:43,462 INFO
>  org.apache.flink.configuration.GlobalConfiguration   [] - Loading
> configuration property: jobmanager.memory.process.size, 1600m
> 2021-08-26 13:08:43,463 INFO
>  org.apache.flink.configuration.GlobalConfiguration   [] - Loading
> configuration property: taskmanager.numberOfTaskSlots, 2
> 2021-08-26 13:08:43,463 INFO
>  org.apache.flink.configuration.GlobalConfiguration   [] - Loading
> configuration property: taskmanager.rpc.port, 6122
> 2021-08-26 13:08:43,463 INFO
>  org.apache.flink.configuration.GlobalConfiguration   [] - Loading
> configuration property: taskmanager.memory.process.size, 1728m
> 2021-08-26 13:08:43,463 INFO
>  org.apache.flink.configuration.GlobalConfiguration   [] - Loading
> configuration property: blob.server.port, 6124
> 2021-08-26 13:08:43,464 INFO
>  org.apache.flink.configuration.GlobalConfiguration   [] - Loading
> configuration property: queryable-state.proxy.ports, 6125
> 2021-08-26 13:08:43,464 INFO
>  org.apache.flink.configuration.GlobalConfiguration   [] - Loading
> configuration property: parallelism.default, 2
> 2021-08-26 13:08:43,465 INFO
>  org.apache.flink.configuration.GlobalConfiguration   [] - Loading
> configuration 

Re: hdfs lease issues on flink retry

2021-08-26 Thread Matthias Pohl
Hi Siddharth,
thanks for reaching out to the community. This might be a bug. Could you
share your Flink and YARN logs? This way we could get a better
understanding of what's going on.

Best,
Matthias

On Tue, Aug 24, 2021 at 10:19 PM Shah, Siddharth [Engineering] <
siddharth.x.s...@gs.com> wrote:

> Hi  Team,
>
>
>
> We are seeing transient failures in the jobs mostly requiring higher
> resources and using flink RestartStrategies
> 
> [1]. Upon checking the yarn logs we have observed hdfs lease issues when
> flink retry happens. The job originally fails for the first try with 
> PartitionNotFoundException
> or NoResourceAvailableException., but on retry it seems form the yarn logs
> is that the lease for the temp sink directory is not yet released by the
> node from previous try.
>
>
>
> Initial Failure log message:
>
>
>
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate enough slots to run the job. Please make sure that the
> cluster has enough resources.
>
> at
> org.apache.flink.runtime.executiongraph.Execution.lambda$scheduleForExecution$0(Execution.java:461)
>
> at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>
> at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>
> at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>
> at
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:190)
>
> at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>
> at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>
> at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>
>
>
>
>
> Retry failure log message:
>
>
>
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.FileAlreadyExistsException):
>  
> /user/p2epda/lake/delp_prod/PROD/APPROVED/data/TECHRISK_SENTINEL/INFORMATION_REPORT/4377/temp/data/_temporary/0/_temporary/attempt___r_03_0/partMapper-r-3.snappy.parquet
>  for client 10.51.63.226 already exists
>
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2815)
>
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2702)
>
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2586)
>
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:736)
>
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:409)
>
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
>
>
>
>
>
>
>
> I could verify that it’s the same nodes from previous try owning the
> lease, and checked for multiple jobs by matching IP addresses. Ideally, we
> want an internal retry to happen since there will be thousands of jobs
> running at a time and hard to manually retry them.
>
>
>
> This is our current restart config:
>
> executionEnv.setRestartStrategy(RestartStrategies.*fixedDelayRestart*(3,
> Time.*of*(10, TimeUnit.*SECONDS*)));
>
>
>
> Is it possible to resolve leases before a retry? Or is it possible to have
> different sink directories (increment attempt id somewhere) for every
> retry, that way we have no lease issues? Or do you have any other
> suggestion on resolving this?
>
>
>
>
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/execution/task_failure_recovery/#fixed-delay-restart-strategy
>
>
>
>
>
> Thanks,
>
> Siddharth
>
>
>
> --
>
> Your Personal Data: We may collect and process information about you that
> may be subject to data protection laws. For more information about how we
> use and disclose your personal data, how we protect your information, our
> legal basis to use your information, your rights and who you can contact,
> please refer to: www.gs.com/privacy-notices
>


Re: Disabling autogenerated uid/hash doesn't work when using file source

2021-08-26 Thread Matthias Pohl
Hi Vishal,
you're right: the FileSource itself doesn't provide these methods. But you
could get them through the DataStreamSource (which
implements SingleOutputStreamOperator and provides these two methods
[1,2]). It is returned by StreamExecutionEnvironment.fromSource [3].
fromSource would need the FileSource being passed as a parameter.

Best,
Matthias

PS: Thanks to Chesnay for the hint.

[1]
https://javadoc.io/doc/org.apache.flink/flink-streaming-java_2.12/latest/org/apache/flink/streaming/api/datastream/SingleOutputStreamOperator.html#uid-java.lang.String-
[2]
https://javadoc.io/doc/org.apache.flink/flink-streaming-java_2.12/latest/org/apache/flink/streaming/api/datastream/SingleOutputStreamOperator.html#uid-java.lang.String-
[3]
https://javadoc.io/doc/org.apache.flink/flink-streaming-java_2.12/latest/org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.html

On Wed, Aug 25, 2021 at 9:28 AM Vishal Surana  wrote:

> I set names and uid for all my flink operators and have explicitly
> disabled auto generation of uid to force developers in my team the same
> practice. However, when using a file source, there's no option of providing
> it due to which the job fails to start unless we enable auto generation. Am
> I doing something wrong?


Re: Flink Avro Timestamp Precision Issue

2021-08-26 Thread Matthias Pohl
Hi Akshay,
thanks for reaching out to the community. There was a similar question on
the mailing list earlier this month [1]. Unfortunately, it just doesn't
seem to be supported, yet. The feature request was already created with
FLINK-23589 [2].

Best,
Matthias

[1]
https://lists.apache.org/thread.html/r463f748358202d207e4bf9c7fdcb77e609f35bbd670dbc5278dd7615%40%3Cuser.flink.apache.org%3E
[2] https://issues.apache.org/jira/browse/FLINK-23589

On Thu, Aug 26, 2021 at 11:07 AM Akshay Agarwal 
wrote:

> Hi everyone,
>
> We are trying out flink 1.13.1 with kafka topics as avro backend but we
> are facing an issue while creating Table SQL that avro doesn't support
> precision greater than 3. I am not getting the reason why flink isn't
> supporting timestamps greater than 3 (exception
> )
> as avro(1.10.0) does support microseconds precision.
> Our kafka records contain a timestamp as -MM-DD'T'HH-mm-ss.SS'Z'.
> Eventually I have read the records as a custom UDF which drops seconds
> offset but wanted to know if there is a better way to handle this also the
> reason why isn't it's supported. It would be a great help for us to know
> about it and we can build on that.
>
> *Regards*
> *Akshay Agarwal*
>
> [image: https://grofers.com] 


Re: Kinesis Producer not working with Flink 1.11.2

2021-08-26 Thread Matthias Pohl
Hi Sanket,
have you considered reaching out to the Kinesis community? I might be wrong
but it looks like a Kinesis issue.

Best,
Matthias

On Tue, Aug 24, 2021 at 7:13 PM Sanket Agrawal 
wrote:

> Hi,
>
>
>
> We are trying to use Kinesis along with Flink(1.11.2) and JDK 11 on EMR
> cluster(6.2). When the application starts we are getting below error:
>
>
>
> 2021-08-24 12:46:27.980 [INFO] [CONFIGURATION_BROADC] {App-N=app_name,
> App-V=3.0.0} {}
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer:881
> - Extracting binaries to /tmp/amazon-kinesis-producer-native-binaries
>
> 2021-08-24 12:46:28.011 [WARN] [kpl-daemon-0003] {App-N=app_name,
> App-V=3.0.0} {}
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.LogInputStreamReader:479
> -
> /tmp/amazon-kinesis-producer-native-binaries/kinesis_producer_11BBB09B74B2C545674C3F227551D80BA4F64AA7:
> /tmp/amazon-kinesis-producer-native-binaries/kinesis_producer_11BBB09B74B2C545674C3F227551D80BA4F64AA7:
> cannot execute binary file
>
> 2021-08-24 12:46:28.007 [ERROR] [kpl-daemon-] {App-N=app_name,
> App-V=3.0.0} {}
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer:152
> - Error in child process
>
> java.lang.RuntimeException: Child process exited with code 126
>
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:533)
> [app_name.jar:?]
>
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.fatalError(Daemon.java:509)
> [app_name.jar:?]
>
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.startChildProcess(Daemon.java:487)
> [app_name.jar:?]
>
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.access$100(Daemon.java:63)
> [app_name.jar:?]
>
> at
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon$1.run(Daemon.java:133)
> [app_name.jar:?]
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> [?:?]
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> [?:?]
>
> at java.lang.Thread.run(Thread.java:829) [?:?]
>
>
>
> Used libraries and respective versions are as follows:
>
>1. flink-connector-kinesis_2.12 - *1.11.2*
>2. aws-java-sdk-kinesis - *1.11.880*
>3. amazon-kinesis-client - *2.2.9*
>
>
>
> Any help on this would be really helpful.
>
>
>
> Thanks,
>
> Sanket
>
>
>


Re: checkpoints/.../shared cleanup

2021-08-26 Thread Matthias Pohl
Hi Alexey,
thanks for reaching out to the community. I have a question: What do you
mean by "the shared subfolder still grows"? As far as I understand, the
shared folder contains the state of incremental checkpoints. If you cancel
the corresponding job and start a new job from one of the retained
incremental checkpoints, it is required for the shared folder of the
previous job to be still around since it contains the state. The new job
would then create its own shared subfolder. Any new incremental checkpoints
will write their state into the new job's shared subfolder while still
relying on shared state of the previous job for older data. The RocksDB
Backend is in charge of consolidating the incremental state.

Hence, you should be careful with removing the shared folder in case you're
planning to restart the job later on.

I'm adding Seth to this thread. He might have more insights and/or correct
my limited knowledge of the incremental checkpoint process.

Best,
Matthias

On Wed, Aug 25, 2021 at 1:39 AM Alexey Trenikhun  wrote:

> Hello,
> I use incremental checkpoints, not externalized, should content of
> checkpoint/.../shared be removed when I cancel job  (or cancel with
> savepoint). Looks like in our case shared continutes to grow...
>
> Thanks,
> Alexey
>


  1   2   3   >