date:20200805

You are right Yang Wang.

Thanks for creating this issue.

Cheers,
Till

On Wed, Aug 5, 2020 at 1:33 PM Yang Wang  wrote:

> Actually, the application status shows in YARN web UI is not determined by
> the jobmanager process exit code.
> Instead, we use "resourceManagerClient.unregisterApplicationMaster" to
> control the final status of YARN application.
> So although jobmanager exit with zero code, it still could show failed
> status in YARN web UI.
>
> I have created a ticket to track this improvement[1].
>
> [1]. https://issues.apache.org/jira/browse/FLINK-18828
>
>
> Best,
> Yang
>
>
> Till Rohrmann  于2020年8月5日周三 下午3:56写道：
>
>> Yes for the other deployments it is not a problem. A reason why people
>> preferred non-zero exit codes in case of FAILED jobs is that this is easier
>> to monitor than having to take a look at the actual job result. Moreover,
>> in the YARN web UI the application shows as failed if I am not mistaken.
>> However, from a framework's perspective, a FAILED job does not mean that
>> Flink has failed and, hence, the return code could still be 0 in my opinion.
>>
>> Cheers,
>> Till
>>
>> On Wed, Aug 5, 2020 at 9:30 AM Yang Wang  wrote:
>>
>>> Hi Eleanore,
>>>
>>> Yes, I suggest to use Job to replace Deployment. It could be used to run
>>> jobmanager one time and finish after a successful/failed completion.
>>>
>>> However, using Job still could not solve your problem completely. Just
>>> as Till said, When a job exhausts the restart strategy, the jobmanager
>>> pod will terminate with non-zero exit code. It will cause the K8s
>>> restarting it again. Even though we could set the resartPolicy and
>>> backoffLimit,
>>> this is not a clean and correct way to go. We should terminate the
>>> jobmanager process with zero exit code in such situation.
>>>
>>> @Till Rohrmann  I just have one concern. Is it a
>>> special case for K8s deployment? For standalone/Yarn/Mesos, it seems that
>>> terminating with
>>> non-zero exit code is harmless.
>>>
>>>
>>> Best,
>>> Yang
>>>
>>> Eleanore Jin  于2020年8月4日周二 下午11:54写道：
>>>
 Hi Yang & Till,

 Thanks for your prompt reply!

 Yang, regarding your question, I am actually not using k8s job, as I
 put my app.jar and its dependencies under flink's lib directory. I have 1
 k8s deployment for job manager, and 1 k8s deployment for task manager, and
 1 k8s service for job manager.

 As you mentioned above, if flink job is marked as failed, it will cause
 the job manager pod to be restarted. Which is not the ideal behavior.

 Do you suggest that I should change the deployment strategy from using
 k8s deployment to k8s job? In case the flink program exit with non-zero
 code (e.g. exhausted number of configured restart), pod can be marked as
 complete hence not restarting the job again?

 Thanks a lot!
 Eleanore

 On Tue, Aug 4, 2020 at 2:49 AM Yang Wang  wrote:

> @Till Rohrmann  In native mode, when a Flink
> application terminates with FAILED state, all the resources will be 
> cleaned
> up.
>
> However, in standalone mode, I agree with you that we need to rethink
> the exit code of Flink. When a job exhausts the restart
> strategy, we should terminate the pod and do not restart again. After
> googling, it seems that we could not specify the restartPolicy
> based on exit code[1]. So maybe we need to return a zero exit code to
> avoid restarting by K8s.
>
> [1].
> https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code
>
> Best,
> Yang
>
> Till Rohrmann  于2020年8月4日周二 下午3:48写道：
>
>> @Yang Wang  I believe that we should
>> rethink the exit codes of Flink. In general you want K8s to restart a
>> failed Flink process. Hence, an application which terminates in state
>> FAILED should not return a non-zero exit code because it is a valid
>> termination state.
>>
>> Cheers,
>> Till
>>
>> On Tue, Aug 4, 2020 at 8:55 AM Yang Wang 
>> wrote:
>>
>>> Hi Eleanore,
>>>
>>> I think you are using K8s resource "Job" to deploy the jobmanager.
>>> Please set .spec.template.spec.restartPolicy = "Never" and
>>> spec.backoffLimit = 0.
>>> Refer here[1] for more information.
>>>
>>> Then, when the jobmanager failed because of any reason, the K8s job
>>> will be marked failed. And K8s will not restart the job again.
>>>
>>> [1].
>>> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup
>>>
>>>
>>> Best,
>>> Yang
>>>
>>> Eleanore Jin  于2020年8月4日周二 上午12:05写道：
>>>
 Hi Till,

 Thanks for the reply!

 I manually deploy as per-job mode [1] and I am using Flink 1.8.2.
 Specifically, I build a custom docker image, which I copied the app jar
 (not uber jar) and

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

You are right Yang Wang.

Thanks for creating this issue.

Cheers,
Till

On Wed, Aug 5, 2020 at 1:33 PM Yang Wang  wrote:

> Actually, the application status shows in YARN web UI is not determined by
> the jobmanager process exit code.
> Instead, we use "resourceManagerClient.unregisterApplicationMaster" to
> control the final status of YARN application.
> So although jobmanager exit with zero code, it still could show failed
> status in YARN web UI.
>
> I have created a ticket to track this improvement[1].
>
> [1]. https://issues.apache.org/jira/browse/FLINK-18828
>
>
> Best,
> Yang
>
>
> Till Rohrmann  于2020年8月5日周三 下午3:56写道：
>
>> Yes for the other deployments it is not a problem. A reason why people
>> preferred non-zero exit codes in case of FAILED jobs is that this is easier
>> to monitor than having to take a look at the actual job result. Moreover,
>> in the YARN web UI the application shows as failed if I am not mistaken.
>> However, from a framework's perspective, a FAILED job does not mean that
>> Flink has failed and, hence, the return code could still be 0 in my opinion.
>>
>> Cheers,
>> Till
>>
>> On Wed, Aug 5, 2020 at 9:30 AM Yang Wang  wrote:
>>
>>> Hi Eleanore,
>>>
>>> Yes, I suggest to use Job to replace Deployment. It could be used to run
>>> jobmanager one time and finish after a successful/failed completion.
>>>
>>> However, using Job still could not solve your problem completely. Just
>>> as Till said, When a job exhausts the restart strategy, the jobmanager
>>> pod will terminate with non-zero exit code. It will cause the K8s
>>> restarting it again. Even though we could set the resartPolicy and
>>> backoffLimit,
>>> this is not a clean and correct way to go. We should terminate the
>>> jobmanager process with zero exit code in such situation.
>>>
>>> @Till Rohrmann  I just have one concern. Is it a
>>> special case for K8s deployment? For standalone/Yarn/Mesos, it seems that
>>> terminating with
>>> non-zero exit code is harmless.
>>>
>>>
>>> Best,
>>> Yang
>>>
>>> Eleanore Jin  于2020年8月4日周二 下午11:54写道：
>>>
 Hi Yang & Till,

 Thanks for your prompt reply!

 Yang, regarding your question, I am actually not using k8s job, as I
 put my app.jar and its dependencies under flink's lib directory. I have 1
 k8s deployment for job manager, and 1 k8s deployment for task manager, and
 1 k8s service for job manager.

 As you mentioned above, if flink job is marked as failed, it will cause
 the job manager pod to be restarted. Which is not the ideal behavior.

 Do you suggest that I should change the deployment strategy from using
 k8s deployment to k8s job? In case the flink program exit with non-zero
 code (e.g. exhausted number of configured restart), pod can be marked as
 complete hence not restarting the job again?

 Thanks a lot!
 Eleanore

 On Tue, Aug 4, 2020 at 2:49 AM Yang Wang  wrote:

> @Till Rohrmann  In native mode, when a Flink
> application terminates with FAILED state, all the resources will be 
> cleaned
> up.
>
> However, in standalone mode, I agree with you that we need to rethink
> the exit code of Flink. When a job exhausts the restart
> strategy, we should terminate the pod and do not restart again. After
> googling, it seems that we could not specify the restartPolicy
> based on exit code[1]. So maybe we need to return a zero exit code to
> avoid restarting by K8s.
>
> [1].
> https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code
>
> Best,
> Yang
>
> Till Rohrmann  于2020年8月4日周二 下午3:48写道：
>
>> @Yang Wang  I believe that we should
>> rethink the exit codes of Flink. In general you want K8s to restart a
>> failed Flink process. Hence, an application which terminates in state
>> FAILED should not return a non-zero exit code because it is a valid
>> termination state.
>>
>> Cheers,
>> Till
>>
>> On Tue, Aug 4, 2020 at 8:55 AM Yang Wang 
>> wrote:
>>
>>> Hi Eleanore,
>>>
>>> I think you are using K8s resource "Job" to deploy the jobmanager.
>>> Please set .spec.template.spec.restartPolicy = "Never" and
>>> spec.backoffLimit = 0.
>>> Refer here[1] for more information.
>>>
>>> Then, when the jobmanager failed because of any reason, the K8s job
>>> will be marked failed. And K8s will not restart the job again.
>>>
>>> [1].
>>> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup
>>>
>>>
>>> Best,
>>> Yang
>>>
>>> Eleanore Jin  于2020年8月4日周二 上午12:05写道：
>>>
 Hi Till,

 Thanks for the reply!

 I manually deploy as per-job mode [1] and I am using Flink 1.8.2.
 Specifically, I build a custom docker image, which I copied the app jar
 (not uber jar) and

Re: flink sql eos

2020-08-05 Thread Leonard Xu

Hi

> 目前仅有kafka实现了TwoPhaseCommitSinkFunction，但kafka的ddl中也没有属性去设
> 置Semantic为EXACTLY_ONCE

除了Kafka还有filesystem connector也是支持 EXACTLY ONCE的，kafka 的已经在1.12支持了[1]


> 当开启全局EXACTLY_ONCE并且所有使用的connector都支持EXACTLY_ONCE，是否整个应
> 用程序就可以做到端到端的精确一致性

是的。 

祝好
Leonard
[1] https://issues.apache.org/jira/browse/FLINK-15221

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Actually, the application status shows in YARN web UI is not determined by
the jobmanager process exit code.
Instead, we use "resourceManagerClient.unregisterApplicationMaster" to
control the final status of YARN application.
So although jobmanager exit with zero code, it still could show failed
status in YARN web UI.

I have created a ticket to track this improvement[1].

[1]. https://issues.apache.org/jira/browse/FLINK-18828


Best,
Yang


Till Rohrmann  于2020年8月5日周三 下午3:56写道：

> Yes for the other deployments it is not a problem. A reason why people
> preferred non-zero exit codes in case of FAILED jobs is that this is easier
> to monitor than having to take a look at the actual job result. Moreover,
> in the YARN web UI the application shows as failed if I am not mistaken.
> However, from a framework's perspective, a FAILED job does not mean that
> Flink has failed and, hence, the return code could still be 0 in my opinion.
>
> Cheers,
> Till
>
> On Wed, Aug 5, 2020 at 9:30 AM Yang Wang  wrote:
>
>> Hi Eleanore,
>>
>> Yes, I suggest to use Job to replace Deployment. It could be used to run
>> jobmanager one time and finish after a successful/failed completion.
>>
>> However, using Job still could not solve your problem completely. Just as
>> Till said, When a job exhausts the restart strategy, the jobmanager
>> pod will terminate with non-zero exit code. It will cause the K8s
>> restarting it again. Even though we could set the resartPolicy and
>> backoffLimit,
>> this is not a clean and correct way to go. We should terminate the
>> jobmanager process with zero exit code in such situation.
>>
>> @Till Rohrmann  I just have one concern. Is it a
>> special case for K8s deployment? For standalone/Yarn/Mesos, it seems that
>> terminating with
>> non-zero exit code is harmless.
>>
>>
>> Best,
>> Yang
>>
>> Eleanore Jin  于2020年8月4日周二 下午11:54写道：
>>
>>> Hi Yang & Till,
>>>
>>> Thanks for your prompt reply!
>>>
>>> Yang, regarding your question, I am actually not using k8s job, as I put
>>> my app.jar and its dependencies under flink's lib directory. I have 1 k8s
>>> deployment for job manager, and 1 k8s deployment for task manager, and 1
>>> k8s service for job manager.
>>>
>>> As you mentioned above, if flink job is marked as failed, it will cause
>>> the job manager pod to be restarted. Which is not the ideal behavior.
>>>
>>> Do you suggest that I should change the deployment strategy from using
>>> k8s deployment to k8s job? In case the flink program exit with non-zero
>>> code (e.g. exhausted number of configured restart), pod can be marked as
>>> complete hence not restarting the job again?
>>>
>>> Thanks a lot!
>>> Eleanore
>>>
>>> On Tue, Aug 4, 2020 at 2:49 AM Yang Wang  wrote:
>>>
 @Till Rohrmann  In native mode, when a Flink
 application terminates with FAILED state, all the resources will be cleaned
 up.

 However, in standalone mode, I agree with you that we need to rethink
 the exit code of Flink. When a job exhausts the restart
 strategy, we should terminate the pod and do not restart again. After
 googling, it seems that we could not specify the restartPolicy
 based on exit code[1]. So maybe we need to return a zero exit code to
 avoid restarting by K8s.

 [1].
 https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code

 Best,
 Yang

 Till Rohrmann  于2020年8月4日周二 下午3:48写道：

> @Yang Wang  I believe that we should
> rethink the exit codes of Flink. In general you want K8s to restart a
> failed Flink process. Hence, an application which terminates in state
> FAILED should not return a non-zero exit code because it is a valid
> termination state.
>
> Cheers,
> Till
>
> On Tue, Aug 4, 2020 at 8:55 AM Yang Wang 
> wrote:
>
>> Hi Eleanore,
>>
>> I think you are using K8s resource "Job" to deploy the jobmanager.
>> Please set .spec.template.spec.restartPolicy = "Never" and
>> spec.backoffLimit = 0.
>> Refer here[1] for more information.
>>
>> Then, when the jobmanager failed because of any reason, the K8s job
>> will be marked failed. And K8s will not restart the job again.
>>
>> [1].
>> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup
>>
>>
>> Best,
>> Yang
>>
>> Eleanore Jin  于2020年8月4日周二 上午12:05写道：
>>
>>> Hi Till,
>>>
>>> Thanks for the reply!
>>>
>>> I manually deploy as per-job mode [1] and I am using Flink 1.8.2.
>>> Specifically, I build a custom docker image, which I copied the app jar
>>> (not uber jar) and all its dependencies under /flink/lib.
>>>
>>> So my question is more like, in this case, if the job is marked as
>>> FAILED, which causes k8s to restart the pod, this seems not help at all,
>>> what are the suggestions for such scenario?
>>>
>>>

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Actually, the application status shows in YARN web UI is not determined by
the jobmanager process exit code.
Instead, we use "resourceManagerClient.unregisterApplicationMaster" to
control the final status of YARN application.
So although jobmanager exit with zero code, it still could show failed
status in YARN web UI.

I have created a ticket to track this improvement[1].

[1]. https://issues.apache.org/jira/browse/FLINK-18828


Best,
Yang


Till Rohrmann  于2020年8月5日周三 下午3:56写道：

> Yes for the other deployments it is not a problem. A reason why people
> preferred non-zero exit codes in case of FAILED jobs is that this is easier
> to monitor than having to take a look at the actual job result. Moreover,
> in the YARN web UI the application shows as failed if I am not mistaken.
> However, from a framework's perspective, a FAILED job does not mean that
> Flink has failed and, hence, the return code could still be 0 in my opinion.
>
> Cheers,
> Till
>
> On Wed, Aug 5, 2020 at 9:30 AM Yang Wang  wrote:
>
>> Hi Eleanore,
>>
>> Yes, I suggest to use Job to replace Deployment. It could be used to run
>> jobmanager one time and finish after a successful/failed completion.
>>
>> However, using Job still could not solve your problem completely. Just as
>> Till said, When a job exhausts the restart strategy, the jobmanager
>> pod will terminate with non-zero exit code. It will cause the K8s
>> restarting it again. Even though we could set the resartPolicy and
>> backoffLimit,
>> this is not a clean and correct way to go. We should terminate the
>> jobmanager process with zero exit code in such situation.
>>
>> @Till Rohrmann  I just have one concern. Is it a
>> special case for K8s deployment? For standalone/Yarn/Mesos, it seems that
>> terminating with
>> non-zero exit code is harmless.
>>
>>
>> Best,
>> Yang
>>
>> Eleanore Jin  于2020年8月4日周二 下午11:54写道：
>>
>>> Hi Yang & Till,
>>>
>>> Thanks for your prompt reply!
>>>
>>> Yang, regarding your question, I am actually not using k8s job, as I put
>>> my app.jar and its dependencies under flink's lib directory. I have 1 k8s
>>> deployment for job manager, and 1 k8s deployment for task manager, and 1
>>> k8s service for job manager.
>>>
>>> As you mentioned above, if flink job is marked as failed, it will cause
>>> the job manager pod to be restarted. Which is not the ideal behavior.
>>>
>>> Do you suggest that I should change the deployment strategy from using
>>> k8s deployment to k8s job? In case the flink program exit with non-zero
>>> code (e.g. exhausted number of configured restart), pod can be marked as
>>> complete hence not restarting the job again?
>>>
>>> Thanks a lot!
>>> Eleanore
>>>
>>> On Tue, Aug 4, 2020 at 2:49 AM Yang Wang  wrote:
>>>
 @Till Rohrmann  In native mode, when a Flink
 application terminates with FAILED state, all the resources will be cleaned
 up.

 However, in standalone mode, I agree with you that we need to rethink
 the exit code of Flink. When a job exhausts the restart
 strategy, we should terminate the pod and do not restart again. After
 googling, it seems that we could not specify the restartPolicy
 based on exit code[1]. So maybe we need to return a zero exit code to
 avoid restarting by K8s.

 [1].
 https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code

 Best,
 Yang

 Till Rohrmann  于2020年8月4日周二 下午3:48写道：

> @Yang Wang  I believe that we should
> rethink the exit codes of Flink. In general you want K8s to restart a
> failed Flink process. Hence, an application which terminates in state
> FAILED should not return a non-zero exit code because it is a valid
> termination state.
>
> Cheers,
> Till
>
> On Tue, Aug 4, 2020 at 8:55 AM Yang Wang 
> wrote:
>
>> Hi Eleanore,
>>
>> I think you are using K8s resource "Job" to deploy the jobmanager.
>> Please set .spec.template.spec.restartPolicy = "Never" and
>> spec.backoffLimit = 0.
>> Refer here[1] for more information.
>>
>> Then, when the jobmanager failed because of any reason, the K8s job
>> will be marked failed. And K8s will not restart the job again.
>>
>> [1].
>> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup
>>
>>
>> Best,
>> Yang
>>
>> Eleanore Jin  于2020年8月4日周二 上午12:05写道：
>>
>>> Hi Till,
>>>
>>> Thanks for the reply!
>>>
>>> I manually deploy as per-job mode [1] and I am using Flink 1.8.2.
>>> Specifically, I build a custom docker image, which I copied the app jar
>>> (not uber jar) and all its dependencies under /flink/lib.
>>>
>>> So my question is more like, in this case, if the job is marked as
>>> FAILED, which causes k8s to restart the pod, this seems not help at all,
>>> what are the suggestions for such scenario?
>>>
>>>

Re: Two Queries and a Kafka Topic

2020-08-05 Thread Theo Diefenthal

Hi Marco,

In general, I see three solutions here you could approach:

1. Use the StateProcessorAPI: You can run a program with the stateProcessorAPI
that loads the data from JDBC and stores it into a Flink SavePoint. Afterwards,
you start your streaming job from that savepoint which will load its state and
within find all the data from JDBC stored already.
2. Load from master, distribute with the job: When you build up your jobgraph,
you could execute the JDBC queries and put the result into some Serializable
class which in turn you plug in a an operator in your stream (e.g. a map
stage). The class along with all the queried data will be serialized and
deserialized on the taskmanagers (Usually, I use this for configuration
parameters, but it might be ok in this case as well)
3. Load from TaskManager: In your map-function, if the very first event is
received, you can block processing and synchronously load the data from JDBC
(So each Taskmanager performs the JDBC query itself). You then keep the data to
be used for all subsequent map steps.

I think, option 3 is the easiest to be implemented while option 1 might be the
most elegant way in my opinion.

Best regards
Theo

Von: "Marco Villalobos"
An: "Leonard Xu"
CC: "user"
Gesendet: Mittwoch, 5. August 2020 04:33:23
Betreff: Re: Two Queries and a Kafka Topic

Hi Leonard,

First, Thank you.

I am currently trying to restrict my solution to Apache Flink 1.10 because its
the current version supported by Amazon EMR.
i am not ready to change our operational environment to solve this.

Second, I am using the DataStream API. The Kafka Topic is not in a table, it is
in a DataStream.

The SQL queries are literally from a PostgresSQL database, and only need to be
run exactly once in the lifetime of the job.

I am struggling to determine where this happens.

JDBCInputFormat seems to query the SQL table repetitively, and also connecting
streams and aggregating into one object is very complicated.

Thus, I am wondering what is the right approach.

Let me restate the parameters.

SQL Query One = data in PostgreSQL (200K records) that is used for business
logic.
SQL Query Two = data in PostgreSQL (1000 records) that is used for business
logic.
Kafka Topic One = unlimited data-stream that uses the data-stream api and
queries above to write into multiple sinks

Asci Diagram:

[ SQL Query One] > [Aggregate to Map]

Kafka > [Kafka Topic One] --- [Keyed Process Function (Query One Map, Query
Two Map)] ---<[Multiple Sinks]

[ SQL Query Two] > [Aggregate to Map]

Maybe my graph above helps. You see, I need Query One and Query Two only ever
execute once. After that the information they provide are used to correctly
process the Kafka Topic.

I'll take a deep further to try and understand what you said, thank you, but
JDBCInputFormat seem to repetitively query the database. Maybe I need to write
a RichFunction or AsyncIO function and cache the results in state after that.

On Aug 4, 2020, at 6:25 PM, Leonard Xu < [ mailto:xbjt...@gmail.com |
xbjt...@gmail.com ] > wrote:

Hi, Marco

BQ_BEGIN

If I need SQL Query One and SQL Query Two to happen just one time,

Looks like you want to reuse this kafka table in one job, It’s supported to
execute multiple query in one sql job in Flink 1.11.
You can use `StatementSet`[1] to add SQL Query one and SQL query Two in a
single SQL job[1].

Best
Leonard
[1] [
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sql/insert.html#run-an-insert-statement
|
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sql/insert.html#run-an-insert-statement
]

BQ_BEGIN

在 2020年8月5日，04:34，Marco Villalobos < [ mailto:mvillalo...@kineteque.com |
mvillalo...@kineteque.com ] > 写道：

Lets say that I have:

SQL Query One from data in PostgreSQL (200K records).
SQL Query Two from data in PostgreSQL (1000 records).
and Kafka Topic One.

Let's also say that main data from this Flink job arrives in Kafka Topic One.

If I need SQL Query One and SQL Query Two to happen just one time, when the job
starts up, and afterwards maybe store it in Keyed State or Broadcast State, but
it's not really part of the stream, then what is the best practice for
supporting that in Flink

The Flink job needs to stream data from Kafka Topic One, aggregate it, and
perform computations that require all of the data in SQL Query One and SQL
Query Two to perform its business logic.

I am using Flink 1.10.

I supposed to query the database before the Job I submitted, and then pass it
on as parameters to a function?
Or am I supposed to use JDBCInputFormat for both queries and create two
streams, and somehow connect or broadcast both of them two the main stream that
uses Kafka Topic One?

I would appreciate guidance. Please. Thank you.

Sincerely,

Marco A. Villalobos

BQ_END

Re: The bytecode of the class does not match the source code

Well of course these differ; on the left you have the decompiled 
bytecode, on the right the original source.


If these were the same you wouldn't need source jars.

On 05/08/2020 12:20, 魏子涵 wrote:
I'm sure the two versions match up. Following is the pic comparing 
codes in IDEA

https://img-blog.csdnimg.cn/20200805180232929.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0NTRE5yaG1t,size_16,color_FF,t_70






At 2020-08-05 16:46:11, "Chesnay Schepler"  wrote:

Please make sure you have loaded the correct source jar, and
aren't by chance still using the 1.11.0 source jar.

On 05/08/2020 09:57, 魏子涵 wrote:

Hi, everyone:
      I found  the
【org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl】 class
in【flink-runtime_2.11-1.11.1.jar】does not match the source code.
Is it a problem we need to fix(if it is, what should we do)? or
just let it go?

Re: The bytecode of the class does not match the source code

Well of course these differ; on the left you have the decompiled 
bytecode, on the right the original source.


If these were the same you wouldn't need source jars.

On 05/08/2020 12:20, 魏子涵 wrote:
I'm sure the two versions match up. Following is the pic comparing 
codes in IDEA

https://img-blog.csdnimg.cn/20200805180232929.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0NTRE5yaG1t,size_16,color_FF,t_70






At 2020-08-05 16:46:11, "Chesnay Schepler"  wrote:

Please make sure you have loaded the correct source jar, and
aren't by chance still using the 1.11.0 source jar.

On 05/08/2020 09:57, 魏子涵 wrote:

Hi, everyone:
      I found  the
【org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl】 class
in【flink-runtime_2.11-1.11.1.jar】does not match the source code.
Is it a problem we need to fix(if it is, what should we do)? or
just let it go?

Re: The bytecode of the class does not match the source code

2020-08-05 Thread Jake


hi 魏子涵

Idea decompiled code is not match java source code, you can download java 
source code in idea.

/Volumes/work/maven_repository/org/apache/flink/flink-runtime_2.11/1.10.1/flink-runtime_2.11-1.10.1-sources.jar!/org/apache/flink/runtime/jobmaster/slotpool/SlotPoolImpl.java

Jake


> On Aug 5, 2020, at 6:20 PM, 魏子涵  wrote:
> 
> I'm sure the two versions match up. Following is the pic comparing codes in 
> IDEA
> https://img-blog.csdnimg.cn/20200805180232929.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0NTRE5yaG1t,size_16,color_FF,t_70
> 
> 
> 
> 
> 
> At 2020-08-05 16:46:11, "Chesnay Schepler"  wrote:
> 
> Please make sure you have loaded the correct source jar, and aren't by chance 
> still using the 1.11.0 source jar.
> 
> On 05/08/2020 09:57, 魏子涵 wrote:
>> Hi, everyone:
>>   I found  the 
>> 【org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl】 class 
>> in【flink-runtime_2.11-1.11.1.jar】does not match the source code. Is it a 
>> problem we need to fix(if it is, what should we do)? or just let it go?
>> 
>> 
>>  
> 
> 
> 
>

?????? flink-1.11 ????????

2020-08-05 Thread kcz

 




----
??: 
   "user-zh"

Re:Re: The bytecode of the class does not match the source code

2020-08-05 Thread 魏子涵

I'm sure the two versions match up. Following is the pic comparing codes in IDEA
https://img-blog.csdnimg.cn/20200805180232929.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0NTRE5yaG1t,size_16,color_FF,t_70
















At 2020-08-05 16:46:11, "Chesnay Schepler"  wrote:

Please make sure you have loaded the correct source jar, and aren't by chance 
still using the 1.11.0 source jar.



On 05/08/2020 09:57, 魏子涵 wrote:

Hi, everyone:
  I found  the 【org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl】 
class in【flink-runtime_2.11-1.11.1.jar】does not match the source code. Is it a 
problem we need to fix(if it is, what should we do)? or just let it go?

Re:Re: The bytecode of the class does not match the source code

2020-08-05 Thread 魏子涵

I'm sure the two versions match up. Following is the pic comparing codes in IDEA
https://img-blog.csdnimg.cn/20200805180232929.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0NTRE5yaG1t,size_16,color_FF,t_70
















At 2020-08-05 16:46:11, "Chesnay Schepler"  wrote:

Please make sure you have loaded the correct source jar, and aren't by chance 
still using the 1.11.0 source jar.



On 05/08/2020 09:57, 魏子涵 wrote:

Hi, everyone:
  I found  the 【org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl】 
class in【flink-runtime_2.11-1.11.1.jar】does not match the source code. Is it a 
problem we need to fix(if it is, what should we do)? or just let it go?

Re: [DISCUSS] FLIP-133: Rework PyFlink Documentation

2020-08-05 Thread Wei Zhong

Hi Xingbo,

Thanks for your information. 

I think the PySpark's documentation redesigning deserves our attention. It 
seems that the Spark community has also begun to treat the user experience of 
Python documentation more seriously. We can continue to pay attention to the 
discussion and progress of the redesigning in the Spark community. It is so 
similar to our working that there should be some ideas worthy for us.

Best,
Wei


> 在 2020年8月5日，15:02，Xingbo Huang  写道：
> 
> Hi,
> 
> I found that the spark community is also working on redesigning pyspark 
> documentation[1] recently. Maybe we can compare the difference between our 
> document structure and its document structure.
> 
> [1] https://issues.apache.org/jira/browse/SPARK-31851 
> 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Need-some-help-and-contributions-in-PySpark-API-documentation-td29972.html
>  
> 
> 
> Best,
> Xingbo
> 
> David Anderson mailto:da...@alpinegizmo.com>> 
> 于2020年8月5日周三 上午3:17写道：
> I'm delighted to see energy going into improving the documentation.
> 
> With the current documentation, I get a lot of questions that I believe 
> reflect two fundamental problems with what we currently provide:
> 
> (1) We have a lot of contextual information in our heads about how Flink 
> works, and we are able to use that knowledge to make reasonable inferences 
> about how things (probably) work in cases we aren't so familiar with. For 
> example, I get a lot of questions of the form "If I use  will I 
> still have exactly once guarantees?" The answer is always yes, but they 
> continue to have doubts because we have failed to clearly communicate this 
> fundamental, underlying principle. 
> 
> This specific example about fault tolerance applies across all of the Flink 
> docs, but the general idea can also be applied to the Table/SQL and PyFlink 
> docs. The guiding principles underlying these APIs should be written down in 
> one easy-to-find place. 
> 
> (2) The other kind of question I get a lot is "Can I do  with ?" E.g., 
> "Can I use the JDBC table sink from PyFlink?" These questions can be very 
> difficult to answer because it is frequently the case that one has to reason 
> about why a given feature doesn't seem to appear in the documentation. It 
> could be that I'm looking in the wrong place, or it could be that someone 
> forgot to document something, or it could be that it can in fact be done by 
> applying a general mechanism in a specific way that I haven't thought of -- 
> as in this case, where one can use a JDBC sink from Python if one thinks to 
> use DDL. 
> 
> So I think it would be helpful to be explicit about both what is, and what is 
> not, supported in PyFlink. And to have some very clear organizing principles 
> in the documentation so that users can quickly learn where to look for 
> specific facts.
> 
> Regards,
> David
> 
> 
> On Tue, Aug 4, 2020 at 1:01 PM jincheng sun  > wrote:
> Hi Seth and David,
> 
> I'm very happy to have your reply and suggestions. I would like to share my 
> thoughts here:
> 
> The main motivation we want to refactor the PyFlink doc is that we want to 
> make sure that the Python users could find all they want starting from the 
> PyFlink documentation mainpage. That’s, the PyFlink documentation should have 
> a catalogue which includes all the functionalities available in PyFlink. 
> However, this doesn’t mean that we will make a copy of the content of the 
> documentation in the other places. It may be just a reference/link to the 
> other documentation if needed. For the documentation added under PyFlink 
> mainpage, the principle is that it should only include Python specific 
> content, instead of making a copy of the Java content.
> 
> >>  I'm concerned that this proposal duplicates a lot of content that will 
> >> quickly get out of sync. It feels like it is documenting PyFlink 
> >> separately from the rest of the project.
> 
> Regarding the concerns about maintainability, as mentioned above, The goal of 
> this FLIP is to provide an intelligible entrance of Python API, and the 
> content in it should only contain the information which is useful for Python 
> users. There are indeed many agenda items that duplicate the Java documents 
> in this FLIP, but it doesn't mean the content would be copied from Java 
> documentation. i.e, if the content of the document is the same as the 
> corresponding Java document, we will add a link to the Java document. e.g. 
> the "Built-in functions" and "SQL". We only create a page for the Python-only 
> content, and then redirect to the Java document if there is something shared 
> with Java. e.g. "Connectors" and "Catalogs". If the document is Python-only 
> and already exists, we will move it from the old python

Re: [DISCUSS] FLIP-133: Rework PyFlink Documentation

2020-08-05 Thread Wei Zhong

Hi Xingbo,

Thanks for your information. 

I think the PySpark's documentation redesigning deserves our attention. It 
seems that the Spark community has also begun to treat the user experience of 
Python documentation more seriously. We can continue to pay attention to the 
discussion and progress of the redesigning in the Spark community. It is so 
similar to our working that there should be some ideas worthy for us.

Best,
Wei


> 在 2020年8月5日，15:02，Xingbo Huang  写道：
> 
> Hi,
> 
> I found that the spark community is also working on redesigning pyspark 
> documentation[1] recently. Maybe we can compare the difference between our 
> document structure and its document structure.
> 
> [1] https://issues.apache.org/jira/browse/SPARK-31851 
> 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Need-some-help-and-contributions-in-PySpark-API-documentation-td29972.html
>  
> 
> 
> Best,
> Xingbo
> 
> David Anderson mailto:da...@alpinegizmo.com>> 
> 于2020年8月5日周三 上午3:17写道：
> I'm delighted to see energy going into improving the documentation.
> 
> With the current documentation, I get a lot of questions that I believe 
> reflect two fundamental problems with what we currently provide:
> 
> (1) We have a lot of contextual information in our heads about how Flink 
> works, and we are able to use that knowledge to make reasonable inferences 
> about how things (probably) work in cases we aren't so familiar with. For 
> example, I get a lot of questions of the form "If I use  will I 
> still have exactly once guarantees?" The answer is always yes, but they 
> continue to have doubts because we have failed to clearly communicate this 
> fundamental, underlying principle. 
> 
> This specific example about fault tolerance applies across all of the Flink 
> docs, but the general idea can also be applied to the Table/SQL and PyFlink 
> docs. The guiding principles underlying these APIs should be written down in 
> one easy-to-find place. 
> 
> (2) The other kind of question I get a lot is "Can I do  with ?" E.g., 
> "Can I use the JDBC table sink from PyFlink?" These questions can be very 
> difficult to answer because it is frequently the case that one has to reason 
> about why a given feature doesn't seem to appear in the documentation. It 
> could be that I'm looking in the wrong place, or it could be that someone 
> forgot to document something, or it could be that it can in fact be done by 
> applying a general mechanism in a specific way that I haven't thought of -- 
> as in this case, where one can use a JDBC sink from Python if one thinks to 
> use DDL. 
> 
> So I think it would be helpful to be explicit about both what is, and what is 
> not, supported in PyFlink. And to have some very clear organizing principles 
> in the documentation so that users can quickly learn where to look for 
> specific facts.
> 
> Regards,
> David
> 
> 
> On Tue, Aug 4, 2020 at 1:01 PM jincheng sun  > wrote:
> Hi Seth and David,
> 
> I'm very happy to have your reply and suggestions. I would like to share my 
> thoughts here:
> 
> The main motivation we want to refactor the PyFlink doc is that we want to 
> make sure that the Python users could find all they want starting from the 
> PyFlink documentation mainpage. That’s, the PyFlink documentation should have 
> a catalogue which includes all the functionalities available in PyFlink. 
> However, this doesn’t mean that we will make a copy of the content of the 
> documentation in the other places. It may be just a reference/link to the 
> other documentation if needed. For the documentation added under PyFlink 
> mainpage, the principle is that it should only include Python specific 
> content, instead of making a copy of the Java content.
> 
> >>  I'm concerned that this proposal duplicates a lot of content that will 
> >> quickly get out of sync. It feels like it is documenting PyFlink 
> >> separately from the rest of the project.
> 
> Regarding the concerns about maintainability, as mentioned above, The goal of 
> this FLIP is to provide an intelligible entrance of Python API, and the 
> content in it should only contain the information which is useful for Python 
> users. There are indeed many agenda items that duplicate the Java documents 
> in this FLIP, but it doesn't mean the content would be copied from Java 
> documentation. i.e, if the content of the document is the same as the 
> corresponding Java document, we will add a link to the Java document. e.g. 
> the "Built-in functions" and "SQL". We only create a page for the Python-only 
> content, and then redirect to the Java document if there is something shared 
> with Java. e.g. "Connectors" and "Catalogs". If the document is Python-only 
> and already exists, we will move it from the old python

Re: flink1.10.1/1.11.1 使用sql 进行group 和时间窗口操作后状态越来越大

2020-08-05 Thread Congxian Qiu

Hi
  RocksDB StateBackend 只需要在 flink-conf 中进行一下配置就行了[1].

  另外从你前面两份邮件看，我有些信息比较疑惑，你能否贴一下现在使用的 flink-conf，以及 checkpoint UI 的截图，以及 HDFS
上 checkpoint 目录的截图

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/ops/state/state_backends.html#%E8%AE%BE%E7%BD%AE-state-backend

Best,
Congxian


op <520075...@qq.com> 于2020年8月5日周三 下午4:03写道：

> 你好，ttl配置是
> val settings = EnvironmentSettings.newInstance().inStreamingMode().build()
> val tableEnv = StreamTableEnvironment.create(bsEnv, settings)
> val tConfig = tableEnv.getConfig
> tConfig.setIdleStateRetentionTime(Time.minutes(1440), Time.minutes(1450))
>
>
>   1)目前是有3个任务都是这种情况
>   2)目前集群没有RocksDB环境
> 谢谢
> --原始邮件--
> 发件人:
>   "user-zh"
> <
> qcx978132...@gmail.com;
> 发送时间:2020年8月5日(星期三) 下午3:30
> 收件人:"user-zh"
> 主题:Re: flink1.10.1/1.11.1 使用sql 进行group 和 时间窗口 操作后 状态越来越大
>
>
>
> Hi op
>  这个情况比较奇怪。我想确认下：
>  1）你所有作业都遇到 checkpoint size 不断变大的情况，还是只有这个类型的作业遇到这个问题呢？
>  2）是否尝试过 RocksDBStateBackend 呢（全量和增量）？情况如何呢
>
>  另外，你 TTL 其他的配置是怎么设置的呢？
>
> 从原理上来说，checkpoint 就是 state 的一个快照，如果 checkpoint 越来越大，那么就是 state 越来越多。
> Best,
> Congxian
>
>
> op <520075...@qq.com 于2020年8月5日周三 下午2:46写道：
>
>  nbsp; nbsp;
> 
> 你好，我使用的是FsStateBackendnbsp;状态后端，调到5分钟也是一样，看了下checkpoint花费的时间都在300ms左右，我们的业务数据量每天基本一样，
>  nbsp;
> nbsp;设置空闲状态清理时间为1440minute，按道理运行一天以后状态大小会趋于平稳，但是目前运行了5天，
>  nbsp; nbsp;观察到的checkpoint shared 目录大小一直在增加，也确认过group
>  by的key只会在处理当天出现，就是说这天的状态当天过后就会处于空闲状态，
>  nbsp; nbsp;运行5天能满足清理条件
> 
> 
> 
> 
>  -- 原始邮件 --
>  发件人:
> 
> "user-zh"
> 
> <
>  qcx978132...@gmail.comgt;;
>  发送时间:nbsp;2020年8月3日(星期一) 下午5:50
>  收件人:nbsp;"user-zh" 
>  主题:nbsp;Re: flink1.10.1/1.11.1 使用sql 进行group 和 时间窗口 操作后 状态越来越大
> 
> 
> 
>  Hi
>  nbsp;nbsp; 能否把 checkpoint 的 interval 调长一点再看看是否稳定呢？从 shared
>  目录的数据量看，有增长，后续基本持平。现在
>  Checkpointed Data Size 是增量的大小[1]，而不是整个 checkpoint 的数据量的大小，如果
> checkpoint
>  之间，数据改动很多的话，这个值会变大
> 
>  [1]
> 
> 
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/ops/state/state_backends.html#%E5%A2%9E%E9%87%8F%E5%BF%AB%E7%85%A7
> 
> ;
> Best,
>  Congxian
> 
> 
>  op <520075...@qq.comgt; 于2020年8月3日周一 下午2:18写道：
> 
>  gt; amp;nbsp; amp;nbsp;
>  gt;
> 同问，我也遇到了状态越来越大的情况，使用的是1.11.0版本，用hdfs保存checkpoint，checkpoint间隔3分钟，
>  gt; 逻辑是按照 事件day 和 id 进行groupby
>  gt; 然后有十几个聚合指标，运行了7天左右，状态一直在增加，设置了失效时间，然后watermark看着也正常在走
>  gt; tConfig.setIdleStateRetentionTime(Time.minutes(1440),
>  gt; Time.minutes(1440+10))
>  gt;
>  gt;
>  gt;
>  gt;
>  gt;
> --amp;nbsp;原始邮件amp;nbsp;--
>  gt; 发件人:
> 
> gt;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;
>  nbsp; "user-zh"
> 
> gt;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;
>  nbsp; <
>  gt; 384939...@qq.comamp;gt;;
>  gt; 发送时间:amp;nbsp;2020年8月3日(星期一) 中午1:50
>  gt; 收件人:amp;nbsp;"user-zh" amp;gt;;
>  gt;
>  gt; 主题:amp;nbsp;Re: flink1.10.1/1.11.1 使用sql 进行group 和 时间窗口
> 操作后 状态越来越大
>  gt;
>  gt;
>  gt;
>  gt; hi，您好：
>  gt; 我改回增量模式重新收集了一些数据：
>  gt; 1、数据处理速度：3000条每秒，是测试环境的，压力比较稳定，几乎没有波动
>  gt; 2、checkpoint是interval设置的是5秒
>  gt; 3、目前这个作业是每分钟一个窗口
>  gt; 4、并行度设置的1，使用on-yarn模式
>  gt;
>  gt; 刚启动的时候，如下：
>  gt; <
> http://apache-flink.147419.n8.nabble.com/file/t793/6.pngamp;gt;
>  gt;
>  gt; 18分钟后，如下：
>  gt; <
> http://apache-flink.147419.n8.nabble.com/file/t793/9.pngamp;gt;
>  gt;
>  gt; checkpoints设置：
>  gt; <
> http://apache-flink.147419.n8.nabble.com/file/t793/conf.pngamp;gt;
>  gt;
>  gt; hdfs上面大小：
>  gt; <
> http://apache-flink.147419.n8.nabble.com/file/t793/hdfs.pngamp;gt;
>  gt;
>  gt; 页面上看到的大小：
>  gt; <
> 
> http://apache-flink.147419.n8.nabble.com/file/t793/checkpoinsts1.pngamp;gt
> 
> ;
> ;
>  gt;
>  gt;
>  gt; Congxian Qiu wrote
>  gt; amp;gt; Hiamp;nbsp;amp;nbsp; 鱼子酱
>  gt;
> amp;gt;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;
> 能否把在使用增量 checkpoint
>  的模式下，截图看一下 checkpoint
>  gt; size 的走势呢？另外可以的话，也麻烦你在每次
>  gt; amp;gt; checkpoint 做完之后，到 hdfs 上 ls 一下 checkpoint 目录的大小。
>  gt;
> amp;gt;amp;nbsp;amp;nbsp;amp;nbsp;amp;nbsp;
>  另外有一个问题还需要回答一下，你的处理速度大概是多少，state 的更新频率能否评估一下呢？
>  gt; amp;gt;
>  gt; amp;gt; Best,
>  gt; amp;gt; Congxian
>  gt; amp;gt;
>  gt; amp;gt;
>  gt; amp;gt; 鱼子酱 <
>  gt;
>  gt; amp;gt; 384939718@
>  gt;
>  gt; amp;gt;amp;gt; 于2020年7月30日周四 上午10:43写道：
>  gt;

Re: The bytecode of the class does not match the source code

Please make sure you have loaded the correct source jar, and aren't by 
chance still using the 1.11.0 source jar.


On 05/08/2020 09:57, 魏子涵 wrote:

Hi, everyone:
      I found  the 
【org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl】 class 
in【flink-runtime_2.11-1.11.1.jar】does not match the source code. Is it 
a problem we need to fix(if it is, what should we do)? or just let it go?

Re: The bytecode of the class does not match the source code

Please make sure you have loaded the correct source jar, and aren't by 
chance still using the 1.11.0 source jar.


On 05/08/2020 09:57, 魏子涵 wrote:

Hi, everyone:
      I found  the 
【org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl】 class 
in【flink-runtime_2.11-1.11.1.jar】does not match the source code. Is it 
a problem we need to fix(if it is, what should we do)? or just let it go?

Re:Re:写入hive 问题

2020-08-05 Thread air23




你好 谢谢。去掉版本号 确实可以了。我用的版本 和我安装的hive版本是一致的。不知道是什么原因导致的。


















在 2020-08-05 15:59:06，"wldd"  写道：
>hi：
>1.你可以看下你配置hive catalog时的hive版本和你当前使用的hive版本是否一致
>2.你也可以尝试在配置hive catalog的时候，不设置hive版本
>
>
>
>
>
>
>
>
>
>
>
>
>
>--
>
>Best，
>wldd
>
>
>
>
>
>在 2020-08-05 15:38:26，"air23"  写道：
>>你好 
>>15:33:59,781 INFO  org.apache.flink.table.catalog.hive.HiveCatalog
>>- Created HiveCatalog 'myhive1'
>>Exception in thread "main" 
>>org.apache.flink.table.catalog.exceptions.CatalogException: Failed to create 
>>Hive Metastore client
>>at 
>>org.apache.flink.table.catalog.hive.client.HiveShimV120.getHiveMetastoreClient(HiveShimV120.java:58)
>>at 
>>org.apache.flink.table.catalog.hive.client.HiveMetastoreClientWrapper.createMetastoreClient(HiveMetastoreClientWrapper.java:240)
>>at 
>>org.apache.flink.table.catalog.hive.client.HiveMetastoreClientWrapper.(HiveMetastoreClientWrapper.java:71)
>>at 
>>org.apache.flink.table.catalog.hive.client.HiveMetastoreClientFactory.create(HiveMetastoreClientFactory.java:35)
>>at org.apache.flink.table.catalog.hive.HiveCatalog.open(HiveCatalog.java:223)
>>at 
>>org.apache.flink.table.catalog.CatalogManager.registerCatalog(CatalogManager.java:191)
>>at 
>>org.apache.flink.table.api.internal.TableEnvironmentImpl.registerCatalog(TableEnvironmentImpl.java:337)
>>at com.zt.kafka.KafkaTest4.main(KafkaTest4.java:73)
>>Caused by: java.lang.NoSuchMethodException: 
>>org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(org.apache.hadoop.hive.conf.HiveConf)
>>at java.lang.Class.getMethod(Class.java:1786)
>>at 
>>org.apache.flink.table.catalog.hive.client.HiveShimV120.getHiveMetastoreClient(HiveShimV120.java:54)
>>... 7 mor
>>
>>
>>
>>
>>请问这个是什么问题 Metastore 也已经启动了。
>>谢谢

flink sql eos

2020-08-05 Thread sllence

大家好

   请问目前flink sql是不是不能没有开启全局端到端精确一致性（eos）的方
式，

目前仅有kafka实现了TwoPhaseCommitSinkFunction，但kafka的ddl中也没有属性去设
置Semantic为EXACTLY_ONCE

 

我们是否可以去支持更多的事务性connector，并可以在flink sql维度支持开启全局的
端到端一致性，并为每个connector是否支持EXACTLY_ONCE进行验证，

当开启全局EXACTLY_ONCE并且所有使用的connector都支持EXACTLY_ONCE，是否整个应
用程序就可以做到端到端的精确一致性

Re: Kafka transaction error lead to data loss under end to end exact-once

2020-08-05 Thread Khachatryan Roman

Hi Lu,

AFAIK, it's not going to be fixed. As you mentioned in the first email,
Kafka should be configured so that it's transaction timeout is less than
your max checkpoint duration.

However, you should not only change transaction.timeout.ms in producer but
also transaction.max.timeout.ms on your brokers.
Please refer to
https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/connectors/kafka.html#caveats

Regards,
Roman


On Wed, Aug 5, 2020 at 12:24 AM Lu Niu  wrote:

> Hi, Khachatryan
>
> Thank you for the reply. Is that a problem that can be fixed? If so, is
> the fix on roadmap? Thanks!
>
> Best
> Lu
>
> On Tue, Aug 4, 2020 at 1:24 PM Khachatryan Roman <
> khachatryan.ro...@gmail.com> wrote:
>
>> Hi Lu,
>>
>> Yes, this error indicates data loss (unless there were no records in the
>> transactions).
>>
>> Regards,
>> Roman
>>
>>
>> On Mon, Aug 3, 2020 at 9:14 PM Lu Niu  wrote:
>>
>>> Hi,
>>>
>>> We are using end to end exact-once flink + kafka and
>>> encountered belowing exception which usually came after checkpoint failures:
>>> ```
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *Caused by: org.apache.kafka.common.errors.ProducerFencedException:
>>> Producer attempted an operation with an old epoch. Either there is a newer
>>> producer with the same transactionalId, or the producer's transaction has
>>> been expired by the broker.2020-07-28 16:27:51,633 INFO
>>>  org.apache.flink.runtime.executiongraph.ExecutionGraph- Job xxx
>>> (f08fc4b1edceb3705e2cb134a8ece73d) switched from state RUNNING to
>>> FAILING.java.lang.RuntimeException: Error while confirming checkpoint at
>>> org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1219) at
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at
>>> java.util.concurrent.FutureTask.run(FutureTask.java:266) at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>> at java.lang.Thread.run(Thread.java:748)Caused by:
>>> org.apache.flink.util.FlinkRuntimeException: Committing one of transactions
>>> failed, logging first encountered failure at
>>> org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.notifyCheckpointComplete(TwoPhaseCommitSinkFunction.java:295)
>>> at
>>> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.notifyCheckpointComplete(AbstractUdfStreamOperator.java:130)
>>> at
>>> org.apache.flink.streaming.runtime.tasks.StreamTask.notifyCheckpointComplete(StreamTask.java:842)
>>> at org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1214) ... 5
>>> more*
>>> ```
>>> We did some end to end tests and noticed whenever such a thing happens,
>>> there will be a data loss.
>>>
>>> Referring to several related questions, I understand I need to increase `
>>> transaction.timeout.ms`  because:
>>> ```
>>> *Semantic.EXACTLY_ONCE mode relies on the ability to commit transactions
>>> that were started before taking a checkpoint, after recovering from the
>>> said checkpoint. If the time between Flink application crash and completed
>>> restart is larger than Kafka’s transaction timeout there will be data loss
>>> (Kafka will automatically abort transactions that exceeded timeout time).*
>>> ```
>>>
>>> But I want to confirm with the community that:
>>> *Does an exception like this will always lead to data loss? *
>>>
>>> I asked because we get this exception sometimes even when the checkpoint
>>> succeeds.
>>>
>>> Setup:
>>> Flink 1.9.1
>>>
>>> Best
>>> Lu
>>>
>>

?????? flink1.10.1/1.11.1 ????sql ????group ?? ???????? ?????? ????????????

2020-08-05 Thread op

??ttl??
val settings = EnvironmentSettings.newInstance().inStreamingMode().build()
val tableEnv = StreamTableEnvironment.create(bsEnv, settings)
val tConfig = tableEnv.getConfig
tConfig.setIdleStateRetentionTime(Time.minutes(1440), Time.minutes(1450))


  1)3??
  2)RocksDB

----
??: 
   "user-zh"

https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/ops/state/state_backends.html#%E5%A2%9E%E9%87%8F%E5%BF%AB%E7%85%A7
 Best,
 Congxian


 op <520075...@qq.comgt; ??2020??8??3?? 2:18??

 gt; amp;nbsp; amp;nbsp;
 gt; 
1.11.0hdfscheckpoint??checkpoint3??
 gt; ?? day ?? id groupby
 gt; 
7watermark??
 gt; tConfig.setIdleStateRetentionTime(Time.minutes(1440),
 gt; Time.minutes(1440+10))
 gt;
 gt;
 gt;
 gt;
 gt; 
--amp;nbsp;amp;nbsp;--
 gt; ??:
 
gt;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;
 nbsp; "user-zh"
 
gt;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;
 nbsp; <
 gt; 384939...@qq.comamp;gt;;
 gt; :amp;nbsp;2020??8??3??(??) 1:50
 gt; 
??:amp;nbsp;"user-zh"http://apache-flink.147419.n8.nabble.com/file/t793/6.pngamp;gt;
 gt;
 gt; 18??
 gt;

Re:写入hive 问题

2020-08-05 Thread wldd

hi：
1.你可以看下你配置hive catalog时的hive版本和你当前使用的hive版本是否一致
2.你也可以尝试在配置hive catalog的时候，不设置hive版本













--

Best，
wldd





在 2020-08-05 15:38:26，"air23"  写道：
>你好 
>15:33:59,781 INFO  org.apache.flink.table.catalog.hive.HiveCatalog 
>   - Created HiveCatalog 'myhive1'
>Exception in thread "main" 
>org.apache.flink.table.catalog.exceptions.CatalogException: Failed to create 
>Hive Metastore client
>at 
>org.apache.flink.table.catalog.hive.client.HiveShimV120.getHiveMetastoreClient(HiveShimV120.java:58)
>at 
>org.apache.flink.table.catalog.hive.client.HiveMetastoreClientWrapper.createMetastoreClient(HiveMetastoreClientWrapper.java:240)
>at 
>org.apache.flink.table.catalog.hive.client.HiveMetastoreClientWrapper.(HiveMetastoreClientWrapper.java:71)
>at 
>org.apache.flink.table.catalog.hive.client.HiveMetastoreClientFactory.create(HiveMetastoreClientFactory.java:35)
>at org.apache.flink.table.catalog.hive.HiveCatalog.open(HiveCatalog.java:223)
>at 
>org.apache.flink.table.catalog.CatalogManager.registerCatalog(CatalogManager.java:191)
>at 
>org.apache.flink.table.api.internal.TableEnvironmentImpl.registerCatalog(TableEnvironmentImpl.java:337)
>at com.zt.kafka.KafkaTest4.main(KafkaTest4.java:73)
>Caused by: java.lang.NoSuchMethodException: 
>org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(org.apache.hadoop.hive.conf.HiveConf)
>at java.lang.Class.getMethod(Class.java:1786)
>at 
>org.apache.flink.table.catalog.hive.client.HiveShimV120.getHiveMetastoreClient(HiveShimV120.java:54)
>... 7 mor
>
>
>
>
>请问这个是什么问题 Metastore 也已经启动了。
>谢谢

The bytecode of the class does not match the source code

2020-08-05 Thread 魏子涵

Hi, everyone:
  I found  the 【org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl】 
class in【flink-runtime_2.11-1.11.1.jar】does not match the source code. Is it a 
problem we need to fix(if it is, what should we do)? or just let it go?

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Yes for the other deployments it is not a problem. A reason why people
preferred non-zero exit codes in case of FAILED jobs is that this is easier
to monitor than having to take a look at the actual job result. Moreover,
in the YARN web UI the application shows as failed if I am not mistaken.
However, from a framework's perspective, a FAILED job does not mean that
Flink has failed and, hence, the return code could still be 0 in my opinion.

Cheers,
Till

On Wed, Aug 5, 2020 at 9:30 AM Yang Wang  wrote:

> Hi Eleanore,
>
> Yes, I suggest to use Job to replace Deployment. It could be used to run
> jobmanager one time and finish after a successful/failed completion.
>
> However, using Job still could not solve your problem completely. Just as
> Till said, When a job exhausts the restart strategy, the jobmanager
> pod will terminate with non-zero exit code. It will cause the K8s
> restarting it again. Even though we could set the resartPolicy and
> backoffLimit,
> this is not a clean and correct way to go. We should terminate the
> jobmanager process with zero exit code in such situation.
>
> @Till Rohrmann  I just have one concern. Is it a
> special case for K8s deployment? For standalone/Yarn/Mesos, it seems that
> terminating with
> non-zero exit code is harmless.
>
>
> Best,
> Yang
>
> Eleanore Jin  于2020年8月4日周二 下午11:54写道：
>
>> Hi Yang & Till,
>>
>> Thanks for your prompt reply!
>>
>> Yang, regarding your question, I am actually not using k8s job, as I put
>> my app.jar and its dependencies under flink's lib directory. I have 1 k8s
>> deployment for job manager, and 1 k8s deployment for task manager, and 1
>> k8s service for job manager.
>>
>> As you mentioned above, if flink job is marked as failed, it will cause
>> the job manager pod to be restarted. Which is not the ideal behavior.
>>
>> Do you suggest that I should change the deployment strategy from using
>> k8s deployment to k8s job? In case the flink program exit with non-zero
>> code (e.g. exhausted number of configured restart), pod can be marked as
>> complete hence not restarting the job again?
>>
>> Thanks a lot!
>> Eleanore
>>
>> On Tue, Aug 4, 2020 at 2:49 AM Yang Wang  wrote:
>>
>>> @Till Rohrmann  In native mode, when a Flink
>>> application terminates with FAILED state, all the resources will be cleaned
>>> up.
>>>
>>> However, in standalone mode, I agree with you that we need to rethink
>>> the exit code of Flink. When a job exhausts the restart
>>> strategy, we should terminate the pod and do not restart again. After
>>> googling, it seems that we could not specify the restartPolicy
>>> based on exit code[1]. So maybe we need to return a zero exit code to
>>> avoid restarting by K8s.
>>>
>>> [1].
>>> https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code
>>>
>>> Best,
>>> Yang
>>>
>>> Till Rohrmann  于2020年8月4日周二 下午3:48写道：
>>>
 @Yang Wang  I believe that we should
 rethink the exit codes of Flink. In general you want K8s to restart a
 failed Flink process. Hence, an application which terminates in state
 FAILED should not return a non-zero exit code because it is a valid
 termination state.

 Cheers,
 Till

 On Tue, Aug 4, 2020 at 8:55 AM Yang Wang  wrote:

> Hi Eleanore,
>
> I think you are using K8s resource "Job" to deploy the jobmanager.
> Please set .spec.template.spec.restartPolicy = "Never" and
> spec.backoffLimit = 0.
> Refer here[1] for more information.
>
> Then, when the jobmanager failed because of any reason, the K8s job
> will be marked failed. And K8s will not restart the job again.
>
> [1].
> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup
>
>
> Best,
> Yang
>
> Eleanore Jin  于2020年8月4日周二 上午12:05写道：
>
>> Hi Till,
>>
>> Thanks for the reply!
>>
>> I manually deploy as per-job mode [1] and I am using Flink 1.8.2.
>> Specifically, I build a custom docker image, which I copied the app jar
>> (not uber jar) and all its dependencies under /flink/lib.
>>
>> So my question is more like, in this case, if the job is marked as
>> FAILED, which causes k8s to restart the pod, this seems not help at all,
>> what are the suggestions for such scenario?
>>
>> Thanks a lot!
>> Eleanore
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes
>>
>> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann 
>> wrote:
>>
>>> Hi Eleanore,
>>>
>>> how are you deploying Flink exactly? Are you using the application
>>> mode with native K8s support to deploy a cluster [1] or are you manually
>>> deploying a per-job mode [2]?
>>>
>>> I believe the problem might be that we terminate the Flink process
>>> with a non-zero exit code if the job reaches the

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Yes for the other deployments it is not a problem. A reason why people
preferred non-zero exit codes in case of FAILED jobs is that this is easier
to monitor than having to take a look at the actual job result. Moreover,
in the YARN web UI the application shows as failed if I am not mistaken.
However, from a framework's perspective, a FAILED job does not mean that
Flink has failed and, hence, the return code could still be 0 in my opinion.

Cheers,
Till

On Wed, Aug 5, 2020 at 9:30 AM Yang Wang  wrote:

> Hi Eleanore,
>
> Yes, I suggest to use Job to replace Deployment. It could be used to run
> jobmanager one time and finish after a successful/failed completion.
>
> However, using Job still could not solve your problem completely. Just as
> Till said, When a job exhausts the restart strategy, the jobmanager
> pod will terminate with non-zero exit code. It will cause the K8s
> restarting it again. Even though we could set the resartPolicy and
> backoffLimit,
> this is not a clean and correct way to go. We should terminate the
> jobmanager process with zero exit code in such situation.
>
> @Till Rohrmann  I just have one concern. Is it a
> special case for K8s deployment? For standalone/Yarn/Mesos, it seems that
> terminating with
> non-zero exit code is harmless.
>
>
> Best,
> Yang
>
> Eleanore Jin  于2020年8月4日周二 下午11:54写道：
>
>> Hi Yang & Till,
>>
>> Thanks for your prompt reply!
>>
>> Yang, regarding your question, I am actually not using k8s job, as I put
>> my app.jar and its dependencies under flink's lib directory. I have 1 k8s
>> deployment for job manager, and 1 k8s deployment for task manager, and 1
>> k8s service for job manager.
>>
>> As you mentioned above, if flink job is marked as failed, it will cause
>> the job manager pod to be restarted. Which is not the ideal behavior.
>>
>> Do you suggest that I should change the deployment strategy from using
>> k8s deployment to k8s job? In case the flink program exit with non-zero
>> code (e.g. exhausted number of configured restart), pod can be marked as
>> complete hence not restarting the job again?
>>
>> Thanks a lot!
>> Eleanore
>>
>> On Tue, Aug 4, 2020 at 2:49 AM Yang Wang  wrote:
>>
>>> @Till Rohrmann  In native mode, when a Flink
>>> application terminates with FAILED state, all the resources will be cleaned
>>> up.
>>>
>>> However, in standalone mode, I agree with you that we need to rethink
>>> the exit code of Flink. When a job exhausts the restart
>>> strategy, we should terminate the pod and do not restart again. After
>>> googling, it seems that we could not specify the restartPolicy
>>> based on exit code[1]. So maybe we need to return a zero exit code to
>>> avoid restarting by K8s.
>>>
>>> [1].
>>> https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code
>>>
>>> Best,
>>> Yang
>>>
>>> Till Rohrmann  于2020年8月4日周二 下午3:48写道：
>>>
 @Yang Wang  I believe that we should
 rethink the exit codes of Flink. In general you want K8s to restart a
 failed Flink process. Hence, an application which terminates in state
 FAILED should not return a non-zero exit code because it is a valid
 termination state.

 Cheers,
 Till

 On Tue, Aug 4, 2020 at 8:55 AM Yang Wang  wrote:

> Hi Eleanore,
>
> I think you are using K8s resource "Job" to deploy the jobmanager.
> Please set .spec.template.spec.restartPolicy = "Never" and
> spec.backoffLimit = 0.
> Refer here[1] for more information.
>
> Then, when the jobmanager failed because of any reason, the K8s job
> will be marked failed. And K8s will not restart the job again.
>
> [1].
> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup
>
>
> Best,
> Yang
>
> Eleanore Jin  于2020年8月4日周二 上午12:05写道：
>
>> Hi Till,
>>
>> Thanks for the reply!
>>
>> I manually deploy as per-job mode [1] and I am using Flink 1.8.2.
>> Specifically, I build a custom docker image, which I copied the app jar
>> (not uber jar) and all its dependencies under /flink/lib.
>>
>> So my question is more like, in this case, if the job is marked as
>> FAILED, which causes k8s to restart the pod, this seems not help at all,
>> what are the suggestions for such scenario?
>>
>> Thanks a lot!
>> Eleanore
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes
>>
>> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann 
>> wrote:
>>
>>> Hi Eleanore,
>>>
>>> how are you deploying Flink exactly? Are you using the application
>>> mode with native K8s support to deploy a cluster [1] or are you manually
>>> deploying a per-job mode [2]?
>>>
>>> I believe the problem might be that we terminate the Flink process
>>> with a non-zero exit code if the job reaches the

写入hive 问题

2020-08-05 Thread air23

你好 
15:33:59,781 INFO  org.apache.flink.table.catalog.hive.HiveCatalog  
 - Created HiveCatalog 'myhive1'
Exception in thread "main" 
org.apache.flink.table.catalog.exceptions.CatalogException: Failed to create 
Hive Metastore client
at 
org.apache.flink.table.catalog.hive.client.HiveShimV120.getHiveMetastoreClient(HiveShimV120.java:58)
at 
org.apache.flink.table.catalog.hive.client.HiveMetastoreClientWrapper.createMetastoreClient(HiveMetastoreClientWrapper.java:240)
at 
org.apache.flink.table.catalog.hive.client.HiveMetastoreClientWrapper.(HiveMetastoreClientWrapper.java:71)
at 
org.apache.flink.table.catalog.hive.client.HiveMetastoreClientFactory.create(HiveMetastoreClientFactory.java:35)
at org.apache.flink.table.catalog.hive.HiveCatalog.open(HiveCatalog.java:223)
at 
org.apache.flink.table.catalog.CatalogManager.registerCatalog(CatalogManager.java:191)
at 
org.apache.flink.table.api.internal.TableEnvironmentImpl.registerCatalog(TableEnvironmentImpl.java:337)
at com.zt.kafka.KafkaTest4.main(KafkaTest4.java:73)
Caused by: java.lang.NoSuchMethodException: 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(org.apache.hadoop.hive.conf.HiveConf)
at java.lang.Class.getMethod(Class.java:1786)
at 
org.apache.flink.table.catalog.hive.client.HiveShimV120.getHiveMetastoreClient(HiveShimV120.java:54)
... 7 mor




请问这个是什么问题 Metastore 也已经启动了。
谢谢

Re: flink1.10.1/1.11.1 使用sql 进行group 和时间窗口操作后状态越来越大

2020-08-05 Thread Congxian Qiu

Hi op
   这个情况比较奇怪。我想确认下：
   1）你所有作业都遇到 checkpoint size 不断变大的情况，还是只有这个类型的作业遇到这个问题呢？
   2）是否尝试过 RocksDBStateBackend 呢（全量和增量）？情况如何呢

   另外，你 TTL 其他的配置是怎么设置的呢？

从原理上来说，checkpoint 就是 state 的一个快照，如果 checkpoint 越来越大，那么就是 state 越来越多。
Best,
Congxian


op <520075...@qq.com> 于2020年8月5日周三 下午2:46写道：

>  
> 你好，我使用的是FsStateBackend状态后端，调到5分钟也是一样，看了下checkpoint花费的时间都在300ms左右，我们的业务数据量每天基本一样，
>  设置空闲状态清理时间为1440minute，按道理运行一天以后状态大小会趋于平稳，但是目前运行了5天，
>  观察到的checkpoint shared 目录大小一直在增加，也确认过group
> by的key只会在处理当天出现，就是说这天的状态当天过后就会处于空闲状态，
>  运行5天能满足清理条件
>
>
>
>
> -- 原始邮件 --
> 发件人:
>   "user-zh"
> <
> qcx978132...@gmail.com;
> 发送时间:2020年8月3日(星期一) 下午5:50
> 收件人:"user-zh"
> 主题:Re: flink1.10.1/1.11.1 使用sql 进行group 和 时间窗口 操作后 状态越来越大
>
>
>
> Hi
>  能否把 checkpoint 的 interval 调长一点再看看是否稳定呢？从 shared
> 目录的数据量看，有增长，后续基本持平。现在
> Checkpointed Data Size 是增量的大小[1]，而不是整个 checkpoint 的数据量的大小，如果 checkpoint
> 之间，数据改动很多的话，这个值会变大
>
> [1]
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/ops/state/state_backends.html#%E5%A2%9E%E9%87%8F%E5%BF%AB%E7%85%A7
> Best,
> Congxian
>
>
> op <520075...@qq.com 于2020年8月3日周一 下午2:18写道：
>
>  nbsp; nbsp;
>  同问，我也遇到了状态越来越大的情况，使用的是1.11.0版本，用hdfs保存checkpoint，checkpoint间隔3分钟，
>  逻辑是按照 事件day 和 id 进行groupby
>  然后有十几个聚合指标，运行了7天左右，状态一直在增加，设置了失效时间，然后watermark看着也正常在走
>  tConfig.setIdleStateRetentionTime(Time.minutes(1440),
>  Time.minutes(1440+10))
> 
> 
> 
> 
>  --nbsp;原始邮件nbsp;--
>  发件人:
> 
>  "user-zh"
> 
>  <
>  384939...@qq.comgt;;
>  发送时间:nbsp;2020年8月3日(星期一) 中午1:50
>  收件人:nbsp;"user-zh" 
>  主题:nbsp;Re: flink1.10.1/1.11.1 使用sql 进行group 和 时间窗口 操作后 状态越来越大
> 
> 
> 
>  hi，您好：
>  我改回增量模式重新收集了一些数据：
>  1、数据处理速度：3000条每秒，是测试环境的，压力比较稳定，几乎没有波动
>  2、checkpoint是interval设置的是5秒
>  3、目前这个作业是每分钟一个窗口
>  4、并行度设置的1，使用on-yarn模式
> 
>  刚启动的时候，如下：
>   
>  18分钟后，如下：
>   
>  checkpoints设置：
>   
>  hdfs上面大小：
>   
>  页面上看到的大小：
>  <
> http://apache-flink.147419.n8.nabble.com/file/t793/checkpoinsts1.pnggt
> ;
> 
> 
>  Congxian Qiu wrote
>  gt; Hinbsp;nbsp; 鱼子酱
>  gt;nbsp;nbsp;nbsp;nbsp; 能否把在使用增量 checkpoint
> 的模式下，截图看一下 checkpoint
>  size 的走势呢？另外可以的话，也麻烦你在每次
>  gt; checkpoint 做完之后，到 hdfs 上 ls 一下 checkpoint 目录的大小。
>  gt;nbsp;nbsp;nbsp;nbsp;
> 另外有一个问题还需要回答一下，你的处理速度大概是多少，state 的更新频率能否评估一下呢？
>  gt;
>  gt; Best,
>  gt; Congxian
>  gt;
>  gt;
>  gt; 鱼子酱 <
> 
>  gt; 384939718@
> 
>  gt;gt; 于2020年7月30日周四 上午10:43写道：
>  gt;
>  gt;gt; 感谢！
>  gt;gt;
>  gt;gt; flink1.11.1版本里面，我尝试了下面两种backend，目前运行了20多个小时，
>  gt;gt; 能够看到状态的大小在一个区间内波动，没有发现一直增长的情况了。
>  gt;gt; StateBackend backend =new
>  gt;gt;
>  gt;gt;
> 
> RocksDBStateBackend("hdfs:///checkpoints-data/"+yamlReader.getValueByKey("jobName").toString()+"/",false);
>  gt;gt; StateBackend backend =new
>  gt;gt;
>  gt;gt;
> 
> FsStateBackend("hdfs:///checkpoints-data/"+yamlReader.getValueByKey("jobName").toString()+"/",false);
>  gt;gt;
>  gt;gt;
>  gt;gt; 这样看，有可能是RocksDBStateBackend增量模式这边可能存在一些问题。
>  gt;gt; RocksDBStateBackend：
>  gt;gt; amp;lt;
>  http://apache-flink.147419.n8.nabble.com/file/t793/444.pngamp;gt
> ;
>  gt;gt; FsStateBackend：
>  gt;gt; amp;lt;
>  http://apache-flink.147419.n8.nabble.com/file/t793/555.pngamp;gt
> ;
>  gt;gt;
>  gt;gt;
>  gt;gt;
>  gt;gt;
>  gt;gt; --
>  gt;gt; Sent from: http://apache-flink.147419.n8.nabble.com/
>  ; gt;gt;
> 
> 
> 
> 
> 
>  --
>  Sent from: http://apache-flink.147419.n8.nabble.com/

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Hi Eleanore,

Yes, I suggest to use Job to replace Deployment. It could be used to run
jobmanager one time and finish after a successful/failed completion.

However, using Job still could not solve your problem completely. Just as
Till said, When a job exhausts the restart strategy, the jobmanager
pod will terminate with non-zero exit code. It will cause the K8s
restarting it again. Even though we could set the resartPolicy and
backoffLimit,
this is not a clean and correct way to go. We should terminate the
jobmanager process with zero exit code in such situation.

@Till Rohrmann  I just have one concern. Is it a
special case for K8s deployment? For standalone/Yarn/Mesos, it seems that
terminating with
non-zero exit code is harmless.


Best,
Yang

Eleanore Jin  于2020年8月4日周二 下午11:54写道：

> Hi Yang & Till,
>
> Thanks for your prompt reply!
>
> Yang, regarding your question, I am actually not using k8s job, as I put
> my app.jar and its dependencies under flink's lib directory. I have 1 k8s
> deployment for job manager, and 1 k8s deployment for task manager, and 1
> k8s service for job manager.
>
> As you mentioned above, if flink job is marked as failed, it will cause
> the job manager pod to be restarted. Which is not the ideal behavior.
>
> Do you suggest that I should change the deployment strategy from using k8s
> deployment to k8s job? In case the flink program exit with non-zero code
> (e.g. exhausted number of configured restart), pod can be marked as
> complete hence not restarting the job again?
>
> Thanks a lot!
> Eleanore
>
> On Tue, Aug 4, 2020 at 2:49 AM Yang Wang  wrote:
>
>> @Till Rohrmann  In native mode, when a Flink
>> application terminates with FAILED state, all the resources will be cleaned
>> up.
>>
>> However, in standalone mode, I agree with you that we need to rethink the
>> exit code of Flink. When a job exhausts the restart
>> strategy, we should terminate the pod and do not restart again. After
>> googling, it seems that we could not specify the restartPolicy
>> based on exit code[1]. So maybe we need to return a zero exit code to
>> avoid restarting by K8s.
>>
>> [1].
>> https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code
>>
>> Best,
>> Yang
>>
>> Till Rohrmann  于2020年8月4日周二 下午3:48写道：
>>
>>> @Yang Wang  I believe that we should rethink the
>>> exit codes of Flink. In general you want K8s to restart a failed Flink
>>> process. Hence, an application which terminates in state FAILED should not
>>> return a non-zero exit code because it is a valid termination state.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Tue, Aug 4, 2020 at 8:55 AM Yang Wang  wrote:
>>>
 Hi Eleanore,

 I think you are using K8s resource "Job" to deploy the jobmanager.
 Please set .spec.template.spec.restartPolicy = "Never" and
 spec.backoffLimit = 0.
 Refer here[1] for more information.

 Then, when the jobmanager failed because of any reason, the K8s job
 will be marked failed. And K8s will not restart the job again.

 [1].
 https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup


 Best,
 Yang

 Eleanore Jin  于2020年8月4日周二 上午12:05写道：

> Hi Till,
>
> Thanks for the reply!
>
> I manually deploy as per-job mode [1] and I am using Flink 1.8.2.
> Specifically, I build a custom docker image, which I copied the app jar
> (not uber jar) and all its dependencies under /flink/lib.
>
> So my question is more like, in this case, if the job is marked as
> FAILED, which causes k8s to restart the pod, this seems not help at all,
> what are the suggestions for such scenario?
>
> Thanks a lot!
> Eleanore
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes
>
> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann 
> wrote:
>
>> Hi Eleanore,
>>
>> how are you deploying Flink exactly? Are you using the application
>> mode with native K8s support to deploy a cluster [1] or are you manually
>> deploying a per-job mode [2]?
>>
>> I believe the problem might be that we terminate the Flink process
>> with a non-zero exit code if the job reaches the ApplicationStatus.FAILED
>> [3].
>>
>> cc Yang Wang have you observed a similar behavior when running Flink
>> in per-job mode on K8s?
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application
>> [2]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions
>> [3]
>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32
>>
>> On Fri, Jul 31, 2020 at 6:26 PM

Re: Behavior for flink job running on K8S failed after restart strategy exhausted