RE: Identifying a flink dashboard

2023-06-27 Thread Schwalbe Matthias
Good Morning Mike,

As a quick fix, sort of, you could use an Ingress on nginx-ingress (instead of 
the port-forward) and
Add a sub_filter rule to patch the HTML response.
I use this to add a  tag to the header and for the Flink-Dashboard I 
experience no glitches.

As to point 3. … you don’t need to expose that Ingress to the internet, but 
only to the node IP, so it becomes visible only within your network, … there is 
a number of ways doing it

I could elaborate a little more, if interested

Hope this helps

Thias


From: Mike Phillips 
Sent: Wednesday, June 28, 2023 3:47 AM
To: user@flink.apache.org
Subject: Re: Identifying a flink dashboard

⚠EXTERNAL MESSAGE – CAUTION: Think Before You Click ⚠


G'day Alex,

Thanks!

1 - hmm maybe beyond my capabilities presently
2 - Yuck! :-) Will look at this
3 - Not possible, the dashboards are not accessible via the internet, so we use 
kube and port forward, URL looks like http://wobbegong:3/ the port changes
4 - I think this requires the dashboard be internet accessible?

On Tue, 27 Jun 2023 at 17:21, Alexander Fedulov 
mailto:alexander.fedu...@gmail.com>> wrote:
Hi Mike,

no, it is currently hard-coded
https://github.com/apache/flink/blob/master/flink-runtime-web/web-dashboard/src/app/app.component.html#L23

Your options are:
1. Contribute a change to make it configurable
2. Use some browser plugin that allows renaming page titles
3. Always use different ports and bookmark the URLs accordingly
4. Use an Ingress in k8s

Best,
Alex

On Tue, 27 Jun 2023 at 05:58, Mike Phillips 
mailto:mike.phill...@intellisense.io>> wrote:
G'day all,

Not sure if this is the correct place but...
We have a number of flink dashboards and it is difficult to know what dashboard 
we are looking at.
Is there a configurable way to change the 'Apache Flink Dashboard' heading on 
the dashboard?
Or some other way of uniquely identifying what dashboard I am currently looking 
at?
Flink is running in k8s and we use kubectl port forwarding to connect to the 
dashboard so we can't ID using the URL

--
--
Kind Regards

Mike


--
--
Kind Regards

Mike
Diese Nachricht ist ausschliesslich für den Adressaten bestimmt und beinhaltet 
unter Umständen vertrauliche Mitteilungen. Da die Vertraulichkeit von 
e-Mail-Nachrichten nicht gewährleistet werden kann, übernehmen wir keine 
Haftung für die Gewährung der Vertraulichkeit und Unversehrtheit dieser 
Mitteilung. Bei irrtümlicher Zustellung bitten wir Sie um Benachrichtigung per 
e-Mail und um Löschung dieser Nachricht sowie eventueller Anhänge. Jegliche 
unberechtigte Verwendung oder Verbreitung dieser Informationen ist streng 
verboten.

This message is intended only for the named recipient and may contain 
confidential or privileged information. As the confidentiality of email 
communication cannot be guaranteed, we do not accept any responsibility for the 
confidentiality and the intactness of this message. If you have received it in 
error, please advise the sender by return e-mail and delete this message and 
any attachments. Any unauthorised use or dissemination of this information is 
strictly prohibited.


Re:Kafka source with idleness and alignment stops consuming

2023-06-27 Thread haishui
Hi,
I have the same trouble. This is really a bug. `shouldWaitForAlignment` needs 
to be another change. 


By the way, a source will be marked as idle, when the source has waiting for 
alignment for a long time. Is this a bug? 
















在 2023-06-27 23:25:38,"Alexis Sarda-Espinosa"  写道:

Hello,


I am currently evaluating idleness and alignment with Flink 1.17.1 and the 
externalized Kafka connector. My job has 3 sources whose watermark strategies 
are defined like this:


WatermarkStrategy.forBoundedOutOfOrderness(maxAllowedWatermarkDrift)
.withIdleness(idleTimeout)
.withWatermarkAlignment("group", maxAllowedWatermarkDrift, 
Duration.ofSeconds(1L))



The max allowed drift is currently 5 seconds, and my sources have an 
idleTimeout of 1, 1.5, and 5 seconds.


What I observe is that, when I restart the job, all sources publish messages, 
but then 2 of them are marked as idle and never resume. I found 
https://issues.apache.org/jira/browse/FLINK-31632, which should be fixed in 
1.17.1, but I don't think it's the same issue, my logs don't show negative 
values:


2023-06-27 15:11:42,927 DEBUG 
org.apache.flink.runtime.source.coordinator.SourceCoordinator:609 - New 
reported watermark=Watermark @ 1687878696690 (2023-06-27 15:11:36.690) from 
subTaskId=1
2023-06-27 15:11:43,009 DEBUG 
org.apache.flink.runtime.source.coordinator.SourceCoordinator:609 - New 
reported watermark=Watermark @ 9223372036854775807 (292278994-08-17 
07:12:55.807) from subTaskId=0
2023-06-27 15:11:43,091 DEBUG 
org.apache.flink.runtime.source.coordinator.SourceCoordinator:609 - New 
reported watermark=Watermark @ 9223372036854775807 (292278994-08-17 
07:12:55.807) from subTaskId=0
2023-06-27 15:11:43,116 DEBUG 
org.apache.flink.runtime.source.coordinator.SourceCoordinator:609 - New 
reported watermark=Watermark @ 9223372036854775807 (292278994-08-17 
07:12:55.807) from subTaskId=0
2023-06-27 15:11:43,298 INFO  
org.apache.flink.runtime.source.coordinator.SourceCoordinator:194 - 
Distributing maxAllowedWatermark=1687878701690 to subTaskIds=[0, 1]
2023-06-27 15:11:43,304 INFO  
org.apache.flink.runtime.source.coordinator.SourceCoordinator:194 - 
Distributing maxAllowedWatermark=1687878701690 to subTaskIds=[0]
2023-06-27 15:11:43,306 INFO  
org.apache.flink.runtime.source.coordinator.SourceCoordinator:194 - 
Distributing maxAllowedWatermark=1687878701690 to subTaskIds=[0]
2023-06-27 15:11:43,486 INFO  
org.apache.flink.runtime.source.coordinator.SourceCoordinator:194 - 
Distributing maxAllowedWatermark=1687878701690 to subTaskIds=[]
2023-06-27 15:11:43,489 INFO  
org.apache.flink.runtime.source.coordinator.SourceCoordinator:194 - 
Distributing maxAllowedWatermark=1687878701690 to subTaskIds=[]
2023-06-27 15:11:43,492 INFO  
org.apache.flink.runtime.source.coordinator.SourceCoordinator:194 - 
Distributing maxAllowedWatermark=1687878701690 to subTaskIds=[]



Does anyone know if I'm missing something or this is really a bug?


Regards,
Alexis.

Re: RocksDB State Backend GET returns null intermittently

2023-06-27 Thread Prabhu Joseph
Thanks Hangxiang and Alex for the pointers. Have added audit logs into
RocsDBValueState (GET call: value() and PUT call: update()) and found
nothing wrong on the RocsDB side. It never sends Null to the GET call for
the key, which was PUT earlier. Then we added audit logs into the CX
application and found they have a cache (HashMap) on top of
RocsDBValueState to speed up, which is where the issue is. The application
checks the key from the cache first, and if it does not exist, it gets it
from RocsDBValueState. There is a race condition in their code where they
override the RocsDBValueState with a new entry that does not have the
previous state, causing an issue.

Sorry for the confusion; it turned out to be a problem on the Flink
Application side rather than the Framework side.




On Tue, Jun 27, 2023 at 2:53 PM Alexander Fedulov <
alexander.fedu...@gmail.com> wrote:

> Hi Prabhu,
>
> make sure that the key you use is the same for both records and try to
> reproduce the issue with the level of parallelism of 1.
>
> Best,
> Alex
>
> On Sun, 25 Jun 2023 at 04:29, Hangxiang Yu  wrote:
>
>> Hi, Prabhu.
>>
>> This is a correctness issue. IIUC, It should not be related to the size
>> of the block cache, write buffer, or whether the bloom filter is enabled.
>>
>> Is your job a DataStream job? Does the job contain a custom Serializer?
>> You could check or share the logic of the Serializer, as this is one of the
>> main differences between RocksDBStateBackend and HashMapStateBackend
>> (HashMapStateBackend does not perform serialization and deserialization).
>>
>> On Wed, Jun 21, 2023 at 3:44 PM Prabhu Joseph 
>> wrote:
>>
>>> Hi,
>>>
>>> RocksDB State Backend GET call on a key that was PUT into the state like
>>> 100 ms earlier but is not returned intermittently. The issue never happened
>>> with the HashDB State backend. We are trying to increase block cache size,
>>> write buffer size, and enable bloom filter as per the doc: -
>>> https://flink.apache.org/2021/01/18/using-rocksdb-state-backend-in-apache-flink-when-and-how/
>>>
>>> Any ideas on what could be wrong or how to debug this?
>>>
>>> Thanks,
>>> Prabhu Joseph
>>>
>>
>>
>> --
>> Best,
>> Hangxiang.
>>
>


Re: Very long launch of the Flink application in BATCH mode

2023-06-27 Thread Shammon FY
Hi Brendan,

I think you may need to confirm which stage the job is blocked, the client
is submitting job or resourcemanage is scheduling job or tasks are
launching in TM? May be you need provide more information to help us to
figure the issue

Best,
Shammon FY

On Tuesday, June 27, 2023, Weihua Hu  wrote:

> Hi, Brendan
>
> It looks like it's invoking your main method referring to the log. You can
> add more logs in the main method to figure out which part takes too long.
>
> Best,
> Weihua
>
>
> On Tue, Jun 27, 2023 at 5:06 AM Brendan Cortez <
> brendan.cortez...@gmail.com> wrote:
>
>> No, I'm using a collection source + 20 same JDBC lookups + Kafka sink.
>>
>> On Mon, 26 Jun 2023 at 19:17, Yaroslav Tkachenko 
>> wrote:
>>
>>> Hey Brendan,
>>>
>>> Do you use a file source by any chance?
>>>
>>> On Mon, Jun 26, 2023 at 4:31 AM Brendan Cortez <
>>> brendan.cortez...@gmail.com> wrote:
>>>
 Hi all!

 I'm trying to submit a Flink Job in Application Mode in the Kubernetes
 cluster.

 I see some problems when an application has a big number of operators
 (more than 20 same operators) - it freezes for ~6 minutes after
 *2023-06-21 15:46:45,082 WARN
  org.apache.flink.connector.kafka.sink.KafkaSinkBuilder   [] - Property
 [transaction.timeout.ms ] not specified.
 Setting it to PT1H*
  and until

 *2023-06-21 15:53:20,002 INFO
  org.apache.flink.streaming.api.graph.StreamGraphGenerator[] - Disabled
 Checkpointing. Checkpointing is not supported and not needed when executing
 jobs in BATCH mode.*(logs in attachment)

 When I set log.level=DEBUG, I see only this message each 10 seconds:
 *2023-06-21 14:51:30,921 DEBUG
 org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
 Trigger heartbeat request.*

 Please, could you help me understand the cause of this problem and how
 to fix it. I use the Flink 1.15.3 version.

 Thank you in advance!

 Best regards,
 Brendan Cortez.

>>>


Re: Identifying a flink dashboard

2023-06-27 Thread Mike Phillips
G'day Alex,

Thanks!

1 - hmm maybe beyond my capabilities presently
2 - Yuck! :-) Will look at this
3 - Not possible, the dashboards are not accessible via the internet, so we
use kube and port forward, URL looks like http://wobbegong:3/ the port
changes
4 - I think this requires the dashboard be internet accessible?

On Tue, 27 Jun 2023 at 17:21, Alexander Fedulov 
wrote:

> Hi Mike,
>
> no, it is currently hard-coded
>
> https://github.com/apache/flink/blob/master/flink-runtime-web/web-dashboard/src/app/app.component.html#L23
>
> Your options are:
> 1. Contribute a change to make it configurable
> 2. Use some browser plugin that allows renaming page titles
> 3. Always use different ports and bookmark the URLs accordingly
> 4. Use an Ingress in k8s
>
> Best,
> Alex
>
> On Tue, 27 Jun 2023 at 05:58, Mike Phillips 
> wrote:
>
>> G'day all,
>>
>> Not sure if this is the correct place but...
>> We have a number of flink dashboards and it is difficult to know what
>> dashboard we are looking at.
>> Is there a configurable way to change the 'Apache Flink Dashboard'
>> heading on the dashboard?
>> Or some other way of uniquely identifying what dashboard I am currently
>> looking at?
>> Flink is running in k8s and we use kubectl port forwarding to connect to
>> the dashboard so we can't ID using the URL
>>
>> --
>> --
>> Kind Regards
>>
>> Mike
>>
>

-- 
--
Kind Regards

Mike


Kafka source with idleness and alignment stops consuming

2023-06-27 Thread Alexis Sarda-Espinosa
Hello,

I am currently evaluating idleness and alignment with Flink 1.17.1 and the
externalized Kafka connector. My job has 3 sources whose watermark
strategies are defined like this:

WatermarkStrategy.forBoundedOutOfOrderness(maxAllowedWatermarkDrift)
.withIdleness(idleTimeout)
.withWatermarkAlignment("group", maxAllowedWatermarkDrift,
Duration.ofSeconds(1L))

The max allowed drift is currently 5 seconds, and my sources have an
idleTimeout of 1, 1.5, and 5 seconds.

What I observe is that, when I restart the job, all sources publish
messages, but then 2 of them are marked as idle and never resume. I found
https://issues.apache.org/jira/browse/FLINK-31632, which should be fixed in
1.17.1, but I don't think it's the same issue, my logs don't show negative
values:

2023-06-27 15:11:42,927 DEBUG
org.apache.flink.runtime.source.coordinator.SourceCoordinator:609 - New
reported watermark=Watermark @ 1687878696690 (2023-06-27 15:11:36.690) from
subTaskId=1
2023-06-27 15:11:43,009 DEBUG
org.apache.flink.runtime.source.coordinator.SourceCoordinator:609 - New
reported watermark=Watermark @ 9223372036854775807 (292278994-08-17
07:12:55.807) from subTaskId=0
2023-06-27 15:11:43,091 DEBUG
org.apache.flink.runtime.source.coordinator.SourceCoordinator:609 - New
reported watermark=Watermark @ 9223372036854775807 (292278994-08-17
07:12:55.807) from subTaskId=0
2023-06-27 15:11:43,116 DEBUG
org.apache.flink.runtime.source.coordinator.SourceCoordinator:609 - New
reported watermark=Watermark @ 9223372036854775807 (292278994-08-17
07:12:55.807) from subTaskId=0
2023-06-27 15:11:43,298 INFO
 org.apache.flink.runtime.source.coordinator.SourceCoordinator:194 -
Distributing maxAllowedWatermark=1687878701690 to subTaskIds=[0, 1]
2023-06-27 15:11:43,304 INFO
 org.apache.flink.runtime.source.coordinator.SourceCoordinator:194 -
Distributing maxAllowedWatermark=1687878701690 to subTaskIds=[0]
2023-06-27 15:11:43,306 INFO
 org.apache.flink.runtime.source.coordinator.SourceCoordinator:194 -
Distributing maxAllowedWatermark=1687878701690 to subTaskIds=[0]
2023-06-27 15:11:43,486 INFO
 org.apache.flink.runtime.source.coordinator.SourceCoordinator:194 -
Distributing maxAllowedWatermark=1687878701690 to subTaskIds=[]
2023-06-27 15:11:43,489 INFO
 org.apache.flink.runtime.source.coordinator.SourceCoordinator:194 -
Distributing maxAllowedWatermark=1687878701690 to subTaskIds=[]
2023-06-27 15:11:43,492 INFO
 org.apache.flink.runtime.source.coordinator.SourceCoordinator:194 -
Distributing maxAllowedWatermark=1687878701690 to subTaskIds=[]

Does anyone know if I'm missing something or this is really a bug?

Regards,
Alexis.


Re: Very long launch of the Flink application in BATCH mode

2023-06-27 Thread Weihua Hu
Hi, Brendan

It looks like it's invoking your main method referring to the log. You can
add more logs in the main method to figure out which part takes too long.

Best,
Weihua


On Tue, Jun 27, 2023 at 5:06 AM Brendan Cortez 
wrote:

> No, I'm using a collection source + 20 same JDBC lookups + Kafka sink.
>
> On Mon, 26 Jun 2023 at 19:17, Yaroslav Tkachenko 
> wrote:
>
>> Hey Brendan,
>>
>> Do you use a file source by any chance?
>>
>> On Mon, Jun 26, 2023 at 4:31 AM Brendan Cortez <
>> brendan.cortez...@gmail.com> wrote:
>>
>>> Hi all!
>>>
>>> I'm trying to submit a Flink Job in Application Mode in the Kubernetes
>>> cluster.
>>>
>>> I see some problems when an application has a big number of operators
>>> (more than 20 same operators) - it freezes for ~6 minutes after
>>> *2023-06-21 15:46:45,082 WARN
>>>  org.apache.flink.connector.kafka.sink.KafkaSinkBuilder   [] - Property
>>> [transaction.timeout.ms ] not specified.
>>> Setting it to PT1H*
>>>  and until
>>>
>>> *2023-06-21 15:53:20,002 INFO
>>>  org.apache.flink.streaming.api.graph.StreamGraphGenerator[] - Disabled
>>> Checkpointing. Checkpointing is not supported and not needed when executing
>>> jobs in BATCH mode.*(logs in attachment)
>>>
>>> When I set log.level=DEBUG, I see only this message each 10 seconds:
>>> *2023-06-21 14:51:30,921 DEBUG
>>> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
>>> Trigger heartbeat request.*
>>>
>>> Please, could you help me understand the cause of this problem and how
>>> to fix it. I use the Flink 1.15.3 version.
>>>
>>> Thank you in advance!
>>>
>>> Best regards,
>>> Brendan Cortez.
>>>
>>


Re: RocksDB State Backend GET returns null intermittently

2023-06-27 Thread Alexander Fedulov
Hi Prabhu,

make sure that the key you use is the same for both records and try to
reproduce the issue with the level of parallelism of 1.

Best,
Alex

On Sun, 25 Jun 2023 at 04:29, Hangxiang Yu  wrote:

> Hi, Prabhu.
>
> This is a correctness issue. IIUC, It should not be related to the size of
> the block cache, write buffer, or whether the bloom filter is enabled.
>
> Is your job a DataStream job? Does the job contain a custom Serializer?
> You could check or share the logic of the Serializer, as this is one of the
> main differences between RocksDBStateBackend and HashMapStateBackend
> (HashMapStateBackend does not perform serialization and deserialization).
>
> On Wed, Jun 21, 2023 at 3:44 PM Prabhu Joseph 
> wrote:
>
>> Hi,
>>
>> RocksDB State Backend GET call on a key that was PUT into the state like
>> 100 ms earlier but is not returned intermittently. The issue never happened
>> with the HashDB State backend. We are trying to increase block cache size,
>> write buffer size, and enable bloom filter as per the doc: -
>> https://flink.apache.org/2021/01/18/using-rocksdb-state-backend-in-apache-flink-when-and-how/
>>
>> Any ideas on what could be wrong or how to debug this?
>>
>> Thanks,
>> Prabhu Joseph
>>
>
>
> --
> Best,
> Hangxiang.
>


Re: Identifying a flink dashboard

2023-06-27 Thread Alexander Fedulov
Hi Mike,

no, it is currently hard-coded
https://github.com/apache/flink/blob/master/flink-runtime-web/web-dashboard/src/app/app.component.html#L23

Your options are:
1. Contribute a change to make it configurable
2. Use some browser plugin that allows renaming page titles
3. Always use different ports and bookmark the URLs accordingly
4. Use an Ingress in k8s

Best,
Alex

On Tue, 27 Jun 2023 at 05:58, Mike Phillips 
wrote:

> G'day all,
>
> Not sure if this is the correct place but...
> We have a number of flink dashboards and it is difficult to know what
> dashboard we are looking at.
> Is there a configurable way to change the 'Apache Flink Dashboard' heading
> on the dashboard?
> Or some other way of uniquely identifying what dashboard I am currently
> looking at?
> Flink is running in k8s and we use kubectl port forwarding to connect to
> the dashboard so we can't ID using the URL
>
> --
> --
> Kind Regards
>
> Mike
>


Invalid AMRMToken from appattempy

2023-06-27 Thread 湘晗刚
Caused by :org.apache.hadoop.ipc.RemoteException:Invalid AMRMToken from 
appattempy_162**_1178_1

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1563)
…
Flink 1.10 on yarn
Thanks in advance
湘晗刚