[jira] [Commented] (NIFI-12194) Nifi fails when ConsumeKafka_2_6 processor is started with PLAINTEXT securityProtocol

2023-11-01 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/NIFI-12194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781812#comment-17781812
 ] 

ASF subversion and git services commented on NIFI-12194:


Commit 9a5a56e79eb26f0c6ccf4d7f6cd9a1fef308c2eb in nifi's branch 
refs/heads/support/nifi-1.x from Paul Grey
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=9a5a56e79e ]

NIFI-12194 Added Yield on Exceptions in Kafka Processors

- Catching KafkaException and yielding for publisher lease requests improves 
behavior when the Processor is unable to connect to Kafka Brokers

This closes #7955

Signed-off-by: David Handermann 
(cherry picked from commit 75c661bbbe56a7951974a701921af9da74dd0d68)


> Nifi fails when ConsumeKafka_2_6 processor is started with PLAINTEXT 
> securityProtocol
> -
>
> Key: NIFI-12194
> URL: https://issues.apache.org/jira/browse/NIFI-12194
> Project: Apache NiFi
>  Issue Type: Bug
>Affects Versions: 1.21.0, 1.23.0
>Reporter: Peter Schmitzer
>Assignee: Paul Grey
>Priority: Major
> Attachments: image-2023-09-27-15-56-02-438.png
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When starting ConsumeKafka_2_6 processor with sasl mechanism GSSAPI and the 
> securityProtocol PLAINTEXT (although SSL would be correct) the UI crashed and 
> nifi was no longer accessible. Not only the frontend was not accessible 
> anymore, also the other processors in our flow stopped performing well 
> according to our dashboards.
> We were able to reproduce this by using the config as described above.
> Our nifi in preprod (where this was detected) runs in a kubernetes cluster.
>  * version 1.21.0
>  * 3 nodes
>  * jvmMemory: 1536m
>  * 3G memory (limit)
>  * 400m cpu (request)
>  * zookeeper
> The logs do not offer any unusual entries when the issue is triggered. 
> Inspecting the pod metrics we found a spike in memory.
> The issue is a bit scary for us because a rather innocent config parameter in 
> one single processor is able to let our whole cluster break down.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NIFI-12194) Nifi fails when ConsumeKafka_2_6 processor is started with PLAINTEXT securityProtocol

2023-11-01 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/NIFI-12194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781811#comment-17781811
 ] 

ASF subversion and git services commented on NIFI-12194:


Commit 75c661bbbe56a7951974a701921af9da74dd0d68 in nifi's branch 
refs/heads/main from Paul Grey
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=75c661bbbe ]

NIFI-12194 Added Yield on Exceptions in Kafka Processors

- Catching KafkaException and yielding for publisher lease requests improves 
behavior when the Processor is unable to connect to Kafka Brokers

This closes #7955

Signed-off-by: David Handermann 


> Nifi fails when ConsumeKafka_2_6 processor is started with PLAINTEXT 
> securityProtocol
> -
>
> Key: NIFI-12194
> URL: https://issues.apache.org/jira/browse/NIFI-12194
> Project: Apache NiFi
>  Issue Type: Bug
>Affects Versions: 1.21.0, 1.23.0
>Reporter: Peter Schmitzer
>Assignee: Paul Grey
>Priority: Major
> Attachments: image-2023-09-27-15-56-02-438.png
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When starting ConsumeKafka_2_6 processor with sasl mechanism GSSAPI and the 
> securityProtocol PLAINTEXT (although SSL would be correct) the UI crashed and 
> nifi was no longer accessible. Not only the frontend was not accessible 
> anymore, also the other processors in our flow stopped performing well 
> according to our dashboards.
> We were able to reproduce this by using the config as described above.
> Our nifi in preprod (where this was detected) runs in a kubernetes cluster.
>  * version 1.21.0
>  * 3 nodes
>  * jvmMemory: 1536m
>  * 3G memory (limit)
>  * 400m cpu (request)
>  * zookeeper
> The logs do not offer any unusual entries when the issue is triggered. 
> Inspecting the pod metrics we found a spike in memory.
> The issue is a bit scary for us because a rather innocent config parameter in 
> one single processor is able to let our whole cluster break down.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NIFI-12194) Nifi fails when ConsumeKafka_2_6 processor is started with PLAINTEXT securityProtocol

2023-10-30 Thread Peter Schmitzer (Jira)


[ 
https://issues.apache.org/jira/browse/NIFI-12194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781213#comment-17781213
 ] 

Peter Schmitzer commented on NIFI-12194:


Hi [~pgrey]  thank you very much for following up on this! If it works out as 
you describe it definitely is a sufficient fix that mitigates the risk for us.

> Nifi fails when ConsumeKafka_2_6 processor is started with PLAINTEXT 
> securityProtocol
> -
>
> Key: NIFI-12194
> URL: https://issues.apache.org/jira/browse/NIFI-12194
> Project: Apache NiFi
>  Issue Type: Bug
>Affects Versions: 1.21.0, 1.23.0
>Reporter: Peter Schmitzer
>Assignee: Paul Grey
>Priority: Major
> Attachments: image-2023-09-27-15-56-02-438.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When starting ConsumeKafka_2_6 processor with sasl mechanism GSSAPI and the 
> securityProtocol PLAINTEXT (although SSL would be correct) the UI crashed and 
> nifi was no longer accessible. Not only the frontend was not accessible 
> anymore, also the other processors in our flow stopped performing well 
> according to our dashboards.
> We were able to reproduce this by using the config as described above.
> Our nifi in preprod (where this was detected) runs in a kubernetes cluster.
>  * version 1.21.0
>  * 3 nodes
>  * jvmMemory: 1536m
>  * 3G memory (limit)
>  * 400m cpu (request)
>  * zookeeper
> The logs do not offer any unusual entries when the issue is triggered. 
> Inspecting the pod metrics we found a spike in memory.
> The issue is a bit scary for us because a rather innocent config parameter in 
> one single processor is able to let our whole cluster break down.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NIFI-12194) Nifi fails when ConsumeKafka_2_6 processor is started with PLAINTEXT securityProtocol

2023-10-30 Thread Paul Grey (Jira)


[ 
https://issues.apache.org/jira/browse/NIFI-12194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781104#comment-17781104
 ] 

Paul Grey commented on NIFI-12194:
--

Following up on this. I've checked nifi/main, and unfortunately the problem 
exists there as well.

It appears to be a combination of a problem in the Kafka client library, acting 
together with the NiFi handling of connection initialization for the processor. 
In this misconfiguration, the library attempts a large allocation of a direct 
byte buffer, which fails in the NiFi configurations with a small memory 
footprint. When the failure occurs, NiFi currently immediately retries the 
connection, which fails in the same way. The large memory allocation combined 
with the immediate retry starves the NiFi process of CPU cycles, causing 
instability.

It is not straightforward to detect the Kafka misconfiguration in NiFi. A more 
reasonable solution seems to be to improve NiFi behavior in general on a 
connection initialization failure.

There is an inbound fix that improves this behavior. On connection init 
failure, a NiFi processor API is invoked that yields CPU resources for a 
configurable amount of time (by default, 1 second). This does not prevent the 
problem, but hopefully preserves sufficient CPU to provide for stable UI 
interactivity (bulletin indicates error, processor can be stopped and 
configuration adjusted).

The fix is directed at the main line. Once merged, it can be backported to the 
1.x line, and would be included in an upcoming NiFi 1.x release.

Thanks very much for raising a red flag here!  Definitely an unusual problem; 
hope the fix helps you, and those who might encounter the problem in the future.

> Nifi fails when ConsumeKafka_2_6 processor is started with PLAINTEXT 
> securityProtocol
> -
>
> Key: NIFI-12194
> URL: https://issues.apache.org/jira/browse/NIFI-12194
> Project: Apache NiFi
>  Issue Type: Bug
>Affects Versions: 1.21.0, 1.23.0
>Reporter: Peter Schmitzer
>Assignee: Paul Grey
>Priority: Major
> Attachments: image-2023-09-27-15-56-02-438.png
>
>
> When starting ConsumeKafka_2_6 processor with sasl mechanism GSSAPI and the 
> securityProtocol PLAINTEXT (although SSL would be correct) the UI crashed and 
> nifi was no longer accessible. Not only the frontend was not accessible 
> anymore, also the other processors in our flow stopped performing well 
> according to our dashboards.
> We were able to reproduce this by using the config as described above.
> Our nifi in preprod (where this was detected) runs in a kubernetes cluster.
>  * version 1.21.0
>  * 3 nodes
>  * jvmMemory: 1536m
>  * 3G memory (limit)
>  * 400m cpu (request)
>  * zookeeper
> The logs do not offer any unusual entries when the issue is triggered. 
> Inspecting the pod metrics we found a spike in memory.
> The issue is a bit scary for us because a rather innocent config parameter in 
> one single processor is able to let our whole cluster break down.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NIFI-12194) Nifi fails when ConsumeKafka_2_6 processor is started with PLAINTEXT securityProtocol

2023-10-25 Thread Joe Witt (Jira)


[ 
https://issues.apache.org/jira/browse/NIFI-12194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779665#comment-17779665
 ] 

Joe Witt commented on NIFI-12194:
-

slack thread: 
https://apachenifi.slack.com/archives/C0L9VCD47/p1698242200810429. 

Guillaume
  6 hours ago
Hello,
I just experienced something bad in NiFi 1.20.0
Let me know if that souds normal to you and if that may have been fixed in 
newer versions.
We're working on switching all our kafka consumers to latest version (2.0 to 
2.6 mainly) and also add SSL authentication to the brokers.
On one processors, I made a mistake :
Used SSL brokers (so with port 9093) but forgot to configure security protocol 
(left @ plaintext) and SSL context service (left @ No value )
What we observed :
The memory usage on all our cluster nodes increased from around 48% to 68% 
percent (and never went down, even after fix)
We reproduced that on a nonprod cluster : the cluster went OOM in less than 1 
minute.
Question :
Is that normal that the connections attempts lead to global outage ? Is it 
possible to fix a max number of attempts for connection ?
Is it normal the memory usage didn't go down after fix (our cluster is 
something like very stable in term of memory usage, for months now)
Has the behaviour changed in latest versions ?

> Nifi fails when ConsumeKafka_2_6 processor is started with PLAINTEXT 
> securityProtocol
> -
>
> Key: NIFI-12194
> URL: https://issues.apache.org/jira/browse/NIFI-12194
> Project: Apache NiFi
>  Issue Type: Bug
>Affects Versions: 1.21.0, 1.23.0
>Reporter: Peter Schmitzer
>Priority: Major
> Attachments: image-2023-09-27-15-56-02-438.png
>
>
> When starting ConsumeKafka_2_6 processor with sasl mechanism GSSAPI and the 
> securityProtocol PLAINTEXT (although SSL would be correct) the UI crashed and 
> nifi was no longer accessible. Not only the frontend was not accessible 
> anymore, also the other processors in our flow stopped performing well 
> according to our dashboards.
> We were able to reproduce this by using the config as described above.
> Our nifi in preprod (where this was detected) runs in a kubernetes cluster.
>  * version 1.21.0
>  * 3 nodes
>  * jvmMemory: 1536m
>  * 3G memory (limit)
>  * 400m cpu (request)
>  * zookeeper
> The logs do not offer any unusual entries when the issue is triggered. 
> Inspecting the pod metrics we found a spike in memory.
> The issue is a bit scary for us because a rather innocent config parameter in 
> one single processor is able to let our whole cluster break down.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)