RE: Nifi and Registry behind Citrix ADC.

2021-10-19 Thread Shawn Weeks
If you’re authenticating with 2-way ssl you’ll have to setup your load balancer 
to directly pass the TCP traffic through. Otherwise NiFi doesn’t see the users 
cert. NiFi doesn’t currently support getting the SSL Cert name from an HTTP 
Header like some other systems do. Usually if your using an HTTP Load Balancer 
you’d authenticate with SSO(SAML or OIDC) or LDAP(Username/Password)

Thanks
Shawn

From: Jens M. Kofoed 
Sent: Tuesday, October 19, 2021 1:16 AM
To: users@nifi.apache.org
Subject: Re: Nifi and Registry behind Citrix ADC.

Only if you want other ways to authenticate users. I have setup our NIFI 
systems to talk with our MS AD via ldaps, and defined different AD groups which 
in nifi has different policy rules. Some people can manage every thing, others 
can only start/stop specific processors in specific process groups.
Using personal certificates is no problem, I have some admins which also use 
there personal certificates. But with certificates you would have to add and 
manage users manually in NIFI. Users can of course being added to internal 
groups in NIFI and policy configured to groups.

reagrd
Jens

Den tir. 19. okt. 2021 kl. 07.43 skrev Jakobsson Stefan 
mailto:stefan.jakobs...@scania.com>>:
We are currently authenticating with personal certificates, should we change 
that then?

Stefan Jakobsson

Systems Manager  |  Scania IT, IKCA |  Scania CV AB
Phone: +46 8 553 527 27 Mobile: +46 7 008 834 76
Forskargatan 20, SE-151 87 Södertälje, Sweden
stefan.jakobs...@scania.com

From: Shawn Weeks mailto:swe...@weeksconsulting.us>>
Sent: den 18 oktober 2021 21:35
To: users@nifi.apache.org
Subject: RE: Nifi and Registry behind Citrix ADC.

Unless you’re operating the LB in TCP Mode you’ll need to configure NiFi to use 
an alternative authentication method like SAML, LDAP, OIDC, etc. You’ll also 
need to make sure that your proxy is passing the various HTTP headers through 
to NiFi and that NiFi is expecting traffic from a proxy. If you look in the 
nifi-user.log and nifi-app.log there might be some hints about what it didn’t 
like.

Thanks
Shawn

From: Jakobsson Stefan 
mailto:stefan.jakobs...@scania.com>>
Sent: Monday, October 18, 2021 2:26 PM
To: users@nifi.apache.org
Subject: RE: Nifi and Registry behind Citrix ADC.

Ahh, no ADC as in applicationdelivery and loadbalancing 😊

Stefan Jakobsson

Systems Manager  |  Scania IT, IKCA |  Scania CV AB
Phone: +46 8 553 527 27 Mobile: +46 7 008 834 76
Forskargatan 20, SE-151 87 Södertälje, Sweden
stefan.jakobs...@scania.com

From: Lehel Boér mailto:lehel.b...@gmail.com>>
Sent: den 18 oktober 2021 15:03
To: users@nifi.apache.org
Subject: Re: Nifi and Registry behind Citrix ADC.

Hi Stefan,

Please disregard my prior response. The name mislead me, I discovered ADC is 
not the same as Active Directory.

Kind Regards,
Lehel Boér

Lehel Boér mailto:lehel.b...@gmail.com>> ezt írta 
(időpont: 2021. okt. 18., H, 14:54):
Hi Stefan,

Have you tried setting up NiFi with an LDAP provider? Here are a few useful 
links.

- 
https://docs.cloudera.com/HDPDocuments/HDF3/HDF-3.4.1.1/nifi-security/content/ldap_login_identity_provider.html
- https://pierrevillard.com/2017/01/24/integration-of-nifi-with-ldap

Kind Regards,
Lehel Boér

Jakobsson Stefan 
mailto:stefan.jakobs...@scania.com>> ezt írta 
(időpont: 2021. okt. 18., H, 13:02):
Hello,

I have some issues trying to run Nifi and Nifi-registry behind an ADC. Reason 
for this is that we need Nifi be accessible from aws onto our onprem nifi 
installation due demands from our IT sec department

Anyhow, I can connect to Nifi-Registry on the servers ipconfig (i.e. 
x.x.x.x:9443/nifi-registry) without problems, but if I try to use the URL setup 
in the ADC with 9443 redirected to the nifiservers IP we get an error saying:

This page isn’t working
nifiprod.oururl.com didn’t send any data.
ERR_EMPTY_RESPONSE

Anyone has any ideas what I should start looking at? I set the https.host to 
0.0.0.0 in nifi-registry.conf.

Stefan Jakobsson

Systems Manager  |  Scania IT, IKCA |  Scania CV AB
Phone: +46 8 553 527 27 Mobile: +46 7 008 834 76
Forskargatan 20, SE-151 87 Södertälje, Sweden
stefan.jakobs...@scania.com



Re: Does NiFi provide end-to-end exactly once semantics with kafka sink and replayable source?

2021-10-19 Thread Theo Diefenthal
Hi Mark, 

Thank's a lot for your explanations. 

Sad to hear that NiFi doesn't support Exactly-Once out-of-the-box, but totally 
understandable. 

I think, your argumentation is not fully complete though. I might be wrong 
here, but the following is my understanding: 
You write that Kafka EOS works in a way that a transaction is fully handled by 
the producer. The producer _atomically_ takes care of committing source offsets 
and committing produced messages ( And in kafka streams also commits internal 
streaming state as well ). You conclucde that Kafka-EOS only apply when a given 
kafka cluster is both source and destination of the data. That's the 
understanding I also got when I first read the confluent blog series about 
exactly once processing. What confluent does in those posts: They explicitly 
describe how exactly once semantics are implemented in a kafka streaming 
application (reading from kafka and writing to the same kafka). Of course, if 
they are in "their own world", controlling source, state and sink, they can be 
more efficient with transactions and exactly once guarantees and that's what 
they describe. But the transaction concept of kafka applies to a broader range 
of tools (at least if one doesn't expect _atomic_ updates of source read and 
sink written simultaenously. But I also don't know why someone would need that) 
. 
When I read more about Apache Flink, I understood that the exactly once 
semantics of Kafka are not bound to processing in the kafka domain only. Kafka 
Producers implement a two-Phase-Commit protocol and one can utilize that with 
his own application. Apache Flink and Apache Spark both promise that. If one 
has a replayble source in Spark or Flink (Like a kafka topic) and one has a 
sink that is either idempotent or supports two-phase-commits, both Flink and 
Spark provide exactly once semantics on demand (Of course, your processing job 
must be free of side-effects). See [1] for instance. 

With the two-phase-commit sink, in the end, it comes down that you kind of need 
to manage your entire state in one system. In Kafka-Streams, they manage all 
their state in kafka (and committing atomically to that). In Flink, it is 
managed in the configured state backend, which can be a standard filesystem, 
HDFS, s3, whatever... 
Let's take a look at a very simple example reading from a Kafka cluster, 
writing to another kafka cluster with a SQL DB in the middle as state store. 
(Of course that's not performant): 
1. I can read messages from kafka and write those to a SQL DB. If I manage my 
consumer offsets myself within the SQL DB where I store the kafka records as 
well, I can easily make exactly once guarantees. Either I commit a transaction 
(with new consumed records and offsets) atomically, or I abort it. If something 
fails, I redo the process (Hence replayable source needed) 
2. More tricky: I read records from the DB, and within another DB transaction, 
I write a batch of messages to kafka (in a kafka transaction) and if Kafka 
succeeds the pre-commit, I commit my SQL DB transaction with the required infos 
(transactional id, producer id, transaction epoch). With some more 
communications with the DB, eventually I end up with exactly once semantics 
from DB to Kafka. (Kafka transactions with two phase commit. If something 
crashes after the pre-commit, I can restart the producer, reassign 
transactional id, resumue transaction with producerId and epoch and retry the 
commit until it eventually succeeds). 
In summary, I designed end-to-end exactly once from one kafka cluster to 
another kafka cluster. [If we want to introduce some new naming here, we could 
call that eventual end-to-end exactly once :) Source and Sink not updated 
simultaenously as in Kafka (atomic), but utilized some intermediate store to 
track progress]. 
Of course, the devil is in the detail and the implementation of such logic is 
very tricky. One has to take care of lots of possible failure cases, but 
technically it's possible. If that's technically possible within the current 
NiFi architecture is another question though which I can't answer as I don't 
have enough understanding of NiFi. For instance, I think for two phase commits, 
one needs some kind of "coordinator". In Flink, that's the Jobmanager. In NiFi, 
I don't know if there is such thing. I just wanted to add my two cents, that in 
general, kafka can provide exactly once semantics even when not fully operating 
withing a single kafka cluster only but having some other compatible sources 
and/or sinks involved. 

[1] [ 
https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
 | 
https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
 ] 

Best regards 
Theo 


Von: "markap14"  
An: "Theo Diefenthal"  
CC: "users"  
Gesendet: Sonntag, 3. Oktober 2021 16:04:26 
Betreff: Re: Does NiFi provide end-to-end exactly once semantics with kafka 
sink and replayable source? 

Hey Th

Re: CryptographicHashContent calculates 2 differents sha256 hashes on the same content

2021-10-19 Thread Mark Payne
Jens,

In the two provenance events - one showing a hash of dd4cc… and the other 
showing f6f0….
If you go to the Content tab, do they both show the same Content Claim? I.e., 
do the Input Claim / Output Claim show the same values for Container, Section, 
Identifier, Offset, and Size?

Thanks
-Mark

On Oct 19, 2021, at 1:22 AM, Jens M. Kofoed 
mailto:jmkofoed@gmail.com>> wrote:

Dear NIFI Users

I have posted this mail in the developers mailing list and just want to inform 
all of our about a very odd behavior we are facing.
The background:
We have data going between 2 different NIFI systems which has no direct network 
access to each other. Therefore we calculate a SHA256 hash value of the content 
at system 1, before the flowfile and data are combined and saved as a 
"flowfile-stream-v3" pkg file. The file is then transported to system 2, where 
the pkg file is unpacked and the flow can continue. To be sure about file 
integrity we calculate a new sha256 at system 2. But sometimes we see that the 
sha256 gets another value, which might suggest the file was corrupted. But 
recalculating the sha256 again gives a new hash value.



Tonight I had yet another file which didn't match the expected sha256 hash 
value. The content is a 1.7GB file and the Event Duration was "00:00:17.539" to 
calculate the hash.
I have created a Retry loop, where the file will go to a Wait process for 
delaying the file 1 minute and going back to the CryptographicHashContent for a 
new calculation. After 3 retries the file goes to the retries_exceeded and goes 
to a disabled process just to be in a queue so I manually can look at it. This 
morning I rerouted the file from my retries_exceeded queue back to the 
CryptographicHashContent for a new calculation and this time it calculated the 
correct hash value.

THIS CAN'T BE TRUE :-( :-( But it is. - Something very very strange is 
happening.


We are running NiFi 1.13.2 in a 3 node cluster at Ubuntu 20.04.02 with openjdk 
version "1.8.0_292", OpenJDK Runtime Environment (build 
1.8.0_292-8u292-b10-0ubuntu1~20.04-b10), OpenJDK 64-Bit Server VM (build 
25.292-b10, mixed mode). Each server is a VM with 4 CPU, 8GB Ram on VMware 
ESXi, 7.0.2. Each NIFI node is running at different vm physical hosts.
I have inspected different logs to see if I can find any correlation what 
happened at the same time as the file is going through my loop, but there are 
no event/task at that exact time.

System 1:
At 10/19/2021 00:15:11.247 CEST my file is going through a 
CryptographicHashContent: SHA256 value: 
dd4cc7ef8dbc8d70528e8aa788581f0ab88d297c9c9f39b6b542df68952efd20
The file is exported as a "FlowFile Stream, v3" to System 2

SYSTEM 2:
At 10/19/2021 00:18:10.528 CEST the file is going through a 
CryptographicHashContent: SHA256 value: 
f6f0909aacae4952f10f6fa7704f3e55d0481ec211d495993550aedbb3fe0819

At 10/19/2021 00:19:08.996 CEST the file is going through the same 
CryptographicHashContent at system 2: SHA256 value: 
f6f0909aacae4952f10f6fa7704f3e55d0481ec211d495993550aedbb3fe0819
At 10/19/2021 00:20:04.376 CEST the file is going through the same a 
CryptographicHashContent at system 2: SHA256 value: 
f6f0909aacae4952f10f6fa7704f3e55d0481ec211d495993550aedbb3fe0819
At 10/19/2021 00:21:01.711 CEST the file is going through the same a 
CryptographicHashContent at system 2: SHA256 value: 
f6f0909aacae4952f10f6fa7704f3e55d0481ec211d495993550aedbb3fe0819

At 10/19/2021 06:07:43.376 CEST the file is going through the same a 
CryptographicHashContent at system 2: SHA256 value: 
dd4cc7ef8dbc8d70528e8aa788581f0ab88d297c9c9f39b6b542df68952efd20


How on earth can this happen???

Kind Regards
Jens M. Kofoed




NiFi 1.12.1 content repository clean up issues

2021-10-19 Thread Bruno Gante
Hi all,

I have a cluster running NiFi 1.12.1 that processes a significant amount of 
small flowfiles (around 3K/sec).

The system is performing quite good after some fine tuning but i am facing an 
issue regarding content repository which does not seem to clean up properly. 
For example, it reached 100% usage of all 3 nodes cluster, each one with 500GB 
content disk in less than 3 days.Today, I´ve also tried to turn content repo in 
memory and regarding performance is even better, however in terms of content 
repository cleanup it reached 12GB memory in just 6 hours.I do not have any old 
queued flow file that may block claim deletion. Each flow file has a lineage 
duration of 1-2 min average.

Any advise or guidance would be really appreciated.

Below my content repository config:


nifi.content.claim.max.appendable.size=10 MB
nifi.content.claim.max.flow.files=100
nifi.content.repository.directory.default=../content_repository
nifi.content.repository.archive.max.retention.period=1 hours
nifi.content.repository.archive.max.usage.percentage=80%
nifi.content.repository.archive.enabled=false
nifi.content.repository.always.sync=false

On top of that, does anyone know if there´s some way to get an estimation of 
required content repository size depending on the number of Read/Write volume 
and content config?

Thanks