Hi Folks, My company is using nifi to perform several data-flow process and now we received a requirement to do some fairly complex ETL over large files. To process those files we have some proprietary applications (mostly written in phyton or go) that ran as docker containers.
I don't think that porting those apps as nifi processors would produce a good result due to each app complexity. Also we would like keep using the nifi queues so we can monitor overall progress as we already do (we ran several other nifi flows) so we are discarding for now solutions that for example submit files to an external queue like SQS or Rabbit for consumption. So far we come up with two solutions that would: 1. have kubernete cluster of running jobs periodically querying the nifi queue for new flowfiles and pull one when a file arrives. 2. download the file-content (which is already stored outside of nifi) and process it. 3. submit the result back to nifi (using a HTTP Listener processor) to trigger subsequent nifi process. For step 1 and 2 so far we are considering two possible approaches: A) use a minifi container togheter with the app container in a sidecar design. minifi would connect to our nifi cluster and handle file download to a local volume for the app container process them. B) use nifi rest API to query and consume flowfiles on queue One requirement is that if needed we would manually scale up the app cluster to have multiple containers consumer more queued files in parallel. Do you guys recommend one over another (or a third approach)? Any pitfalls you can foresee? Would be really glad to hear your thoughts on this matter. Best regards, Eric