Hi Folks,

My company is using nifi to perform several data-flow process and now we
received a requirement to do some fairly complex ETL over large files. To
process those files we have some proprietary applications (mostly written
in phyton or go) that ran as docker containers.

I don't think that porting those apps as nifi processors would produce a
good result due to each app complexity.

Also we would like keep using the nifi queues so we can monitor overall
progress as we already do (we ran several other nifi flows) so we are
discarding for now solutions that for example submit files to an external
queue like SQS or Rabbit for consumption.

So far we come up with two solutions that would:

   1. have kubernete cluster of running jobs periodically querying the nifi
   queue for new flowfiles and pull one when a file arrives.
   2. download the file-content (which is already stored outside of nifi)
   and process it.
   3. submit the result back to nifi (using a HTTP Listener processor) to
   trigger subsequent nifi process.


For step 1 and 2 so far we are considering two possible approaches:

A) use a minifi container togheter with the app container in a sidecar
design. minifi would connect to our nifi cluster and handle file download
to a local volume for the app container process them.

B) use nifi rest API to query and consume flowfiles on queue

One requirement is that if needed we would manually scale up the app
cluster to have multiple containers consumer more queued files in parallel.

Do you guys recommend one over another (or a third approach)? Any pitfalls
you can foresee?

Would be really glad to hear your thoughts on this matter.

Best regards,

Eric

Reply via email to