Hi Eric, Although my knowledge on MiNiFi, Python and Go is limited, I wonder if "nanofi" library can be used from the proprietary application so that they can fetch FlowFiles directly using Site-to-Site protocol. That can be an interesting approach and will be able to eliminate the need of storing data to a local volume (mentioned in the possible approach A). https://github.com/apache/nifi-minifi-cpp/tree/master/nanofi
The latest MiNiFi (C++) version 0.6.0 was released recently. https://cwiki.apache.org/confluence/display/MINIFI/Release+Notes Thanks, Koji On Thu, Apr 11, 2019 at 2:28 AM Eric Chaves <[email protected]> wrote: > > Hi Folks, > > My company is using nifi to perform several data-flow process and now we > received a requirement to do some fairly complex ETL over large files. To > process those files we have some proprietary applications (mostly written in > phyton or go) that ran as docker containers. > > I don't think that porting those apps as nifi processors would produce a good > result due to each app complexity. > > Also we would like keep using the nifi queues so we can monitor overall > progress as we already do (we ran several other nifi flows) so we are > discarding for now solutions that for example submit files to an external > queue like SQS or Rabbit for consumption. > > So far we come up with two solutions that would: > > have kubernete cluster of running jobs periodically querying the nifi queue > for new flowfiles and pull one when a file arrives. > download the file-content (which is already stored outside of nifi) and > process it. > submit the result back to nifi (using a HTTP Listener processor) to trigger > subsequent nifi process. > > > For step 1 and 2 so far we are considering two possible approaches: > > A) use a minifi container togheter with the app container in a sidecar > design. minifi would connect to our nifi cluster and handle file download to > a local volume for the app container process them. > > B) use nifi rest API to query and consume flowfiles on queue > > One requirement is that if needed we would manually scale up the app cluster > to have multiple containers consumer more queued files in parallel. > > Do you guys recommend one over another (or a third approach)? Any pitfalls > you can foresee? > > Would be really glad to hear your thoughts on this matter. > > Best regards, > > Eric
