Re: Data Ingestion forLarge Source Files and Masking

obaidul karim Sun, 03 Jan 2016 17:12:37 -0800

Hi Joe,

Yes, symlink is another option I was thinking when I was trying to use
getfile.
Thanks for your insights, I will update you on this mail chain when my
entire workflow completes. So that thus could be an reference for other :).


-Obaid

On Monday, January 4, 2016, Joe Witt <[email protected]> wrote:

> Obaid,
>
> You make a great point.
>
> I agree we will ultimately need to do more to make that very valid
> approach work easily.  The downside is that puts the onus on NiFi to
> keep track of a variety of potentially quite large state about the
> directory.  One way to avoid that expense is if NiFi can pull a copy
> of then delete the source file.  If you'd like to keep a copy around I
> wonder if a good approach is to simply create a symlink to the
> original file you want NiFi to pull but have the symlink in the NiFi
> pickup directory.  NiFi is then free to read and delete which means it
> simply pulls whatever shows up in that directory and doesn't have to
> keep state about filenames and checksums.
>
> I realize we still need to do what you're suggesting as well but
> thought I'd run this by you.
>
> Joe
>
> On Sun, Jan 3, 2016 at 6:43 PM, obaidul karim <[email protected]
> <javascript:;>> wrote:
> > Hi Joe,
> >
> > Condider a scenerio, where we need to feed some older files and we are
> using
> > "mv" to feed files to input directory( to reduce IO we may use "mv"). If
> we
> > use "mv", last modified date will not changed. And this is very common
> on a
> > busy file collection system.
> >
> > However, I think I can still manage it by adding additional "touch"
> before
> > moving fole in the target directory.
> >
> > So, my suggestion is to add file selection criteria as an configurable
> > option in listfile process on workflow. Options could be last modified
> > date(as current one) unique file names, checksum etc.
> >
> > Thanks again man.
> > -Obaid
> >
> >
> > On Monday, January 4, 2016, Joe Witt <[email protected] <javascript:;>>
> wrote:
> >>
> >> Hello Obaid,
> >>
> >> The default behavior of the ListFile processor is to keep track of the
> >> last modified time of the files it lists.  When you changed the name
> >> of the file that doesn't change the last modified time as tracked by
> >> the OS but when you altered content it does.  Simply 'touch' on the
> >> file would do it too.
> >>
> >> I believe we could observe the last modified time of the directory in
> >> which the file lives to detect something like a rename.  However, we'd
> >> not know which file was renamed just that something was changed.  So
> >> it require keeping some potentially problematic state to deconflict or
> >> requiring the user to have a duplicate detection process afterwards.
> >>
> >> So with that in mind is the current behavior sufficient for your case?
> >>
> >> Thanks
> >> Joe
> >>
> >> On Sun, Jan 3, 2016 at 6:17 AM, obaidul karim <[email protected]
> <javascript:;>> wrote:
> >> > Hi Joe,
> >> >
> >> > I am now exploring your solution.
> >> > Starting with below flow:
> >> >
> >> > ListFIle > FetchFile > CompressContent > PutFile.
> >> >
> >> > Seems all fine. Except some confusion with how ListFile identifies new
> >> > files.
> >> > In order to test, I renamed a already processed file and put in in
> input
> >> > folder and found that the file is not processing.
> >> > Then I randomly changed the content of the file and it was immediately
> >> > processed.
> >> >
> >> > My question is what is the new file selection criteria for "ListFile"
> ?
> >> > Can
> >> > I change it only to file name ?
> >> >
> >> > Thanks in advance.
> >> >
> >> > -Obaid
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Fri, Jan 1, 2016 at 10:43 PM, Joe Witt <[email protected]
> <javascript:;>> wrote:
> >> >>
> >> >> Hello Obaid,
> >> >>
> >> >> At 6 TB/day and average size of 2-3GB per dataset you're looking at a
> >> >> sustained rate of 70+MB/s and a pretty low transaction rate.  So well
> >> >> within a good range to work with on a single system.
> >> >>
> >> >> 'I's there any way to by pass writing flow files on disk or directly
> >> >> pass those files to HDFS as it is ?"
> >> >>
> >> >>   There is no way to bypass NiFi taking a copy of that data by
> design.
> >> >> NiFi is helping you formulate a graph of dataflow requirements from a
> >> >> given source(s) through given processing steps and ultimate driving
> >> >> data into given destination systems.  As a result it takes on the
> >> >> challenge of handling transactionality of each interaction and the
> >> >> buffering and backpressure to deal with the realities of different
> >> >> production/consumption patterns.
> >> >>
> >> >> "If the files on the spool directory are compressed(zip/gzip), can we
> >> >> store files on HDFS as uncompressed ?"
> >> >>
> >> >>   Certainly.  Both of those formats (zip/gzip) are supported in NiFi
> >> >> out of the box.  You simply run the data through the proper process
> >> >> prior to the PutHDFS process to unpack (zip) or decompress (gzip) as
> >> >> needed.
> >> >>
> >> >> "2.a Can we use our existing java code for masking ? if yes then how
> ?
> >> >> 2.b For this Scenario we also want to bypass storing flow files on
> >> >> disk. Can we do it on the fly, masking and storing on HDFS ?
> >> >> 2.c If the source files are compressed (zip/gzip), is there any issue
> >> >> for masking here ?"
> >> >>
> >> >>   You would build a custom NiFi processor that leverages your
> existing
> >> >> code.  If your code is able to operate on an InputStream and writes
> to
> >> >> an OutputStream then it is very likely you'll be able to handle
> >> >> arbitrarily large objects with zero negative impact to the JVM Heap
> as
> >> >> well.  This is thanks to the fact that the data is present in NiFi's
> >> >> repository with copy-on-write/pass-by-reference semantics and that
> the
> >> >> API is exposing those streams to your code in a transactional manner.
> >> >>
> >> >>   If you want the process of writing to HDFS to also do decompression
> >> >> and masking in one pass you'll need to extend/alter the PutHDFS
> >> >> process to do that.  It is probably best to implement the flow using
> >> >> cohesive processors (grab files, decompress files, mask files, write
> >> >> to hdfs).  Given how the repository construct in NiFi works and given
> >> >> how caching in Linux works it is very possible you'll be quite
> >> >> surprised by the throughput you'll see.  Even then you can optimize
> >> >> once you're sure you need to.  The other thing to keep in mind here
> is
> >> >> that often a flow that starts out as specific as this turns into a
> >> >> great place to tap the stream of data to feed some new system or new
> >> >> algorithm with a different format or protocol.  At that moment the
> >> >> benefits become even more obvious.
> >> >>
> >> >> Regarding the Flume processes in NiFi and their memory usage.  NiFi
> >> >> offers a nice hosting mechanism for the Flume processes and brings
> >> >> some of the benefits of NiFi's UI, provenance, repository concept.
> >> >> However, we're still largely limited to the design assumptions one
> >> >> gets when building a Flume process and that can be quite memory
> >> >> limiting.  We see what we have today as a great way to help people
> >> >> transition their existing Flume flows into NiFi by leveraging their
> >> >> existing code but would recommend working to phase the use of those
> >> >> out in time so that you can take full benefit of what NiFi brings
> over
> >> >> Flume.
> >> >>
> >> >> Thanks
> >> >> Joe
> >> >>
> >> >>
> >> >> On Fri, Jan 1, 2016 at 4:18 AM, obaidul karim <[email protected]
> <javascript:;>>
> >> >> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > I am new in Nifi and exploring it as open source ETL tool.
> >> >> >
> >> >> > As per my understanding, flow files are stored on local disk and it
> >> >> > contains
> >> >> > actual data.
> >> >> > If above is true, lets consider a below scenario:
> >> >> >
> >> >> > Scenario 1:
> >> >> > - In a spool directory we have terabytes(5-6TB/day) of files coming
> >> >> > from
> >> >> > external sources
> >> >> > - I want to push those files to HDFS as it is without any changes
> >> >> >
> >> >> > Scenario 2:
> >> >> > - In a spool directory we have terabytes(5-6TB/day) of files coming
> >> >> > from
> >> >> > external sources
> >> >> > - I want to mask some of the sensitive columns
> >> >> > - Then send one copy to HDFS and another copy to Kafka
> >> >> >
> >> >> > Question for Scenario 1:
> >> >> > 1.a In that case those 5-6TB data will be again written on local
> disk
> >> >> > as
> >> >> > flow files and will cause double I/O. Which eventually may cause
> >> >> > slower
> >> >> > performance due to I/O bottleneck.
> >> >> > Is there any way to by pass writing flow files on disk or directly
> >> >> > pass
> >> >> > those files to HDFS as it is ?
> >> >> > 1.b If the files on the spool directory are compressed(zip/gzip),
> can
> >> >> > we
> >> >> > store files on HDFS as uncompressed ?
> >> >> >
> >> >> > Question for Scenario 2:
> >> >> > 2.a Can we use our existing java code for masking ? if yes then
> how ?
> >> >> > 2.b For this Scenario we also want to bypass storing flow files on
> >> >> > disk.
> >> >> > Can
> >> >> > we do it on the fly, masking and storing on HDFS ?
> >> >> > 2.c If the source files are compressed (zip/gzip), is there any
> issue
> >> >> > for
> >> >> > masking here ?
> >> >> >
> >> >> >
> >> >> > In fact, I tried above using flume+flume interceptors. Everything
> >> >> > working
> >> >> > fine with smaller files. But when source files greater that 50MB
> >> >> > flume
> >> >> > chocks :(.
> >> >> > So, I am exploring options in NiFi. Hope I will get some guideline
> >> >> > from
> >> >> > you
> >> >> > guys.
> >> >> >
> >> >> >
> >> >> > Thanks in advance.
> >> >> > -Obaid
> >> >
> >> >
>

Re: Data Ingestion forLarge Source Files and Masking

Reply via email to