Re: stream one large file, only once

Joe Witt Mon, 14 Nov 2016 05:36:05 -0800

OOM errors can often show you the symptom more readily than the cause.

If you have SplitText after it then what Andrew mentioned is almost
certainly the cause.  If RouteText will meet the need I think you'll
find that yields far better behavior.  The way I'd do what it sounds
like you're doing is:


ListFile
FetchFile
RouteText
PublishKafka (using demarcation strategy based on whatever the end of
line bytes you have are)

This will be very efficient and low memory.

Thanks
Joe

On Mon, Nov 14, 2016 at 8:32 AM, Raf Huys <[email protected]> wrote:
> Thanks for making this clear!
>
> I was distracted because I do have a `java.lang.OutOfMemoryError` on the
> GetFile processor itself (and a matching `bytes read` spike corresponding to
> the file size).
>
> On Mon, Nov 14, 2016 at 2:23 PM, Joe Witt <[email protected]> wrote:
>>
>> The pattern you want for this is
>>
>> 1) GetFile or (ListFile + FetchFile)
>> 2) RouteText
>> 3) PublishKafka
>>
>> As Andrew points out GetFile and FetchFile do *not* read the file
>> contents into memory.  The whole point of NiFi's design in general is
>> to take advantage of the content repository rather than forcing
>> components to hold things in memory.  While they can elect to hold
>> things in memory they don't have to and the repository allows reading
>> from and writing to streams all within a unit of work pattern
>> transactional model.  There is a lot more to say on that topic but you
>> can see a good bit about it in the docs.
>>
>> RouteText is the way to avoid the SplitText memory scenario where
>> there are so many lines that even holding pointers/metadata about
>> those lines itself becomes problematic.  You can also do as Andrew
>> points out and split in chunks which also works well.  RouteText will
>> likely yield higher performance though overall if it works for your
>> case.
>>
>> Thanks
>> Joe
>>
>> On Mon, Nov 14, 2016 at 8:11 AM, Andrew Grande <[email protected]> wrote:
>> > Neither GetFile nor FetchFile read the file into memory, they only deal
>> > with
>> > the file handle and pass the contents via a handle to the content
>> > repository
>> > (NiFi streams data into and reads as a stream).
>> >
>> > What you will face, however, is an issue with a SplitText when you try
>> > to
>> > split it in 1 transaction. This might fail based on the JVM heap
>> > allocated
>> > and file size. A recommended best practice in this case is to introduce
>> > a
>> > series of 2 SplitText processors. 1st pass would split into e.g. 10 000
>> > row
>> > chunks, 2nd - into individual. Adjust for your expected file sizes and
>> > available memory.
>> >
>> > HTH,
>> > Andrew
>> >
>> > On Mon, Nov 14, 2016 at 7:23 AM Raf Huys <[email protected]> wrote:
>> >>
>> >> I would like to read in a large (several gigs) of logdata, and route
>> >> every
>> >> line to a (potentially different) Kafka topic.
>> >>
>> >> - I don't want this file to be in memory
>> >> - I want it to be read once, not more
>> >>
>> >> using `GetFile` takes the whole file in memory. Same with `FetchFile`
>> >> as
>> >> far as I can see.
>> >>
>> >> I also used a `ExecuteProcess` processor in which the file is `cat` and
>> >> which splits off a flowfile every millisecond. This looked to be a
>> >> somewhat
>> >> streaming approach to the problem, but this processor runs continuously
>> >> (or
>> >> cron based) and by consequence the logfile is re-injected all the time.
>> >>
>> >> What's the typical Nifi for this? Tx
>> >>
>> >> Raf Huys
>
>
>
>
> --
> Mvg,
>
> Raf Huys

Re: stream one large file, only once

Reply via email to