Re: Best practices for handling large files

Mike Thomsen Tue, 06 Jun 2017 16:24:07 -0700

Thanks, that's actually what I ended up doing. In case anyone comes along
looking for this. The approach we used for development was:


GetFile -> SplitText (50k chunks) -> SplitText (1 line/flowfile) -> the rest

On Fri, Apr 7, 2017 at 1:11 PM, Andy LoPresto <[email protected]> wrote:

> Mike,
>
> Are the files a single coherent piece of information (i.e. a video file)
> or collections of smaller atomic units of data (i.e. CSV, JSON batches)? In
> the first case, it’s important to ensure that the processors which deal
> with the content do so in a streaming manner so as not to exhaust your heap
> (and ensure any customer processors you develop do the same), and and with
> the latter, when splitting and merging these records, we generally propose
> a two-step approach, where a single giant file is split into medium size
> flowfiles, and then each of these is split into individual records (i.e. 1
> * 1MM -> 10 * 100K -> 10 * 100K * 1 as opposed to 1 * 1MM -> 1MM * 1).
>
> Other than that, be sure to follow the best practices for configuration in
> the Admin Guide [1] and read about performance expectations [2].
>
> [1] https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#
> configuration-best-practices
> [2] https://nifi.apache.org/docs/nifi-docs/html/overview.
> html#performance-expectations-and-characteristics-of-nifi
>
>
> Andy LoPresto
> [email protected]
> *[email protected] <[email protected]>*
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Apr 7, 2017, at 5:26 AM, Mike Thomsen <[email protected]> wrote:
>
> I have one flow that will have to handle files that are anywhere from
> 500mb to several GB in size. The current plan is to store the in HDFS or S3
> and then bring them down for processing in NiFi. Are there any suggestions
> on how to handle such large single files?
>
> Thanks,
>
> Mike
>
>
>

Re: Best practices for handling large files

Reply via email to