Re: speeding up ListFile

James McMahon Sat, 20 Mar 2021 17:26:39 -0700

When we set “include file attributes” to False, does that in any way impact
ListFile’s ability to track and retrieve new files by state?


On Fri, Mar 19, 2021 at 1:08 PM Mark Payne <[email protected]> wrote:

> It’s hard to say without knowing what’s taking so long. Is it simply
> crawling the directory structure that takes forever? If so, there’s not a
> lot that can be done, as accessing tons of files just tends to be slow. One
> way to verify this, on Linux, would be to run:
>
> ls -laR
>
> I.e., a recursive listing of all files. Not sure what the analogous
> command would be on Windows.
>
> The “Track Performance” property of the processor can be used to
> understand more about the performance characteristics of the disk access.
> Set that to true and enable DEBUG logging for the processor.
>
> If there are heap concerns, generating a million FlowFiles, then you can
> set a Record Writer on the processor so that only a single FlowFile gets
> created. That can then be split up using a tiered approach (SplitRecord to
> split into 10,000 Record chunks, and then another SplitRecord to split each
> 10,000 Record chunk into a 1-Record chunk, and then EvaluateJsonPath, for
> instance, to pull the actual filename into an attribute). I suspect this is
> not the issue, with that mean heap and given that it’s approximately 1
> million files. But it may be a factor.
>
> Also, setting the “Include File Attributes” to false can significantly
> improve performance, especially on a remote network drive, or some specific
> types of drives/OS’s.
>
> Would recommend you play around with the above options to better
> understand the performance characteristics of your particular environment.
>
> Thanks
> -Mark
>
> On Mar 19, 2021, at 12:57 PM, Mike Sofen <[email protected]>
> wrote:
>
> I’ve built a document processing solution in Nifi, using the
> ListFile/FetchFile model hitting a large document repository on our Windows
> file server.  It’s nearly a million files ranging in size from 100kb to
> 300mb, and files types of pdf, doc/docx, xsl/xslx, pptx, text, xml, rtf,
> png, tiff and some specialized binary files.  The million files is
> distributed across tens of thousands of folders.
>
> The challenge is, for an example subfolder that has 25k files in 11k
> folders totalling 17gb, it took upwards of 30 minutes for a single ListFile
> to generate a list and send it downstream to the next processor.  It’s
> running on a PC with the latest gen core i7 with 32gb ram and a 1TB SSD –
> plenty of horsepower and speed.  My bootstrap.cnf has the java.arg.2=-Xms4g
> and java.arg.3=-Xmx16g.
>
> Is there any way to speed up ListFile?
>
> Also, is there any way to detect that a file is encrypted?  I’m sending
> these for processing by Tika and Tika generates an error when it receives
> an encrypted file (we have just a few of those, but enough to be annoying).
>
> Mike Sofen
>
>
>

Re: speeding up ListFile

Reply via email to