When we set “include file attributes” to False, does that in any way impact
ListFile’s ability to track and retrieve new files by state?

On Fri, Mar 19, 2021 at 1:08 PM Mark Payne <[email protected]> wrote:

> It’s hard to say without knowing what’s taking so long. Is it simply
> crawling the directory structure that takes forever? If so, there’s not a
> lot that can be done, as accessing tons of files just tends to be slow. One
> way to verify this, on Linux, would be to run:
>
> ls -laR
>
> I.e., a recursive listing of all files. Not sure what the analogous
> command would be on Windows.
>
> The “Track Performance” property of the processor can be used to
> understand more about the performance characteristics of the disk access.
> Set that to true and enable DEBUG logging for the processor.
>
> If there are heap concerns, generating a million FlowFiles, then you can
> set a Record Writer on the processor so that only a single FlowFile gets
> created. That can then be split up using a tiered approach (SplitRecord to
> split into 10,000 Record chunks, and then another SplitRecord to split each
> 10,000 Record chunk into a 1-Record chunk, and then EvaluateJsonPath, for
> instance, to pull the actual filename into an attribute). I suspect this is
> not the issue, with that mean heap and given that it’s approximately 1
> million files. But it may be a factor.
>
> Also, setting the “Include File Attributes” to false can significantly
> improve performance, especially on a remote network drive, or some specific
> types of drives/OS’s.
>
> Would recommend you play around with the above options to better
> understand the performance characteristics of your particular environment.
>
> Thanks
> -Mark
>
> On Mar 19, 2021, at 12:57 PM, Mike Sofen <[email protected]>
> wrote:
>
> I’ve built a document processing solution in Nifi, using the
> ListFile/FetchFile model hitting a large document repository on our Windows
> file server.  It’s nearly a million files ranging in size from 100kb to
> 300mb, and files types of pdf, doc/docx, xsl/xslx, pptx, text, xml, rtf,
> png, tiff and some specialized binary files.  The million files is
> distributed across tens of thousands of folders.
>
> The challenge is, for an example subfolder that has 25k files in 11k
> folders totalling 17gb, it took upwards of 30 minutes for a single ListFile
> to generate a list and send it downstream to the next processor.  It’s
> running on a PC with the latest gen core i7 with 32gb ram and a 1TB SSD –
> plenty of horsepower and speed.  My bootstrap.cnf has the java.arg.2=-Xms4g
> and java.arg.3=-Xmx16g.
>
> Is there any way to speed up ListFile?
>
> Also, is there any way to detect that a file is encrypted?  I’m sending
> these for processing by Tika and Tika generates an error when it receives
> an encrypted file (we have just a few of those, but enough to be annoying).
>
> Mike Sofen
>
>
>

Reply via email to