When we set “include file attributes” to False, does that in any way impact ListFile’s ability to track and retrieve new files by state?
On Fri, Mar 19, 2021 at 1:08 PM Mark Payne <[email protected]> wrote: > It’s hard to say without knowing what’s taking so long. Is it simply > crawling the directory structure that takes forever? If so, there’s not a > lot that can be done, as accessing tons of files just tends to be slow. One > way to verify this, on Linux, would be to run: > > ls -laR > > I.e., a recursive listing of all files. Not sure what the analogous > command would be on Windows. > > The “Track Performance” property of the processor can be used to > understand more about the performance characteristics of the disk access. > Set that to true and enable DEBUG logging for the processor. > > If there are heap concerns, generating a million FlowFiles, then you can > set a Record Writer on the processor so that only a single FlowFile gets > created. That can then be split up using a tiered approach (SplitRecord to > split into 10,000 Record chunks, and then another SplitRecord to split each > 10,000 Record chunk into a 1-Record chunk, and then EvaluateJsonPath, for > instance, to pull the actual filename into an attribute). I suspect this is > not the issue, with that mean heap and given that it’s approximately 1 > million files. But it may be a factor. > > Also, setting the “Include File Attributes” to false can significantly > improve performance, especially on a remote network drive, or some specific > types of drives/OS’s. > > Would recommend you play around with the above options to better > understand the performance characteristics of your particular environment. > > Thanks > -Mark > > On Mar 19, 2021, at 12:57 PM, Mike Sofen <[email protected]> > wrote: > > I’ve built a document processing solution in Nifi, using the > ListFile/FetchFile model hitting a large document repository on our Windows > file server. It’s nearly a million files ranging in size from 100kb to > 300mb, and files types of pdf, doc/docx, xsl/xslx, pptx, text, xml, rtf, > png, tiff and some specialized binary files. The million files is > distributed across tens of thousands of folders. > > The challenge is, for an example subfolder that has 25k files in 11k > folders totalling 17gb, it took upwards of 30 minutes for a single ListFile > to generate a list and send it downstream to the next processor. It’s > running on a PC with the latest gen core i7 with 32gb ram and a 1TB SSD – > plenty of horsepower and speed. My bootstrap.cnf has the java.arg.2=-Xms4g > and java.arg.3=-Xmx16g. > > Is there any way to speed up ListFile? > > Also, is there any way to detect that a file is encrypted? I’m sending > these for processing by Tika and Tika generates an error when it receives > an encrypted file (we have just a few of those, but enough to be annoying). > > Mike Sofen > > >
