I've built a document processing solution in Nifi, using the ListFile/FetchFile model hitting a large document repository on our Windows file server. It's nearly a million files ranging in size from 100kb to 300mb, and files types of pdf, doc/docx, xsl/xslx, pptx, text, xml, rtf, png, tiff and some specialized binary files. The million files is distributed across tens of thousands of folders.
The challenge is, for an example subfolder that has 25k files in 11k folders totalling 17gb, it took upwards of 30 minutes for a single ListFile to generate a list and send it downstream to the next processor. It's running on a PC with the latest gen core i7 with 32gb ram and a 1TB SSD - plenty of horsepower and speed. My bootstrap.cnf has the java.arg.2=-Xms4g and java.arg.3=-Xmx16g. Is there any way to speed up ListFile? Also, is there any way to detect that a file is encrypted? I'm sending these for processing by Tika and Tika generates an error when it receives an encrypted file (we have just a few of those, but enough to be annoying). Mike Sofen
