RE: [EXTERNAL] Re: speeding up ListFile

Mike Sofen Fri, 19 Mar 2021 18:54:32 -0700

Someone help me here: the 157 file listing averaged 46ms, so the total duration 
SHOULD have been 7.2 seconds, not nearly 4 minutes (227 seconds).  What could 
be going on for the other 220 seconds?  Something is amiss.

Mike

From: Mike Sofen <[email protected]> 
Sent: Friday, March 19, 2021 3:47 PM
To: [email protected]
Subject: RE: [EXTERNAL] Re: speeding up ListFile

Hopes dashed on the rocks of reality...dang.  I just retested my folder with 
25k files and 11k subfolders (many nesting levels deep – perhaps 15 levels), 
after clearing state, with the Include File Attributes set to false and it took 
the same amount of time to produce the listing – about 30 minutes.

For some reason my debug setting isn’t writing to the log file (I set debug 
from within the ListFile processor).  But it did pop up that red error square 
on the processor.  So to save time, I re-ran it again for just a deep child 
folder that had 2 subfolders with a total of 157 files.  Here’s my 
transcription of the debug:

“Over the past 227 seconds, For Operation ‘RETRIEVE_NEXT_FILE_FROM_OS’ there 
were 157 operations performed with an average time of 46.229 milliseconds; STD 
Deviation = 34ms; Min Time = 0ms; Max Time = 170ms; 12 significant outliers.”

To state the obvious, this tiny listing of 157 files averaged more than 1 
second per file.  That mirrors the speed from my 25k trial which averaged a bit 
over 1 second per file – that is really slow.  What might be going on with the 
“significant outliers”?  

Mike

From: Olson, Eric <[email protected] <mailto:[email protected]> > 
Sent: Friday, March 19, 2021 11:45 AM
To: [email protected] <mailto:[email protected]> 
Subject: RE: [EXTERNAL] Re: speeding up ListFile

I’ve observed the same thing. I’m also monitoring directories of large numbers 
of files and noticed this morning that ListFile took about 30 min to process 
one directory of about 800,000 files. This is under Linux, but the folder in 
question is a shared Windows network folder that has been mounted to the Linux 
machine. (I don’t know how that was done; it’s something my Linux admin set up 
for me.)

I just ran a quick test on a folder with about 75,000 files. ListFile with 
Include File Attributes set to false took about 10 s to emit the 75,000 
FlowFiles. ListFile including file attributes took about 70 s. At the OS level, 
“ls -lR | wc” takes 2 seconds.

Interestingly, in the downstream queue, the two sets of files have the same 
lineage duration. I guess that’s measured starting at when the ListFile 
processor was started.

From: Mark Payne <[email protected] <mailto:[email protected]> > 
Sent: Friday, March 19, 2021 12:08 PM
To: [email protected] <mailto:[email protected]> 
Subject: [EXTERNAL] Re: speeding up ListFile

It’s hard to say without knowing what’s taking so long. Is it simply crawling 
the directory structure that takes forever? If so, there’s not a lot that can 
be done, as accessing tons of files just tends to be slow. One way to verify 
this, on Linux, would be to run: 

ls -laR

I.e., a recursive listing of all files. Not sure what the analogous command 
would be on Windows.

The “Track Performance” property of the processor can be used to understand 
more about the performance characteristics of the disk access. Set that to true 
and enable DEBUG logging for the processor.

If there are heap concerns, generating a million FlowFiles, then you can set a 
Record Writer on the processor so that only a single FlowFile gets created. 
That can then be split up using a tiered approach (SplitRecord to split into 
10,000 Record chunks, and then another SplitRecord to split each 10,000 Record 
chunk into a 1-Record chunk, and then EvaluateJsonPath, for instance, to pull 
the actual filename into an attribute). I suspect this is not the issue, with 
that mean heap and given that it’s approximately 1 million files. But it may be 
a factor.

Also, setting the “Include File Attributes” to false can significantly improve 
performance, especially on a remote network drive, or some specific types of 
drives/OS’s.

Would recommend you play around with the above options to better understand the 
performance characteristics of your particular environment.

Thanks

-Mark

On Mar 19, 2021, at 12:57 PM, Mike Sofen <[email protected] 
<mailto:[email protected]> > wrote:

I’ve built a document processing solution in Nifi, using the ListFile/FetchFile 
model hitting a large document repository on our Windows file server.  It’s 
nearly a million files ranging in size from 100kb to 300mb, and files types of 
pdf, doc/docx, xsl/xslx, pptx, text, xml, rtf, png, tiff and some specialized 
binary files.  The million files is distributed across tens of thousands of 
folders.

The challenge is, for an example subfolder that has 25k files in 11k folders 
totalling 17gb, it took upwards of 30 minutes for a single ListFile to generate 
a list and send it downstream to the next processor.  It’s running on a PC with 
the latest gen core i7 with 32gb ram and a 1TB SSD – plenty of horsepower and 
speed.  My bootstrap.cnf has the java.arg.2=-Xms4g and java.arg.3=-Xmx16g.

Is there any way to speed up ListFile?  

Also, is there any way to detect that a file is encrypted?  I’m sending these 
for processing by Tika and Tika generates an error when it receives an 
encrypted file (we have just a few of those, but enough to be annoying).

Mike Sofen

Confidentiality Notice:
This message may contain confidential or privileged information, or information 
that is otherwise exempt from disclosure. If you are not the intended 
recipient, you should promptly delete it and should not disclose, copy or 
distribute it to others.

RE: [EXTERNAL] Re: speeding up ListFile

Reply via email to