A few (untested, off-the-cuff) follow-up thoughts: (if you have a 32-bit JVM) doing "processedCount++" in multiple threads will blow up on you: https://stackoverflow.com/questions/17481153/long-and-double-assignments-are-not-atomic-how-does-it-matter
You should use something like LongAdder: https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/concurrent/atomic/LongAdder.html And peek might be your friend (https://www.baeldung.com/java-streams-peek-api); for the counting part, you could do something like: streamyStuff.peek(unused -> longAdder.increment()).forEach(msgFile -> ); *I* think it is a good idea to keep each step in a stream as "single-minded" as possible: makes working with tools like debuggers, https://www.baeldung.com/intellij-debugging-java-streams, etc. easier. *I* really think you should experiment with the position of the filter...you may not need to worry about 'waste' work if you have lots of threads "chipping in" to get the work done...your way guarantees that only 1 thread can be dedicated to the task...so 0 speedup is possible. With the parallel situation, even if 50% of the filters reject, you might still get an approx. (#threads / 2) speedup. This all seems so 'groovy' to me that I haven't looked for even-more-Groovy ways... BOB From: Merlin Beedell <mbeed...@cryoserver.com> Sent: Friday, 13 May 2022 7:05 PM To: users@groovy.apache.org Subject: RE: Design pattern for processing a huge directory tree of files using GPars Thank you Bob, that did work for me. Some Java syntax is new to me - like this .map(Path::toFile). Back to school again. This is standard Java, which is pretty groovy already, but I wonder if this could be (or already has been) groovy-ised in some way, e.g. to simplify the Files.walk(..).collect(..).parallelStream(). I put the filter before the collect - on the assertion that it would be more efficient to skip unnecessary files before adding to the parallel processing. In the following snippet I include a processedCount counter - and although this works, I am aware that altering things outside of the parallel process can be bad. import java.nio.file.* import java.util.stream.* long scanFolder (File directory, Pattern fileMatch) { long processedCount = 0 Files.walk(directory.toPath(), 1) //just walk the current directory, not subdirectories .filter(p -> (Files.isRegularFile(p) && p.toString().matches(fileMatch) ) ) //skip files that do not match a regex pattern .collect(Collectors.toList()) .parallelStream() .map(Path::toFile) .forEach( msgFile -> { <do stuff with msgFile> processedCount++ } ) return processedCount } Merlin Beedell From: Bob Brown <b...@transentia.com.au<mailto:b...@transentia.com.au>> Sent: 10 May 2022 09:19 To: users@groovy.apache.org<mailto:users@groovy.apache.org> Subject: RE: Design pattern for processing a huge directory tree of files using GPars If you are able to use a modern Java implementation, you can use pure-Java streams, eg: https://stackoverflow.com/a/66044221 /// Files.walk(Paths.get("/path/to/root/directory")) // create a stream of paths .collect(Collectors.toList()) // collect paths into list to better parallize .parallelStream() // process this stream in multiple threads .filter(Files::isRegularFile) // filter out any non-files (such as directories) .map(Path::toFile) // convert Path to File object .sorted((a, b) -> Long.compare(a.lastModified(), b.lastModified())) // sort files date .limit(500) // limit processing to 500 files (optional) .forEachOrdered(f -> { // do processing here System.out.println(f); }); /// also read : https://www.airpair.com/java/posts/parallel-processing-of-io-based-data-with-java-streams Hope this helps some. BOB From: Merlin Beedell <mbeed...@cryoserver.com<mailto:mbeed...@cryoserver.com>> Sent: Monday, 9 May 2022 8:12 PM To: users@groovy.apache.org<mailto:users@groovy.apache.org> Subject: Design pattern for processing a huge directory tree of files using GPars I am trying to process millions of files, spread over a tree of directories. At the moment I can collect the set of top level directories into a list and then process these in parallel using GPars with list processing (e.g. .eachParallel). But what would be more efficient would be a 'parallel' for the File handling routines, for example: withPool() { directory.eachFileMatchParallel (FILES, ~/($fileMatch)/) {aFile -> ... then I would be a very happy bunny! I know I could copy the list of matching files into an Array list and then use the withPool { filesArray.eachParallel { ... - but this does not seem like an efficient solution - especially if there are several hundred thousand files in a directory. What design pattern(s) might be better to consider using? Merlin Beedell