Thank you Bob, that did work for me. Some Java syntax is new to me - like this .map(Path::toFile). Back to school again. This is standard Java, which is pretty groovy already, but I wonder if this could be (or already has been) groovy-ised in some way, e.g. to simplify the Files.walk(..).collect(..).parallelStream(). I put the filter before the collect - on the assertion that it would be more efficient to skip unnecessary files before adding to the parallel processing. In the following snippet I include a processedCount counter - and although this works, I am aware that altering things outside of the parallel process can be bad.
import java.nio.file.* import java.util.stream.* long scanFolder (File directory, Pattern fileMatch) { long processedCount = 0 Files.walk(directory.toPath(), 1) //just walk the current directory, not subdirectories .filter(p -> (Files.isRegularFile(p) && p.toString().matches(fileMatch) ) ) //skip files that do not match a regex pattern .collect(Collectors.toList()) .parallelStream() .map(Path::toFile) .forEach( msgFile -> { <do stuff with msgFile> processedCount++ } ) return processedCount } Merlin Beedell From: Bob Brown <b...@transentia.com.au> Sent: 10 May 2022 09:19 To: users@groovy.apache.org Subject: RE: Design pattern for processing a huge directory tree of files using GPars If you are able to use a modern Java implementation, you can use pure-Java streams, eg: https://stackoverflow.com/a/66044221 /// Files.walk(Paths.get("/path/to/root/directory")) // create a stream of paths .collect(Collectors.toList()) // collect paths into list to better parallize .parallelStream() // process this stream in multiple threads .filter(Files::isRegularFile) // filter out any non-files (such as directories) .map(Path::toFile) // convert Path to File object .sorted((a, b) -> Long.compare(a.lastModified(), b.lastModified())) // sort files date .limit(500) // limit processing to 500 files (optional) .forEachOrdered(f -> { // do processing here System.out.println(f); }); /// also read : https://www.airpair.com/java/posts/parallel-processing-of-io-based-data-with-java-streams Hope this helps some. BOB From: Merlin Beedell <mbeed...@cryoserver.com<mailto:mbeed...@cryoserver.com>> Sent: Monday, 9 May 2022 8:12 PM To: users@groovy.apache.org<mailto:users@groovy.apache.org> Subject: Design pattern for processing a huge directory tree of files using GPars I am trying to process millions of files, spread over a tree of directories. At the moment I can collect the set of top level directories into a list and then process these in parallel using GPars with list processing (e.g. .eachParallel). But what would be more efficient would be a 'parallel' for the File handling routines, for example: withPool() { directory.eachFileMatchParallel (FILES, ~/($fileMatch)/) {aFile -> ... then I would be a very happy bunny! I know I could copy the list of matching files into an Array list and then use the withPool { filesArray.eachParallel { ... - but this does not seem like an efficient solution - especially if there are several hundred thousand files in a directory. What design pattern(s) might be better to consider using? Merlin Beedell