RE: Design pattern for processing a huge directory tree of files using GPars

Bob Brown Tue, 10 May 2022 01:19:39 -0700

If you are able to use a modern Java implementation, you can use pure-Java 
streams, eg:


https://stackoverflow.com/a/66044221

///
Files.walk(Paths.get("/path/to/root/directory")) // create a stream of paths
    .collect(Collectors.toList()) // collect paths into list to better parallize
    .parallelStream() // process this stream in multiple threads
    .filter(Files::isRegularFile) // filter out any non-files (such as 
directories)
    .map(Path::toFile) // convert Path to File object
    .sorted((a, b) -> Long.compare(a.lastModified(), b.lastModified())) // sort 
files date
    .limit(500) // limit processing to 500 files (optional)
    .forEachOrdered(f -> {
        // do processing here
        System.out.println(f);
    });
///

also read : 
https://www.airpair.com/java/posts/parallel-processing-of-io-based-data-with-java-streams

Hope this helps some.

BOB


From: Merlin Beedell <mbeed...@cryoserver.com>
Sent: Monday, 9 May 2022 8:12 PM
To: users@groovy.apache.org
Subject: Design pattern for processing a huge directory tree of files using 
GPars

I am trying to process millions of files, spread over a tree of directories.  
At the moment I can collect the set of top level directories into a list and 
then process these in parallel using GPars with list processing (e.g. 
.eachParallel).
But what would be more efficient would be a 'parallel' for the File handling 
routines, for example:

               withPool() {
                              directory.eachFileMatchParallel (FILES, 
~/($fileMatch)/) {aFile ->  ...

then I would be a very happy bunny!

I know I could copy the list of matching files into an Array list and then use 
the withPool { filesArray.eachParallel { ... - but this does not seem like an 
efficient solution - especially if there are several hundred thousand files in 
a directory.

What design pattern(s) might be better to consider using?

Merlin Beedell

RE: Design pattern for processing a huge directory tree of files using GPars

Reply via email to