RE: Design pattern for processing a huge directory tree of files using GPars

Bob Brown Fri, 13 May 2022 03:26:11 -0700

A few (untested, off-the-cuff) follow-up thoughts:

(if you have a 32-bit JVM) doing "processedCount++" in multiple threads will 
blow up on you:
https://stackoverflow.com/questions/17481153/long-and-double-assignments-are-not-atomic-how-does-it-matter


You should use something like LongAdder: 
https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/concurrent/atomic/LongAdder.html

And peek might be your friend (https://www.baeldung.com/java-streams-peek-api); 
for the counting part, you could do something like:


streamyStuff.peek(unused -> longAdder.increment()).forEach(msgFile -> );

*I* think it is a good idea to keep each step in a stream as "single-minded" as 
possible: makes working with tools like debuggers, 
https://www.baeldung.com/intellij-debugging-java-streams, etc. easier.

*I* really think you should experiment with the position of the filter...you 
may not need to worry about 'waste' work if you have lots of threads "chipping 
in" to get the work done...your way guarantees that only 1 thread can be 
dedicated to the task...so 0 speedup is possible. With the parallel situation, 
even if 50% of the filters reject, you might still get an approx. (#threads / 
2) speedup.

This all seems so 'groovy' to me that I haven't looked for even-more-Groovy 
ways...

BOB

From: Merlin Beedell <mbeed...@cryoserver.com>
Sent: Friday, 13 May 2022 7:05 PM
To: users@groovy.apache.org
Subject: RE: Design pattern for processing a huge directory tree of files using 
GPars

Thank you Bob, that did work for me.
Some Java syntax is new to me - like this .map(Path::toFile). Back to school 
again.
This is standard Java, which is pretty groovy already, but I wonder if this 
could be (or already has been) groovy-ised in some way, e.g. to simplify the 
Files.walk(..).collect(..).parallelStream().
I put the filter before the collect - on the assertion that it would be more 
efficient to skip unnecessary files before adding to the parallel processing.
In the following snippet I include a processedCount counter - and although this 
works, I am aware that altering things outside of the parallel process can be 
bad.

import java.nio.file.*
import java.util.stream.*

               long scanFolder (File directory, Pattern fileMatch)
               {
long processedCount = 0
Files.walk(directory.toPath(), 1)  //just walk the current directory, not 
subdirectories
  .filter(p -> (Files.isRegularFile(p) && p.toString().matches(fileMatch) ) )  
//skip files that do not match a regex pattern
  .collect(Collectors.toList())
  .parallelStream()
 .map(Path::toFile)
  .forEach( msgFile -> {
  <do stuff with msgFile>
   processedCount++
} )
return processedCount
               }

Merlin Beedell

From: Bob Brown <b...@transentia.com.au<mailto:b...@transentia.com.au>>
Sent: 10 May 2022 09:19
To: users@groovy.apache.org<mailto:users@groovy.apache.org>
Subject: RE: Design pattern for processing a huge directory tree of files using 
GPars

If you are able to use a modern Java implementation, you can use pure-Java 
streams, eg:

https://stackoverflow.com/a/66044221

///
Files.walk(Paths.get("/path/to/root/directory")) // create a stream of paths
    .collect(Collectors.toList()) // collect paths into list to better parallize
    .parallelStream() // process this stream in multiple threads
    .filter(Files::isRegularFile) // filter out any non-files (such as 
directories)
    .map(Path::toFile) // convert Path to File object
    .sorted((a, b) -> Long.compare(a.lastModified(), b.lastModified())) // sort 
files date
    .limit(500) // limit processing to 500 files (optional)
    .forEachOrdered(f -> {
        // do processing here
        System.out.println(f);
    });
///

also read : 
https://www.airpair.com/java/posts/parallel-processing-of-io-based-data-with-java-streams

Hope this helps some.

BOB


From: Merlin Beedell <mbeed...@cryoserver.com<mailto:mbeed...@cryoserver.com>>
Sent: Monday, 9 May 2022 8:12 PM
To: users@groovy.apache.org<mailto:users@groovy.apache.org>
Subject: Design pattern for processing a huge directory tree of files using 
GPars

I am trying to process millions of files, spread over a tree of directories.  
At the moment I can collect the set of top level directories into a list and 
then process these in parallel using GPars with list processing (e.g. 
.eachParallel).
But what would be more efficient would be a 'parallel' for the File handling 
routines, for example:

               withPool() {
                              directory.eachFileMatchParallel (FILES, 
~/($fileMatch)/) {aFile ->  ...

then I would be a very happy bunny!

I know I could copy the list of matching files into an Array list and then use 
the withPool { filesArray.eachParallel { ... - but this does not seem like an 
efficient solution - especially if there are several hundred thousand files in 
a directory.

What design pattern(s) might be better to consider using?

Merlin Beedell

RE: Design pattern for processing a huge directory tree of files using GPars

Reply via email to