Re: [GRASS-user] Parallel processes

2015-10-21 Thread Glynn Clements

Dylan Beaudette wrote:

> My main motivation for asking this question was to determine instances
> where parallel operations in GRASS are _not_ safe. From my reading of
> the wiki, manual pages, and your recent comments on GRASS-dev, it
> would appear that the following operations may not be safe:
> 
> 1. region-altering

Commands which modify the WIND or VAR files cannot safely be executed
in parallel. Similarly, commands which overwrite maps or modify
database tables cannot safely be executed in parallel if they will (or
might) modify the same maps, tables, etc.

For commands which create new maps (or similar), you need to ensure
that they don't concidentally choose the same name for their outputs. 

In short, GRASS doesn't attempt to "lock" maps, files, or similar
entities which it modifies, beyond the fact that each GRASS "session"
locks its current mapset.

> 2. calculations in the presence of a MASK
> 
> 3. reading "external" (r.external) GDAL sources (?)
> 
> 4. some mapcalc expressions

These should be safe if the parallelism is in the form of multiple
processes. The known or suspected issues with r.mapcalc are related to
using multiple threads within a single process.

-- 
Glynn Clements 
___
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user

Re: [GRASS-user] Parallel processes

2015-10-21 Thread patrick s.

Dylan

A small sidenote on your issue. I also use GNU parallel for operations 
that have to run on very large scales. Never had problems with it when 
running it on different mapsets, i.e. I create a temporary mapset in my 
scripts that are wrapped  by the command. Storing the data back to a 
main mapset or PostgreSQL allows to delete the temporary mapsets at the 
end of each process. Maybe a kind of hack, but works well for me. Always 
happy on feedback to optimize these ;-)


Greetz,
Patrick

On 20.10.2015 19:09, grass-user-requ...@lists.osgeo.org wrote:

Thank you Glynn, your advice confirms some empirical notes:

1. parallel processes that use data from external USB disks quickly
saturate the capacity of the bus or mechanism of the drive

2. parallel processes that use data from an internal SSD can generally
saturate all 8 cores of my Intel i7


My main motivation for asking this question was to determine instances
where parallel operations in GRASS are_not_  safe. From my reading of
the wiki, manual pages, and your recent comments on GRASS-dev, it
would appear that the following operations may not be safe:

1. region-altering

2. calculations in the presence of a MASK

3. reading "external" (r.external) GDAL sources (?)

4. some mapcalc expressions

In order to simplify my testing, I have disabled pthread support and
invoke "parallelization" via backgrounding or GNU parallel. My
examples with GNU parallel stem from the tremendous (apparent) utility
of this tool, in that most "bash for loops" can be directly converted
into "smart" parallel jobs.

Thanks,
Dylan


___
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user

Re: [GRASS-user] Parallel processes

2015-10-19 Thread Dylan Beaudette
On Mon, Oct 19, 2015 at 12:05 PM, Glynn Clements
 wrote:
>
> Dylan Beaudette wrote:
>
>> Are there any reasons to prefer sequential operations (that do not
>> alter the region) vs. parallel operations?
>
> Running additional jobs in parallel is only worthwhile if the
> resources which they would use (CPU, memory, I/O bandwidth) would
> otherwise be idle.
>
> Once you get to the point that a resource is saturated and jobs are
> contending for it, parallel execution will be less efficient than
> serial execution.
>
> Maybe the "parallel" command takes these factors into account
> sufficiently. If it only considers CPU cores (i.e. one job per core),
> you'd need to confirm that you aren't saturating I/O bandwidth or
> thrashing memory or CPU caches. Try running the same sequence of tasks
> with varying numbers of parallel jobs to determine the optimal value.
> Needless to say, this will vary according to the nature of the task
> (e.g. I/O-bound versus CPU-bound).
>

Thank you Glynn, your advice confirms some empirical notes:

1. parallel processes that use data from external USB disks quickly
saturate the capacity of the bus or mechanism of the drive

2. parallel processes that use data from an internal SSD can generally
saturate all 8 cores of my Intel i7


My main motivation for asking this question was to determine instances
where parallel operations in GRASS are _not_ safe. From my reading of
the wiki, manual pages, and your recent comments on GRASS-dev, it
would appear that the following operations may not be safe:

1. region-altering

2. calculations in the presence of a MASK

3. reading "external" (r.external) GDAL sources (?)

4. some mapcalc expressions

In order to simplify my testing, I have disabled pthread support and
invoke "parallelization" via backgrounding or GNU parallel. My
examples with GNU parallel stem from the tremendous (apparent) utility
of this tool, in that most "bash for loops" can be directly converted
into "smart" parallel jobs.

Thanks,
Dylan
___
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user

Re: [GRASS-user] Parallel processes

2015-10-19 Thread Glynn Clements

Dylan Beaudette wrote:

> Are there any reasons to prefer sequential operations (that do not
> alter the region) vs. parallel operations?

Running additional jobs in parallel is only worthwhile if the
resources which they would use (CPU, memory, I/O bandwidth) would
otherwise be idle.

Once you get to the point that a resource is saturated and jobs are
contending for it, parallel execution will be less efficient than
serial execution.

Maybe the "parallel" command takes these factors into account
sufficiently. If it only considers CPU cores (i.e. one job per core),
you'd need to confirm that you aren't saturating I/O bandwidth or
thrashing memory or CPU caches. Try running the same sequence of tasks
with varying numbers of parallel jobs to determine the optimal value. 
Needless to say, this will vary according to the nature of the task
(e.g. I/O-bound versus CPU-bound).

-- 
Glynn Clements 
___
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user

[GRASS-user] Parallel processes

2015-10-15 Thread Dylan Beaudette
Hi,

Are there any reasons to prefer sequential operations (that do not
alter the region) vs. parallel operations?

For example:

# this
seq 1 30 | parallel -j8 --gnu --progress r.surf.gauss --o --q
output=testing_00{}

# vs.

# this
for map in `seq 1 30`
do
r.surf.gauss --o --q output=testing_00$map
done


I have consulted the relevant page on the wiki:

https://grasswiki.osgeo.org/wiki/Parallel_GRASS_jobs

... and it does appear to discourage the first example above. Anyone
else have some examples of this kind of workflow and potential
caveats?

I have noticed that there can be a significant speed bump when running
some tasks in parallel, especially when the source files are stored on
a SSD.

Thanks,
Dylan
___
grass-user mailing list
grass-user@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-user