Re: [postgis-users] Parallelisation provides powerful postgis performance perks (script + ppt slides) [x-posted: pgsql-performance]

2015-08-02 Thread Kuien Liu
Great, see you at FOSS4G

Cheers,
Kuien Liu

On Thu, Jul 23, 2015 at 10:24 PM, David Haynes hayne...@gmail.com wrote:

 Hello,

 I am a researcher at the University of Minnesota, US. who is working on a
 project that uses PostGIS as our platform. I have been researching a number
 of PostgreSQL platforms that claim to support parallelizing PostGIS or
 geographic analysis. I am interested to learn more about the wrapper you
 have written. Have you looked into the other platforms such at pg_shard
 (CitusDB), or postgres-xl? We have not found them effective for actually
 distributing a spatial query. However, I will test your code out in our use
 case.

 When I briefly look at the text you have written in the Quick Example It
 seems that you are distributing your query by an ID field. I am wondering
 how your method would apply to raster datasets? Distributing geographic
 data by an ID can get you into problems because of the dependency for
 certain analytical functions.

 This sounds great, hope to hear back from you soon.

 On Thu, Jul 23, 2015 at 8:54 AM, Graeme B. Bell graeme.b...@nibio.no
 wrote:

 Hi all,

 Do you run map intersections between national scale geometry maps in
 postgis? (or other long-running GIS operations?)
 Do you hate that feeling of waiting all day/week for the query to
 complete?
 Well, here's a solution for you.

 1. For those that don't like par_psql (http://github.com/gbb/par_psql),
 this alternative approach uses the Gnu Parallel command to organise
 parallelism for queries that take days to run usually. We saw up to 20x
 performance improvements here, a day's work in one hour. May give you a few
 ideas about how to parallelise your own code with Gnu Parallel.

 https://github.com/gbb/fast_map_intersection


 2. Also, I gave a talk at FOSS4G Como about these tools (and a few
 others), and how to get better performance from your PostGIS database with
 parallelisation.
 This may be helpful to people who are new to parallelisation / multi-core
 work with postgres for GIS work.

 http://graemebell.net/foss4gcomo.pdf

 Enjoy,

 Graeme Bell.
 ___
 postgis-users mailing list
 postgis-users@lists.osgeo.org
 http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users



 ___
 postgis-users mailing list
 postgis-users@lists.osgeo.org
 http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users

___
postgis-users mailing list
postgis-users@lists.osgeo.org
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users

Re: [postgis-users] Parallelisation provides powerful postgis performance perks (script + ppt slides) [x-posted: pgsql-performance]

2015-07-23 Thread David Haynes
Hello,

I am a researcher at the University of Minnesota, US. who is working on a
project that uses PostGIS as our platform. I have been researching a number
of PostgreSQL platforms that claim to support parallelizing PostGIS or
geographic analysis. I am interested to learn more about the wrapper you
have written. Have you looked into the other platforms such at pg_shard
(CitusDB), or postgres-xl? We have not found them effective for actually
distributing a spatial query. However, I will test your code out in our use
case.

When I briefly look at the text you have written in the Quick Example It
seems that you are distributing your query by an ID field. I am wondering
how your method would apply to raster datasets? Distributing geographic
data by an ID can get you into problems because of the dependency for
certain analytical functions.

This sounds great, hope to hear back from you soon.

On Thu, Jul 23, 2015 at 8:54 AM, Graeme B. Bell graeme.b...@nibio.no
wrote:

 Hi all,

 Do you run map intersections between national scale geometry maps in
 postgis? (or other long-running GIS operations?)
 Do you hate that feeling of waiting all day/week for the query to complete?
 Well, here's a solution for you.

 1. For those that don't like par_psql (http://github.com/gbb/par_psql),
 this alternative approach uses the Gnu Parallel command to organise
 parallelism for queries that take days to run usually. We saw up to 20x
 performance improvements here, a day's work in one hour. May give you a few
 ideas about how to parallelise your own code with Gnu Parallel.

 https://github.com/gbb/fast_map_intersection


 2. Also, I gave a talk at FOSS4G Como about these tools (and a few
 others), and how to get better performance from your PostGIS database with
 parallelisation.
 This may be helpful to people who are new to parallelisation / multi-core
 work with postgres for GIS work.

 http://graemebell.net/foss4gcomo.pdf

 Enjoy,

 Graeme Bell.
 ___
 postgis-users mailing list
 postgis-users@lists.osgeo.org
 http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users

___
postgis-users mailing list
postgis-users@lists.osgeo.org
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users

[postgis-users] Parallelisation provides powerful postgis performance perks (script + ppt slides) [x-posted: pgsql-performance]

2015-07-23 Thread Graeme B. Bell
Hi all,

Do you run map intersections between national scale geometry maps in postgis? 
(or other long-running GIS operations?)
Do you hate that feeling of waiting all day/week for the query to complete?
Well, here's a solution for you.

1. For those that don't like par_psql (http://github.com/gbb/par_psql), this 
alternative approach uses the Gnu Parallel command to organise parallelism for 
queries that take days to run usually. We saw up to 20x performance 
improvements here, a day's work in one hour. May give you a few ideas about how 
to parallelise your own code with Gnu Parallel. 

https://github.com/gbb/fast_map_intersection


2. Also, I gave a talk at FOSS4G Como about these tools (and a few others), and 
how to get better performance from your PostGIS database with parallelisation. 
This may be helpful to people who are new to parallelisation / multi-core work 
with postgres for GIS work. 

http://graemebell.net/foss4gcomo.pdf  

Enjoy,

Graeme Bell.
___
postgis-users mailing list
postgis-users@lists.osgeo.org
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users


[postgis-users] Parallelisation provides powerful postgis performance perks (script + ppt slides) [x-posted: pgsql-performance]

2015-07-23 Thread Graeme B. Bell
 Hello,
 
 I am a researcher at the University of Minnesota, US. who is working on a
 project that uses PostGIS as our platform. I have been researching a number
 of PostgreSQL platforms that claim to support parallelizing PostGIS or
 geographic analysis. I am interested to learn more about the wrapper you
 have written. Have you looked into the other platforms such at pg_shard
 (CitusDB), or postgres-xl? We have not found them effective for actually
 distributing a spatial query. However, I will test your code out in our use
 case.
 
 When I briefly look at the text you have written in the Quick Example It
 seems that you are distributing your query by an ID field. I am wondering
 how your method would apply to raster datasets? Distributing geographic
 data by an ID can get you into problems because of the dependency for
 certain analytical functions.
 
 This sounds great, hope to hear back from you soon.

Hi there,

I don't have time to give out much advice just now, all I can say is 'read the 
slides and see what you think' (http://graemebell.net/foss4gcomo.pdf). 
Particularly the par_psql slides. 
You probably want to check out par_psql (and the slides I mentioned) more than 
the fast_map_intersection code, which is just an example of metaprogramming to 
get parallelism.

In my work here, I am not looking to get parallelism in a clever way on the 
server side (because generally speaking, I know much better than the server 
where the best parallelism opportunities are, how they're structured, and where 
things can't be parallelised). 

Also the problem we have here isn't about making the DB scale out horizontally. 
The problem is simply getting as much value out of the 16-core / 32-coreHT 
server we have here - it has a few SSDs in it, 128GB RAM. That's already more 
than enough to take problem run time down by more than an order of magnitude. 
If you have problems where you need improvements of 2-3 orders of magnitude, 
sorry, can't help much with these tools. I just want to take the pretty much 
endless opportunities for super-easy parallelism you get with huge 
geometry/raster data sets and make my everyday work a lot faster than my 
desktop. 

For maps with dependency between data items - my colleague Lar Opsahl is doing 
other work with parallel algorithms for topology maps, where he isolates the 
data into two parts - things that don't overlap and things that do. Anything 
that doesn't overlap is embarrasingly parallel and we can scale it up to about 
20x quite easily with parallel tiles; anything that DOES overlap, can still be 
dealt with quite quickly (since it's usually 10% of the map). 

Distributing geographic data by an ID can get you into problems because of the 
dependency for certain analytical functions.

Unsure what you mean by this, have never encountered any problems whatsoever 
from using ID this way. In our work with large geometry sets we get great 
scaling from this. We're very dependent on spatial indices as a magic box that 
picks out only the bits we need (e.g. intersections). In fact, if we're 
processing all the rows, then using the ID/modulo method is great because it 
means incoming IO pages are being split between available processors in a 
pretty balanced way. 

For rasters. Well, postgis raster queries should parallelise much like any 
other query with par_psql unless there are hidden internal locks I don't know 
about or underlying IO contention. Otherwise, for GDAL stuff, take a look at 
rbuild (http://github.com/gbb/rbuild). I wrote it as a kind of framework for 
tiling/parallelism when we were processing vector maps via an intermediate 
stage of rasters.

The main trick I've found with raster parallel processing was to use a sensible 
size  number of tiles, good lossless compression and the tiniest datatype 
possible, because when parallelising raster tiles, the IO will kill your 
performance, not the compute cost. Also, parallelisation has its overheads, so 
going beyond e.g. 400 tiles per map was counteproductive. I gave a presentation 
here: http://graemebell.net/foss4g2013.pdf. Hope that helps with your raster 
work. 

The tools I'm making are simply handy tools for people who don't know parallel 
programming, but want their GIS code to run 16x faster with very little work on 
a single powerful server, or 4x faster on their local machine. They're not 
going to scale you out horizontally over a million AWS instances , or get you 
much more than 1 order of magnitude improvement in run time. But for most 
people, a 4-32x improvement is still a huge improvement. For very large 
projects, it won't be enough.  

Graeme. 





___
postgis-users mailing list
postgis-users@lists.osgeo.org
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users