Re: [PERFORM] Issue for partitioning with extra check constriants

2010-10-01 Thread Tom Lane
Josh Berkus  writes:
>> [ shrug ... ]  We do not promise that the current partitioning scheme
>> scales to the number of partitions where this is likely to be an
>> interesting concern.

> Actually, you can demonstrate pretty significant response time delays on
> only 50 partitions.

And your point is?  The design center for the current setup is maybe 5
or 10 partitions.  We didn't intend it to be used for more partitions
than you might have spindles to spread the data across.

>> We're talking "wasted effort on a dead-end situation".  The time that
>> would go into this would be much better spent on real partitioning.

> That only applies if someone is working on "real partitioning".  Is anyone?

There is discussion going on, and even if there weren't, the argument
still applies.  Time spent on this band-aid would be time taken away
from a real solution.  In case you haven't noticed, we have very finite
amounts of manpower that's competent to do planner surgery.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Issue for partitioning with extra check constriants

2010-10-01 Thread Josh Berkus

> [ shrug ... ]  We do not promise that the current partitioning scheme
> scales to the number of partitions where this is likely to be an
> interesting concern.

Actually, you can demonstrate pretty significant response time delays on
only 50 partitions.

> We're talking "wasted effort on a dead-end situation".  The time that
> would go into this would be much better spent on real partitioning.

That only applies if someone is working on "real partitioning".  Is anyone?

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Issue for partitioning with extra check constriants

2010-10-01 Thread Tom Lane
Josh Berkus  writes:
> The issue is this: when a partitioned table is evaluated by the planner
> for constraint exclusion, it evaluates ALL check constraints on each
> partition, regardless of whether or not they include a referenced column
> in the query (and whether or not they relate to partitioning).

[ shrug ... ]  We do not promise that the current partitioning scheme
scales to the number of partitions where this is likely to be an
interesting concern.

*After* we have a real partitioning scheme, it might be worth worrying
about this sort of problem, if it's still a problem then.

> Are we talking "moderate", "hard" or "nearly impossible"?

We're talking "wasted effort on a dead-end situation".  The time that
would go into this would be much better spent on real partitioning.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Issue for partitioning with extra check constriants

2010-10-01 Thread Josh Berkus
Simon, Greg, etc.,

Just barked my nose against a major performance issue with CE &
partitioning, and was wondering if anyone had poked at it.

The issue is this: when a partitioned table is evaluated by the planner
for constraint exclusion, it evaluates ALL check constraints on each
partition, regardless of whether or not they include a referenced column
in the query (and whether or not they relate to partitioning).  If some
of those check constraints are expensive (like GIS functions) then this
can add considerably (on the order of 2ms per partition) to planning time.

If this is news to anyone, I have a nice test case.

So ... how plausible is it to fix the planner so that it only evaluates
check constraints on a partition if there is a match of referenced
columns?  Are we talking "moderate", "hard" or "nearly impossible"?

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] turn off caching for performance test

2010-10-01 Thread Willy-Bas Loos
> I found one query that did a seqscan anyway(with enable_seqscan off),
> because doing an index scan would be more than 1M points more
> expensive (to the planner).

Hmm, i guess that says it all :)
-- 
"Patriotism is the conviction that your country is superior to all
others because you were born in it." -- George Bernard Shaw

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] gist indexes for distance calculations

2010-10-01 Thread Marcelo Zabani
Thanks a lot everyone for all the info! It is all really helpful.


2010/10/1 Merlin Moncure 

>  On Thu, Sep 30, 2010 at 2:33 PM, Marcelo Zabani 
> wrote:
> > Hi everyone. I have a question, and it's well beyond me to even speculate
> > about the inner workings of postgresql on this.
> >
> > I have a "places" table, and a "coordinates" column, of type POINT.
> >
> > If I want to find every place within, for example, a distance of 1 unit
> from
> > an arbitrary point, I'll do:
> >
> > CREATE INDEX ON places USING gist (circle(coordinates, 1));
> >
> > And then I'll fetch the desired rows like this:
> >
> > SELECT * FROM places WHERE circle(coordinates, 1) @> circle('(a,b)', 0);
> > (where (a,b) is an arbitrary point)
> >
> > I'd like to know how this index works, though, as it seems to me the only
> > way to have this kind of index to work is to calculate the distance of
> every
> > point in a square of sides 2*1=2 units centered on (a, b).
> >
> > So, am I wrong to think it works like that? If it does work like that,
> could
> > I have instead two columns of type FLOAT (xcoordinate and ycoordinate)
> and
> > create traditional b-tree indexes on both of these, and then do something
> > like:
> >
> > SELECT * FROM places WHERE xcoordinate >= (a-1) AND xcoordinate <= (a+1)
> AND
> > ycoordinate >= (b-1) AND ycoordinate <= (b+1) And
> > SQRT(POW(a-xcoordinate,2)+POW(b-ycoordinate,2))<=1;
> >
> > If you can also pinpoint me to where I can find this sort of information
> > (index utilization and planning, performance tuning), I'd be very
> grateful.
>
> A quick heads up: It's possible, although it may not necessarily help,
> to further reduce distance calcs by drawing an inner bounding box of
> points that are confirmed good.  Your outer box is made by squaring
> the circle on lat/lon projection -- you can also calculate the biggest
> lat lon 'rectangle' that completely fits inside the circle, and play
> with a query that looks something like this (pseudo sql):
>
> select * from points where (point inside good box) or (point inside
> possible box and dist(point, mypoint < n));
>
> You get reduction of dist calcs at expense of second gist lookup.  You
> can also, of course, do this on application side, but what's the fun
> in that? :-).
>
> merlin
>



-- 
Marcelo Zabani
(19) 9341-0221


Re: [PERFORM] How does PG know if data is in memory?

2010-10-01 Thread Samuel Gendler
2010/10/1 Fabrício dos Anjos Silva 

>Craig,
>
>I agree with you. Not completely, but I do.
>
>I'm just stuck in a situation where I can't figure out what values to
> use for the parameters. I can't even think of a way on how to test and
> discover that.
>I followed Josh Berkus' GUC spreadsheet and some tutorials on PG wiki,
> but how do I test if my configuration is good or bad? I see in PG log that
> some queries have bad plans, but should I do in order to tell PG to make
> better decisions? I tried different values with no success.
>
> You can set different values for most configuration params on individual db
> connections.  You can test different values for individual slow-running
> queries.  Rather than disabling whole features in the entire database -
> which may well make lots of other queries run less quickly - you can, at the
> very least, just disable those features before running the queries that are
> known to be slow and for which you could not find global values which worked
> well.  Disable sequence plans just before running query x, or boost work_mem
> to a very high value just for query y.  It is also possible that you've
> simply outstripped your hardware's capability.  We had a database with a
> number of tables containing tens of millions of rows and queries which
> frequently required aggregating over whole tables.  Moving from 8Gb of RAM
> to 48GB of RAM (so that a large chunk of the db fits in memory) and from 6
> spindles to 12, and then just modifying the global config to suit the new
> hardware gave us a huge performance boost that we could never have gotten on
> the old hardware, no matter how much tuning of individual queries we did.  I
> was actually able to drop all of the custom config tweaks that we had on
> individual queries, though I'm sure I'll eventually wind up adding some back
> - queries that aggregate over large tables really benefit from a lot of
> work_mem - more than I want to configure globally.
>


Re: [PERFORM] gist indexes for distance calculations

2010-10-01 Thread Merlin Moncure
On Thu, Sep 30, 2010 at 2:33 PM, Marcelo Zabani  wrote:
> Hi everyone. I have a question, and it's well beyond me to even speculate
> about the inner workings of postgresql on this.
>
> I have a "places" table, and a "coordinates" column, of type POINT.
>
> If I want to find every place within, for example, a distance of 1 unit from
> an arbitrary point, I'll do:
>
> CREATE INDEX ON places USING gist (circle(coordinates, 1));
>
> And then I'll fetch the desired rows like this:
>
> SELECT * FROM places WHERE circle(coordinates, 1) @> circle('(a,b)', 0);
> (where (a,b) is an arbitrary point)
>
> I'd like to know how this index works, though, as it seems to me the only
> way to have this kind of index to work is to calculate the distance of every
> point in a square of sides 2*1=2 units centered on (a, b).
>
> So, am I wrong to think it works like that? If it does work like that, could
> I have instead two columns of type FLOAT (xcoordinate and ycoordinate) and
> create traditional b-tree indexes on both of these, and then do something
> like:
>
> SELECT * FROM places WHERE xcoordinate >= (a-1) AND xcoordinate <= (a+1) AND
> ycoordinate >= (b-1) AND ycoordinate <= (b+1) And
> SQRT(POW(a-xcoordinate,2)+POW(b-ycoordinate,2))<=1;
>
> If you can also pinpoint me to where I can find this sort of information
> (index utilization and planning, performance tuning), I'd be very grateful.

A quick heads up: It's possible, although it may not necessarily help,
to further reduce distance calcs by drawing an inner bounding box of
points that are confirmed good.  Your outer box is made by squaring
the circle on lat/lon projection -- you can also calculate the biggest
lat lon 'rectangle' that completely fits inside the circle, and play
with a query that looks something like this (pseudo sql):

select * from points where (point inside good box) or (point inside
possible box and dist(point, mypoint < n));

You get reduction of dist calcs at expense of second gist lookup.  You
can also, of course, do this on application side, but what's the fun
in that? :-).

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How does PG know if data is in memory?

2010-10-01 Thread Tom Lane
"Kevin Grittner"  writes:
> I agree with that, but I think there's an even more insidious issue
> here.  Biasing plans heavily toward using what is already in cache
> could have a destabilizing effect on performance.

Not to mention the destabilizing effect on the plans themselves.
Behavior like that would make EXPLAIN nearly useless, because the plan
you get would vary from moment to moment even when "nothing is
changing".  It's fairly clear that people don't actually want that.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How does PG know if data is in memory?

2010-10-01 Thread Fabrício dos Anjos Silva
   Craig,

   I agree with you. Not completely, but I do.

   I'm just stuck in a situation where I can't figure out what values to use
for the parameters. I can't even think of a way on how to test and discover
that.
   I followed Josh Berkus' GUC spreadsheet and some tutorials on PG wiki,
but how do I test if my configuration is good or bad? I see in PG log that
some queries have bad plans, but should I do in order to tell PG to make
better decisions? I tried different values with no success.

   I understand that parameters have no "work everywhere" values. Each
database has its characteristics and each server has its HW specifications.

   Is there any automated test tool? A can compile a list of real-world
queries, and provide an exact copy of my db server just for testing. But how
do I do it? Write a bunch of scripts? Is there any serious tool that try
different parameters, run a load test, process results and generate reports?

   Again, thanks all of you for the replies.

   Cheers,

Fabrício dos Anjos Silva
LinkCom Soluções em T.I.



2010/10/1 Kevin Grittner 

> Craig Ringer  wrote:
>
> > Because the query often only wants a small subset of the data, and
> > whole relations are rarely fully cached, it's not enough to know
> > that "some of relation X is cached", it has to know if the cached
> > parts are the parts that'll be required, or at least an
> > approximation of that. It sounds horrendously complicated to keep
> > track of to me, and in the end it won't make query execution any
> > faster, it'll just potentially help the planner pick a better
> > plan. I wonder if that'd be worth the extra CPU time spent
> > managing the cache and cache content stats, and using those cache
> > stats when planning? It'd be an interesting experiment, but the
> > outcome is hardly obvious.
>
> I agree with that, but I think there's an even more insidious issue
> here.  Biasing plans heavily toward using what is already in cache
> could have a destabilizing effect on performance.  Let's say that
> some query or maintenance skews the cache toward some plan which is
> much slower when cached than another plan would be if cached.  Let's
> also postulate that this query runs very frequently.  It will always
> settle for what's fastest *this* time, not what would make for
> fastest performance if consistently used.  If it never chooses the
> plan which would run better if cached, the data used for that plan
> may never make it into cache, and you will limp along with the
> inferior plan forever.
>
> If you set the overall level of caching you expect, the optimizer
> will tend to wind up with data cached to support the optimal plans
> for that level of caching for the frequently run queries.
>
> -Kevin
>


Re: [PERFORM] How does PG know if data is in memory?

2010-10-01 Thread Pierre C
It sounds horrendously complicated to keep track of to me, and in the  
end it won't make query execution any faster, it'll just potentially  
help the planner pick a better plan. I wonder if that'd be worth the  
extra CPU time spent managing the cache and cache content stats, and  
using those cache stats when planning? It'd be an interesting  
experiment, but the outcome is hardly obvious.


Well, suppose you pick an index scan, the only way to know which index  
(and heap) pages you'll need is to actually do the index scan... which  
isn't really something you'd do when planning. So you scan,


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How does PG know if data is in memory?

2010-10-01 Thread Kevin Grittner
Craig Ringer  wrote:
 
> Because the query often only wants a small subset of the data, and
> whole relations are rarely fully cached, it's not enough to know
> that "some of relation X is cached", it has to know if the cached
> parts are the parts that'll be required, or at least an
> approximation of that. It sounds horrendously complicated to keep
> track of to me, and in the end it won't make query execution any
> faster, it'll just potentially help the planner pick a better
> plan. I wonder if that'd be worth the extra CPU time spent
> managing the cache and cache content stats, and using those cache
> stats when planning? It'd be an interesting experiment, but the
> outcome is hardly obvious.
 
I agree with that, but I think there's an even more insidious issue
here.  Biasing plans heavily toward using what is already in cache
could have a destabilizing effect on performance.  Let's say that
some query or maintenance skews the cache toward some plan which is
much slower when cached than another plan would be if cached.  Let's
also postulate that this query runs very frequently.  It will always
settle for what's fastest *this* time, not what would make for
fastest performance if consistently used.  If it never chooses the
plan which would run better if cached, the data used for that plan
may never make it into cache, and you will limp along with the
inferior plan forever.
 
If you set the overall level of caching you expect, the optimizer
will tend to wind up with data cached to support the optimal plans
for that level of caching for the frequently run queries.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How does PG know if data is in memory?

2010-10-01 Thread Craig Ringer

On 1/10/2010 7:12 PM, Fabrício dos Anjos Silva wrote:


Thank you all for the replies.

If PG does not know whether needed data is in memory, how does it
estimate cost? There is a huge difference between access time in memory
and in secondary storage. Not taking this into account results in almost
"useless" estimates.


It's generally configured with the conservative assumption that data 
will have to come from disk.


Note that the query planner's job isn't to figure out how long the query 
will take. It's to compare various possible query plans and decide which 
will be fastest. There are certainly cases where knowing what's cached 
would help with this - for example: if an index is cached but the table 
data isn't, it's more likely to be worth using the index to reduce disk 
reads. But I don't know just how much difference it really makes.


Because the query often only wants a small subset of the data, and whole 
relations are rarely fully cached, it's not enough to know that "some of 
relation X is cached", it has to know if the cached parts are the parts 
that'll be required, or at least an approximation of that. It sounds 
horrendously complicated to keep track of to me, and in the end it won't 
make query execution any faster, it'll just potentially help the planner 
pick a better plan. I wonder if that'd be worth the extra CPU time spent 
managing the cache and cache content stats, and using those cache stats 
when planning? It'd be an interesting experiment, but the outcome is 
hardly obvious.


As you can see, I don't really agree that the planner's estimates are 
useless just because it's not very aware of the cache's current 
contents. It has a pretty good idea of the system's memory and how much 
of that can be used for cache, and knows how big various indexes and 
relations are. That seems to work pretty well.


If some kind of cache awareness was to be added, I'd be interested in 
seeing a "hotness" measure that tracked how heavily a given 
relation/index has been accessed and how much has been read from it 
recently. A sort of age-scaled blocks-per-second measure that includes 
both cached and uncached (disk) reads. This would let the planner know 
how likely parts of a given index/relation are to be cached in memory 
without imposing the cost of tracking the cache in detail. I'm still not 
sure it'd be all that useful, though...


> I am not saying that PG does a pour job, but I've

been using it for 4 years and from time to time I notice very pour
estimates.


Most of the issues reported here, at least, are statistics issues, 
rather than lack of knowledge about cache status. The planner thinks 
it'll find (say) 2 tuples maching a filter, and instead finds 100,000, 
so it chooses a much less efficient join type. That sort of thing is 
really independent of the cache state.



Recently, I faced poor performance again, but this time because we
started to work with larger tables (10M rows). This encourage me to
study PG tuning again, trying to understand how the planner works and
trying to get the best of it. Unfortunately, it does not seem to be an
easy task.


No argument there! Like any database there's a fair bit of black magic 
involved, and a whole lot of benchmarking. The key thing is to have 
appropriate statistics (usually high), get a reasonable random_page_cost 
and seq_page_cost, to set your effective cache size appropriately, and 
to set reasonable work_mem.


"Reasonable" is hard to work out for work_mem, because Pg's work_mem 
limit is per-sort (etc) not per-query or per-backend. I understand that 
making it per-query is way, way harder than it sounds at face value, 
though, so we must make do.


--
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How does PG know if data is in memory?

2010-10-01 Thread Fabrício dos Anjos Silva
   Thank you all for the replies.

   If PG does not know whether needed data is in memory, how does it
estimate cost? There is a huge difference between access time in memory and
in secondary storage. Not taking this into account results in almost
"useless" estimates. I am not saying that PG does a pour job, but I've been
using it for 4 years and from time to time I notice very pour estimates.
After some testing 2 years ago, the only configuration I could manage to use
was to tell PG to avoid Seq Scan and Index Scans. I know that in many
situations these techniques are the best to choose, but when they are chosen
where they are not suitable, I get very bad plans.
   Recently, I faced poor performance again, but this time because we
started to work with larger tables (10M rows). This encourage me to study PG
tuning again, trying to understand how the planner works and trying to get
the best of it. Unfortunately, it does not seem to be an easy task.
   If someone could point good books about PG tuning, I would appreciate
that. I found some yet to be released books about PG 9. Any comments about
them?

   Thank you all.

Fabrício dos Anjos Silva
LinkCom Soluções em T.I.



Em 29 de setembro de 2010 14:08, Samuel Gendler
escreveu:

>
>
> 2010/9/29 Fabrício dos Anjos Silva 
>
>
>>
>>When setting seq_page_cost and random_page_cost, do I have to consider
>> the probability that data will be in memory? Or does seq_page_cost mean
>> "sequential access on disk" and random_page_cost mean "random access on
>> disk"?
>>
>>
> The reason seq_page_cost and random_page_cost exist as parameters is so
> that you can inform the optimizer what the relative costs of those actions
> are, which is directly related to the expected size of the filesystem cache,
> ratio of total db size to available cache memory, and the performance of
> your disk i/o subsystems (and any other disk-related work the host may be
> doing).  effective_cache_size allows you to tell postgres how big you
> believe all available cache memory is - shared_buffers and OS cache.
>
> As to your question about increasing shared_buffers to be some significant
> proportion of available RAM - apparently, that is not a good idea.  I've
> seen advice that said you shouldn't go above 8GB for shared_buffers and I've
> also seen 12GB suggested as an upper limit, too.  On my host with 48GB of
> RAM, I didn't see much difference between 8GB and 12GB on a fairly wide
> variety of tests, so mine is set at 8GB with an efective_cache_size of 36GB.
>
>
>>I appreciate if someone could clear this out.
>>
>>Thanks!
>>
>> Fabrício dos Anjos Silva
>> LinkCom Soluções em T.I.
>>
>>
>