Re: [PERFORM] statistics target for columns in unique constraint?

2013-05-20 Thread Kevin Grittner
ach  wrote:

> One quick follow up since I'm expecting y'all might know:  Do the
> statistics targets actually speed performance on an index search
> itself; the actual lookup?  Or are the JUST to inform the planner
> towards the best pathway decision?

Since the statistics are just a random sampling and generally not
completely up-to-date, they really can't be used for anything other
than *estimating* relative costs in order to try to pick the best
plan.  Once a plan is chosen, its execution time is not influenced
by the statistics.  A higher statistics target can increase
planning time.  In a complex query with many joins and many indexes
on the referenced tables, the increase in planning time can be
significant.  I have seen cases where blindly increasing the
default statistics target resulted in planning time which was
longer than run time -- without any increase in plan quality.

Generally when something is configurable, it's because there can be
benefit to adjusting it.  If there was a single setting which could
not be materially improved upon for some cases, we wouldn't expose
a configuration option.  This is something which is not only
globally adjustable, you can override the setting for individual
columns -- again, we don't go to the trouble of supporting that
without a good reason.

-- 
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] statistics target for columns in unique constraint?

2013-05-18 Thread ach
Thanks guys!  I'm gonna try tuning the statistics back down to 10 on that
table and see what that does to the insertion rates.  Oh and for Mark:  Not
to worry - i'd actually tuned the stats there up myself awhile ago in an
experiment to see if -that- would've sped insertions some; back before i'd
had enough mileage on postgres for it to have occurred to me that might just
have been useless ;-)

One quick follow up since I'm expecting y'all might know:  Do the statistics
targets actually speed performance on an index search itself; the actual
lookup?  Or are the JUST to inform the planner towards the best pathway
decision?  In other words if I have statistics set to 1000, say, in one
case, and the planner chose the exact same path it would have if they'd just
been set to 100, would the lookup return faster when the stats were at 1000? 
Or would it actually take the same time either way?  My hunch is it's the
latter...




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/statistics-target-for-columns-in-unique-constraint-tp5755256p5756093.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] statistics target for columns in unique constraint?

2013-05-13 Thread Mark Kirkwood

On 14/05/13 10:10, Marti Raudsepp wrote:

On Mon, May 13, 2013 at 6:01 PM, ach  wrote:

what I'm wondering is, since
the unique constraint already covers the whole table and all rows in
entirety, is it really necessary for statistics to be set that high on
those?


AFAIK if there are exact-matching unique constraints/indexes for a
query's WHERE clause, the planner will deduce that the query only
returns 1 row and won't consult statistics at all.


Or does that only serve to slow down inserts to that table?


It doesn't slow down inserts directly. Tables are analyzed in the
background by autovacuum. However, I/O traffic from autovacuum analyze
may slow down inserts running concurrently.




A higher number in stats target means larger stats structures - which in 
turn means that the planning stage of *all* queries may be impacted - 
e.g takes up more memory, slightly slower as these larger structures are 
read, iterated over, free'd etc.


So if your only access is via a defined unique key, then (as Marti 
suggests) - a large setting for stats target would seem to be unnecessary.


If you have access to a test environment I'd recommend you model the 
effect of reducing stats target down (back to the default of 100 or even 
to the old version default of 10).


A little - paranoia - maybe switch on statement logging and ensure that 
there are no *other* ways this table is accessed...the fact that the 
number was cranked up from the default is a little suspicious!


Regards

Mark




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] statistics target for columns in unique constraint?

2013-05-13 Thread Marti Raudsepp
On Mon, May 13, 2013 at 6:01 PM, ach  wrote:
> what I'm wondering is, since
> the unique constraint already covers the whole table and all rows in
> entirety, is it really necessary for statistics to be set that high on
> those?

AFAIK if there are exact-matching unique constraints/indexes for a
query's WHERE clause, the planner will deduce that the query only
returns 1 row and won't consult statistics at all.

> Or does that only serve to slow down inserts to that table?

It doesn't slow down inserts directly. Tables are analyzed in the
background by autovacuum. However, I/O traffic from autovacuum analyze
may slow down inserts running concurrently.

Regards,
Marti


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Statistics and Multi-Column indexes

2011-07-15 Thread lars

On 07/10/2011 02:31 PM, Samuel Gendler wrote:
What about partitioning tables by tenant id and then maintaining 
indexes on each partition independent of tenant id, since constraint 
exclusion should handle filtering by tenant id for you.  That seems 
like a potentially more tolerable variant of #5 How many tenants are 
we talking about?  I gather partitioning starts to become problematic 
when the number of partitions gets large.



I thought I had replied... Apparently I didn't.

The database can grow in two dimensions: The number of tenants and the 
number of rows per tenant.
We have many tenants with relatively little data and a few with a lot of 
data. So the number of tenants

is known ahead of time and might be 1000's.

-- Lars


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Statistics and Multi-Column indexes

2011-07-11 Thread Kevin Grittner
lars  wrote:
 
> We are maintaining a large multi tenant database where *all*
> tables have a tenant-id and all indexes and PKs lead with the
> tenant-id. Statistics and counts for the all other columns are
> only really meaningful within the context of the tenant they
> belong to.
> 
> There appear to be five options for me:
> 1. Using single column indexes on all interesting columns and rely
> on PostgreSQLs bitmap indexes to combine them (which are pretty
> cool).
 
Those are cool -- when programmers are skeptical that they should
just say *what* they want and let the database figure out how to get
it, I like to point out that this option is available to the
planner, but not to application programming.  Of course, there are a
great many other reason *I* find more compelling, but this one tends
to impress application programmers.
 
Assuming you keep the primary key as a multi-column index, this
seems like a good place to start.
 
> 2. Use multi column indexes and accept that sometimes Postgres
> pick the wrong index (because a non-tenant-id column might seem
> highly selective over the table, but it is not for a particular
> tenant - or vice versa).
 
If you have a lot of queries which access data based on matching
some set of columns, an occasional multicolumn index in addition to
the individual column index may be worth it.
 
You might want to avoid prepared statements, since these are planned
for the general case and can fall down badly for the extremes.
 
> 3. Use a functional index that combines multiple columns and only
> query via these, that causes statistics gathering for the
> expression. I.e. create index i on t((tenantid||column1)) and
> SELECT ... FROM t WHERE tenantid||column1 = '...'
 
I would hold off on that until I saw evidence of a specific need.
 
> 4. Play with n_distinct and/or set the statistics for the inner
> columns to some fixed values that lead to the plans that we want.
 
Try not to think in terms of "plans we want", but in terms of
modeling your costs so that, given your tables and indexes, the
PostgreSQL planner will do a good job of picking a fast plan.  You
normally need to tweak a few of the costing factors to match the
reality of your server and load.
 
> 5. Have a completely different schema and maybe a database per
> tenant.
 
> Are there any other options?
 
If most queries operate within a single tenant and you have less
than 100 tenants, you might think about partitioned tables.  Beyond
100, or if most queries need to look at many partitions, it becomes
problematic.
 
> I don't think #2 is actually an option. We have some tenants with
> many (sometimes 100s) millions of rows per table, and picking the
> wrong index would be disastrous.
 
You can always drop counter-productive or unused indexes.  I think
it's best to look at indexes, as much as possible, as database
tuning options, rather than something with any semantic meaning.  If
you're doing things right, an index will never change the result of
any query, so you are free to try different tunings and see what's
fastest with your schema and your production load.
 
We have tables with 100s of millions of rows where everything is
indexed by county number.  The biggest county has about 20% of the
rows; the smallest about 0.05%.  We generally do pretty well with
(1) and (2).  We do occasionally find it useful to create an index
with a WHERE clause, though.  You might want to consider those.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Statistics and Multi-Column indexes

2011-07-10 Thread Samuel Gendler
On Sun, Jul 10, 2011 at 2:16 PM, lars  wrote:

> I know this has been discussed various times...
>
> We are maintaining a large multi tenant database where *all* tables have a
> tenant-id and all indexes and PKs lead with the tenant-id.
> Statistics and counts for the all other columns are only really meaningful
> within the context of the tenant they belong to.
>
> There appear to be five options for me:
> 1. Using single column indexes on all interesting columns and rely on
> PostgreSQLs bitmap indexes to combine them (which are pretty cool).
> 2. Use multi column indexes and accept that sometimes Postgres pick the
> wrong index (because a non-tenant-id
> column might seem highly selective over the table, but it is not for a
> particular tenant - or vice versa).
> 3. Use a functional index that combines multiple columns and only query via
> these, that causes statistics
> gathering for the expression.
> I.e. create index i on t((tenantid||column1)) and SELECT ... FROM t WHERE
> tenantid||column1 = '...'
> 4. Play with n_distinct and/or set the statistics for the inner columns to
> some fixed values that lead to the plans that we want.
> 5. Have a completely different schema and maybe a database per tenant.
>
>
What about partitioning tables by tenant id and then maintaining indexes on
each partition independent of tenant id, since constraint exclusion should
handle filtering by tenant id for you.  That seems like a potentially more
tolerable variant of #5 How many tenants are we talking about?  I gather
partitioning starts to become problematic when the number of partitions gets
large.


Re: [PERFORM] Statistics use with functions

2009-05-08 Thread Tom Lane
Matthew Wakeling  writes:
> Ah, now I see it - I re-analysed, and found entries in pg_stats where 
> tablename is the name of the index. Now the query plans correctly and has 
> the right estimates. So, one needs to analyse AFTER creating indexes - 
> didn't know that.

Yes, for functional indexes it's helpful to do that.  Doesn't matter
for plain-old-plain-old indexes.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Statistics use with functions

2009-05-08 Thread Matthew Wakeling

On Fri, 8 May 2009, Tom Lane wrote:

In this case, however, you evidently have an index on lower(distance)
which should have caused ANALYZE to gather stats on the values of that
functional expression.  It looks like there might be something wrong
there --- can you look into pg_stats and see if there is such an entry
and if it looks sane?


What should I be looking for? I don't see anything obvious from this:

modmine-r9=# select attname from pg_stats where tablename = 
'geneflankingregion';

Ah, now I see it - I re-analysed, and found entries in pg_stats where 
tablename is the name of the index. Now the query plans correctly and has 
the right estimates. So, one needs to analyse AFTER creating indexes - 
didn't know that.


modmine-r9=# explain analyse SELECT * FROM geneflankingregion WHERE 
LOWER(distance) = '10.0kb' AND LOWER(direction) = 'upstream';

QUERY PLAN
-
 Bitmap Heap Scan on geneflankingregion
   (cost=1197.19..11701.87 rows=45614 width=212)
   (actual time=18.336..153.825 rows=45502 loops=1)
   Recheck Cond: (lower(distance) = '10.0kb'::text)
   Filter: (lower(direction) = 'upstream'::text)
   ->  Bitmap Index Scan on geneflankingregion__distance_equals
 (cost=0.00..1185.78 rows=91134 width=0)
 (actual time=16.565..16.565 rows=91004 loops=1)
 Index Cond: (lower(distance) = '10.0kb'::text)
 Total runtime: 199.282 ms
(6 rows)

Matthew

--
It is better to keep your mouth closed and let people think you are a fool
than to open it and remove all doubt.  -- Mark Twain

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Statistics use with functions

2009-05-08 Thread Tom Lane
Matthew Wakeling  writes:
> When I wrap the fields in the constraints in a LOWER() function, the 
> planner stops looking at the statistics and makes a wild guess, even 
> though it is very obvious from just looking what the result should be. 

Well, in general the planner can't assume anything about the statistics
of a function result, since it doesn't know how the function behaves.
In this case, however, you evidently have an index on lower(distance)
which should have caused ANALYZE to gather stats on the values of that
functional expression.  It looks like there might be something wrong
there --- can you look into pg_stats and see if there is such an entry
and if it looks sane?

> Also, why does the BitmapAnd say zero actual rows?

There isn't any reasonably-inexpensive way for EXPLAIN ANALYZE to
determine how many rows are represented by a bitmap result, so it
doesn't try.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Statistics issue

2008-05-31 Thread Tom Lane
Vlad Arkhipov <[EMAIL PROTECTED]> writes:
> EXPLAIN ANALYZE
> SELECT *
> FROM i
> WHERE d BETWEEN '2007-05-12' AND '2007-05-12'

> Index Scan using i_d on i  (cost=0.00..2.39 rows=1 width=402) (actual
> time=0.053..4.284 rows=1721 loops=1)
>   Index Cond: ((d >= '2007-05-12'::date) AND (d <= '2007-05-12'::date))
> Total runtime: 6.645 ms

> EXPLAIN ANALYZE
> SELECT *
> FROM i
> WHERE d = '2007-05-12'

> Index Scan using i_d on i  (cost=0.00..38.97 rows=1572 width=402)
> (actual time=0.044..4.250 rows=1721 loops=1)
>   Index Cond: (d = '2007-05-12'::date)
> Total runtime: 6.619 ms

Hmm, I wonder whether we shouldn't do something like this
http://archives.postgresql.org/pgsql-committers/2008-03/msg00128.php
for all range conditions, not just those made up by
prefix_selectivity().

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] statistics buffer is full

2006-04-02 Thread Qingqing Zhou

""Gábriel Ákos"" <[EMAIL PROTECTED]> wrote
>
> I've got this message while heavily inserting into a database. What should 
> I tune and how? It is postgresql 8.1.3.
>
> 2006-03-29 14:16:57.513 CEST:LOG:  statistics buffer is full
>

Since your server is in a heavy load, so the common trick is to increase 
PGSTAT_RECVBUFFERSZ in include/pgstat.h and recompile your server.

Regards,
Qingqing 



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Statistics not working??

2005-03-11 Thread Oleg Bartunov
On Fri, 11 Mar 2005, Hugo Ferreira wrote:
Hi there!
I think I may have a problem with the statistics in my postgresql 8.0
running under Windowx XP. When I view both pg_stat_all_tables and
pg_stat_all_indexes, all the numeric columns that should hold the
statistics are 0 (zero). My configuration file has the following:
stats_start_collector = true
stats_command_string = true
stats_reset_on_server_start = false
Any tip?
You need to define stats_block_level and/or stats_row_level

Thanks in advance,
Hugo Ferreira

Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] statistics

2004-04-08 Thread Jaime Casanova
Thanks for the answer.

I know the question was to primitive (it claims: i have no idea about 
databases).
But i simply didn't find the answer and if a don't ask i won't learn.

Someday i will talk with Tom Lane about how to improve the planner but until 
that day comes i have a lot of technical things to learn.

_
Add photos to your messages with MSN 8. Get 2 months FREE*. 
http://join.msn.com/?page=features/featuredemail

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] statistics

2004-04-07 Thread Paul Thomas
On 07/04/2004 22:05 Jaime Casanova wrote:
What the statistics are? Where can i view it? where can i find info 
about its field and why are they valuable information to performance?

thanx in advance,

Jaime Casanova


OK. An idiot's guide to statistics by a full-time idiot...

Let's start with a simple premise. I'm a RDBMS (forget that I'm actually 
an idiot for a moment...) and I've been asked for

select * from foo where bar = 7;

How do I go about fulfilling the reequest in the most efficient manner? 
(i.e., ASAP!)

One way might be to read through the whole table and return only those 
rows which match the where criteron - a sequential scan on the table.

But wait a minute, there is an index on column bar. Could I use this 
instead? Well, of course, I could use it but I have to keep sight of the 
goal of returning the data ASAP and I know that the act of reading 
index/reading table/... will have a performance penalty due to a lot more 
head movement on the disk. So how do I make chose between a sequential 
scan and an index scan? Let's lokk at a couple of extreme scenarios:

1) let's look at the condition where all or virtually all of the bar 
columns are populated wityh the value 7. In this case it would be more 
efficient to read sequentially through the table.

2) the opposite of (1) - very few of the bar columns have the value 7. In 
this case using the index could be a winner.

So generalising, I need to be able to estimate whether doing a sequential 
scan is more efficient that an index scan and this comes down to 2 factors:

a) the cost of moving the disk heads all over the place (random page cost)
b) the spread of values in the selecting column(s)
(a) is specfified in postgresql.conf (see archives for much discusion 
about what the value should be..)
(b) is determined by the dastardly trick of actually sampling the data in 
the table!!! That's what analyze does. It samples your table(s) and uses 
the result to feeede into it's descision about when to flip between 
sequential and index scans.

Hope this makes some kind of sense...

--
Paul Thomas
+--+-+
| Thomas Micro Systems Limited | Software Solutions for 
Business |
| Computer Consultants | 
http://www.thomas-micro-systems-ltd.co.uk   |
+--+-+

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match