Re: [PERFORM] Query only slow on first run

2007-11-28 Thread Scott Marlowe
On Nov 28, 2007 3:15 PM, cluster <[EMAIL PROTECTED]> wrote:
> > The indexes don't contain visibility information, so Postgres has to look up
> > the row on disk to verify it isn't dead.
>
> I guess this fact drastically decreases the performance. :-(
> The number of rows with a random_number will just grow over time while
> the number of questions with status = 1 will always be somewhat constant
> at about 10.000 or most likely much less.

Have you tried a partial index?

create index xyz on tablename (random) where status = 1

> I could really use any kind of suggestion on how to improve the query in
> order to make it scale better for large data sets The 6-7000 ms for a
> clean run is really a showstopper. Need to get it below 70 ms somehow.

Also, look into clustering the table on status or random every so often.

More importantly, you might need to research a faster way to get your
random results

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Query only slow on first run

2007-11-28 Thread Tom Lane
cluster <[EMAIL PROTECTED]> writes:
> I could really use any kind of suggestion on how to improve the query in 
> order to make it scale better for large data sets The 6-7000 ms for a 
> clean run is really a showstopper. Need to get it below 70 ms somehow.

Buy a faster disk?

You're essentially asking for a random sample of data that is not
currently in memory.  You're not going to get that without some I/O.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Query only slow on first run

2007-11-28 Thread Jean-David Beyer
cluster wrote:
>> The indexes don't contain visibility information, so Postgres has to 
>> look up the row on disk to verify it isn't dead.
> 
> I guess this fact drastically decreases the performance. :-( The number
> of rows with a random_number will just grow over time while the number of
> questions with status = 1 will always be somewhat constant at about
> 10.000 or most likely much less.
> 
> I could really use any kind of suggestion on how to improve the query in 
> order to make it scale better for large data sets The 6-7000 ms for a 
> clean run is really a showstopper. Need to get it below 70 ms somehow.
> 
Here is a suggestion that I have not tried. This might not make sense,
depending on how often you do this.

Make two tables whose DDL is almost the same. In one, put all the rows with
status = 1, and in the other put all the rows whose status != 1.

Now all the other queries you run would probably need to join both tables,
so maybe you make a hash index on the right fields so that would go fast.

Now for the status = 1 queries, you just look at that smaller table. This
would obviously be faster.

For the other queries, you would get stuck with the join. You would have to
weigh the overall performance issue vs. the performance of this special query.


-- 
  .~.  Jean-David Beyer  Registered Linux User 85642.
  /V\  PGP-Key: 9A2FC99A Registered Machine   241939.
 /( )\ Shrewsbury, New Jerseyhttp://counter.li.org
 ^^-^^ 16:55:01 up 2 days, 22:43, 0 users, load average: 4.31, 4.32, 4.20

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] Query only slow on first run

2007-11-28 Thread cluster

The indexes don't contain visibility information, so Postgres has to look up
the row on disk to verify it isn't dead.


I guess this fact drastically decreases the performance. :-(
The number of rows with a random_number will just grow over time while 
the number of questions with status = 1 will always be somewhat constant 
at about 10.000 or most likely much less.


I could really use any kind of suggestion on how to improve the query in 
order to make it scale better for large data sets The 6-7000 ms for a 
clean run is really a showstopper. Need to get it below 70 ms somehow.


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] TB-sized databases

2007-11-28 Thread david

On Wed, 28 Nov 2007, Simon Riggs wrote:


On Wed, 2007-11-28 at 14:48 +0100, Csaba Nagy wrote:


In fact an even more useful option would be to ask the planner to throw
error if the expected cost exceeds a certain threshold...


Well, I've suggested it before:

statement_cost_limit on pgsql-hackers, 1 March 2006

Would people like me to re-write and resubmit this patch for 8.4?

Tom's previous concerns were along the lines of "How would know what to
set it to?", given that the planner costs are mostly arbitrary numbers.


arbitrary numbers are fine if they are relativly consistant with each 
other.


will a plan with a estimated cost of 1,000,000 take approximatly 100 times 
as long as one with a cost of 10,000?


or more importantly, will a plan with an estimated cost of 2000 reliably 
take longer then one with an estimated cost of 1000?


David Lang


Any bright ideas, or is it we want it and we don't care about the
possible difficulties?




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Query only slow on first run

2007-11-28 Thread Steinar H. Gunderson
On Wed, Nov 28, 2007 at 09:16:08PM +0100, cluster wrote:
> Hmm, actually I still don't understand why it takes 6400 ms to fetch the 
> rows. As far as I can see the index used is "covering" so that real row 
> lookups shouldn't be necessary.

The indexes don't contain visibility information, so Postgres has to look up
the row on disk to verify it isn't dead.

> Also, only the the random_numbers induces by questions with status = 1
> should be considered - and this part is a relatively small subset.

Again, you'll need to have a combined index if you want this to help you any.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Query only slow on first run

2007-11-28 Thread cluster

I'm wondering why --- doesn't seem like it should take 6400msec to fetch
646 rows, unless perhaps the data is just horribly misordered relative
to the index.   Which may in fact be the case ...


Hmm, actually I still don't understand why it takes 6400 ms to fetch the 
rows. As far as I can see the index used is "covering" so that real row 
lookups shouldn't be necessary. Also, only the the random_numbers 
induces by questions with status = 1 should be considered - and this 
part is a relatively small subset.


In general, I don't understand why the query is so I/O dependant as it 
apparently is.


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] TB-sized databases

2007-11-28 Thread Matthew
On Wed, 28 Nov 2007, Simon Riggs wrote:
> statement_cost_limit on pgsql-hackers, 1 March 2006
>
> Would people like me to re-write and resubmit this patch for 8.4?

Yes please. The more options, the better.

> Tom's previous concerns were along the lines of "How would know what to
> set it to?", given that the planner costs are mostly arbitrary numbers.
>
> Any bright ideas, or is it we want it and we don't care about the
> possible difficulties?

I think this is something that the average person should just knuckle down
and work out.

At the moment on my work's system, we call EXPLAIN before queries to find
out if it will take too long. This would improve performance by stopping
us having to pass the query into the query planner twice.

Matthew

-- 
An ant doesn't have a lot of processing power available to it. I'm not trying
to be speciesist - I wouldn't want to detract you from such a wonderful
creature, but, well, there isn't a lot there, is there?
-- Computer Science Lecturer

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] TB-sized databases

2007-11-28 Thread Simon Riggs
On Tue, 2007-11-27 at 18:06 -0500, Pablo Alcaraz wrote:
> Simon Riggs wrote:
> > All of those responses have cooked up quite a few topics into one. Large
> > databases might mean text warehouses, XML message stores, relational
> > archives and fact-based business data warehouses.
> >
> > The main thing is that TB-sized databases are performance critical. So
> > it all depends upon your workload really as to how well PostgreSQL, or
> > another other RDBMS vendor can handle them.
> >
> >
> > Anyway, my reason for replying to this thread is that I'm planning
> > changes for PostgreSQL 8.4+ that will make allow us to get bigger and
> > faster databases. If anybody has specific concerns then I'd like to hear
> > them so I can consider those things in the planning stages
> it would be nice to do something with selects so we can recover a rowset 
> on huge tables using a criteria with indexes without fall running a full 
> scan.
> 
> In my opinion, by definition, a huge database sooner or later will have 
> tables far bigger than RAM available (same for their indexes). I think 
> the queries need to be solved using indexes enough smart to be fast on disk.

OK, I agree with this one. 

I'd thought that index-only plans were only for OLTP, but now I see they
can also make a big difference with DW queries. So I'm very interested
in this area now.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] TB-sized databases

2007-11-28 Thread Simon Riggs
On Wed, 2007-11-28 at 14:48 +0100, Csaba Nagy wrote:

> In fact an even more useful option would be to ask the planner to throw
> error if the expected cost exceeds a certain threshold...

Well, I've suggested it before: 

statement_cost_limit on pgsql-hackers, 1 March 2006

Would people like me to re-write and resubmit this patch for 8.4?

Tom's previous concerns were along the lines of "How would know what to
set it to?", given that the planner costs are mostly arbitrary numbers.

Any bright ideas, or is it we want it and we don't care about the
possible difficulties?

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Query only slow on first run

2007-11-28 Thread Tom Lane
"Dave Dutcher" <[EMAIL PROTECTED]> writes:
> ... According to the explain analyze
> there are only 646 rows in posts which match your criteria, so it does seem
> like scanning posts first might be the right thing to do. 

No, that's not right.  What the output actually shows is that only 646
posts rows were needed to produce the first 200 aggregate rows, which was
enough to satisfy the LIMIT.  The planner is evidently doing things this
way in order to exploit the presence of the LIMIT --- if it had to
compute all the aggregate results it would likely have picked a
different plan.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Training Recommendations

2007-11-28 Thread Usama Munir Dar

EnterpriseDB (www.enterprisedb.com), ofcourse

Campbell, Lance wrote:


PostgreSQL: 8.2.4

 

Does anyone have any companies they would recommend using for 
performance tuning training of PostgreSQL for Linux?  Or general DBA 
training?


 


Thanks,

 


Lance Campbell

Project Manager/Software Architect

Web Services at Public Affairs

University of Illinois

217.333.0382

http://webservices.uiuc.edu

 



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Windows XP selects are very slow

2007-11-28 Thread Usama Munir Dar
PG generally comes with very basic default settings, one *start* maybe 
this page for you


http://www.webservices.uiuc.edu/postgresql/

Then obviously you will need to work though your query plans and iterate.

Shadkam Islam wrote:

Hi All,

We are having a table whose data we need to bucketize and show. This is
a continuously growing table (archival is a way to trim it to size).
We are facing 2 issues here:

1. When the records in the table are in the range of 10K, it works fine
for some time after starting postgres server. But as time passes, the
entire machine becomes slower and slower - to the extent that we need to
go for a restart. Though taskmgr does not show any process consuming
extra-ordinary amount of CPU / Memory. After a restart of postgres
server, things come back to normal. What may be going wrong here?

2. When the records cross 200K, the queries (even "select count(*) from
_TABLE_") start taking minutes, and sometimes does not return back at
all. We were previously using MySql and at least this query used to work
OK there. [Our queries are of the form "select sum(col1),  sum(col2),
count(col3) ... where  group by ... " ]. Any suggestions ... 


Below is the tuning parameter changes thet we did with the help from
internet:

We are starting postgres with the options [-o "-B 4096"], later we added

a "-S 1024" as well - without any visible improvement.
Machine has 1GB RAM.

shadkam

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings
  


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] GiST indexing tuples

2007-11-28 Thread Matthew
On Wed, 28 Nov 2007, Tom Lane wrote:
> Have you looked at contrib/seg/ ?

Yes, I had a pretty good look at that. However, I believe that in order to
use seg's indexes, I would need to put my data into seg's data type, and
reformat my query, as I stated in my original message. What I'm looking
for is a general R-tree (or similar) index that will index multiple
columns of normal data types.

For instance, the normal B-tree index on (a, b) is able to answer queries
like "a = 5 AND b > 1" or "a > 5". An R-tree would be able to index these,
plus queries like "a > 5 AND b < 1".

As far as I can see, it is not possible at the moment to write such an
index system for GiST, which is a shame because the actual R-tree
algorithm is very simple. It's just a matter of communicating both values
from the query to the index code.

Matthew

-- 
I have an inferiority complex. But it's not a very good one.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Query only slow on first run

2007-11-28 Thread Dave Dutcher
> -Original Message-
> From: tmp
> We have primarily two tables of interest here: questions 
> (~100k rows) and posts (~400k rows). Each post refers to a 
> question, but only the "posts" rows for which the 
> corresponding "question.status = 1" are relevant. This 
> reduces the number of relevant question rows to about 10k. 

Earlier you said only a small subset of questions have a status of 1, so I
assumed you meant like 100 not 10k :)  According to the explain analyze
there are only 646 rows in posts which match your criteria, so it does seem
like scanning posts first might be the right thing to do. 


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] GiST indexing tuples

2007-11-28 Thread Tom Lane
Matthew <[EMAIL PROTECTED]> writes:
>> This sounds like something an R-tree can do.

> I *know* that. However, Postgres (as far as I can see) doesn't provide a
> simple R-tree index that will index two integers. This means I have to
> write one myself. I'm asking whether it is possible to get two values into
> a GiST index, which would allow me to implement this.

Have you looked at contrib/seg/ ?

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


[PERFORM] Optimizer regression 8.2.1 -> 8.2.3 on TSEARCH2 queries with ORDER BY and LIMIT

2007-11-28 Thread Brendan McMahon
Hi folks,

An apparent optimizer regression between 8.2.1 & 8.2.3 ? : 

select pk,... from tbl where tsv @@ to_tsquery(...) order by pk limit 
10 

disadvantageously uses PK index scan against a 2.5 million row (vacuum 
analysed) table whenever limit<=16 , leading to an increase in query time from 
sub 100ms to 4 seconds typically.

With identical freshly vaccuum analyzed table, 8.2.1 does the same only when 
limit <= 3

Although it's not a difference in principle, the later behaviour is more 
problematic as it is much more likely to be encountered in practice as part of 
a results paging scheme (with OFFSET N) 

Changing the ORDER BY clause to pk ||'' seems to get around the problem without 
any substantial execution overhead.

Anyone aware of any alternate workaround or info on likely behaviour in 8.3 ?


Brendan

* ** *** ** * ** *** ** * ** *** ** *
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed.
Any views or opinions presented are solely those of the author, and do not 
necessarily
represent those of ESB.
If you have received this email in error please notify the sender.

Although ESB scans e-mail and attachments for viruses, it does not guarantee
that either are virus-free and accepts no liability for any damage sustained
as a result of viruses.

Company Registration Information: http://www.esb.ie/companies
* ** *** ** * ** *** ** * ** *** ** *



Re: [PERFORM] TB-sized databases

2007-11-28 Thread Alvaro Herrera
Pablo Alcaraz escribió:

> In my opinion there are queries that I think they ll need to be tuned for 
> "huge databases" (huge databases = a database which relevant 
> tables(indexes) are (will be) far bigger that all the ram available):
>
> -- example table
> CREATE TABLE homes (
>id bigserial,
>name text,
>location text,
>bigint money_win,
>int zipcode;
> );
> CREATE INDEX money_win_idx ON homes(money_win);
> CREATE INDEX zipcode_idx ON homes(zipcode);

Your example does not work, so I created my own for your first item.

alvherre=# create table test (a int primary key);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "test_pkey" for 
table "test"
CREATE TABLE
alvherre=# insert into test select * from generate_series(1, 10);
INSERT 0 10
alvherre=# analyze test;
ANALYZE

> SELECT max( id) from homes;
> I think the information to get the max row quickly could be found using the 
> pk index. Idem min( id).

alvherre=# explain analyze select max(a) from test;
  QUERY PLAN
   
---
 Result  (cost=0.03..0.04 rows=1 width=0) (actual time=0.054..0.057 rows=1 
loops=1)
   InitPlan
 ->  Limit  (cost=0.00..0.03 rows=1 width=4) (actual time=0.041..0.043 
rows=1 loops=1)
   ->  Index Scan Backward using test_pkey on test  (cost=0.00..3148.26 
rows=10 width=4) (actual time=0.034..0.034 rows=1 loops=1)
 Filter: (a IS NOT NULL)
 Total runtime: 0.143 ms
(6 rows)


> SELECT max( id) from homes WHERE id > 80;
> Same, but useful to find out the same thing in partitioned tables (using id 
> like partition criteria). It would be nice if Postgres would not need the 
> WHERE clause to realize it does not need to scan every single partition, 
> but only the last. Idem min(id).

Yeah, this could be improved.

> SELECT * from homes WHERE money_win = 13;
> Postgres thinks too easily to solve these kind of queries that it must to 
> do a sequential scan where the table (or the index) does not fix in memory 
> if the number of rows is not near 1 (example: if the query returns 5000 
> rows). Same case with filters like 'WHERE money_win >= xx', 'WHERE 
> money_win BETWEEN xx AND yy'. But I do not know if this behavior is because 
> I did a wrong posgresql's configuration or I missed something.

There are thresholds to switch from index scan to seqscans.  It depends
on the selectivity of the clauses.

> SELECT count( *) from homes;
> it would be *cute* that Postgres stores this value and only recalculate if 
> it thinks the stored value is wrong (example: after an anormal shutdown).

This is not as easy as you put it for reasons that have been discussed
at length.  I'll only say that there are workarounds to make counting
quick.

> SELECT zipcode, count( zipcode) as n from homes GROUP BY zipcode;
> it would be *very cute* that Postgres could store this value (or is this 
> there?) on the index or wherever and it only recalculates if it thinks the 
> stored value is wrong (example: after an anormal shutdown).

Same as above.


> Last but not least, it would be *excelent* that this kind of optimization 
> would be posible without weird non standard sql sentences.

Right.  If you can afford to sponsor development, it could make them a
reality sooner.

-- 
Alvaro Herrera   Valdivia, Chile   ICBM: S 39º 49' 18.1", W 73º 13' 56.4"
"You're _really_ hosed if the person doing the hiring doesn't understand
relational systems: you end up with a whole raft of programmers, none of
whom has had a Date with the clue stick."  (Andrew Sullivan)

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] TB-sized databases

2007-11-28 Thread Pablo Alcaraz

Pablo Alcaraz wrote:

Simon Riggs wrote:

All of those responses have cooked up quite a few topics into one. Large
databases might mean text warehouses, XML message stores, relational
archives and fact-based business data warehouses.

The main thing is that TB-sized databases are performance critical. So
it all depends upon your workload really as to how well PostgreSQL, or
another other RDBMS vendor can handle them.


Anyway, my reason for replying to this thread is that I'm planning
changes for PostgreSQL 8.4+ that will make allow us to get bigger and
faster databases. If anybody has specific concerns then I'd like to hear
them so I can consider those things in the planning stages
it would be nice to do something with selects so we can recover a 
rowset on huge tables using a criteria with indexes without fall 
running a full scan.


In my opinion, by definition, a huge database sooner or later will 
have tables far bigger than RAM available (same for their indexes). I 
think the queries need to be solved using indexes enough smart to be 
fast on disk.


Pablo


I am dealing with a very huge database. I am not sure if all these 
things could be solved with the current Postgres version using somes 
configuration parameters. I ll be happy to read your suggestions and 
ideas about these queries.


In my opinion there are queries that I think they ll need to be tuned 
for "huge databases" (huge databases = a database which relevant 
tables(indexes) are (will be) far bigger that all the ram available):


-- example table
CREATE TABLE homes (
   id bigserial,
   name text,
   location text,
   bigint money_win,
   int zipcode;
);
CREATE INDEX money_win_idx ON homes(money_win);
CREATE INDEX zipcode_idx ON homes(zipcode);


SELECT max( id) from homes;
I think the information to get the max row quickly could be found using 
the pk index. Idem min( id).


SELECT max( id) from homes WHERE id > 80;
Same, but useful to find out the same thing in partitioned tables (using 
id like partition criteria). It would be nice if Postgres would not need 
the WHERE clause to realize it does not need to scan every single 
partition, but only the last. Idem min(id).


SELECT * from homes WHERE money_win = 13;
Postgres thinks too easily to solve these kind of queries that it must 
to do a sequential scan where the table (or the index) does not fix in 
memory if the number of rows is not near 1 (example: if the query 
returns 5000 rows). Same case with filters like 'WHERE money_win >= xx', 
'WHERE money_win BETWEEN xx AND yy'. But I do not know if this behavior 
is because I did a wrong posgresql's configuration or I missed something.


SELECT count( *) from homes;
it would be *cute* that Postgres stores this value and only recalculate 
if it thinks the stored value is wrong (example: after an anormal 
shutdown).


SELECT zipcode, count( zipcode) as n from homes GROUP BY zipcode;
it would be *very cute* that Postgres could store this value (or is this 
there?) on the index or wherever and it only recalculates if it thinks 
the stored value is wrong (example: after an anormal shutdown).


In my opinion, partitioned tables in "huge databases" would be the 
usual, not the exception. It would be important (for me at least) that 
these queries could be fast solved when they run in partitioned tables.


Maybe one or more of these queries could be solved using some kind of 
optimization. But I do not discover which ones (I ll be happy to read 
suggestions :D). I am sure a lot/all these queries could be solved using 
some kind of triggers/sequence to store information to solve the stuff. 
But in general the information is there right now (is it there?) and the 
queries only need that the server could look in the right place. A 
trigger/function using some pgsql supported languages probably will 
consume far more CPU resources to find out the same information that 
exist right now and we need to do it using transactions (more perfomance 
costs) only to be sure we are fine if the server has an anormal shutdown.


Currently I have several 250Gb+ tables with billions of rows (little 
rows like the homes table example). I partitioned and distributed the 
partitions/index in different tablespaces, etc. I think "I did not need" 
so much partitions like I have right now (300+ for some tables and 
growing). I just would need enough partitions to distribute the tables 
in differents tablespaces. I did so much partitions because the 
perfomance with really big tables is not enough good for me when the 
programs run these kind of queries and the insert/update speed is worst 
and worst with the time.


I hope that a couple of tables will be 1Tb+ in a few months... buy more 
and more RAM is an option but not a solution because eventually the 
database will be far bigger than ram available.


Last but not least, it would be *excelent* that this kind of 
optimization would be posible without weird non standard sql sent

Re: [PERFORM] Query only slow on first run

2007-11-28 Thread Craig James

tmp wrote:

what exactly is that
"random_number" column


A random float that is initialized when the row is created and never 
modified afterwards. The physical row ordering will clearly not match 
the random_number ordering. However, other queries uses a row ordering 
by the primary key so I don't think it would make much sense to make the 
index on random_number a clustering index just in order to speed up this 
single query.



 and why are you desirous of ordering by it?


In order to simulate a random pick of K rows. See [1].


A trick that I used is to sample the random column once, and create a much 
smaller table of the first N rows, where N is the sample size you want, and use 
that.

If you need a different N samples each time, you can create a temporary table, 
put your random N rows into that, do an ANALYZE, and then join to this smaller 
table.  The overall performance can be MUCH faster even though you're creating 
and populating a whole table, than the plan that Postgres comes up with. This 
seems wrong-headed (why shouldn't Postgres be able to be as efficient on its 
own?), but it works.

Craig


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] TB-sized databases

2007-11-28 Thread Matthew
On Wed, 28 Nov 2007, Gregory Stark wrote:
> > Is there something wrong with:
> > set enable_seqscan = off
> > ?
>
> This does kind of the opposite of what you would actually want here. What you
> want is that if you give it a query which would be best satisfied by a
> sequential scan it should throw an error since you've obviously made an error
> in the query.
>
> What this does is it forces such a query to use an even *slower* method such
> as a large index scan. In cases where there isn't any other method it goes
> ahead and does the sequential scan anyways.

The query planner is not always right. I would like an option like
"set enable_seqscan = off" but with the added effect of making Postgres
return an error if there is no alternative to scanning the whole table,
because I have obviously made a mistake setting up my indexes. I would
effectively be telling Postgres "For this table, I *know* that a full
table scan is dumb for all of my queries, even if the statistics say
otherwise."

Of course, it would have to be slightly intelligent, because there are
circumstances where a sequential scan doesn't necessarily mean a full
table scan (for instance if there is a LIMIT), and where an index scan
*does* mean a full table scan (for instance, selecting the whole table and
ordering by an indexed field).

Matthew

-- 
Existence is a convenient concept to designate all of the files that an
executable program can potentially process.   -- Fortran77 standard

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] TB-sized databases

2007-11-28 Thread Pablo Alcaraz

Matthew wrote:

On Tue, 27 Nov 2007, Pablo Alcaraz wrote:
  

it would be nice to do something with selects so we can recover a rowset
on huge tables using a criteria with indexes without fall running a full
scan.



You mean: Be able to tell Postgres "Don't ever do a sequential scan of
this table. It's silly. I would rather the query failed than have to wait
for a sequential scan of the entire table."

Yes, that would be really useful, if you have huge tables in your
database.
  


Thanks. That would be nice too. I want that Postgres does not fall so 
easy to do sequential scan if a field are indexed. if it concludes that 
the index is *huge* and it does not fit in ram I want that Postgresql 
uses the index anyway because the table is *more than huge* and a 
sequential scan will take hours.


I ll put some examples in a next mail.

Regards

Pablo

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] TB-sized databases

2007-11-28 Thread Csaba Nagy
On Wed, 2007-11-28 at 08:54 -0500, Bill Moran wrote:
> > Nothing wrong with enable_seqscan = off except it is all or nothing type
> > of thing...
> 
> If that's true, then I have a bug report to file:
[snip]
> It looks to me to be session-alterable.

I didn't mean that it can't be set per session, I meant that it is not
fine grained enough to select the affected table but it affects _all_
tables in a query... and big tables are rarely alone in a query.

Cheers,
Csaba.



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] TB-sized databases

2007-11-28 Thread Bill Moran
In response to Gregory Stark <[EMAIL PROTECTED]>:

> "Bill Moran" <[EMAIL PROTECTED]> writes:
> 
> > In response to Matthew <[EMAIL PROTECTED]>:
> >
> >> On Tue, 27 Nov 2007, Pablo Alcaraz wrote:
> >> > it would be nice to do something with selects so we can recover a rowset
> >> > on huge tables using a criteria with indexes without fall running a full
> >> > scan.
> >> 
> >> You mean: Be able to tell Postgres "Don't ever do a sequential scan of
> >> this table. It's silly. I would rather the query failed than have to wait
> >> for a sequential scan of the entire table."
> >> 
> >> Yes, that would be really useful, if you have huge tables in your
> >> database.
> >
> > Is there something wrong with:
> > set enable_seqscan = off
> > ?
> 
> This does kind of the opposite of what you would actually want here. What you
> want is that if you give it a query which would be best satisfied by a
> sequential scan it should throw an error since you've obviously made an error
> in the query.
> 
> What this does is it forces such a query to use an even *slower* method such
> as a large index scan. In cases where there isn't any other method it goes
> ahead and does the sequential scan anyways.

Ah.  I misunderstood the intent of the comment.

-- 
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

[EMAIL PROTECTED]
Phone: 412-422-3463x4023

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] TB-sized databases

2007-11-28 Thread david

On Wed, 28 Nov 2007, Csaba Nagy wrote:


On Wed, 2007-11-28 at 08:27 -0500, Bill Moran wrote:

Is there something wrong with:
set enable_seqscan = off
?


Nothing wrong with enable_seqscan = off except it is all or nothing type
of thing... if you want the big table to never use seqscan, but a medium
table which is joined in should use it, then what you do ? And setting
enable_seqscan = off will actually not mean the planner can't use a
sequential scan for the query if no other alternative exist. In any case
it doesn't mean "please throw an error if you can't do this without a
sequential scan".

In fact an even more useful option would be to ask the planner to throw
error if the expected cost exceeds a certain threshold...


and even better if the option can be overridden for a specific transaction 
or connection. that way it can be set relativly low for normal operations, 
but when you need to do an expensive query you can change it for that 
query.


David Lang

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] TB-sized databases

2007-11-28 Thread Gregory Stark
"Bill Moran" <[EMAIL PROTECTED]> writes:

> In response to Matthew <[EMAIL PROTECTED]>:
>
>> On Tue, 27 Nov 2007, Pablo Alcaraz wrote:
>> > it would be nice to do something with selects so we can recover a rowset
>> > on huge tables using a criteria with indexes without fall running a full
>> > scan.
>> 
>> You mean: Be able to tell Postgres "Don't ever do a sequential scan of
>> this table. It's silly. I would rather the query failed than have to wait
>> for a sequential scan of the entire table."
>> 
>> Yes, that would be really useful, if you have huge tables in your
>> database.
>
> Is there something wrong with:
> set enable_seqscan = off
> ?

This does kind of the opposite of what you would actually want here. What you
want is that if you give it a query which would be best satisfied by a
sequential scan it should throw an error since you've obviously made an error
in the query.

What this does is it forces such a query to use an even *slower* method such
as a large index scan. In cases where there isn't any other method it goes
ahead and does the sequential scan anyways.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's PostGIS support!

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] TB-sized databases

2007-11-28 Thread Bill Moran
In response to Csaba Nagy <[EMAIL PROTECTED]>:

> On Wed, 2007-11-28 at 08:27 -0500, Bill Moran wrote:
> > Is there something wrong with:
> > set enable_seqscan = off
> > ?
> 
> Nothing wrong with enable_seqscan = off except it is all or nothing type
> of thing...

If that's true, then I have a bug report to file:

test=# set enable_seqscan=off;
SET
test=# show enable_seqscan;
 enable_seqscan 

 off
(1 row)

test=# set enable_seqscan=on;
SET
test=# show enable_seqscan;
 enable_seqscan 

 on
(1 row)

It looks to me to be session-alterable.

> if you want the big table to never use seqscan, but a medium
> table which is joined in should use it, then what you do ? And setting
> enable_seqscan = off will actually not mean the planner can't use a
> sequential scan for the query if no other alternative exist. In any case
> it doesn't mean "please throw an error if you can't do this without a
> sequential scan".

True.  It would still choose some other plan.

> In fact an even more useful option would be to ask the planner to throw
> error if the expected cost exceeds a certain threshold...

Interesting concept.

-- 
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

[EMAIL PROTECTED]
Phone: 412-422-3463x4023

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] TB-sized databases

2007-11-28 Thread Csaba Nagy
On Wed, 2007-11-28 at 08:27 -0500, Bill Moran wrote:
> Is there something wrong with:
> set enable_seqscan = off
> ?

Nothing wrong with enable_seqscan = off except it is all or nothing type
of thing... if you want the big table to never use seqscan, but a medium
table which is joined in should use it, then what you do ? And setting
enable_seqscan = off will actually not mean the planner can't use a
sequential scan for the query if no other alternative exist. In any case
it doesn't mean "please throw an error if you can't do this without a
sequential scan". 

In fact an even more useful option would be to ask the planner to throw
error if the expected cost exceeds a certain threshold...

Cheers,
Csaba.



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] GiST indexing tuples

2007-11-28 Thread Gregory Stark
"Matthew" <[EMAIL PROTECTED]> writes:

> On Tue, 27 Nov 2007, Steinar H. Gunderson wrote:
>> On Tue, Nov 27, 2007 at 06:28:23PM +, Matthew wrote:
>> > SELECT * FROM table WHERE a > 1 AND b < 4;
>>
>> This sounds like something an R-tree can do.
>
> I *know* that. However, Postgres (as far as I can see) doesn't provide a
> simple R-tree index that will index two integers. This means I have to
> write one myself. I'm asking whether it is possible to get two values into
> a GiST index, which would allow me to implement this.

The database is capable of determining that a>1 and b<4 are both conditions
which a single index can satisfy.

However GIST itself treats each column of the index independently applying the
first column then the second one and so on like a traditional btree index, so
it doesn't really do what you would want.

I did propose a while back that GIST should consider all columns
simultaneously in the same style as rtree. 

However this would require making GIST somewhat less general in another sense.
Currently page splits can be handled arbitrarily but if you have to be able to
combine different datatypes it would mean having to come up with a standard
algorithm which would work everywhere. (I suggested making everything work in
terms of "distance" and then using the n-space vector distance (ie
sqrt((a1-b1)^2+(a2-b2)^2+...).) This means GIST wouldn't be as general as
it is now but it would allow us to handle cases like yours automatically.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] TB-sized databases

2007-11-28 Thread Bill Moran
In response to Matthew <[EMAIL PROTECTED]>:

> On Tue, 27 Nov 2007, Pablo Alcaraz wrote:
> > it would be nice to do something with selects so we can recover a rowset
> > on huge tables using a criteria with indexes without fall running a full
> > scan.
> 
> You mean: Be able to tell Postgres "Don't ever do a sequential scan of
> this table. It's silly. I would rather the query failed than have to wait
> for a sequential scan of the entire table."
> 
> Yes, that would be really useful, if you have huge tables in your
> database.

Is there something wrong with:
set enable_seqscan = off
?

-- 
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

[EMAIL PROTECTED]
Phone: 412-422-3463x4023

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] GiST indexing tuples

2007-11-28 Thread Matthew
On Tue, 27 Nov 2007, Steinar H. Gunderson wrote:
> On Tue, Nov 27, 2007 at 06:28:23PM +, Matthew wrote:
> > SELECT * FROM table WHERE a > 1 AND b < 4;
>
> This sounds like something an R-tree can do.

I *know* that. However, Postgres (as far as I can see) doesn't provide a
simple R-tree index that will index two integers. This means I have to
write one myself. I'm asking whether it is possible to get two values into
a GiST index, which would allow me to implement this.

Matthew

-- 
It is better to keep your mouth closed and let people think you are a fool
than to open it and remove all doubt.  -- Mark Twain

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] TB-sized databases

2007-11-28 Thread Matthew
On Tue, 27 Nov 2007, Pablo Alcaraz wrote:
> it would be nice to do something with selects so we can recover a rowset
> on huge tables using a criteria with indexes without fall running a full
> scan.

You mean: Be able to tell Postgres "Don't ever do a sequential scan of
this table. It's silly. I would rather the query failed than have to wait
for a sequential scan of the entire table."

Yes, that would be really useful, if you have huge tables in your
database.

Matthew

-- 
Trying to write a program that can't be written is... well, it can be an
enormous amount of fun! -- Computer Science Lecturer

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Windows XP selects are very slow

2007-11-28 Thread Richard Huxton

Shadkam Islam wrote:

Hi All,

We are having a table whose data we need to bucketize and show. This is
a continuously growing table (archival is a way to trim it to size).
We are facing 2 issues here:

1. When the records in the table are in the range of 10K, it works fine
for some time after starting postgres server. But as time passes, the
entire machine becomes slower and slower - to the extent that we need to
go for a restart. Though taskmgr does not show any process consuming
extra-ordinary amount of CPU / Memory. After a restart of postgres
server, things come back to normal. What may be going wrong here?


Do you have any connections sat "idle in transaction"?
Are you happy that vacuuming is happening?
Are you happy that the configuration values are sensible for your hardware?


2. When the records cross 200K, the queries (even "select count(*) from
_TABLE_") start taking minutes, and sometimes does not return back at
all. We were previously using MySql and at least this query used to work
OK there. [Our queries are of the form "select sum(col1),  sum(col2),
count(col3) ... where  group by ... " ]. Any suggestions ... 


Well, "SELECT count(*) FROM TABLE" *is* slow in PG, because it needs to 
check visibility of each row and hence scan the table. Shouldn't be 
minutes though, not unless you've turned vacuuming off. A table of 
200,000 rows isn't particularly large.


Can you give an example of a particular query that's too slow and the 
EXPLAIN ANALYSE to go with it? Oh, and the schema and sizes for the 
tables involved if possible.



Below is the tuning parameter changes thet we did with the help from
internet:


Just "the internet" in general, or any particular pages?


We are starting postgres with the options [-o "-B 4096"], later we added

a "-S 1024" as well - without any visible improvement.
Machine has 1GB RAM.


Why on earth are you fiddling with PG's command-line options? You can 
set all of this stuff in the postgresql.conf file, and I recommend you 
do so.


So that's 8k*4096 or 32MB of shared buffers and 1MB of sort memory. If 
your queries are doing lots of sorting and sum()ing then that's probably 
not enough.


You might want to try issuing "SET work_mem=..." for various values 
before each query and see if there's a good value for your workload.


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Query only slow on first run

2007-11-28 Thread tmp

The query's spending nearly all its time in the scan of "posts", and
I'm wondering why --- doesn't seem like it should take 6400msec to fetch
646 rows, unless perhaps the data is just horribly misordered relative
to the index.   Which may in fact be the case ...


Yes, they probably are. I use the random_number column in order to 
receive a semi random sample subset from the large amount of rows. The 
technique is described in [1]. This subset is later used for some 
statistical investigation, but this is somewhat irrelevant here. In 
order to receive the sample fast, I have made an index on the 
random_number column.



what exactly is that
"random_number" column


A random float that is initialized when the row is created and never 
modified afterwards. The physical row ordering will clearly not match 
the random_number ordering. However, other queries uses a row ordering 
by the primary key so I don't think it would make much sense to make the 
index on random_number a clustering index just in order to speed up this 
single query.



 and why are you desirous of ordering by it?


In order to simulate a random pick of K rows. See [1].


For that matter, if it is what it sounds like, why is it sane to group
by it?  You'll probably always get groups of one row ...


For each random_number, another table (question_tags) holds zero or more 
rows satisfying a number of constraints. I need to count(*) the number 
of corresponding question_tag rows for each random_number.


We have primarily two tables of interest here: questions (~100k rows) 
and posts (~400k rows). Each post refers to a question, but only the 
"posts" rows for which the corresponding "question.status = 1" are 
relevant. This reduces the number of relevant question rows to about 
10k. Within the post rows corresponding to these 10k questions I would 
like to pick a random sample of size K.


[1] http://archives.postgresql.org/pgsql-general/2007-10/msg01240.php


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


[PERFORM] Windows XP selects are very slow

2007-11-28 Thread Shadkam Islam
Hi All,

We are having a table whose data we need to bucketize and show. This is
a continuously growing table (archival is a way to trim it to size).
We are facing 2 issues here:

1. When the records in the table are in the range of 10K, it works fine
for some time after starting postgres server. But as time passes, the
entire machine becomes slower and slower - to the extent that we need to
go for a restart. Though taskmgr does not show any process consuming
extra-ordinary amount of CPU / Memory. After a restart of postgres
server, things come back to normal. What may be going wrong here?

2. When the records cross 200K, the queries (even "select count(*) from
_TABLE_") start taking minutes, and sometimes does not return back at
all. We were previously using MySql and at least this query used to work
OK there. [Our queries are of the form "select sum(col1),  sum(col2),
count(col3) ... where  group by ... " ]. Any suggestions ... 

Below is the tuning parameter changes thet we did with the help from
internet:

We are starting postgres with the options [-o "-B 4096"], later we added

a "-S 1024" as well - without any visible improvement.
Machine has 1GB RAM.

shadkam

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings