Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-11 Thread Jim C. Nasby
On Wed, May 10, 2006 at 08:31:54PM -0400, Tom Lane wrote:
 Jim C. Nasby [EMAIL PROTECTED] writes:
  On Tue, May 09, 2006 at 03:13:01PM -0400, Tom Lane wrote:
  PFC [EMAIL PROTECTED] writes:
  Fun thing is, the rowcount from a temp table (which is the problem here)  
  should be available without ANALYZE ; as the temp table is not 
  concurrent,  
  it would be simple to inc/decrement a counter on INSERT/DELETE...
  
  No, because MVCC rules still apply.
 
  But can anything ever see more than one version of what's in the table?
 
 Yes, because there can be more than one active snapshot within a single
 transaction (think about volatile functions in particular).

Any documentation on how snapshot's work? They're a big mystery to me.
:(

  Speaking of which, if a temp table is defined as ON COMMIT DROP or
  DELETE ROWS, there shouldn't be any need to store xmin/xmax, only
  cmin/cmax, correct?
 
 No; you forgot about subtransactions.

Oh, I thought those were done with cmin and cmax... if that's not what
cmin/cmax are for, then what is?
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-11 Thread Scott Marlowe
On Thu, 2006-05-11 at 12:18, Jim C. Nasby wrote:
 On Wed, May 10, 2006 at 08:31:54PM -0400, Tom Lane wrote:
  Jim C. Nasby [EMAIL PROTECTED] writes:
   On Tue, May 09, 2006 at 03:13:01PM -0400, Tom Lane wrote:
   PFC [EMAIL PROTECTED] writes:
   Fun thing is, the rowcount from a temp table (which is the problem 
   here)  
   should be available without ANALYZE ; as the temp table is not 
   concurrent,  
   it would be simple to inc/decrement a counter on INSERT/DELETE...
   
   No, because MVCC rules still apply.
  
   But can anything ever see more than one version of what's in the table?
  
  Yes, because there can be more than one active snapshot within a single
  transaction (think about volatile functions in particular).
 
 Any documentation on how snapshot's work? They're a big mystery to me.
 :(

http://www.postgresql.org/docs/8.1/interactive/mvcc.html

Does the concurrency doc not cover this subject well enough (I'm not
being sarcastic, it's a real question)

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-11 Thread Martijn van Oosterhout
On Thu, May 11, 2006 at 12:18:06PM -0500, Jim C. Nasby wrote:
  Yes, because there can be more than one active snapshot within a single
  transaction (think about volatile functions in particular).
 
 Any documentation on how snapshot's work? They're a big mystery to me.
 :(

A snapshot is a particular view on a database. In particular, you have
to be able to view a version of the database that doesn't have you own
changes, otherwise an UPDATE would keep updating the same tuple. Also,
for example, a cursor might see an older version of the database than
queries being run. I don't know of any particular information about it
though. Google wasn't that helpful.

  No; you forgot about subtransactions.
 
 Oh, I thought those were done with cmin and cmax... if that's not what
 cmin/cmax are for, then what is?

cmin/cmax are command counters. So in the sequence:

BEGIN;
SELECT 1;
SELECT 2;

The second query runs as the same transaction ID but a higher command
ID so it can see the result of the previous query. Subtransactions are
(AIUI anyway) done by having transactions depend on other transactions.
When you start a savepoint you start a new transaction ID whose status
is tied to its top-level transaction ID but can also be individually
rolledback.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-11 Thread Jim C. Nasby
On Thu, May 11, 2006 at 08:03:19PM +0200, Martijn van Oosterhout wrote:
 On Thu, May 11, 2006 at 12:18:06PM -0500, Jim C. Nasby wrote:
   Yes, because there can be more than one active snapshot within a single
   transaction (think about volatile functions in particular).
  
  Any documentation on how snapshot's work? They're a big mystery to me.
  :(
 
 A snapshot is a particular view on a database. In particular, you have
 to be able to view a version of the database that doesn't have you own
 changes, otherwise an UPDATE would keep updating the same tuple. Also,
 for example, a cursor might see an older version of the database than
 queries being run. I don't know of any particular information about it
 though. Google wasn't that helpful.

Ahh, I'd forgotten that commands sometimes needed to see prior data. But
that's done with cmin/max, right?

In any case, going back to the original thought/question... my point was
that in a single-session table, it should be possible to maintain a
row counter. Worst case, you might have to keep a seperate count for
each CID or XID, but that doesn't seem that unreasonable for a single
backend to do, unless you end up running a heck of a lot of commands.
More importantnly, it seems a lot more feasable to at least know how
many rows there are every time you COMMIT, which means you can
potentially avoid having to ANALYZE.

   No; you forgot about subtransactions.
  
  Oh, I thought those were done with cmin and cmax... if that's not what
  cmin/cmax are for, then what is?
 
 cmin/cmax are command counters. So in the sequence:
 
 BEGIN;
 SELECT 1;
 SELECT 2;
 
 The second query runs as the same transaction ID but a higher command
 ID so it can see the result of the previous query. Subtransactions are
 (AIUI anyway) done by having transactions depend on other transactions.
 When you start a savepoint you start a new transaction ID whose status
 is tied to its top-level transaction ID but can also be individually
 rolledback.

Hmmm, interesting. I would have thought it was tied to CID, but I guess
XID has more of that machinery around to support rollback.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-10 Thread Markus Schaber
Hi, PFC,

PFC wrote:

 The problem is that you need a set-returning function to retrieve
 the  values. SRFs don't have rowcount estimates, so the plans suck.

What about adding some way of rowcount estimation to SRFs, in the way of:

CREATE FUNCTION foo (para, meters) RETURNS SETOF bar AS
$$ ... function code ... $$ LANGUAGE plpgsql
ROWCOUNT_ESTIMATOR $$ ... estimation code ... $$ ;

Internally, this could create two functions, foo (para, meters) and
estimate_foo(para, meters) that are the same language and coupled
together (just like a SERIAL column and its sequence). The estimator
functions have an implicit return parameter of int8. Parameters may be
NULL when they are not known at query planning time.

What do you think about this idea?

The same scheme could be used to add a CPUCOST_ESTIMATOR to expensive
functions.

HTH,
Markus
-- 
Markus Schaber | Logical TrackingTracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-10 Thread PFC



The problem is that you need a set-returning function to retrieve
the  values. SRFs don't have rowcount estimates, so the plans suck.


What about adding some way of rowcount estimation to SRFs, in the way of:

CREATE FUNCTION foo (para, meters) RETURNS SETOF bar AS
$$ ... function code ... $$ LANGUAGE plpgsql
ROWCOUNT_ESTIMATOR $$ ... estimation code ... $$ ;

Internally, this could create two functions, foo (para, meters) and
estimate_foo(para, meters) that are the same language and coupled
together (just like a SERIAL column and its sequence). The estimator
functions have an implicit return parameter of int8. Parameters may be
NULL when they are not known at query planning time.

What do you think about this idea?


It would be very useful.
A few thoughts...

	You need to do some processing to know how many rows the function would  
return.

Often, this processing will be repeated in the function itself.
	Sometimes it's very simple (ie. the function will RETURN NEXT each  
element in an array, you know the array length...)
	Sometimes, for functions returning few rows, it might be faster to  
compute the entire result set in the cost estimator.


So, it might be a bit hairy to find a good compromise.

Ideas on how to do this (clueless hand-waving mode) :

	1- Add new attributes to set-returning functions ; basically a list of  
functions, each returning an estimation parameter (rowcount, cpu tuple  
cost, etc).

This is just like you said.

	2- Add an estimator, to a function, which would just be another  
function, returning one row, a record, containing the estimations in  
several columns (rowcount, cpu tuple cost, etc).
	Pros : only one function call to estimate, easier and faster, the  
estimator just leaves the unknown columns to NULL.
	The estimator needs not be in the same language as the function itself.  
It's just another function.


	3- The estimator could be a set-returning function itself which would  
return rows mimicking pg_statistics
	Pros : planner-friendly, the planner would SELECT from the SRF instead of  
looking in pg_statistics, and the estimator could tell the planner that,  
for instance, the function will return unique values.

Cons : complex, maybe slow

4- Add simple flags to a function, like :
- returns unique values
- returns sorted values (no need to sort my results)
	- please execute me and store my results in a temporary storage, count  
the rows returned, and plan the outer query accordingly

- etc.


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-10 Thread Martijn van Oosterhout
On Wed, May 10, 2006 at 04:38:31PM +0200, PFC wrote:
   You need to do some processing to know how many rows the function 
   would  return.
   Often, this processing will be repeated in the function itself.
   Sometimes it's very simple (ie. the function will RETURN NEXT each  
 element in an array, you know the array length...)
   Sometimes, for functions returning few rows, it might be faster to  
 compute the entire result set in the cost estimator.

I think the best would probably be to assign a constant. An SRF will
generally return between one of 1-10, 10-100, 100-1000, etc. You don't
need exact number, you just need to get within an order of magnitude
and a constant will work fine for that.

How many functions sometimes return one and sometimes a million rows?

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-10 Thread Markus Schaber
Hi, PFC,

PFC wrote:

 You need to do some processing to know how many rows the function
 would  return.
 Often, this processing will be repeated in the function itself.
 Sometimes it's very simple (ie. the function will RETURN NEXT each 
 element in an array, you know the array length...)
 Sometimes, for functions returning few rows, it might be faster to 
 compute the entire result set in the cost estimator.

I know, but we only have to estmiate the number of rows to give a hint
to the query planner, so we can use lots of simplifications.

E. G. for generate_series we return ($2-$1)/$3, and for some functions
even constant estimates will be good enough.

 - please execute me and store my results in a temporary storage,
 count  the rows returned, and plan the outer query accordingly

That's an interesting idea.

Markus


-- 
Markus Schaber | Logical TrackingTracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-10 Thread Nis Jorgensen
Martijn van Oosterhout wrote:
 On Wed, May 10, 2006 at 04:38:31PM +0200, PFC wrote:
  You need to do some processing to know how many rows the function 
  would  return.
  Often, this processing will be repeated in the function itself.
  Sometimes it's very simple (ie. the function will RETURN NEXT each  
 element in an array, you know the array length...)
  Sometimes, for functions returning few rows, it might be faster to  
 compute the entire result set in the cost estimator.
 
 I think the best would probably be to assign a constant. An SRF will
 generally return between one of 1-10, 10-100, 100-1000, etc. You don't
 need exact number, you just need to get within an order of magnitude
 and a constant will work fine for that.
 
 How many functions sometimes return one and sometimes a million rows?

It will probably be quite common for the number to depend on the number
of rows in other tables. Even if this is fairly constant within one db
(some assumption), it is likely to be different in others using the same
function definition. Perhaps a better solution would be to cache the
result of the estimator function.

/Nis



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-10 Thread Markus Schaber
Hi, Nils,

Nis Jorgensen wrote:

 It will probably be quite common for the number to depend on the number
 of rows in other tables. Even if this is fairly constant within one db
 (some assumption), it is likely to be different in others using the same
 function definition. Perhaps a better solution would be to cache the
 result of the estimator function.

Sophisticated estimator functions are free to use the pg_statistics
views for their row count estimation.


HTH,
Markus
-- 
Markus Schaber | Logical TrackingTracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-10 Thread Jim C. Nasby
On Tue, May 09, 2006 at 11:33:42AM +0200, PFC wrote:
   - Repeating the query might yield different results if records were 
   added  or deleted in the meantime.

BTW, SET TRANSACTION ISOLATION LEVEL serializeable or BEGIN ISOLATION
LEVEL serializeable would cure that.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-10 Thread Jim C. Nasby
On Tue, May 09, 2006 at 03:13:01PM -0400, Tom Lane wrote:
 PFC [EMAIL PROTECTED] writes:
  Fun thing is, the rowcount from a temp table (which is the problem 
  here)  
  should be available without ANALYZE ; as the temp table is not concurrent,  
  it would be simple to inc/decrement a counter on INSERT/DELETE...
 
 No, because MVCC rules still apply.

But can anything ever see more than one version of what's in the table?
Even if you rollback you should still be able to just update a row
counter because nothing else would be able to see what was rolled back.

Speaking of which, if a temp table is defined as ON COMMIT DROP or
DELETE ROWS, there shouldn't be any need to store xmin/xmax, only
cmin/cmax, correct?
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-10 Thread Jim C. Nasby
On Tue, May 09, 2006 at 01:29:56PM +0200, PFC wrote:
 0.101 ms BEGIN
 1.451 ms CREATE TEMPORARY TABLE tmp ( a INTEGER NOT NULL, b INTEGER NOT  
 NULL, c TIMESTAMP NOT NULL, d INTEGER NOT NULL ) ON COMMIT DROP
 0.450 ms INSERT INTO tmp SELECT * FROM bookmarks ORDER BY annonce_id DESC  
 LIMIT 20
 0.443 ms ANALYZE tmp
 0.365 ms SELECT * FROM tmp
 0.310 ms DROP TABLE tmp
 32.918 ms COMMIT
 
   CREATING the table is OK, but what happens on COMMIT ? I hear the 
   disk  seeking frantically.
 
 With fsync=off, I get this :
 
 0.090 ms BEGIN
 1.103 ms CREATE TEMPORARY TABLE tmp ( a INTEGER NOT NULL, b INTEGER NOT  
 NULL, c TIMESTAMP NOT NULL, d INTEGER NOT NULL ) ON COMMIT DROP
 0.439 ms INSERT INTO tmp SELECT * FROM bookmarks ORDER BY annonce_id DESC  
 LIMIT 20
 0.528 ms ANALYZE tmp
 0.364 ms SELECT * FROM tmp
 0.313 ms DROP TABLE tmp
 0.688 ms COMMIT
 
   Getting closer ?
   I'm betting on system catalogs updates. I get the same timings with  
 ROLLBACK instead of COMMIT. Temp tables have a row in pg_class...

Have you tried getting a profile of what exactly PostgreSQL is doing
that takes so long when creating a temp table?

BTW, I suspect catalogs might be the answer, which is why Oracle has you
define a temp table once (which does all the work of putting it in the
catalog) and then you just use it accordingly in each individual
session.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-10 Thread PFC



Speaking of which, if a temp table is defined as ON COMMIT DROP or
DELETE ROWS, there shouldn't be any need to store xmin/xmax, only
cmin/cmax, correct?


Yes, that's that type of table I was thinking about...
You can't ROLLBACK a transaction on such a table.
	You can however rollback a savepoint and use INSERT INTO tmp SELECT FROM  
tmp which implies MVCC (I think ?)


	I was suggesting to be able to use FETCH (from a cursor) in the same way  
as SELECT, effectively using a named cursor (DECLARE...) as a simpler,  
faster version of a temporary table, but there is another (better ?)  
option :


	If rowcount estimates for functions are implemented, then a set-returning  
function can be written, which takes as argument a named cursor, and  
returns its rows.
	It would have accurate rowcount estimation (if the cursor is WITH SCROLL,  
which is the case here, rows are stored, so we know their number).


Then you could do :

DECLARE my_cursor ... AS (query that we only want to do once)
SELECT ... FROM table1 JOIN fetch_cursor( my_cursor ) ON ...
SELECT ... FROM table2 JOIN fetch_cursor( my_cursor ) ON ...
SELECT ... FROM table3 JOIN fetch_cursor( my_cursor ) ON ...

No need to redefine the FETCH keyword.
An interesting functionalyty with minimal hassle.


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-10 Thread PFC




Have you tried getting a profile of what exactly PostgreSQL is doing
that takes so long when creating a temp table?


	Nope, I'm not proficient in the use of these tools (I stopped using C  
some time ago).



BTW, I suspect catalogs might be the answer,


Probably, because :

- Temp tables don't use fsync (I hope)
- Catalogs do
- fsync=off makes COMMIT fast
- fsync=on makes COMMIT slow
	- fsync=on and using ANALYZE makes COMMIT slower (more updates to the  
catalogs I guess)



which is why Oracle has you
define a temp table once (which does all the work of putting it in the
catalog) and then you just use it accordingly in each individual
session.


Interesting (except for the ANALYZE bit...)



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-10 Thread Tom Lane
Jim C. Nasby [EMAIL PROTECTED] writes:
 On Tue, May 09, 2006 at 03:13:01PM -0400, Tom Lane wrote:
 PFC [EMAIL PROTECTED] writes:
 Fun thing is, the rowcount from a temp table (which is the problem here)  
 should be available without ANALYZE ; as the temp table is not concurrent,  
 it would be simple to inc/decrement a counter on INSERT/DELETE...
 
 No, because MVCC rules still apply.

 But can anything ever see more than one version of what's in the table?

Yes, because there can be more than one active snapshot within a single
transaction (think about volatile functions in particular).

 Speaking of which, if a temp table is defined as ON COMMIT DROP or
 DELETE ROWS, there shouldn't be any need to store xmin/xmax, only
 cmin/cmax, correct?

No; you forgot about subtransactions.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-09 Thread PFC



You might consider just selecting your primary key or a set of
primary keys to involved relations in your search query.  If you
currently use select * this can make your result set very large.

Copying all the result set to the temp. costs you additional IO
that you propably dont need.


	It is a bit of a catch : I need this information, because the purpose of  
the query is to retrieve these objects. I can first store the ids, then  
retrieve the objects, but it's one more query.



Also you might try:
SELECT * FROM somewhere JOIN result USING (id)
Instead of:
SELECT * FROM somewhere WHERE id IN (SELECT id FROM result)


	Yes you're right in this case ; however the query to retrieve the owners  
needs to eliminate duplicates, which IN() does.


On the other hand if your search query runs in 10ms it seems to be fast  
enough for you to run it multiple times.  Theres propably no point in  
optimizing anything in such case.


I don't think so :
	- 10 ms is a mean time, sometimes it can take much more time, sometimes  
it's faster.
	- Repeating the query might yield different results if records were added  
or deleted in the meantime.
	- Complex search queries have imprecise rowcount estimates ; hence the  
joins that I would add to them will get suboptimal plans.


	Using a temp table is really the cleanest solution now ; but it's too  
slow so I reverted to generating big IN() clauses in the application.


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-09 Thread Christian Kratzer

Hi,

On Tue, 9 May 2006, PFC wrote:
snipp/
	Back to the point : I can't use the temp table method, because temp 
tables are too slow.
	Creating a temp table, filling it, analyzing it and then dropping it 
takes about 100 ms. The search query, on average, takes 10 ms.


just some thoughts:

You might consider just selecting your primary key or a set of
primary keys to involved relations in your search query.  If you
currently use select * this can make your result set very large.

Copying all the result set to the temp. costs you additional IO
that you propably dont need.

Also you might try:

SELECT * FROM somewhere JOIN result USING (id)

Instead of:

SELECT * FROM somewhere WHERE id IN (SELECT id FROM result)

Joins should be a lot faster than large IN clauses.

Here it will also help if result only contains the primary keys 
and not all the other data. The join will be much faster.


On the other hand if your search query runs in 10ms it seems to be fast 
enough for you to run it multiple times.  Theres propably no point in 
optimizing anything in such case.


Greetings
Christian

--
Christian Kratzer   [EMAIL PROTECTED]
CK Software GmbHhttp://www.cksoft.de/
Phone: +49 7452 889 135 Fax: +49 7452 889 136

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-09 Thread Christian Kratzer

Hi,

On Tue, 9 May 2006, PFC wrote:




You might consider just selecting your primary key or a set of
primary keys to involved relations in your search query.  If you
currently use select * this can make your result set very large.

Copying all the result set to the temp. costs you additional IO
that you propably dont need.


	It is a bit of a catch : I need this information, because the purpose 
of the query is to retrieve these objects. I can first store the ids, then 
retrieve the objects, but it's one more query.


yes but depending on what you really need that can be faster.

Additionally to your query you are already transferring the whole result 
set multiple times.  First you copy it to the result table. Then you

read it again.   Your subsequent queries will also have to read over
all the unneeded tuples just to get your primary key.


Also you might try:
SELECT * FROM somewhere JOIN result USING (id)
Instead of:
SELECT * FROM somewhere WHERE id IN (SELECT id FROM result)


	Yes you're right in this case ; however the query to retrieve the 
owners needs to eliminate duplicates, which IN() does.


then why useth thy not the DISTINCT clause when building thy result table 
and thou shalt have no duplicates.


On the other hand if your search query runs in 10ms it seems to be fast 
enough for you to run it multiple times.  Theres propably no point in 
optimizing anything in such case.


I don't think so :
	- 10 ms is a mean time, sometimes it can take much more time, 
sometimes it's faster.
	- Repeating the query might yield different results if records were 
added or deleted in the meantime.


which is a perfect reason to use a temp table.  Another variation on 
the temp table scheme is use a result table and add a query_id.


We do something like this in our web application when users submit 
complex queries.  For each query we store tuples of (query_id,result_id)

in a result table.  It's then easy for the web application to page the
result set.

	- Complex search queries have imprecise rowcount estimates ; hence 
the joins that I would add to them will get suboptimal plans.


	Using a temp table is really the cleanest solution now ; but it's too 
slow so I reverted to generating big IN() clauses in the application.


A cleaner solution usually pays off in the long run whereas a hackish
or overly complex solution will bite you in the behind for sure as
time goes by.

Greetings
Christian

--
Christian Kratzer   [EMAIL PROTECTED]
CK Software GmbHhttp://www.cksoft.de/
Phone: +49 7452 889 135 Fax: +49 7452 889 136

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-09 Thread PFC


Additionally to your query you are already transferring the whole result  
set multiple times.  First you copy it to the result table. Then you

read it again.   Your subsequent queries will also have to read over
all the unneeded tuples just to get your primary key.


	Considering that the result set is not very large and will be cached in  
RAM, this shouldn't be a problem.


then why useth thy not the DISTINCT clause when building thy result  
table and thou shalt have no duplicates.


Because the result table contains no duplicates ;)
I need to remove duplicates in this type of queries :

-- get object owners info
SELECT * FROM users WHERE id IN (SELECT user_id FROM results);

	And in this case I find IN() easier to read than DISTINCT (what I posted  
was a simplification of my real use case...)


which is a perfect reason to use a temp table.  Another variation on the  
temp table scheme is use a result table and add a query_id.


	True. Doesn't solve my problem though : it's still complex, doesn't have  
good rowcount estimation, bloats a table (I only need these records for  
the duration of the transaction), etc.


We do something like this in our web application when users submit  
complex queries.  For each query we store tuples of (query_id,result_id)

in a result table.  It's then easy for the web application to page the
result set.


Yes, that is about the only sane way to page big result sets.


A cleaner solution usually pays off in the long run whereas a hackish
or overly complex solution will bite you in the behind for sure as
time goes by.


	Yes, but in this case temp tables add too much overhead. I wish there  
were RAM based temp tables like in mysql. However I guess the current temp  
table slowness comes from the need to mark their existence in the system  
catalogs or something. That's why I proposed using cursors...


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-09 Thread Martijn van Oosterhout
On Tue, May 09, 2006 at 12:10:37PM +0200, PFC wrote:
   Yes, but in this case temp tables add too much overhead. I wish 
   there  were RAM based temp tables like in mysql. However I guess the 
 current temp  table slowness comes from the need to mark their existence in 
 the system  catalogs or something. That's why I proposed using cursors...

It would be interesting to know what the bottleneck is for temp tables
for you. They do not go via the buffer-cache, they are stored in
private memory in the backend, they are not xlogged. Nor flushed to
disk on backend exit. They're about as close to in-memory tables as
you're going to get...

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-09 Thread PFC




It would be interesting to know what the bottleneck is for temp tables
for you. They do not go via the buffer-cache, they are stored in
private memory in the backend, they are not xlogged. Nor flushed to
disk on backend exit. They're about as close to in-memory tables as
you're going to get...


Hum...
	Timings are a mean over 100 queries, including roundtrip to localhost,  
via a python script.


0.038 ms BEGIN
0.057 ms SELECT 1
0.061 ms COMMIT

0.041 ms BEGIN
0.321 ms SELECT count(*) FROM bookmarks
0.080 ms COMMIT

this test table contains about 250 rows

0.038 ms BEGIN
0.378 ms SELECT * FROM bookmarks ORDER BY annonce_id DESC LIMIT 20
0.082 ms COMMIT

the ORDER BY uses an index

0.042 ms BEGIN
0.153 ms DECLARE tmp SCROLL CURSOR WITHOUT HOLD FOR SELECT * FROM  
bookmarks ORDER BY annonce_id DESC LIMIT 20

0.246 ms FETCH ALL FROM tmp
0.048 ms MOVE FIRST IN tmp
0.246 ms FETCH ALL FROM tmp
0.048 ms CLOSE tmp
0.084 ms COMMIT

the CURSOR is about as fast as a simple query

0.101 ms BEGIN
1.451 ms CREATE TEMPORARY TABLE tmp ( a INTEGER NOT NULL, b INTEGER NOT  
NULL, c TIMESTAMP NOT NULL, d INTEGER NOT NULL ) ON COMMIT DROP
0.450 ms INSERT INTO tmp SELECT * FROM bookmarks ORDER BY annonce_id DESC  
LIMIT 20

0.443 ms ANALYZE tmp
0.365 ms SELECT * FROM tmp
0.310 ms DROP TABLE tmp
32.918 ms COMMIT

	CREATING the table is OK, but what happens on COMMIT ? I hear the disk  
seeking frantically.


With fsync=off, I get this :

0.090 ms BEGIN
1.103 ms CREATE TEMPORARY TABLE tmp ( a INTEGER NOT NULL, b INTEGER NOT  
NULL, c TIMESTAMP NOT NULL, d INTEGER NOT NULL ) ON COMMIT DROP
0.439 ms INSERT INTO tmp SELECT * FROM bookmarks ORDER BY annonce_id DESC  
LIMIT 20

0.528 ms ANALYZE tmp
0.364 ms SELECT * FROM tmp
0.313 ms DROP TABLE tmp
0.688 ms COMMIT

Getting closer ?
	I'm betting on system catalogs updates. I get the same timings with  
ROLLBACK instead of COMMIT. Temp tables have a row in pg_class...


Another temporary table wart :

BEGIN;
CREATE TEMPORARY TABLE tmp ( a INTEGER NOT NULL, b INTEGER NOT NULL, c  
TIMESTAMP NOT NULL, d INTEGER NOT NULL ) ON COMMIT DROP;

INSERT INTO tmp SELECT * FROM bookmarks ORDER BY annonce_id DESC LIMIT 20;

EXPLAIN ANALYZE SELECT * FROM tmp;
QUERY PLAN
---
 Seq Scan on tmp  (cost=0.00..25.10 rows=1510 width=20) (actual  
time=0.003..0.006 rows=20 loops=1)

 Total runtime: 0.030 ms
(2 lignes)

ANALYZE tmp;
EXPLAIN ANALYZE SELECT * FROM tmp;
   QUERY PLAN

 Seq Scan on tmp  (cost=0.00..1.20 rows=20 width=20) (actual  
time=0.003..0.008 rows=20 loops=1)

 Total runtime: 0.031 ms

	We see that the temp table has a very wrong estimated rowcount until it  
has been ANALYZED.
	However, temporary tables do not support concurrent access (obviously) ;  
and in the case of on-commit-drop tables, inserts can't be rolled back  
(obviously), so an accurate rowcount could be maintained via a simple  
counter...




---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-09 Thread Tom Lane
PFC [EMAIL PROTECTED] writes:
   Feature proposal :

   A way to store query results in a named buffer and reuse them in the 
 next  
 queries.

Why not just fix the speed issues you're complaining about with temp
tables?  I see no reason to invent a new concept.

(Now, just fix might be easier said than done, but inventing an
essentially duplicate facility would be a lot of work too.)

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-09 Thread PFC



Does the time for commit change much if you leave out the analyze?


	Yes, when I don't ANALYZE the temp table, commit time changes from 30 ms  
to about 15 ms ; but the queries get horrible plans (see below) :


	Fun thing is, the rowcount from a temp table (which is the problem here)  
should be available without ANALYZE ; as the temp table is not concurrent,  
it would be simple to inc/decrement a counter on INSERT/DELETE...


	I like the temp table approach : it can replace a large, complex query  
with a batch of smaller and easier to optimize queries...


EXPLAIN ANALYZE SELECT a.* FROM tmp t, annonces_display a WHERE a.id=t.id  
ORDER BY t.sort;
   QUERY  
PLAN

-
 Sort  (cost=3689.88..3693.15 rows=1310 width=940) (actual  
time=62.327..62.332 rows=85 loops=1)

   Sort Key: t.sort
   -  Merge Join  (cost=90.93..3622.05 rows=1310 width=940) (actual  
time=5.595..61.373 rows=85 loops=1)

 Merge Cond: (outer.id = inner.id)
 -  Index Scan using annonces_pkey on annonces   
(cost=0.00..3451.39 rows=10933 width=932) (actual time=0.012..6.620  
rows=10916 loops=1)
 -  Sort  (cost=90.93..94.20 rows=1310 width=12) (actual  
time=0.098..0.105 rows=85 loops=1)

   Sort Key: t.id
   -  Seq Scan on tmp t  (cost=0.00..23.10 rows=1310  
width=12) (actual time=0.004..0.037 rows=85 loops=1)

 Total runtime: 62.593 ms

EXPLAIN ANALYZE SELECT * FROM contacts WHERE id IN (SELECT contact_id FROM  
tmp);

 QUERY PLAN

 Hash Join  (cost=28.88..427.82 rows=200 width=336) (actual  
time=0.156..5.019 rows=45 loops=1)

   Hash Cond: (outer.id = inner.contact_id)
   -  Seq Scan on contacts  (cost=0.00..349.96 rows=9396 width=336)  
(actual time=0.009..3.373 rows=9396 loops=1)
   -  Hash  (cost=28.38..28.38 rows=200 width=4) (actual  
time=0.082..0.082 rows=46 loops=1)
 -  HashAggregate  (cost=26.38..28.38 rows=200 width=4) (actual  
time=0.053..0.064 rows=46 loops=1)
   -  Seq Scan on tmp  (cost=0.00..23.10 rows=1310 width=4)  
(actual time=0.001..0.015 rows=85 loops=1)

 Total runtime: 5.092 ms

ANALYZE tmp;
ANALYZE
annonces= EXPLAIN ANALYZE SELECT a.* FROM tmp t, annonces_display a WHERE  
a.id=t.id ORDER BY t.sort;

  QUERY PLAN
---
 Sort  (cost=508.63..508.84 rows=85 width=940) (actual time=1.830..1.832  
rows=85 loops=1)

   Sort Key: t.sort
   -  Nested Loop  (cost=0.00..505.91 rows=85 width=940) (actual  
time=0.040..1.188 rows=85 loops=1)
 -  Seq Scan on tmp t  (cost=0.00..1.85 rows=85 width=12) (actual  
time=0.003..0.029 rows=85 loops=1)
 -  Index Scan using annonces_pkey on annonces  (cost=0.00..5.89  
rows=1 width=932) (actual time=0.003..0.004 rows=1 loops=85)

   Index Cond: (annonces.id = outer.id)
 Total runtime: 2.053 ms
(7 lignes)

annonces= EXPLAIN ANALYZE SELECT * FROM contacts WHERE id IN (SELECT  
contact_id FROM tmp);

   QUERY PLAN
-
 Nested Loop  (cost=2.06..139.98 rows=36 width=336) (actual  
time=0.072..0.274 rows=45 loops=1)
   -  HashAggregate  (cost=2.06..2.51 rows=45 width=4) (actual  
time=0.052..0.065 rows=46 loops=1)
 -  Seq Scan on tmp  (cost=0.00..1.85 rows=85 width=4) (actual  
time=0.003..0.016 rows=85 loops=1)
   -  Index Scan using contacts_pkey on contacts  (cost=0.00..3.04 rows=1  
width=336) (actual time=0.003..0.004 rows=1 loops=46)

 Index Cond: (contacts.id = outer.contact_id)
 Total runtime: 0.341 ms

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-09 Thread Dawid Kuroczko

On 5/9/06, PFC [EMAIL PROTECTED] wrote:

 You might consider just selecting your primary key or a set of
 primary keys to involved relations in your search query.  If you
 currently use select * this can make your result set very large.

 Copying all the result set to the temp. costs you additional IO
 that you propably dont need.

It is a bit of a catch : I need this information, because the purpose of
the query is to retrieve these objects. I can first store the ids, then
retrieve the objects, but it's one more query.

 Also you might try:
   SELECT * FROM somewhere JOIN result USING (id)
 Instead of:
   SELECT * FROM somewhere WHERE id IN (SELECT id FROM result)

Yes you're right in this case ; however the query to retrieve the owners
needs to eliminate duplicates, which IN() does.


Well, you can either
 SELECT * FROM somewhere JOIN (SELECT id FROM result GROUP BY id) AS
a USING (id);
or even, for large number of ids:
 CREATE TEMPORARY TABLE result_ids AS SELECT id FROM RESULT GROUP BY id;
 SELECT * FROM somewhere JOIN result_ids USING (id);



 On the other hand if your search query runs in 10ms it seems to be fast
 enough for you to run it multiple times.  Theres propably no point in
 optimizing anything in such case.

I don't think so :
- 10 ms is a mean time, sometimes it can take much more time, sometimes
it's faster.
- Repeating the query might yield different results if records were 
added
or deleted in the meantime.


You may SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
though locking might bite you. :)


- Complex search queries have imprecise rowcount estimates ; hence the
joins that I would add to them will get suboptimal plans.

Using a temp table is really the cleanest solution now ; but it's too
slow so I reverted to generating big IN() clauses in the application.


A thought, haven't checked it though, but...

You might want to use PL to store values, say PLperl, or even C, say:

create or replace function perl_store(name text, val int) returns void
as $$ my $name = shift; push @{$foo{$name}}, shift; return $$ LANGUAGE
plperl;

select perl_store('someids', id) from something group by id;
(you may need to warp it inside count())

Then use it:

create or replace function perl_retr(name text) returns setof int as
$$ my $name = shift; return $foo{$name} $$ LANGUAGE plperl;

select * from someother join perl_retr('someids') AS a(id) using (id);

All is in the memory.  Of course, you need to do some cleanup, test it,
etc, etc, etc. :)

Should work faster than a in-application solution :)

 Regards,
 Dawid

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-09 Thread PFC




   SELECT * FROM somewhere WHERE id IN (SELECT id FROM result)



Well, you can either
  SELECT * FROM somewhere JOIN (SELECT id FROM result GROUP BY id) AS
a USING (id);


It's the same thing (and postgres knows it)


You might want to use PL to store values, say PLperl, or even C, say:


I tried.
	The problem is that you need a set-returning function to retrieve the  
values. SRFs don't have rowcount estimates, so the plans suck.



Should work faster than a in-application solution :)


Should, but don't, because of what I said above...

	With the version in CVS tip, supprting a fast =ANY( array ), this should  
be doable, though.


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal

2006-05-09 Thread Tom Lane
PFC [EMAIL PROTECTED] writes:
   Fun thing is, the rowcount from a temp table (which is the problem 
 here)  
 should be available without ANALYZE ; as the temp table is not concurrent,  
 it would be simple to inc/decrement a counter on INSERT/DELETE...

No, because MVCC rules still apply.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org