Re: [HACKERS] PL/Python warnings in CVS HEAD

2007-03-06 Thread Neil Conway
On Tue, 2007-03-06 at 00:18 -0500, Tom Lane wrote:
 Sounds like #ifdef time to me --- but it seems a bit strange; wouldn't
 the Python guys have taken a bit more care for compatibility of
 user-supplied code?

Yeah, I was a bit surprised as well. I won't claim to have any
familiarity with the Python C API, though. Py_Deprecated() is simply
defined as:

/* Py_DEPRECATED(version)
 * Declare a variable, type, or function deprecated.
 * Usage:
 *extern int old_var Py_DEPRECATED(2.3);
 *typedef int T1 Py_DEPRECATED(2.4);
 *extern int x() Py_DEPRECATED(2.5);
 */
#if defined(__GNUC__)  ((__GNUC__ = 4) || \
  (__GNUC__ == 3)  (__GNUC_MINOR__ = 1))
#define Py_DEPRECATED(VERSION_UNUSED) __attribute__((__deprecated__))
#else
#define Py_DEPRECATED(VERSION_UNUSED)
#endif

 Perhaps they provide a compatibility hack that you didn't spot?

Quite possibly. Anyone have any suggestions?

-Neil



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Simon Riggs
On Tue, 2007-03-06 at 00:54 +0100, Florian G. Pflug wrote:
 Simon Riggs wrote:

 But it would break the idea of letting a second seqscan follow in the
 first's hot cache trail, no?

No, but it would make it somewhat harder to achieve without direct
synchronization between scans. It could still work; lets see.

I'm not sure thats an argument against fixing the problem with the
buffer strategy though. We really want both, not just one or the other.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[HACKERS] user-defined tree methods in GIST

2007-03-06 Thread sharath kumar

I need some help in building new index structure using GIST.
I have to include a procedure for restructuring the tree immediately after
i have created the index using gist. How do i do it?
Actually I am trying to implement the slim-tree index using gist. To
increase the performance, a slim-down algorithm runs after creation of the
slim-tree. Where can i give a call to this slim-down procedure after the
tree has been built using gist?

Regards
Sharat.


Re: [HACKERS] [PATCHES] xml2 contrib patch supporting default XML namespaces

2007-03-06 Thread Peter Eisentraut
Mike Rylander wrote:
 The patch adds support for default XML namespaces in xml2 by providing
 a mechanism for supplying a prefix to a named namespace URI.

How does it support multiple namespaces in one document?

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PATCHES] xml2 contrib patch supporting default XML namespaces

2007-03-06 Thread Nikolay Samokhvalov

On 3/6/07, Mike Rylander [EMAIL PROTECTED] wrote:

Attatched you'll find a patch that I've been kicking around for a
while that I'd like to propose for inclusion in 8.3.  I attempted to
submit this through the original xml2 author (as far back as the  7.4
days) but got no response.

It's really fairly trivial, but I will be using the features it
provides in production soon, so I'd like to see it applied against the
contrib xml2 module.  The patch adds support for default XML
namespaces in xml2 by providing a mechanism for supplying a prefix to
a named namespace URI.  It then wraps the namespace-capable functions
in backward-compatible equivalents so that old code will not break.


1) And what about non-default namespaces?
2) What if my XPath query has different prefix, that also should be
mapped to the same URI? (Not frequent case, but this really can occur
-- e.g. XML doc has prefix 'local' for URI='http://127.0.0.1', but
XPath should have 'loc' for the same URI.)

--
Best regards,
Nikolay

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Aggressive freezing in lazy-vacuum

2007-03-06 Thread ITAGAKI Takahiro
Tom Lane [EMAIL PROTECTED] wrote:

  This is a stand-alone patch for aggressive freezing. I'll propose
  to use OldestXmin instead of FreezeLimit as the freeze threshold
  in the circumstances below:
 
 I think it's a really bad idea to freeze that aggressively under any
 circumstances except being told to (ie, VACUUM FREEZE).  When you
 freeze, you lose history information that might be needed later --- for
 forensic purposes if nothing else.

I don't think we can supply such a historical database functionality here,
because we can guarantee it just only for INSERTed tuples even if we pay 
attention. We've already enabled autovacuum as default, so that we cannot
predict when the next vacuum starts and recently UPDATEd and DELETEd tuples
are removed at random times. Furthermore, HOT will also accelerate removing
expired tuples. Instead, we'd better to use WAL or something like audit
logs for keeping history information.


 You need to show a fairly amazing
 performance gain to justify that, and I don't think you can.

Thank you for your advice. I found that aggressive freezing for
already dirty pages made things worse, but for pages that contain
other tuples being frozen or dead tuples was useful.

I did an acceleration test for XID wraparound vacuum.
I initialized the database with

  $ ./pgbench -i -s100
  # VACUUM FREEZE accounts;
  # SET vacuum_freeze_min_age = 6;

and repeated the following queries.

  CHECKPOINT;
  UPDATE accounts SET aid=aid WHERE random()  0.005;
  SELECT count(*) FROM accounts WHERE xmin  2;
  VACUUM accounts;

After the freeze threshold got at vacuum_freeze_min_age (run = 3),
the VACUUM became faster with aggressive freezing. I think it came
from piggybacking multiple freezing operations -- the number of
unfrozen tuples were kept lower values.

* Durations of VACUUM [sec]
run|  HEAD  | freeze
---++
 1 |5.8 |   8.2 
 2 |5.2 |   9.0 
 3 |  118.2 | 102.0 
 4 |  122.4 |  99.8 
 5 |  121.0 |  79.8 
 6 |  122.1 |  77.9 
 7 |  123.8 | 115.5 
---++
avg|  121.5 |  95.0 
3-7|

* Numbers of unfrozen tuples
run|  HEAD  | freeze
---++
 1 |  50081 |  50434 
 2 |  99836 | 100072 
 3 | 100047 |  86484 
 4 | 100061 |  86524 
 5 |  99766 |  87046 
 6 |  99854 |  86824 
 7 |  99502 |  86595 
---++
avg|  99846 |  86695
3-7|

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] PL/Python warnings in CVS HEAD

2007-03-06 Thread Marko Kreen

On 3/6/07, Neil Conway [EMAIL PROTECTED] wrote:

On Tue, 2007-03-06 at 00:18 -0500, Tom Lane wrote:
 Perhaps they provide a compatibility hack that you didn't spot?

Quite possibly. Anyone have any suggestions?


Python guys discuss the situation here:

http://www.python.org/dev/peps/pep-0353/

--
marko

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[HACKERS] A Survey on Defect Management Practices in Free/Open Source Software

2007-03-06 Thread Anu Gupta DCSA
Dear PostgreSQL Contributors,

I seek help from designers, developers, testers,defect fixers,project 
managers or playing any other key role in Free/Open Source software 
development or maintenence 
in carrying out a study on practices and problems of defect management in 
various Free/Open Source Software projects. The 
insights gained from the study can further help us to extract publicly 
accessible defect data and determine impact of defect management practices 
on software quality. 
Please spend a few minutes of your precious time to fill up the 
Questionnaire. The most of the questions follow multiple choice formats and 
are quite easy to answer. 

To have the Online Questionnaire, please visit: 

http://anu.puchd.ac.in/phpESP/public/survey.php?name=FOSS_Defect_Survey 

(You can also copy and paste this link into your browser, and hit the 
'Return' key.) 

I hope you will find all the questions interesting and thought-provoking. 
Your answers will be kept anonymous.The data thus collected will 
only be used for research purpose.It would be nice if you may further refer 
this mail to others actively engaged with Free/Open Source Software 
development. If you have any query or suggestions then 
feel free to contact. 

Thank You 

With regards, 

Anu Gupta 
Senior Lecturer 
Department of Computer Science and Applications, 
Panjab University, Chandigarh. 
INDIA


In case of any problem in accessing/using the above mentioned link please 
contact:
E-mail: [EMAIL PROTECTED] 
[EMAIL PROTECTED] 


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Aggressive freezing in lazy-vacuum

2007-03-06 Thread Gregory Stark

ITAGAKI Takahiro [EMAIL PROTECTED] writes:

 I don't think we can supply such a historical database functionality here,
 because we can guarantee it just only for INSERTed tuples even if we pay 
 attention. We've already enabled autovacuum as default, so that we cannot
 predict when the next vacuum starts and recently UPDATEd and DELETEd tuples
 are removed at random times. Furthermore, HOT will also accelerate removing
 expired tuples. Instead, we'd better to use WAL or something like audit
 logs for keeping history information.

Well comparing the data to WAL is precisely the kind of debugging that I think
Tom is concerned with.

The hoped for gain here is that vacuum finds fewer pages with tuples that
exceed vacuum_freeze_min_age? That seems useful though vacuum is still going
to have to read every page and I suspect most of the writes pertain to dead
tuples, not freezing tuples.

This strikes me as something that will be more useful once we have the DSM
especially if it ends up including a frozen map. Once we have the DSM vacuum
will no longer be visiting every page, so it will be much easier for pages to
get quite old and only be caught by a vacuum freeze. The less i/o that vacuum
freeze has to do the better. If we get a freeze map then agressive freezing
would help keep pages out of that map so they never need to be vacuumed just
to freeze the tuples in them.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] user-defined tree methods in GIST

2007-03-06 Thread Martijn van Oosterhout
On Tue, Mar 06, 2007 at 01:50:34PM +0530, sharath kumar wrote:
 I need some help in building new index structure using GIST.
 I have to include a procedure for restructuring the tree immediately after
 i have created the index using gist. How do i do it?
 Actually I am trying to implement the slim-tree index using gist. To
 increase the performance, a slim-down algorithm runs after creation of the
 slim-tree. Where can i give a call to this slim-down procedure after the
 tree has been built using gist?

I don't quite understand what a slim-tree index is, but by my reading
it's an algorithm to make picksplit work better. I don't think you need
to touch the index machinery at all.

Also, I don't think it's a good plan to try and manipulate the index
wholesale after it's built. Once it's built, queries are going want to
use it and any changes are are going to need to be WAL logged and made
crash safe which seems quite a lot of work for something you can
probably do during picksplit.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


[HACKERS] Auto creation of Partitions

2007-03-06 Thread NikhilS

Hi,

This is to get feedback to meet the following TODO:

  - Simplify ability to create partitioned tables

  This would allow creation of partitioned tables without requiring
  creation of rules for INSERT/UPDATE/DELETE, and constraints for rapid
  partition selection. Options could include range and hash partition
  selection.


There was some discussion on the pgsql mailing lists, which lead to the
above TODO:
http://archives.postgresql.org/pgsql-hackers/2006-09/msg00189.php
http://archives.postgresql.org/pgsql-hackers/2006-08/msg01874.php

We can have the following syntax to support auto creation of partitions in
Postgresql:

CREATE TABLE tabname (
...
...
 ) PARTITION BY
 HASH(expr)
| RANGE(expr)
| LIST(expr)
[PARTITIONS num_partitions] /* will apply to HASH only for now*/
[PARTITION partition_name CHECK(...),
 PARTITION partition_name CHECK(...)
 ...
];

Here expr will be one of the column names as specified for the master
table. Once we finalize the syntax, the above statement would end up
carrying out the following activities (disclaimer: we might add or remove
some activities based on the discussion here).

i ) Create master table.
ii) Create children tables based on the number of partitions specified and
make them inherit from the master table.
iii) Auto generate rules (or triggers?) using the checks mentioned for the
partitions, to handle INSERTs/DELETEs/UPDATEs to navigate them to the
appropriate child. Note that checks specified directly on the master table
will get inherited automatically.
iv) Based on the PRIMARY, UNIQUE, REFERENCES information specified, pass it
on to the children tables.
v) If possible add CHECK (false) to the master table to avoid any activity
on it.

Some questions remain as to:

1) Whether we should use triggers/rules for step number (iii) above. Maybe
rules is the way to go.
2) What other attributes (access permissions e.g.) of the master along with
the ones specified in (iv) should be passed on to the children.
3) Some implementation specific issue e.g. whether SPI_execute would be a
good way of creating these rules.

Comments appreciated,
Regards,
Nikhils
EnterpriseDB   http://www.enterprisedb.com


Re: [HACKERS] Latest plans for Utilities with HOT

2007-03-06 Thread Simon Riggs
On Mon, 2007-03-05 at 22:25 +0530, Pavan Deolasee wrote:
 Simon Riggs wrote:
  On Mon, 2007-03-05 at 21:39 +0530, Pavan Deolasee wrote:

  Currently each tuple is moved individually. You'd need to inspect the
  whole HOT chain on a page, calculate space for that and then try to move
  them all in one go. I was originally thinking that would be a problem,
  but its not so bad - but it may cause us to end repair_frag() earlier
  than we otherwise would depending upon the game of Tetris plays out.
 

 Umm.. I still need to look deeper to understand the VACUUM FULL code, 
 but ISTM
 that we can move tuple chains just the way its done today, without 
 bothering to keep
 HOT-update chains intact. The tuples may actually got into different 
 pages and have
 equal number of index entries. To my mind, this is not such a big 
 problem because
 we shouldn't expect too many HOT-update chains while running VACUUM FULL.
 Isn't that true ?

Well, its true enough to be a great argument.

So what you're saying is: we do nothing and it just works. At least not
too badly, and at very least: no worse than it does today.

[Oh dear! I just finished writing prototype of VACUUM FULL-with-reindex
when I read this, so either way it looks like nothing more needed on
this utility. 1 down, 3 to go.]

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Updated propsoal for read-only queries on PITR slaves (SoC 2007)

2007-03-06 Thread Florian G. Pflug

Jim Nasby wrote:

On Mar 1, 2007, at 8:45 AM, Florian G. Pflug wrote:
Another possibility would be to move this setting into the 
recovery.conf. The problems
with this approach is that the recovery.conf file is deleted after the 
information
it contains is incorporated into pg_control. Thus, the readonly 
setting would need to
be stored in pg_control too, making it impossible for the user to 
change it later
(e.g, after interrupting and restarting WAL replay which is possible 
with 8.2)


I think it would be best to very clearly divide setting up a cluster as 
a read-only slave from doing an actual recovery. One obvious way to do 
this would be to require that all read-only GUCs have to live in 
postgresql.conf and not recovery.conf. There's probably some other more 
elegant solutions as well.


The main argument for putting this into recovery.conf ist that it changes
the behaviour only during recovery. Much like restore_command ist
part of the recovery.conf. But I agree that overall postgresql.conf
seems saner.

greetings, Florian Pflug


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Auto creation of Partitions

2007-03-06 Thread Peter Eisentraut
NikhilS wrote:
 iv) Based on the PRIMARY, UNIQUE, REFERENCES information specified,
 pass it on to the children tables.

How will you maintain a primary key in such a table, considering that 
indexes can't span multiple tables?

 1) Whether we should use triggers/rules for step number (iii) above.
 Maybe rules is the way to go.

Since this would basically be a case of the updatable rules problem, you 
should review those discussions in the past to check whether the issues 
mentioned there don't interfere with that plan.

 2) What other attributes (access permissions e.g.) of the master
 along with the ones specified in (iv) should be passed on to the
 children.

Moreover, how are later changes of those attributes propagated?

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] GIST and TOAST

2007-03-06 Thread Gregory Stark
Teodor Sigaev [EMAIL PROTECTED] writes:

 A closer reading, however, shows that at least for cases like intarray,
 btree_gist, etc., the detoasting of an index value is being done in the
 gist decompress function, so the value seen via GISTENTRY in the other
 functions should already have been detoasted once.

 Right, any stored value form index should be decompressed by GiST decompress
 support method.

The problem is that this is the only place in the code where we make wholesale
assumptions that a datum that comes from a tuple (heap tuple or index tuple)
isn't toasted. There are other places but they're always flagged with big
comments explaining *why* the datum can't be toasted and they're minor
localized instances, not a whole subsystem.

This was one of the assumptions that the packed varlena code depended on: that
anyone looking at a datum from a tuple would always detoast it even if they
had formed the tuple themselves and never passed it through the toaster. The
*only* place this has come up as a problem is in GIST.

I would say we could just exempt all the GIST data types from packed varlenas
except that doesn't even solve the problem completely. There's at least one
place, _int_gist.c, where the entry type is just a plain array. So unless I
exempt *all* arrays the arrays it gets out of the index tuples it forms will
be packed and need to be detoasted.

So I'm leaning towards going through all of the GIST modules and replacing all
the (Type*)DatumGetPointers formulations with actually DatumGetType and all
the (Type*)PG_GETARG_POINTER() formulations with PG_GETARG_TYPE(). And having
those macros call PG_DETOAST_DATUM().

How would you feel about that?

There are two downsides I see:

It's an extra check against the toast flag bits which is extra cpu. But this
is how the rest of the Postgres source works and we don't think the extra cpu
cost is significant.

There may be places that assume they won't leak detoasted copies of datums. If
you could help point those places out they should just need PG_FREE_IF_COPY()
calls or in some cases a pg_detoast_datum_copy() call earlier in the correct
memeory context. This again is how the rest of the postgres source handles
this.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Auto creation of Partitions

2007-03-06 Thread NikhilS

Hi,

On 3/6/07, Peter Eisentraut [EMAIL PROTECTED] wrote:


NikhilS wrote:
 iv) Based on the PRIMARY, UNIQUE, REFERENCES information specified,
 pass it on to the children tables.

How will you maintain a primary key in such a table, considering that
indexes can't span multiple tables?



We will not (I know its a hard thing to do :) ), the intention is to use
this information from the parent and make it a property of the child table.
This will avoid the step for the user having to manually specify CREATE
INDEX and the likes on all the children tables one-by-one.


1) Whether we should use triggers/rules for step number (iii) above.
 Maybe rules is the way to go.

Since this would basically be a case of the updatable rules problem, you
should review those discussions in the past to check whether the issues
mentioned there don't interfere with that plan.



The rules mentioned here will be to specify that all the
inserts/updates/deletes should go into proper children tables instead of the
parent. I do not see the updateable rules problem with regards to this, but
will check out the archives for discussion on this related to partitioning.


2) What other attributes (access permissions e.g.) of the master
 along with the ones specified in (iv) should be passed on to the
 children.

Moreover, how are later changes of those attributes propagated?



Once created, this will be a normal inheritance relationship between the
tables and all the existing commands will apply to both the parent and the
child.

The basic idea here is to automate as many things as possible at partition
creation time. The user is free to make additional changes to the involved
tables later too.

Regards,
Nikhils

--
EnterpriseDB   http://www.enterprisedb.com


Re: [HACKERS] Auto creation of Partitions

2007-03-06 Thread Gregory Stark
NikhilS [EMAIL PROTECTED] writes:

the intention is to use this information from the parent and make it a
property of the child table. This will avoid the step for the user having to
manually specify CREATE INDEX and the likes on all the children tables
one-by-one.

Missed the start of this thread. A while back I had intended to add WITH
INDEXES to CREATE TABLE LIKE. That would let you create a tale LIKE parent
WITH CONSTRAINTS WITH INDEXES and get pretty much a perfect table ready for
adding to the inheritance structure.


-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Auto creation of Partitions

2007-03-06 Thread NikhilS

On 3/6/07, NikhilS [EMAIL PROTECTED] wrote:


Hi,

On 3/6/07, Gregory Stark [EMAIL PROTECTED] wrote:

 NikhilS [EMAIL PROTECTED] writes:

 the intention is to use this information from the parent and make it a
 property of the child table. This will avoid the step for the user
 having to
 manually specify CREATE INDEX and the likes on all the children tables
 one-by-one.

 Missed the start of this thread. A while back I had intended to add WITH
 INDEXES to CREATE TABLE LIKE. That would let you create a tale LIKE
 parent
 WITH CONSTRAINTS WITH INDEXES and get pretty much a perfect table ready
 for
 adding to the inheritance structure.


Yeah, this one aims to do pretty much the above as part of the auto
creation of the inheritance-based partitions.



And to add, maybe if there is consensus/demand for the WITH INDEXES idea
mentioned above too, I could work on it as well.

Regards,
Nikhils

Regards,

Nikhils


--
   Gregory Stark
   EnterpriseDB   http://www.enterprisedb.com




--
EnterpriseDB   http://www.enterprisedb.com





--
EnterpriseDB   http://www.enterprisedb.com


Re: [HACKERS] Auto creation of Partitions

2007-03-06 Thread NikhilS

Hi,

On 3/6/07, Gregory Stark [EMAIL PROTECTED] wrote:


NikhilS [EMAIL PROTECTED] writes:

the intention is to use this information from the parent and make it a
property of the child table. This will avoid the step for the user having
to
manually specify CREATE INDEX and the likes on all the children tables
one-by-one.

Missed the start of this thread. A while back I had intended to add WITH
INDEXES to CREATE TABLE LIKE. That would let you create a tale LIKE parent
WITH CONSTRAINTS WITH INDEXES and get pretty much a perfect table ready
for
adding to the inheritance structure.



Yeah, this one aims to do pretty much the above as part of the auto creation
of the inheritance-based partitions.

Regards,
Nikhils


--

  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com





--
EnterpriseDB   http://www.enterprisedb.com


Re: [HACKERS] GIST and TOAST

2007-03-06 Thread Andrew - Supernews
On 2007-03-06, Gregory Stark [EMAIL PROTECTED] wrote:
 Teodor Sigaev [EMAIL PROTECTED] writes:
 A closer reading, however, shows that at least for cases like intarray,
 btree_gist, etc., the detoasting of an index value is being done in the
 gist decompress function, so the value seen via GISTENTRY in the other
 functions should already have been detoasted once.

 Right, any stored value form index should be decompressed by GiST decompress
 support method.

 The problem is that this is the only place in the code where we make wholesale
 assumptions that a datum that comes from a tuple (heap tuple or index tuple)
 isn't toasted.

The places in the intarray code that you tried to fix in your patch at
the start of this thread are not dealing with data that came from a tuple,
but from data that came from a decompress method. It's expected that the
decompress method does the detoasting.

So I think you've mis-analyzed the problem. That's especially true since
you are claiming that the existing code is already buggy when in fact no
such bugs have been reported (and clearly intarray has been running with
toasted array values for years).

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Aggressive freezing in lazy-vacuum

2007-03-06 Thread Tom Lane
ITAGAKI Takahiro [EMAIL PROTECTED] writes:
 Tom Lane [EMAIL PROTECTED] wrote:
 I think it's a really bad idea to freeze that aggressively under any
 circumstances except being told to (ie, VACUUM FREEZE).  When you
 freeze, you lose history information that might be needed later --- for
 forensic purposes if nothing else.

 I don't think we can supply such a historical database functionality here,
 because we can guarantee it just only for INSERTed tuples even if we pay 
 attention. We've already enabled autovacuum as default, so that we cannot
 predict when the next vacuum starts and recently UPDATEd and DELETEd tuples
 are removed at random times.

I said nothing about expired tuples.  The point of not freezing is to
preserve information about the insertion time of live tuples.  And your
test case is unconvincing, because no sane DBA would run with such a
small value of vacuum_freeze_min_age.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] GIST and TOAST

2007-03-06 Thread Teodor Sigaev



The problem is that this is the only place in the code where we make wholesale
assumptions that a datum that comes from a tuple (heap tuple or index tuple)
isn't toasted. There are other places but they're always flagged with big
comments explaining *why* the datum can't be toasted and they're minor
localized instances, not a whole subsystem.

This was one of the assumptions that the packed varlena code depended on: that
anyone looking at a datum from a tuple would always detoast it even if they
had formed the tuple themselves and never passed it through the toaster. The
*only* place this has come up as a problem is in GIST.


I'm afraid that we have some lack of understanding. Flow of operation with 
indexed tuple in gist is:

- read tuple
- get n-th attribute with a help of  index_getattr
- call user-defined decompress method which should, at least, detoast value
- result value is passed to other user-defined method

Any new value, produced by user-defined method of GiST, before packing into 
tuple should be compressed by user-defined compress method. Compress method 
should not toast value - that is not its task.


New values are always modified by compress method before insertion. See 
gistinsert:gist.c and gistFormTuple:gistutil.c.


So, index_form_tuple should toast value, but value is already compressed and 
live in memory. Detoasting of value should be done by decompress method and live 
in memory, and so, only after that value can be passed to other user-defined method.


As I understand, packing/unpacking varlena header is doing during 
toasting/detoastiong. So, I'm not understand the problem here.


What is more, GiST API doesn't limit type of keys passed between user-defined 
GiST methods. It just says that new value should be a type on which opclass was 
defined and output of compress method should be a type pointed by STORAGE option 
 in CREATE OPERATOR CLASS.



There may be places that assume they won't leak detoasted copies of datums. If
you could help point those places out they should just need PG_FREE_IF_COPY()


GiST code works in separate memory context to prevent memory leaks. See 
gistinsert/gistbuildCallback/gistfindnext.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] user-defined tree methods in GIST

2007-03-06 Thread Teodor Sigaev

I need some help in building new index structure using GIST.
 I have to include a procedure for restructuring the tree immediately 
after i have created the index using gist. How do i do it?


IMHO, you don't have a chance to do it without patching postgres or developing 
your own low-level function.


Try to play with SP-GiST (http://www.cs.purdue.edu/spgist/). SP-GiST is a 
modification of GiST for Space Partitioning Trees. But they patch will not work 
with 8.2 and up because of lack of concurrency. 8.2 doesn't support indexes 
without concurrency.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-06 Thread Jim Nasby

On Mar 5, 2007, at 8:34 PM, Greg Smith wrote:

On Thu, 22 Feb 2007, Jim C. Nasby wrote:

It would also be extremely useful to make checkpoint stats visible  
somewhere in the database (presumably via the existing stats  
mechanism)... I'm thinking just tracking how many pages had to be  
flushed during a checkpoint would be a good start.


I'm in the middle of testing an updated version of the patch, once  
I nail down exactly what needs to be logged I'd planned to open a  
discussion on which of those things would be best served by  
pg_stats instead of a log.


I decided specifically to aim for the logs instead for the  
checkpoint data because if you're in a situation where are  
inserting fast enough that the checkpoints are spaced closely  
together, you'd end up having to poll pg_stats all the time for  
make sure you catch them all, which becomes even less efficient  
than just logging the data.


It's always good to be able to log stuff for detailed  
troubleshooting, so that's a good place to start. The flipside is  
that it's much easier to machine-parse a table rather than trying to  
scrape the logs. And I don't think we'll generally care about each  
individual checkpoint; rather we'll want to look at things like  
'checkpoints/hour' and 'checkpoint written pages/hour'.

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


[HACKERS] Calculated view fields (8.1 != 8.2)

2007-03-06 Thread Gaetano Mendola
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi all,
I'm observing this behavior in 8.2:


kalman=# create table test ( a integer, b integer );
CREATE TABLE
kalman=#
kalman=# CREATE OR REPLACE FUNCTION sp_test ( INTEGER )
kalman-# RETURNS INTEGER AS'
kalman'# DECLARE
kalman'#a_id_contractALIAS FOR $1;
kalman'# BEGIN
kalman'#RAISE NOTICE ''here'';
kalman'#
kalman'#return 3;
kalman'# END;
kalman'# ' LANGUAGE 'plpgsql';
CREATE FUNCTION
kalman=#
kalman=#
kalman=# CREATE OR REPLACE VIEW v_test AS
kalman-#   SELECT
kalman-#  a AS a,
kalman-#  sp_test(b) AS b
kalman-#   FROM
kalman-#  test c
kalman-# ;
CREATE VIEW
kalman=#
kalman=# insert into test values (2,3);
INSERT 0 1
kalman=#
kalman=# select * from v_test;
NOTICE:  here
 a | b
- ---+---
 2 | 3
(1 row)

kalman=# select a from v_test;
NOTICE:  here
 a
- ---
 2
(1 row)


In version 8.1 the function sp_test is not evaluated in case of select a from 
v_test.

If I mark the function as STABLE or IMMUTABLE then even with version 8.2 the 
function
is not evaluated. Is this the intended behavior? I didn't see something about 
it in
the release note.


Regards
Gaetano Mendola





-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFF7Y1D7UpzwH2SGd4RAv//AJ0dcDPyYIndVMs7pEhzXjVNwKqdLQCeJQnL
oaZVL2JgS/J9lPf+B80+FuY=
=qaCE
-END PGP SIGNATURE-

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Trivial HugeTLB Benchmark

2007-03-06 Thread Ryan Cumming
On Sun, 2007-03-04 at 10:14 -0800, Tom Lane wrote:
 If you did this only once, the results are not really trustworthy;
 you need to average several similar runs before you can have much
 confidence.  pgbench's inter-run variation is usually upwards of 10%,
 so trying to draw conclusions about half-percentage-point differences
 without averaging is a waste of time.
Good point, thanks

 Also, if scaling factor  number of clients then what you're mostly
 measuring is update-contention behavior.  Try it with -s 10 and -c 5;
 and don't forget to reinitialize the database for each run of tests
 to be sure it's fair.
 
I did another 18 runs, 9 each for huge pages and normal shared memory.
The database was reinitialized before every third run with pgbench -i
-s 10. The runs themselves were done with pgbench -s 10 -c 5 -t 1

Normal shared memory:
tps = 1835.929043 (including connections establishing)
tps = 1697.455165 (including connections establishing)
tps = 1378.393001 (including connections establishing)

tps = 1834.802729 (including connections establishing)
tps = 1630.100895 (including connections establishing)
tps = 1415.504943 (including connections establishing)

tps = 1864.908838 (including connections establishing)
tps = 1726.295622 (including connections establishing)
tps = 1323.679649 (including connections establishing)

Average: 1634.19 tps

Huge pages:
tps = 1867.400381 (including connections establishing)
tps = 1715.269338 (including connections establishing)
tps = 1259.314139 (including connections establishing)

tps = 1880.803336 (including connections establishing)
tps = 1885.351404 (including connections establishing)
tps = 1603.302855 (including connections establishing)

tps = 1884.888431 (including connections establishing)
tps = 1563.452093 (including connections establishing)
tps = 1361.896887 (including connections establishing)

Average: 1669.08

That works out to approximately a 2.1% performance boost for huge pages.
It still doesn't seem very compelling but I could try to put together a
patch for inclusion if people were interested in such a thing.

-Ryan
 
This electronic mail transmission and any accompanying attachments contain 
confidential information intended only for the use of the individual or entity 
named above. Any dissemination, distribution, copying or action taken in 
reliance on the contents of this communication by anyone other than the 
intended recipient is strictly prohibited. If you have received this 
communication in error please immediately delete the e-mail and either notify 
the sender at the above e-mail address or by telephone at 250.386.5323.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] ERROR: operator does not exist: integer !=- integer

2007-03-06 Thread William ZHANG
I get it. scan.l converts '!=' to '':

644 /* Convert != operator to  for 
compatibility */
645 if (strcmp(yytext, !=) == 0)
646 yylval.str = pstrdup();
647 else
648 yylval.str = pstrdup(yytext);


Joshua D. Drake [EMAIL PROTECTED]

 Well yes it will work if you add a space, but technically the problem is 
 the query should be written like this:

 1 -1 or 1  -1

 Joshua D. Drake 



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] NTA access on Solaris

2007-03-06 Thread Sherry Moore
 With copyout_max_cached being 8K:
^^^
4K

 Working
 set   16K 32K 64K 128K256K512K1M  2M  128M
 
 Seconds   4.8 4.8 4.9 4.9 5.0 5.0 5.0 5.0 
 5.1

Sherry


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] GIST and TOAST

2007-03-06 Thread Gregory Stark
Andrew - Supernews [EMAIL PROTECTED] writes:

 The places in the intarray code that you tried to fix in your patch at
 the start of this thread are not dealing with data that came from a tuple,
 but from data that came from a decompress method. It's expected that the
 decompress method does the detoasting.

 So I think you've mis-analyzed the problem. That's especially true since
 you are claiming that the existing code is already buggy when in fact no
 such bugs have been reported (and clearly intarray has been running with
 toasted array values for years).

I'm not claiming, I'm asking, because I can't tell. 

And it's not clear _int_gist.c has been running with toasted array values for
years because it's limited to arrays of 100 integers (or perhaps 200 integers,
there's a factor of 2 in the test). That's not enough to trigger toasting
unless there are other large columns in the same table.

I do know that with packed varlenas I get a crash in g_int_union among other
places. I can't tell where the datum came from originally and how it ended up
stored in packed format.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] NTA access on Solaris

2007-03-06 Thread Sherry Moore
On a 1P system system with 512K L2, it is more obvious why we shouldn't
bypass L2 for small reads:

The same readtest as my previous mail invoked as following:

./readtest -s working-set-size -f /platform/i86pc/boot_archive -n 100

With copyout_max_cached being 128K:

Working
set 16K 32K 64K 128K256K512K1M  2M  128M

Seconds 4.2 4.0 4.1 4.1 5.7 7.0 7.1 7.0 7.5

With copyout_max_cached being 8K:

Working
set 16K 32K 64K 128K256K512K1M  2M  128M

Seconds 4.8 4.8 4.9 4.9 5.0 5.0 5.0 5.0 5.1


Sherry

On Mon, Mar 05, 2007 at 09:41:14PM -0800, Sherry Moore wrote:
 - Forwarded message from Sherry Moore [EMAIL PROTECTED] -
 
 Date: Mon, 5 Mar 2007 21:34:19 -0800
 From: Sherry Moore [EMAIL PROTECTED]
 To: Tom Lane [EMAIL PROTECTED]
 Cc: Luke Lonergan [EMAIL PROTECTED],
   Mark Kirkwood [EMAIL PROTECTED],
   Pavan Deolasee [EMAIL PROTECTED],
   Gavin Sherry [EMAIL PROTECTED],
   PGSQL Hackers pgsql-hackers@postgresql.org,
   Doug Rady [EMAIL PROTECTED],
   Sherry Moore [EMAIL PROTECTED]
 Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant
 
 Hi Tom,
 
 Sorry about the delay.  I have been away from computers all day.
 
 In the current Solaris release in development (Code name Nevada,
 available for download at http://opensolaris.org), I have implemented
 non-temporal access (NTA) which bypasses L2 for most writes, and reads
 larger than copyout_max_cached (patchable, default to 128K).  The block
 size used by Postgres is 8KB.  If I patch copyout_max_cached to 4KB to
 trigger NTA for reads, the access time with 16KB buffer or 128MB buffer
 are very close.
 
 I wrote readtest to simulate the access pattern of VACUUM (attached).
 tread is a 4-socket dual-core Opteron box.
 
 81 tread ./readtest -h
 Usage: readtest [-v] [-N] -s size -n iter [-d delta] [-c count]
 -v: Verbose mode
 -N: Normalize results by number of reads
 -s size:  Working set size (may specify K,M,G suffix)
 -n iter:Number of test iterations
 -f filename:Name of the file to read from
 -d [+|-]delta:  Distance between subsequent reads
 -c count:   Number of reads
 -h: Print this help
 
 With copyout_max_cached at 128K (in nanoseconds, NTA not triggered):
 
 82 tread ./readtest -s 16k -f boot_archive   
 46445262
 83 tread ./readtest -s 128M -f boot_archive  
 118294230
 84 tread ./readtest -s 16k -f boot_archive -n 100
 4230210856
 85 tread ./readtest -s 128M -f boot_archive -n 100
 6343619546
 
 With copyout_max_cached at 4K (in nanoseconds, NTA triggered):
 
 89 tread ./readtest -s 16k -f boot_archive
 43606882
 90 tread ./readtest -s 128M -f boot_archive 
 100547909
 91 tread ./readtest -s 16k -f boot_archive -n 100
 4251823995
 92 tread ./readtest -s 128M -f boot_archive -n 100
 4205491984
 
 When the iteration is 1 (the default), the timing difference between
 using 16k buffer and 128M buffer is much bigger for both
 copyout_max_cached sizes, mostly due to the cost of TLB misses.  When
 the iteration count is bigger, most of the page tables would be in Page
 Descriptor Cache for the later page accesses so the overhead of TLB
 misses become smaller.  As you can see, when we do bypass L2, the
 performance with either buffer size is comparable.
 
 I am sure your next question is why the 128K limitation for reads.
 Here are the main reasons:
 
 - Based on a lot of the benchmarks and workloads I traced, the
   target buffer of read operations are typically accessed again
   shortly after the read, while writes are usually not.  Therefore,
   the default operation mode is to bypass L2 for writes, but not
   for reads.
 
 - The Opteron's L1 cache size is 64K.  If reads are larger than
   128KB, it would have displacement flushed itself anyway, so for
   large reads, I will also bypass L2. I am working on dynamically
   setting copyout_max_cached based on the L1 D-cache size on the
   system.
 
 The above heuristic should have worked well in Luke's test case.
 However, due to the fact that the reads was done as 16,000 8K reads
 rather than one 128MB read, the NTA code was not triggered.
 
 Since the OS code has to be general enough to handle with most
 workloads, we have to pick some defaults that might not work best for
 some specific operations.  It is a calculated balance.
 
 Thanks,
 Sherry
 
 
 On Mon, Mar 05, 2007 at 10:58:40PM -0500, Tom Lane wrote:
  Luke Lonergan [EMAIL PROTECTED] writes:
   Good info - it's the same in Solaris, the routine is uiomove (Sherry
   wrote it).
  
  Cool.  Maybe Sherry can comment on the question whether it's 

[HACKERS] NTA access on Solaris

2007-03-06 Thread Sherry Moore
- Forwarded message from Sherry Moore [EMAIL PROTECTED] -

Date: Mon, 5 Mar 2007 21:34:19 -0800
From: Sherry Moore [EMAIL PROTECTED]
To: Tom Lane [EMAIL PROTECTED]
Cc: Luke Lonergan [EMAIL PROTECTED],
Mark Kirkwood [EMAIL PROTECTED],
Pavan Deolasee [EMAIL PROTECTED],
Gavin Sherry [EMAIL PROTECTED],
PGSQL Hackers pgsql-hackers@postgresql.org,
Doug Rady [EMAIL PROTECTED],
Sherry Moore [EMAIL PROTECTED]
Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant

Hi Tom,

Sorry about the delay.  I have been away from computers all day.

In the current Solaris release in development (Code name Nevada,
available for download at http://opensolaris.org), I have implemented
non-temporal access (NTA) which bypasses L2 for most writes, and reads
larger than copyout_max_cached (patchable, default to 128K).  The block
size used by Postgres is 8KB.  If I patch copyout_max_cached to 4KB to
trigger NTA for reads, the access time with 16KB buffer or 128MB buffer
are very close.

I wrote readtest to simulate the access pattern of VACUUM (attached).
tread is a 4-socket dual-core Opteron box.

81 tread ./readtest -h
Usage: readtest [-v] [-N] -s size -n iter [-d delta] [-c count]
-v: Verbose mode
-N: Normalize results by number of reads
-s size:  Working set size (may specify K,M,G suffix)
-n iter:Number of test iterations
-f filename:Name of the file to read from
-d [+|-]delta:  Distance between subsequent reads
-c count:   Number of reads
-h: Print this help

With copyout_max_cached at 128K (in nanoseconds, NTA not triggered):

82 tread ./readtest -s 16k -f boot_archive   
46445262
83 tread ./readtest -s 128M -f boot_archive  
118294230
84 tread ./readtest -s 16k -f boot_archive -n 100
4230210856
85 tread ./readtest -s 128M -f boot_archive -n 100
6343619546

With copyout_max_cached at 4K (in nanoseconds, NTA triggered):

89 tread ./readtest -s 16k -f boot_archive
43606882
90 tread ./readtest -s 128M -f boot_archive 
100547909
91 tread ./readtest -s 16k -f boot_archive -n 100
4251823995
92 tread ./readtest -s 128M -f boot_archive -n 100
4205491984

When the iteration is 1 (the default), the timing difference between
using 16k buffer and 128M buffer is much bigger for both
copyout_max_cached sizes, mostly due to the cost of TLB misses.  When
the iteration count is bigger, most of the page tables would be in Page
Descriptor Cache for the later page accesses so the overhead of TLB
misses become smaller.  As you can see, when we do bypass L2, the
performance with either buffer size is comparable.

I am sure your next question is why the 128K limitation for reads.
Here are the main reasons:

- Based on a lot of the benchmarks and workloads I traced, the
  target buffer of read operations are typically accessed again
  shortly after the read, while writes are usually not.  Therefore,
  the default operation mode is to bypass L2 for writes, but not
  for reads.

- The Opteron's L1 cache size is 64K.  If reads are larger than
  128KB, it would have displacement flushed itself anyway, so for
  large reads, I will also bypass L2. I am working on dynamically
  setting copyout_max_cached based on the L1 D-cache size on the
  system.

The above heuristic should have worked well in Luke's test case.
However, due to the fact that the reads was done as 16,000 8K reads
rather than one 128MB read, the NTA code was not triggered.

Since the OS code has to be general enough to handle with most
workloads, we have to pick some defaults that might not work best for
some specific operations.  It is a calculated balance.

Thanks,
Sherry


On Mon, Mar 05, 2007 at 10:58:40PM -0500, Tom Lane wrote:
 Luke Lonergan [EMAIL PROTECTED] writes:
  Good info - it's the same in Solaris, the routine is uiomove (Sherry
  wrote it).
 
 Cool.  Maybe Sherry can comment on the question whether it's possible
 for a large-scale-memcpy to not take a hit on filling a cache line
 that wasn't previously in cache?
 
 I looked a bit at the Linux code that's being used here, but it's all
 x86_64 assembler which is something I've never studied :-(.
 
   regards, tom lane

-- 
Sherry Moore, Solaris Kernel Developmenthttp://blogs.sun.com/sherrym

#include stdlib.h
#include stdio.h
#include ctype.h
#include unistd.h
#include fcntl.h
#include sys/param.h
#include sys/time.h
#include sys/mman.h
#include errno.h
#include thread.h
#include signal.h
#include strings.h
#include libgen.h

#define KB(a)   (a*1024)
#define MB(a)   (KB(a)*1024)

static void
usage(char *s)
{
fprintf(stderr,
Usage: %s [-v] [-N] -s size -n iter 
[-d delta] [-c count]\n, s);
fprintf(stderr,
\t-v:\t\tVerbose mode\n
\t-N:\t\tNormalize results by number of reads\n

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Sherry Moore
Hi Tom,

Sorry about the delay.  I have been away from computers all day.

In the current Solaris release in development (Code name Nevada,
available for download at http://opensolaris.org), I have implemented
non-temporal access (NTA) which bypasses L2 for most writes, and reads
larger than copyout_max_cached (patchable, default to 128K).  The block
size used by Postgres is 8KB.  If I patch copyout_max_cached to 4KB to
trigger NTA for reads, the access time with 16KB buffer or 128MB buffer
are very close.

I wrote readtest to simulate the access pattern of VACUUM (attached).
tread is a 4-socket dual-core Opteron box.

81 tread ./readtest -h
Usage: readtest [-v] [-N] -s size -n iter [-d delta] [-c count]
-v: Verbose mode
-N: Normalize results by number of reads
-s size:  Working set size (may specify K,M,G suffix)
-n iter:Number of test iterations
-f filename:Name of the file to read from
-d [+|-]delta:  Distance between subsequent reads
-c count:   Number of reads
-h: Print this help

With copyout_max_cached at 128K (in nanoseconds, NTA not triggered):

82 tread ./readtest -s 16k -f boot_archive   
46445262
83 tread ./readtest -s 128M -f boot_archive  
118294230
84 tread ./readtest -s 16k -f boot_archive -n 100
4230210856
85 tread ./readtest -s 128M -f boot_archive -n 100
6343619546

With copyout_max_cached at 4K (in nanoseconds, NTA triggered):

89 tread ./readtest -s 16k -f boot_archive
43606882
90 tread ./readtest -s 128M -f boot_archive 
100547909
91 tread ./readtest -s 16k -f boot_archive -n 100
4251823995
92 tread ./readtest -s 128M -f boot_archive -n 100
4205491984

When the iteration is 1 (the default), the timing difference between
using 16k buffer and 128M buffer is much bigger for both
copyout_max_cached sizes, mostly due to the cost of TLB misses.  When
the iteration count is bigger, most of the page tables would be in Page
Descriptor Cache for the later page accesses so the overhead of TLB
misses become smaller.  As you can see, when we do bypass L2, the
performance with either buffer size is comparable.

I am sure your next question is why the 128K limitation for reads.
Here are the main reasons:

- Based on a lot of the benchmarks and workloads I traced, the
  target buffer of read operations are typically accessed again
  shortly after the read, while writes are usually not.  Therefore,
  the default operation mode is to bypass L2 for writes, but not
  for reads.

- The Opteron's L1 cache size is 64K.  If reads are larger than
  128KB, it would have displacement flushed itself anyway, so for
  large reads, I will also bypass L2. I am working on dynamically
  setting copyout_max_cached based on the L1 D-cache size on the
  system.

The above heuristic should have worked well in Luke's test case.
However, due to the fact that the reads was done as 16,000 8K reads
rather than one 128MB read, the NTA code was not triggered.

Since the OS code has to be general enough to handle with most
workloads, we have to pick some defaults that might not work best for
some specific operations.  It is a calculated balance.

Thanks,
Sherry


On Mon, Mar 05, 2007 at 10:58:40PM -0500, Tom Lane wrote:
 Luke Lonergan [EMAIL PROTECTED] writes:
  Good info - it's the same in Solaris, the routine is uiomove (Sherry
  wrote it).
 
 Cool.  Maybe Sherry can comment on the question whether it's possible
 for a large-scale-memcpy to not take a hit on filling a cache line
 that wasn't previously in cache?
 
 I looked a bit at the Linux code that's being used here, but it's all
 x86_64 assembler which is something I've never studied :-(.
 
   regards, tom lane

-- 
Sherry Moore, Solaris Kernel Developmenthttp://blogs.sun.com/sherrym
#include stdlib.h
#include stdio.h
#include ctype.h
#include unistd.h
#include fcntl.h
#include sys/param.h
#include sys/time.h
#include sys/mman.h
#include errno.h
#include thread.h
#include signal.h
#include strings.h
#include libgen.h

#define KB(a)   (a*1024)
#define MB(a)   (KB(a)*1024)

static void
usage(char *s)
{
fprintf(stderr,
Usage: %s [-v] [-N] -s size -n iter 
[-d delta] [-c count]\n, s);
fprintf(stderr,
\t-v:\t\tVerbose mode\n
\t-N:\t\tNormalize results by number of reads\n
\t-s size:\tWorking set size (may specify K,M,G suffix)\n
\t-n iter:\tNumber of test iterations\n
\t-f filename:\tName of the file to read from\n
\t-d [+|-]delta:\tDistance between subsequent reads\n
\t-c count:\tNumber of reads\n
\t-h:\t\tPrint this help\n );
exit(1);
}

#define ABS(x) ((x) = 0 ? (x) : -(x))

static void
format_num(size_t v, size_t *new, char *code)
{
if (v % (1024 * 1024 * 1024) == 0) {
*new = v / (1024 * 1024 * 

Re: [HACKERS] Auto creation of Partitions

2007-03-06 Thread Shane Ambler

NikhilS wrote:

Hi,

On 3/6/07, Peter Eisentraut [EMAIL PROTECTED] wrote:


NikhilS wrote:
 iv) Based on the PRIMARY, UNIQUE, REFERENCES information specified,
 pass it on to the children tables.

How will you maintain a primary key in such a table, considering that
indexes can't span multiple tables?



We will not (I know its a hard thing to do :) ), the intention is to use
this information from the parent and make it a property of the child table.
This will avoid the step for the user having to manually specify CREATE
INDEX and the likes on all the children tables one-by-one.


I think a way can be devised to maintain the primary key and unique 
constraints.
If a search is done on the parent table, the planner knows to rewrite 
the query as a union (or union all) of all child tables that relate to 
the where clause, or all child tables if the where clause is not on the 
column/s used to partition, then this concept should be able to be 
converted to indexes as well, so that when a primary or unique index 
from a child table is inserted to, then each of the related child 
indexes is consulted to ensure uniqueness.
This would only apply if the partitioning was not done by the primary or 
unique column.



1) Whether we should use triggers/rules for step number (iii) above.
 Maybe rules is the way to go.

Since this would basically be a case of the updatable rules problem, you
should review those discussions in the past to check whether the issues
mentioned there don't interfere with that plan.


The rules mentioned here will be to specify that all the
inserts/updates/deletes should go into proper children tables instead of 
the

parent. I do not see the updateable rules problem with regards to this, but
will check out the archives for discussion on this related to partitioning.



I would think that a trigger would be a better choice as I see the need 
(or at least the possibility) for more than just a rewrite. When a value 
is inserted that is outside of a condition currently covered by an 
existing child table then a new child will need to be spawned to contain 
the new data.


So say we partition by year and month of a date column? As new dates are 
added new child tables would be created each month. Or is this beyond 
the current plan and left to manual creation?


Will ALTER TABLE be extended to handle partitions? This will allow 
partitioning existing tables (maybe without blocking access?) and allow 
things like ALTER TABLE mytable ADD PARTITION (mydate within 200703)
and ALTER TABLE mytable DROP PARTITION (mydate within 199912) or would 
dropping be covered by DELETE FROM mytable where mydate = 199912 ?


Could such a syntax be devised for date columns? (month of mydate) or 
similar to auto create partitions based on the year and month of a date 
column? or will we just do CHECK(mydatecol = 1/3/07 and mydatecol = 
31/3/07) for each month of data? Also (day of mydatecol) to partition 
based on year and day of year.


Another syntax possibility - range(myserialcol of 50) where new 
child tables are created every 50 rows?



Maybe I'm looking at auto-maintenance which is beyond any current planning?



--

Shane Ambler
[EMAIL PROTECTED]

Get Sheeky @ http://Sheeky.Biz

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] GIST and TOAST

2007-03-06 Thread Gregory Stark

Teodor Sigaev [EMAIL PROTECTED] writes:

 I'm afraid that we have some lack of understanding. Flow of operation with
 indexed tuple in gist is:
 - read tuple
 - get n-th attribute with a help of  index_getattr
 - call user-defined decompress method which should, at least, detoast value
 - result value is passed to other user-defined method

So when does index_form_tuple get called?

 So, index_form_tuple should toast value, but value is already compressed and
 live in memory. Detoasting of value should be done by decompress method and
 live in memory, and so, only after that value can be passed to other
 user-defined method.

Does every data type define a compress/decompress method? Even if it's not a
data type that normally gets very large?

 As I understand, packing/unpacking varlena header is doing during
 toasting/detoastiong. So, I'm not understand the problem here.

Well we cheated a bit and had heap/index_form_tuple convert the data to packed
format. This saves having to push small tuples through the toaster. So now
tuples can magically become toasted as soon as they go into a tuple even if
they never get pushed through the toaster. 

 There may be places that assume they won't leak detoasted copies of datums. 
 If
 you could help point those places out they should just need PG_FREE_IF_COPY()

 GiST code works in separate memory context to prevent memory leaks. See
 gistinsert/gistbuildCallback/gistfindnext.

So it's perfectly safe to just use DatumGetType and PG_GETARG_TYPE instead of
using DatumGetPointer and PG_GETARG_POINTER and having to manually cast
everywhere, no? It seems like there's a lot of extra pain to maintain the code
in the present style with all the manual casts.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] PostgreSQL - 'SKYLINE OF' clause added!

2007-03-06 Thread David Fetter
On Mon, Mar 05, 2007 at 09:04:46PM -0600, Jim Nasby wrote:
 FWIW, this sounds like a subset of the Query By Example stuff that  
 someone is working on. I don't have a URL handy since I'm on a plane,  
 but I think google can find it.

It's now called ObelisQ http://pgfoundry.org/projects/qbe

Cheers,
D
 
 On Mar 3, 2007, at 8:12 AM, ranbeer makin wrote:
 
 
 Here is a description of what the SKYLINE operator is:
 ---
 Suppose you wish to purchase books and you are looking for books  
 with high rating and low price. However, both the criteria of  
 selecting books are complementary since books of higher rating are  
 generally more expensive. For finding such books, you'll query the  
 database system of the book store which will return a set of  
 interesting books. The word 'interesting' implies all the books  
 which are as good or better in both the dimensions (rating and  
 price) and better in at least one dimension. This set of  
 interesting points forms the Skyline.
 Skyline operator finds points which are not dominated by other data  
 points. A point dominates another point if it is as good or better  
 in all dimensions and better in at least one dimension.
 
 For specifying the Skyline queries, we extend SQL SELECT statement  
 by an optional SKYLINE OF clause as given below:
 
 SELECT ... FROM ... WHERE...
 
 GROUP BY ... HAVING...
 
 SKYLINE OF [DISTINCT] d1 [MIN | MAX | DIFF],  .., dm [MIN | MAX |  
 DIFF]
 
 ORDER BY...
 
 
 Where, d1, d2 ,…, dm denote the dimensions of the Skyline, and MIN,  
 MAX, DIFF specify whether the value in that dimension should be  
 minimized, maximized, or simply be different. When DIFF is  
 specified, two tuples are compared only if the value of the  
 attribute on which DIFF is applied is different.
 
 When DISTINCT clause is specified and if there are two or more  
 tuples with the same values of skyline attributes, then only one of  
 them is retained in the skyline set. Otherwise, all of them are  
 retained.
 
 Let's consider the above example of purchasing books with high  
 rating and low price.
 
 
 Book Name
 
 Rating (out of 5)
 
 Price (Rs)
 
 Prodigal Daughter
 
 3
 
 250
 
 The city of Joy
 
 5
 
 400
 
 Vanishing Acts
 
 2
 
 250
 
 The Notebook
 
 4
 
 300
 
 Fountain Head
 
 5
 
 350
 
 Dear John
 
 5
 
 500
 
 Table1. Sample of book database
 
 
 Now, in order to get books with high rating and low price, you  
 simply can issue the following query:
 
 SELECT *
 
 FROM Books
 
 SKYLINE OF rating MAX, price MIN;
 
 
 The Skyline set returned will be:
 
 
 Book Name
 
 Rating (out of 5)
 
 Price (Rs)
 
 Prodigal Daughter
 
 3
 
 250
 
 The Notebook
 
 4
 
 300
 
 Fountain Head
 
 5
 
 350
 
 Table2. Skyline set
 
 
 From this set, you can now make your choice of books, by weighing  
 your personal preferences for price and rating of the books.
 
 For more information, you can refer to:
 S. Borzsonyi, D. Kossmann, and K. Stocker. The skyline operator. In  
 ICDE, pages 421.430, 2001
 
 ---
 
 Thanks.
 
 
 
 On 3/3/07, Martijn van Oosterhout kleptog@svana.org wrote: On  
 Sat, Mar 03, 2007 at 07:02:41PM +0530, ranbeer makin wrote:
  We at International Institute of Information Technology (IIIT)  
 Hyderabad,
  India, have extended the Postgres database
  system with the skyline operation. For this work, we were guided  
 by our
  Prof. Kamalakar Karlapalem
  (http://www.iiit.ac.in/~kamal/).
 
 snip
 
  Can this piece of work contribute to PostgreSQL? If yes, then  
 we'll send out
  a detailed report of this project including changes
  made, issues involved/need to be solved, limitations, future  
 work, and the
  source code etc.
 
 Well, that kind of depends. I have no idea what Skyline means so
 telling us what it is would be a good start
 
 Have a nice day,
 --
 Martijn van Oosterhout   kleptog@svana.org   http://svana.org/ 
 kleptog/
  From each according to his ability. To each according to his  
 ability to litigate.
 
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.1 (GNU/Linux)
 
 iD8DBQFF6XrkIB7bNG8LQkwRAqw8AJ0UKAy41OMxdgLUdY1G+e7R6/jGPwCZAQY4
 9uCKFUW65UBIx7fpogR75Yo=
 =6Yc0
 -END PGP SIGNATURE-
 
 
 
 --
 Jim Nasby[EMAIL PROTECTED]
 EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)
 
 
 
 ---(end of broadcast)---
 TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

-- 
David Fetter [EMAIL PROTECTED] http://fetter.org/
phone: +1 415 235 3778AIM: dfetter666
  Skype: davidfetter

Remember to vote!
Consider donating to PostgreSQL: http://www.postgresql.org/about/donate

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if 

Re: [HACKERS] GIST and TOAST

2007-03-06 Thread Gregory Stark

Gregory Stark [EMAIL PROTECTED] writes:

 Andrew - Supernews [EMAIL PROTECTED] writes:

 So I think you've mis-analyzed the problem. That's especially true since
 you are claiming that the existing code is already buggy when in fact no
 such bugs have been reported (and clearly intarray has been running with
 toasted array values for years).

 I'm not claiming, I'm asking, because I can't tell. 

 And it's not clear _int_gist.c has been running with toasted array values for
 years because it's limited to arrays of 100 integers (or perhaps 200 integers,
 there's a factor of 2 in the test). That's not enough to trigger toasting
 unless there are other large columns in the same table.

Actually I just realized the other large columns in the table would be
irrelevant. It's not whether it's toasted in the table that matters, only if
it gets compressed by index_form_tuple that does. And it can't since 400 bytes
isn't large enough to trigger compression. Unless someone's using multi-column
intarray gist indexes with very large arrays which I'm not convinced anyone
is.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Synchronized Scan update

2007-03-06 Thread Josh Berkus
JD,

 Don't get me wrong, I want things to be easily understandable as well
 but the reason you site above pretty much
 makes us need to remove most of the postgresql.conf, including all
 bgwriter, vacuum cost delay, and autovac settings.
 Not to mention commit delay and others ;).

Wouldn't that be nice!

The explosion of GUC settings is primarily a result of not enough information.  
The reason there are 7 bgwriter settings, for example, is that we have no 
idea what those settings should be and are hoping that people will tinker 
with them and tell us.  Someday when I can fully profile bgwriter, we'll just 
have one setting: bgwriter_aggressive, set to a number between 0 and 9.  

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] GIST and TOAST

2007-03-06 Thread Teodor Sigaev

So when does index_form_tuple get called?


The single call of index_form_tuple in GiST core is in gistFormTuple which 
initially compress any indexed value with a help of their compress methods.


Only tuples formed by gistFormTuple could be inserted in index.


Does every data type define a compress/decompress method? Even if it's not a
data type that normally gets very large?


Yes, any GiST opclass should have such methods. In trivial case it just returns 
 input value. As I remember, only R-Tree emulation over boxes, contrib/seg and 
contrib/cube have simple compress method.




Well we cheated a bit and had heap/index_form_tuple convert the data to packed
format. This saves having to push small tuples through the toaster. So now
tuples can magically become toasted as soon as they go into a tuple even if
they never get pushed through the toaster. 
Ok, it should be safe for GiST except some possible memory management issue. 
index_form_tuple in GiST works in GiST's memory context which is short-live. Is 
it possible issue for your patch? BTW, that's connected to GIN too.




So it's perfectly safe to just use DatumGetType and PG_GETARG_TYPE instead of
using DatumGetPointer and PG_GETARG_POINTER and having to manually cast
everywhere, no? It seems like there's a lot of extra pain to maintain the code
in the present style with all the manual casts.

Of course, I agree. Just PG_FREE_IF_COPY is extra call in support methods.

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PATCHES] xml2 contrib patch supporting default XML namespaces

2007-03-06 Thread Mike Rylander

On 3/6/07, Peter Eisentraut [EMAIL PROTECTED] wrote:

Mike Rylander wrote:
 The patch adds support for default XML namespaces in xml2 by providing
 a mechanism for supplying a prefix to a named namespace URI.

How does it support multiple namespaces in one document?


It supports one default (unprefixed) namespace URI per document, which
ISTM is the overwhelmingly common case (and the itch that I must
scratch).

--
Mike Rylander
[EMAIL PROTECTED]
GPLS -- PINES Development
Database Developer
http://open-ils.org

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] [PATCHES] xml2 contrib patch supporting default XML namespaces

2007-03-06 Thread Mike Rylander

On 3/6/07, Nikolay Samokhvalov [EMAIL PROTECTED] wrote:

On 3/6/07, Mike Rylander [EMAIL PROTECTED] wrote:
 Attatched you'll find a patch that I've been kicking around for a
 while that I'd like to propose for inclusion in 8.3.  I attempted to
 submit this through the original xml2 author (as far back as the  7.4
 days) but got no response.

 It's really fairly trivial, but I will be using the features it
 provides in production soon, so I'd like to see it applied against the
 contrib xml2 module.  The patch adds support for default XML
 namespaces in xml2 by providing a mechanism for supplying a prefix to
 a named namespace URI.  It then wraps the namespace-capable functions
 in backward-compatible equivalents so that old code will not break.

1) And what about non-default namespaces?


I'm not sure I understand.  If the namespace already has a prefix then
it works fine.  This patch simply gives a known non-prefixed namespace
URI a prefix so one can write XPath that looks like

 //marc:[EMAIL PROTECTED]'245']/marc:[EMAIL PROTECTED]'a']

instead of

 //*[local-name()='datafield' and
@tag='245']/*[local-name()='subfied' and @code='a']

A little two node example is painful enough, now imagine a non-trivial
example with backtracking conditionals... :P


2) What if my XPath query has different prefix, that also should be
mapped to the same URI? (Not frequent case, but this really can occur
-- e.g. XML doc has prefix 'local' for URI='http://127.0.0.1', but
XPath should have 'loc' for the same URI.)



Both prefixes work fine as multiple prefixes can map to the same URI.


--
Best regards,
Nikolay




--
Mike Rylander
[EMAIL PROTECTED]
GPLS -- PINES Development
Database Developer
http://open-ils.org

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] GIST and TOAST

2007-03-06 Thread Teodor Sigaev

And it's not clear _int_gist.c has been running with toasted array values for
years because it's limited to arrays of 100 integers (or perhaps 200 integers,
there's a factor of 2 in the test). That's not enough to trigger toasting
unless there are other large columns in the same table.
That's was intended limitation to prevent indexing of huge arrays. gist__int_ops 
compression method is orientated for small and isn't effective on big ones.




I do know that with packed varlenas I get a crash in g_int_union among other
places. I can't tell where the datum came from originally and how it ended up
stored in packed format.

Can you provide your patch (in current state) and test suite? Or backtrace at 
least.

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] GIST and TOAST

2007-03-06 Thread Gregory Stark


 Does every data type define a compress/decompress method? Even if it's not a
 data type that normally gets very large?

 Yes, any GiST opclass should have such methods. In trivial case it just 
 returns
 input value. As I remember, only R-Tree emulation over boxes, contrib/seg and
 contrib/cube have simple compress method.

Hm, if they just return the original datum without detoasting it then it could
be an issue. I'll check.

 Well we cheated a bit and had heap/index_form_tuple convert the data to 
 packed
 format. This saves having to push small tuples through the toaster. So now
 tuples can magically become toasted as soon as they go into a tuple even if
 they never get pushed through the toaster. 

 Ok, it should be safe for GiST except some possible memory management issue.
 index_form_tuple in GiST works in GiST's memory context which is short-live. 
 Is
 it possible issue for your patch? BTW, that's connected to GIN too.

index_form_tuple doesn't leak memory. packed varlena format just has a shorter
header so it can store the header and then copy the data to the new location.
It doesn't have to create a copy of the data (except in the tuple, obviously).

But it means index_getattr can return a toasted tuple. I see several calls to
index_getattr that immediately put the datum into a GISTENTRY and call support
functions like the union function. For example gistMakeUnionItVec does this.

 So it's perfectly safe to just use DatumGetType and PG_GETARG_TYPE instead of
 using DatumGetPointer and PG_GETARG_POINTER and having to manually cast
 everywhere, no? It seems like there's a lot of extra pain to maintain the 
 code
 in the present style with all the manual casts.

 Of course, I agree. Just PG_FREE_IF_COPY is extra call in support methods.

Well if you're doing everything in short-lived memory contexts then we don't
even need this. btree procedures do it because the btree code expects to be
able to do comparisons without having to set up short-lived memory contexts.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] PostgreSQL - 'SKYLINE OF' clause added!

2007-03-06 Thread Chris Browne
[EMAIL PROTECTED] (Jim Nasby) writes:
 FWIW, this sounds like a subset of the Query By Example stuff that
 someone is working on. I don't have a URL handy since I'm on a plane,
 but I think google can find it.

The pgFoundry project is here...
   http://pgfoundry.org/projects/qbe

And yes, indeed, this sounds quite a lot like what Meredith Patterson
presented at the Toronto conference.
-- 
(reverse (concatenate 'string ofni.secnanifxunil @ enworbbc))
http://linuxfinances.info/info/finances.html
Rules of  the Evil Overlord #189. I  will never tell the  hero Yes I
was the one who  did it, but you'll never be able  to prove it to that
incompetent  old fool.  Chances  are, that  incompetent  old fool  is
standing behind the curtain.  http://www.eviloverlord.com/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Auto creation of Partitions

2007-03-06 Thread Chris Browne
[EMAIL PROTECTED] (Peter Eisentraut) writes:
 NikhilS wrote:
 iv) Based on the PRIMARY, UNIQUE, REFERENCES information specified,
 pass it on to the children tables.

 How will you maintain a primary key in such a table, considering that 
 indexes can't span multiple tables?

On the one hand, I seem to recall seeing multiple-table-spanning
indices being on the TODO list.

On the other, it may be that this sort of partitioning is only usable
for scenarios where it is only needful to maintain uniqueness on a
partition-by-partition basis.

 1) Whether we should use triggers/rules for step number (iii) above.
 Maybe rules is the way to go.

 Since this would basically be a case of the updatable rules problem, you 
 should review those discussions in the past to check whether the issues 
 mentioned there don't interfere with that plan.

 2) What other attributes (access permissions e.g.) of the master
 along with the ones specified in (iv) should be passed on to the
 children.

 Moreover, how are later changes of those attributes propagated?

I hear rumour of there being a more comprehensive proposal on this in
the works...
-- 
(reverse (concatenate 'string ofni.secnanifxunil @ enworbbc))
http://linuxfinances.info/info/finances.html
Rules of  the Evil Overlord #189. I  will never tell the  hero Yes I
was the one who  did it, but you'll never be able  to prove it to that
incompetent  old fool.  Chances  are, that  incompetent  old fool  is
standing behind the curtain.  http://www.eviloverlord.com/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] GIST and TOAST

2007-03-06 Thread Teodor Sigaev

input value. As I remember, only R-Tree emulation over boxes, contrib/seg and
contrib/cube have simple compress method.

Hm, if they just return the original datum without detoasting it then it could
be an issue. I'll check.

seg and box aren't a varlena types, but cube is and it seems broken :(.
g_cube_decompress and g_cube_compress don't detoast values. I'll fix that.


index_form_tuple doesn't leak memory. packed varlena format just has a shorter
header so it can store the header and then copy the data to the new location.
It doesn't have to create a copy of the data (except in the tuple, obviously).

Nice, now that's clear.


But it means index_getattr can return a toasted tuple. I see several calls to
index_getattr that immediately put the datum into a GISTENTRY and call support
functions like the union function. For example gistMakeUnionItVec does this.

From gistMakeUnionItVec:

datum = index_getattr(itvec[j], i + 1, giststate-tupdesc, IsNull);
if (IsNull)
continue;
gistdentryinit(giststate, i, evec-vector + evec-n, datum, )

gistdentryinit calls user-defined decompress method.

The reason of confusion is: there is three similar functions/macros:
gistentryinit, gistcentryinit and gistdentryinit :) That names was choosen by 
authors initially developed GiST in pgsql.




Well if you're doing everything in short-lived memory contexts then we don't
even need this. 

Sure

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


[HACKERS] Plan invalidation vs. unnamed prepared statements

2007-03-06 Thread Tom Lane
[ cc'd to pgsql-jdbc which seems the group most likely to be affected
  by any protocol change ]

So I've been working on a plan cache module per my earlier proposal,
and I've run up against a problem with getting exec_parse_message
to use it.  The problem is that the current rather hackish handling
of unnamed prepared statements doesn't fit in.  Per the documentation,
unnamed statements are supposed to be optimized for the case of
executing a query only once and then discarding it.  In the current
code this largely just means that we avoid copying the parse/plan trees
into the normal PreparedStatement cache, preferring to let them sit
in the context where they were generated (which means that any detritus
generated by the parser/planner can't be recovered until we discard the
unnamed statement, but that seems a good tradeoff in this situation).

To use the plan cache for unnamed statements, there's going to have to
be more overhead (more tree-copying) in this code path; moreover having
the unnamed statement's plan in the cache will result in distributed
overhead for checking it to see if it's still valid.  This overhead is
largely going to be wasted if the statement is always discarded
immediately after use.

I can think of several options for dealing with this:

A. Just accept the extra overhead, thereby preserving the current
behavior of unnamed statements, and gaining the benefit that plan
invalidation will work correctly in the few cases where an unnamed
statement's plan lasts long enough to need replanning.

B. Don't store the unnamed statement in the plan cache.  To make sure
it's not used anymore when the plan might be stale, forcibly discard
the unnamed statement after execution.  This would get rid of a lot
of overhead but would mean a significant change in the protocol-level
behavior.  It's hard to guess how many clients might be broken by it
--- conceivably not any, but that seems too optimistic :-(

C. Don't store the unnamed statement in the plan cache.  To make sure
it's not used anymore when the plan might be stale, don't analyze or
plan at Parse-message time, but postpone *all* that work until Bind;
and always discard the plan after Execute.  We could still do raw
parsing at Parse time, since that's independent of database state,
but all but the most trivial syntactic errors would now occur at Bind
not Parse time, as well as the majority of the time expenditure.  This
still amounts to a change in the protocol semantics, although it's a
lot more subtle than plan B.  Also there's a problem if the client
does Describe Statement before Bind: we still have to run parse analysis
before we can answer, and if we then throw that away, we have no very
good way to guarantee that the statement still has the same description
when it's subsequently executed; plus we end up doing parse analysis
twice.

D. Don't store the unnamed statement in the plan cache, and just ignore
the possibility that its plan might become stale before use.  That's
exactly what happens now, but considering that the whole point of the
plan inval work is to seal off such pitfalls, I can't say that I care
for this alternative.

Comments?  I'm leaning to plan A but wanted to see if anyone would
support plan B or sees a way to fix plan C.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] PostgreSQL - 'SKYLINE OF' clause added!

2007-03-06 Thread Josh Berkus

 And yes, indeed, this sounds quite a lot like what Meredith Patterson
 presented at the Toronto conference.

This would be good to have, though, since Meredith's work has some problematic 
IP encumbrances.

Question, though: is the SKYLINE syntax part of a standard anywhere?

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Auto creation of Partitions

2007-03-06 Thread Martijn van Oosterhout
On Wed, Mar 07, 2007 at 02:43:39AM +1030, Shane Ambler wrote:
 I think a way can be devised to maintain the primary key and unique 
 constraints.
 If a search is done on the parent table, the planner knows to rewrite 
 the query as a union (or union all) of all child tables that relate to 
 the where clause, or all child tables if the where clause is not on the 
 column/s used to partition, then this concept should be able to be 
 converted to indexes as well, so that when a primary or unique index 
 from a child table is inserted to, then each of the related child 
 indexes is consulted to ensure uniqueness.

But that's where it breaks down: you not only need to check that the
row you're inserting is unique, you need to make sure that other people
trying to insert the same value see it.

After all, a unique index does contain multiple instances of any
particular value, it's just that the DB has verified that none of these
are visible to the same transaction. If there is a collision a lock is
taken that waits until the colliding insert commits or rejects.

With b-tree it's easy, every value can only appear in one place, so
it's a very simple lock. The reason why GiST doesn't support unique
indexes is due to the fact that any particular value could appear in
many places, so to stop concurrent inserts you need to lock *every*
page in the tree that another backend might want to insert the record.

I fear this will also be a problem for indexes over multiple tables, you
will need to lock the page in every index that some other DB might want
to insert the value into. All while avoiding deadlocks.

I think the problems are related. If you can solve it for multiple
tables, you can solve it for GiST indexes also.

 This would only apply if the partitioning was not done by the primary or 
 unique column.

That's the easy case, no problem there.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] Calculated view fields (8.1 != 8.2)

2007-03-06 Thread Tom Lane
Gaetano Mendola [EMAIL PROTECTED] writes:
 [ 8.2 evaluates volatile functions in the targetlist of a view ]
 If I mark the function as STABLE or IMMUTABLE then even with version
 8.2 the function is not evaluated. Is this the intended behavior?

Yes; people complained that we needed to be more careful about the
number of times volatile functions get evaluated.

 I didn't see something about it in the release note.

  listitem
   para
Do not flatten subqueries that contain literalvolatile/
functions in their target lists (Jaime Casanova)
   /para

   para
This prevents surprising behavior due to multiple evaluation
of a literalvolatile/ function (such as functionrandom()/
or functionnextval()/).  It might cause performance
degradation in the presence of functions that are unnecessarily
marked as literalvolatile/.
   /para
  /listitem

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] GIST and TOAST

2007-03-06 Thread Gregory Stark

Teodor Sigaev [EMAIL PROTECTED] writes:

 And it's not clear _int_gist.c has been running with toasted array values for
 years because it's limited to arrays of 100 integers (or perhaps 200 
 integers,
 there's a factor of 2 in the test). That's not enough to trigger toasting
 unless there are other large columns in the same table.

 That's was intended limitation to prevent indexing of huge arrays.
 gist__int_ops compression method is orientated for small and isn't effective 
 on
 big ones.

Right, so it's possible nobody see any toasted arrays with _int_gist.c since
they never get very large. It looks like index_form_tuple will never compress
anything under 512b so I guess it's safe currently.

 I do know that with packed varlenas I get a crash in g_int_union among other
 places. I can't tell where the datum came from originally and how it ended up
 stored in packed format.
 Can you provide your patch (in current state) and test suite? Or backtrace at 
 least.

It doesn't actually crash, it just fails CHECKARRVALID. I added an assertion
in there to cause it to generate a core dump.


You can download the core dump and binary from 

 http://community.enterprisedb.com/varlena/core._int
 http://community.enterprisedb.com/varlena/postgres._int

The last patch (without the assertion) is at:

 http://community.enterprisedb.com/varlena/patch-varvarlena-14.patch.gz


What I'm seeing is this:

(gdb) f 3
#3  0xb7fd924b in inner_int_union (a=0x84e41f4, b=0xb64220d0)
at _int_tool.c:81
81  CHECKARRVALID(b);

The array is actually garbage:

(gdb) p *b
$2 = {vl_len_ = 141, ndim = 0, dataoffset = 5888, elemtype = 0}

What's going on is that the va_1byte header is 141 which is 0x80 | 13. So it's
actually only 13 bytes with a 1 byte header or a 12 byte array:

(gdb) p *(varattrib*)b
$3 = {va_1byte = {va_header = 141 '\215', va_data = }, va_external = {
va_header = 141 '\215', va_padding = \000\000, va_rawsize = 0, 
va_extsize = 5888, va_valueid = 0, va_toastrelid = 0}, va_compressed = {
va_header = 141, va_rawsize = 0, va_data = }, va_4byte = {
va_header = 141, va_data = }}

(gdb) bt
#0  0xb7e6a947 in raise () from /lib/tls/libc.so.6
#1  0xb7e6c0c9 in abort () from /lib/tls/libc.so.6
#2  0x082fec97 in ExceptionalCondition (
conditionName=0xb7fdd3b9 !(!((b)-dataoffset != 0)), 
errorType=0xb7fdd371 FailedAssertion, 
fileName=0xb7fdd347 _int_tool.c, lineNumber=81) at assert.c:51
#3  0xb7fd924b in inner_int_union (a=0x84e41f4, b=0xb64220d0)
at _int_tool.c:81
#4  0xb7fd547d in g_int_picksplit (fcinfo=0xbf9e43f0) at _int_gist.c:403
#5  0x08304d9c in FunctionCall2 (flinfo=0xbf9e5a94, arg1=139342312, 
arg2=3214821160) at fmgr.c:1154
#6  0x08094fd3 in gistUserPicksplit (r=0xb6078d4c, entryvec=0x84e31e8, 
attno=0, v=0xbf9e4728, itup=0x84e2ddc, len=142, giststate=0xbf9e4b94)
at gistsplit.c:306
#7  0x08095deb in gistSplitByKey (r=0xb6078d4c, page=0xb6420220 , 
itup=0x84e2ddc, len=142, giststate=0xbf9e4b94, v=0xbf9e4728, 
entryvec=0x84e31e8, attno=0) at gistsplit.c:548
#8  0x080874bd in gistSplit (r=0xb6078d4c, page=0xb6420220 , 
itup=0x84e2ddc, len=142, giststate=0xbf9e4b94) at gist.c:943
#9  0x080850fa in gistplacetopage (state=0xbf9e49e0, giststate=0xbf9e4b94)
at gist.c:329
#10 0x080871eb in gistmakedeal (state=0xbf9e49e0, giststate=0xbf9e4b94)
at gist.c:873
#11 0x08084f21 in gistdoinsert (r=0xb6078d4c, itup=0x84e2ce4, freespace=819, 
giststate=0xbf9e4b94) at gist.c:278
#12 0x08084cf5 in gistbuildCallback (index=0xb6078d4c, htup=0x84c8c30, 
values=0xbf9e4a98, isnull=0xbf9e4a78 , tupleIsAlive=1 '\001', 
state=0xbf9e4b94) at gist.c:201
#13 0x080fc81f in IndexBuildHeapScan (heapRelation=0xb60d6860, 
indexRelation=0xb6078d4c, indexInfo=0x84cd620, 
callback=0x8084c27 gistbuildCallback, callback_state=0xbf9e4b94)
at index.c:1548
#14 0x08084bdd in gistbuild (fcinfo=0xbf9e60e8) at gist.c:150
#15 0x08305630 in OidFunctionCall3 (functionId=782, arg1=3054332000, 
arg2=3053948236, arg3=139253280) at fmgr.c:1460
#16 0x080fc363 in index_build (heapRelation=0xb60d6860, 
indexRelation=0xb6078d4c, indexInfo=0x84cd620, isprimary=0 '\0')
at index.c:1296
#17 0x080fb86a in index_create (heapRelationId=21361, 
indexRelationName=0x84a531c text_idx, indexRelationId=21366, 
indexInfo=0x84cd620, accessMethodObjectId=783, tableSpaceId=0, 
classObjectId=0x84cd60c, coloptions=0x84cd6ac, reloptions=0, 
isprimary=0 '\0', isconstraint=0 '\0', allow_system_table_mods=0 '\0', 
skip_build=0 '\0', concurrent=0 '\0') at index.c:794
#18 0x0815f3e4 in DefineIndex (heapRelation=0x84a5354, 
indexRelationName=0x84a531c text_idx, indexRelationId=0, 
accessMethodName=0x84a5380 gist, tableSpaceName=0x0, 
attributeList=0x84a5448, predicate=0x0, rangetable=0x0, options=0x0, 
unique=0 '\0', primary=0 '\0', isconstraint=0 '\0', 
is_alter_table=0 '\0', check_rights=1 '\001', skip_build=0 '\0', 
quiet=0 '\0', concurrent=0 

Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Jeff Davis
On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote:
 On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
  Another approach I proposed back in December is to not have a  
  variable like that at all, but scan the buffer cache for pages  
  belonging to the table you're scanning to initialize the scan.  
  Scanning all the BufferDescs is a fairly CPU and lock heavy  
  operation, but it might be ok given that we're talking about large  
  I/O bound sequential scans. It would require no DBA tuning and  
  would work more robustly in varying conditions. I'm not sure where  
  you would continue after scanning the in-cache pages. At the  
  highest in-cache block number, perhaps.
 
 If there was some way to do that, it'd be what I'd vote for.
 

I still don't know how to make this take advantage of the OS buffer
cache. 

However, no DBA tuning is a huge advantage, I agree with that.

If I were to implement this idea, I think Heikki's bitmap of pages
already read is the way to go. Can you guys give me some pointers about
how to walk through the shared buffers, reading the pages that I need,
while being sure not to read a page that's been evicted, and also not
potentially causing a performance regression somewhere else?

 Given the partitioning of the buffer lock that Tom did it might not  
 be that horrible for many cases, either, since you'd only need to  
 scan through one partition.
 
 We also don't need an exact count, either. Perhaps there's some way  
 we could keep a counter or something...

Exact count of what? The pages already in cache?

Regards,
Jeff Davis


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Tom Lane
Jeff Davis [EMAIL PROTECTED] writes:
 If I were to implement this idea, I think Heikki's bitmap of pages
 already read is the way to go.

I think that's a good way to guarantee that you'll not finish in time
for 8.3.  Heikki's idea is just at the handwaving stage at this point,
and I'm not even convinced that it will offer any win.  (Pages in
cache will be picked up by a seqscan already.)

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] PostgreSQL - 'SKYLINE OF' clause added!

2007-03-06 Thread Tom Lane
Josh Berkus josh@agliodbs.com writes:
 Question, though: is the SKYLINE syntax part of a standard anywhere?

There's certainly not anything like that in SQL2003.

I'm also kind of wondering if the main use-cases couldn't be met with
suitable multi-input custom aggregates, which is something we already
have as of 8.2.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Plan invalidation vs. unnamed prepared statements

2007-03-06 Thread Gregory Stark

Tom Lane [EMAIL PROTECTED] writes:

 B. Don't store the unnamed statement in the plan cache.  To make sure
 it's not used anymore when the plan might be stale, forcibly discard
 the unnamed statement after execution.  This would get rid of a lot
 of overhead but would mean a significant change in the protocol-level
 behavior.  It's hard to guess how many clients might be broken by it
 --- conceivably not any, but that seems too optimistic :-(

Can we forcibly discard it if *any* messages are received that might
invalidate a plan? So basically it would work fine unless anyone in the system
does any DDL at all? I guess that has the downside of introducing random
unpredictable failures.

Or stash the query string and replan it (possibly in the query cache this
time) if someone executes it a second time?

Can't say I like either of those options much, just trying to brainstorm.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Jeff Davis
On Tue, 2007-03-06 at 12:59 -0500, Tom Lane wrote:
 Jeff Davis [EMAIL PROTECTED] writes:
  If I were to implement this idea, I think Heikki's bitmap of pages
  already read is the way to go.
 
 I think that's a good way to guarantee that you'll not finish in time
 for 8.3.  Heikki's idea is just at the handwaving stage at this point,
 and I'm not even convinced that it will offer any win.  (Pages in
 cache will be picked up by a seqscan already.)
 

I agree that it's a good idea stick with the current implementation
which is, as far as I can see, meeting all of my performance goals.

Regards,
Jeff Davis




---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Plan invalidation vs. unnamed prepared statements

2007-03-06 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes:
 Can we forcibly discard it if *any* messages are received that might
 invalidate a plan? So basically it would work fine unless anyone in the system
 does any DDL at all? I guess that has the downside of introducing random
 unpredictable failures.

Ugh :-(

 Or stash the query string and replan it (possibly in the query cache this
 time) if someone executes it a second time?

I think that's either my plan A or C.

The main problem with uncontrolled replanning is that there's no way to
detect a change in the query properties.  For example suppose the query
is SELECT * FROM foo and we've already told the client (via Describe
Statement) that that returns two integer columns.  If an inval now
arrives because of ALTER TABLE foo ADD COLUMN (or perhaps worse, ALTER
COLUMN TYPE), we've got a problem.  If we just blindly replan then we'll
return tuples that do not match the previously given row description,
which will certainly break most clients.

The plan caching module has enough infrastructure to detect and complain
about these sorts of situations, and it also knows how to manage lock
acquisition so that once we've decided a plan is still good, the tables
won't change underneath us while we use the plan.  I don't see any way
to make comparable guarantees without the overhead that goes with the
cache manager.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Auto creation of Partitions

2007-03-06 Thread Florian G. Pflug

Martijn van Oosterhout wrote:

On Wed, Mar 07, 2007 at 02:43:39AM +1030, Shane Ambler wrote:
I think a way can be devised to maintain the primary key and unique 
constraints.
If a search is done on the parent table, the planner knows to rewrite 
the query as a union (or union all) of all child tables that relate to 
the where clause, or all child tables if the where clause is not on the 
column/s used to partition, then this concept should be able to be 
converted to indexes as well, so that when a primary or unique index 
from a child table is inserted to, then each of the related child 
indexes is consulted to ensure uniqueness.


But that's where it breaks down: you not only need to check that the
row you're inserting is unique, you need to make sure that other people
trying to insert the same value see it.


This sounds like what is really needed is a way to lock a certain
condition, namely the existance or non-existance of a record with
certain values in certain fields. This would not only help this case,
it would also help RI triggers, because those wouldn't have to acquire
a share lock on the referenced rows anymore.

As you pointed out, this would also make unique GiST indices possible

No real idea how to do this, though :-(

greetings, Florian Pfluge


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Heikki Linnakangas

Jeff Davis wrote:

On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote:

On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
Another approach I proposed back in December is to not have a  
variable like that at all, but scan the buffer cache for pages  
belonging to the table you're scanning to initialize the scan.  
Scanning all the BufferDescs is a fairly CPU and lock heavy  
operation, but it might be ok given that we're talking about large  
I/O bound sequential scans. It would require no DBA tuning and  
would work more robustly in varying conditions. I'm not sure where  
you would continue after scanning the in-cache pages. At the  
highest in-cache block number, perhaps.

If there was some way to do that, it'd be what I'd vote for.



I still don't know how to make this take advantage of the OS buffer
cache. 


Yep, I don't see any way to do that. I think we could live with that, 
though. If we went with the sync_scan_offset approach, you'd have to 
leave a lot of safety margin in that as well.



However, no DBA tuning is a huge advantage, I agree with that.

If I were to implement this idea, I think Heikki's bitmap of pages
already read is the way to go. Can you guys give me some pointers about
how to walk through the shared buffers, reading the pages that I need,
while being sure not to read a page that's been evicted, and also not
potentially causing a performance regression somewhere else?


You could take a look at BufferSync, for example. It walks through the 
buffer cache, syncing all dirty buffers.


FWIW, I've attached a function I wrote some time ago when I was playing 
with the same idea for vacuums. A call to the new function loops through 
the buffer cache and returns the next buffer that belong to a certain 
relation. I'm not sure that it's correct and safe, and there's not much 
comments, but should work if you want to play with it...


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/storage/buffer/bufmgr.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.214
diff -c -r1.214 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c	5 Jan 2007 22:19:37 -	1.214
--- src/backend/storage/buffer/bufmgr.c	22 Jan 2007 16:38:37 -
***
*** 97,102 
--- 97,134 
  static void AtProcExit_Buffers(int code, Datum arg);
  
  
+ Buffer
+ ReadAnyBufferForRelation(Relation reln)
+ {
+ 	static int last_buf_id = 0;
+ 	int new_buf_id;
+ 	volatile BufferDesc *bufHdr;
+ 
+ 	/* Make sure we will have room to remember the buffer pin */
+ 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+ 
+ 	new_buf_id = last_buf_id;
+ 	do
+ 	{
+ 		if (++new_buf_id = NBuffers)
+ 			new_buf_id = 0;
+ 
+ 		bufHdr = BufferDescriptors[new_buf_id];
+ 		LockBufHdr(bufHdr);
+ 
+ 		if ((bufHdr-flags  BM_VALID)  RelFileNodeEquals(bufHdr-tag.rnode, reln-rd_node))
+ 		{
+ 			PinBuffer_Locked(bufHdr);
+ 			last_buf_id = new_buf_id;
+ 			return BufferDescriptorGetBuffer(bufHdr);
+ 		}
+ 		UnlockBufHdr(bufHdr);
+ 	} while(new_buf_id != last_buf_id);
+ 	last_buf_id = new_buf_id;
+ 	return InvalidBuffer;
+ }
+ 
+ 
  /*
   * ReadBuffer -- returns a buffer containing the requested
   *		block of the requested relation.  If the blknum

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Auto creation of Partitions

2007-03-06 Thread Peter Eisentraut
NikhilS wrote:
 We will not (I know its a hard thing to do :) ), the intention is to
 use this information from the parent and make it a property of the
 child table. This will avoid the step for the user having to manually
 specify CREATE INDEX and the likes on all the children tables
 one-by-one.

But when I say

CREATE TABLE ( a int PRIMARY KEY, ... ) PARTITION blah ...

then I expect that the primary key will be enforced across all 
partitions.  We currently sidestep that issue by not offering seemingly 
transparent partitioning.  But if you are planning to offer that, the 
unique index issue needs to be solved, and I see nothing in your plan 
about that.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Auto creation of Partitions

2007-03-06 Thread Josh Berkus
Florian,

 This sounds like what is really needed is a way to lock a certain
 condition, namely the existance or non-existance of a record with
 certain values in certain fields. This would not only help this case,
 it would also help RI triggers, because those wouldn't have to acquire
 a share lock on the referenced rows anymore.

That's called predicate locking and it's very, very hard to do.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] PostgreSQL - 'SKYLINE OF' clause added!

2007-03-06 Thread Josh Berkus
Tom,

 I'm also kind of wondering if the main use-cases couldn't be met with
 suitable multi-input custom aggregates, which is something we already
 have as of 8.2.

Actually, given that skyline of is *only* for aggregate sorting (as far as I 
can tell) it doesn't present the complications which QBE did for using a 
function interface.   

Ranbeer, would it be possible to use an aggregate function syntax instead of 
the SKYLINE OF syntax extension?   This allows us to sidestep the issue of 
non-standard additions to the reserved word list.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Auto creation of Partitions

2007-03-06 Thread Tom Lane
Peter Eisentraut [EMAIL PROTECTED] writes:
 But when I say
 CREATE TABLE ( a int PRIMARY KEY, ... ) PARTITION blah ...
 then I expect that the primary key will be enforced across all 
 partitions.  We currently sidestep that issue by not offering seemingly 
 transparent partitioning.  But if you are planning to offer that, the 
 unique index issue needs to be solved, and I see nothing in your plan 
 about that.

Agreed, it needs to Just Work.  I think it'd still be useful though
if we only support auto-partitioning on the primary key, and that
restriction avoids the indexing problem.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [JDBC] [HACKERS] Plan invalidation vs. unnamed prepared statements

2007-03-06 Thread Dave Cramer
I think C is how the JDBC driver is written. We name the statements  
if they have been used more than prepareThreshold times.


So we have a mechanism by which to allow statements to be cached, or  
not.


Dave

On 6-Mar-07, at 1:14 PM, Tom Lane wrote:


Gregory Stark [EMAIL PROTECTED] writes:

Can we forcibly discard it if *any* messages are received that might
invalidate a plan? So basically it would work fine unless anyone  
in the system
does any DDL at all? I guess that has the downside of introducing  
random

unpredictable failures.


Ugh :-(

Or stash the query string and replan it (possibly in the query  
cache this

time) if someone executes it a second time?


I think that's either my plan A or C.

The main problem with uncontrolled replanning is that there's no  
way to
detect a change in the query properties.  For example suppose the  
query

is SELECT * FROM foo and we've already told the client (via Describe
Statement) that that returns two integer columns.  If an inval now
arrives because of ALTER TABLE foo ADD COLUMN (or perhaps worse,  
ALTER
COLUMN TYPE), we've got a problem.  If we just blindly replan then  
we'll

return tuples that do not match the previously given row description,
which will certainly break most clients.

The plan caching module has enough infrastructure to detect and  
complain

about these sorts of situations, and it also knows how to manage lock
acquisition so that once we've decided a plan is still good, the  
tables

won't change underneath us while we use the plan.  I don't see any way
to make comparable guarantees without the overhead that goes with the
cache manager.

regards, tom lane

---(end of  
broadcast)---

TIP 2: Don't 'kill -9' the postmaster




---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] GIST and TOAST

2007-03-06 Thread Teodor Sigaev

It doesn't actually crash, it just fails CHECKARRVALID. I added an assertion
in there to cause it to generate a core dump.


Wow, catch that, see attached patch.

g_int_decompress doesn't returns detoasted array in case it was empty. 
Previously it was safe because empty array never has been toasted.


Should I commit it or you'll include in your patch?

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/
*** ./contrib/intarray.orig/./_int_gist.c   Tue Mar  6 20:59:23 2007
--- ./contrib/intarray/./_int_gist.cTue Mar  6 21:41:54 2007
***
*** 232,238 
--- 232,247 
  
CHECKARRVALID(in);
if (ARRISVOID(in))
+   {
+   if (in != (ArrayType *) DatumGetPointer(entry-key)) {
+   retval = palloc(sizeof(GISTENTRY));
+   gistentryinit(*retval, PointerGetDatum(in),
+   entry-rel, entry-page, entry-offset, FALSE);
+   PG_RETURN_POINTER(retval);
+   }
+ 
PG_RETURN_POINTER(entry);
+   }
  
lenin = ARRNELEMS(in);
  

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Heikki Linnakangas

Tom Lane wrote:

Jeff Davis [EMAIL PROTECTED] writes:

If I were to implement this idea, I think Heikki's bitmap of pages
already read is the way to go.


I think that's a good way to guarantee that you'll not finish in time
for 8.3.  Heikki's idea is just at the handwaving stage at this point,
and I'm not even convinced that it will offer any win.  (Pages in
cache will be picked up by a seqscan already.)


The scenario that I'm worried about is that you have a table that's 
slightly larger than RAM. If you issue many seqscans on that table, one 
at a time, every seqscan will have to read the whole table from disk, 
even though say 90% of it is in cache when the scan starts.


This can be alleviated by using a large enough sync_scan_offset, but a 
single setting like that is tricky to tune, especially if your workload 
is not completely constant. Tune it too low, and you don't get much 
benefit, tune it too high and your scans diverge and you lose all benefit.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Auto creation of Partitions

2007-03-06 Thread Josh Berkus
 Agreed, it needs to Just Work.  I think it'd still be useful though
 if we only support auto-partitioning on the primary key, and that
 restriction avoids the indexing problem.

+1

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Plan invalidation vs. unnamed prepared statements

2007-03-06 Thread andyk

Tom Lane wrote:

Gregory Stark [EMAIL PROTECTED] writes:
  

Can we forcibly discard it if *any* messages are received that might
invalidate a plan? So basically it would work fine unless anyone in the system
does any DDL at all? I guess that has the downside of introducing random
unpredictable failures.



Ugh :-(

  

Or stash the query string and replan it (possibly in the query cache this
time) if someone executes it a second time?



I think that's either my plan A or C.

The main problem with uncontrolled replanning is that there's no way to
detect a change in the query properties.  For example suppose the query
is SELECT * FROM foo and we've already told the client (via Describe
Statement) that that returns two integer columns.  If an inval now
arrives because of ALTER TABLE foo ADD COLUMN (or perhaps worse, ALTER
COLUMN TYPE), we've got a problem.  If we just blindly replan then we'll
return tuples that do not match the previously given row description,
which will certainly break most clients.
  
   It will always be a good question what user expects as a result of 
'SELECT * FROM...'. For example, client may use ODBC or some other 
interface for DB communication. One the first step he retrieves 
information about the table and it's datatypes, on the second tries to 
fetch rows (using interface functions). Client application won't even 
guess that table could be changed between these two steps. It's 
impossible to avoid such situations, because we can't know how the user 
retrieves information about results he will expect.

The plan caching module has enough infrastructure to detect and complain
about these sorts of situations, and it also knows how to manage lock
acquisition so that once we've decided a plan is still good, the tables
won't change underneath us while we use the plan.  I don't see any way
to make comparable guarantees without the overhead that goes with the
cache manager.
  
It's a required overhead. Result should be valid on the execution time, 
not on prepare. Cache manager is the best for this.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster
  



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] PostgreSQL - 'SKYLINE OF' clause added!

2007-03-06 Thread Alvaro Herrera
Josh Berkus wrote:
 Tom,
 
  I'm also kind of wondering if the main use-cases couldn't be met with
  suitable multi-input custom aggregates, which is something we already
  have as of 8.2.
 
 Actually, given that skyline of is *only* for aggregate sorting (as far as 
 I 
 can tell) it doesn't present the complications which QBE did for using a 
 function interface.   

There is people on a Venezuelan university working on SKYLINE OF and
other operators on Postgres.  I had some looks at their work because
they asked for help in the spanish list.

Not only they added the SKYLINE OF clause, but they also had some mods
to the ORDER BY clause, and a couple of other grammar changes as well.
While SKYLINE OF itself could probably be folded as aggregates, the
other stuff is not likely to be amenable to such treatment.

Also, keep in mind that there were plenty of changes in the executor.
This stuff is not likely to be very easy to implement efficiently using
our extant executor machinery; note that Ranbeer mentioned
implementation of block nested loop and other algorithms.  Not sure
how easy would be to fold that stuff into the optimizer for multi-input
aggregates, instead of hardwiring it to the SKYLINE OF syntax.

There's a certain group in the Venezuelan Uni that was about to finish
their thesis.  They promised me a look into their report; maybe I can
give further input from that and maybe merge Ranbeer's stuff with it.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Auto creation of Partitions

2007-03-06 Thread Andrew Dunstan

Tom Lane wrote:

Peter Eisentraut [EMAIL PROTECTED] writes:
  

But when I say
CREATE TABLE ( a int PRIMARY KEY, ... ) PARTITION blah ...
then I expect that the primary key will be enforced across all 
partitions.  We currently sidestep that issue by not offering seemingly 
transparent partitioning.  But if you are planning to offer that, the 
unique index issue needs to be solved, and I see nothing in your plan 
about that.



Agreed, it needs to Just Work.  I think it'd still be useful though
if we only support auto-partitioning on the primary key, and that
restriction avoids the indexing problem.

  


Maybe. The most obvious use for automatic partitioning that I can think 
of would be based in the value of a timestamptz field rather than any 
PK. Of course I tend to work more in the OLTP field than in DW type 
apps, where other considerations might apply.


cheers

andrew

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Arrays of Complex Types

2007-03-06 Thread David Fetter
On Fri, Mar 02, 2007 at 06:59:50PM -0500, Tom Lane wrote:
 David Fetter [EMAIL PROTECTED] writes:
  1.  In src/backend/commands/tablecmds.c, change DefineRelation as
  follows:
  * After the first call to heap_create_with_catalog, construct and
do another call to for the array type.
 
 I'm still not happy about the idea of doing this for every relation
 (and doing it for sequences and indexes would be the height of
 wastefulness).  How about we only do it for composite types?

How about doing it for user-defined tables, views and composite types,
and skipping ?

  * Add an appropriate pg_depend entry.
  2.  Change RemoveRelation to reflect the above.
 
 You only need one of those two: either you drop by hand or you let the
 dependency machinery deal with it.  Not both.

pg_depend it is, then :)

  Does the above make sense?  Have I missed anything critical?
 
 Ummm ... making it actually work?  Possibly that just falls out, but
 I'm not sure.
 
 If it turns out that it does Just Work, you might take a stab at
 arrays of domains too.

OK.

I noticed something in src/backend/commands/tablecmds.c which worries
me, namely that it ignores functions and views.  It should at least be
checking that the typeoid isn't in pg_proc.prorettype or
pg_proc.proargtypes, and if possible, the DECLARE section of pl/pgsql
functions.

Is there a way to do SQL at that place in the back-end, or is there
some different kind of Magick(TM) needed to access these kinds of
things at that level?

Cheers,
D
-- 
David Fetter [EMAIL PROTECTED] http://fetter.org/
phone: +1 415 235 3778AIM: dfetter666
  Skype: davidfetter

Remember to vote!
Consider donating to PostgreSQL: http://www.postgresql.org/about/donate

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Arrays of Complex Types

2007-03-06 Thread Tom Lane
David Fetter [EMAIL PROTECTED] writes:
 I noticed something in src/backend/commands/tablecmds.c which worries
 me, namely that it ignores functions and views.

What?

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Arrays of Complex Types

2007-03-06 Thread David Fetter
On Tue, Mar 06, 2007 at 04:14:07PM -0500, Tom Lane wrote:
 David Fetter [EMAIL PROTECTED] writes:
  I noticed something in src/backend/commands/tablecmds.c which
  worries me, namely that it ignores functions and views.
 
 What?

The it in question is, find_composite_type_dependencies()

Cheers,
D
-- 
David Fetter [EMAIL PROTECTED] http://fetter.org/
phone: +1 415 235 3778AIM: dfetter666
  Skype: davidfetter

Remember to vote!
Consider donating to PostgreSQL: http://www.postgresql.org/about/donate

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Arrays of Complex Types

2007-03-06 Thread Tom Lane
David Fetter [EMAIL PROTECTED] writes:
 On Tue, Mar 06, 2007 at 04:14:07PM -0500, Tom Lane wrote:
 David Fetter [EMAIL PROTECTED] writes:
 I noticed something in src/backend/commands/tablecmds.c which
 worries me, namely that it ignores functions and views.
 
 What?

 The it in question is, find_composite_type_dependencies()

All that that's interested in is whether there are stored values of the
datatype somewhere.  Views don't have any storage, and a function
definition doesn't either.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Arrays of Complex Types

2007-03-06 Thread David Fetter
On Tue, Mar 06, 2007 at 04:24:36PM -0500, Tom Lane wrote:
 David Fetter [EMAIL PROTECTED] writes:
  On Tue, Mar 06, 2007 at 04:14:07PM -0500, Tom Lane wrote:
  David Fetter [EMAIL PROTECTED] writes:
  I noticed something in src/backend/commands/tablecmds.c which
  worries me, namely that it ignores functions and views.
  
  What?
 
  The it in question is, find_composite_type_dependencies()
 
 All that that's interested in is whether there are stored values of the
 datatype somewhere.  Views don't have any storage, and a function
 definition doesn't either.

I see.  Perhaps I've misunderstood what this thing was for, then.
What is it that checks whether it's OK to change a composite type,
then?

Cheers,
D
-- 
David Fetter [EMAIL PROTECTED] http://fetter.org/
phone: +1 415 235 3778AIM: dfetter666
  Skype: davidfetter

Remember to vote!
Consider donating to PostgreSQL: http://www.postgresql.org/about/donate

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Auto creation of Partitions

2007-03-06 Thread Joris Dobbelsteen
-Original Message-
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of Josh Berkus
Sent: dinsdag 6 maart 2007 19:45
To: pgsql-hackers@postgresql.org
Cc: Florian G. Pflug; Martijn van Oosterhout; Shane Ambler; 
NikhilS; Peter Eisentraut
Subject: Re: [HACKERS] Auto creation of Partitions

Florian,

 This sounds like what is really needed is a way to lock a certain 
 condition, namely the existance or non-existance of a record with 
 certain values in certain fields. This would not only help 
this case, 
 it would also help RI triggers, because those wouldn't have 
to acquire 
 a share lock on the referenced rows anymore.

That's called predicate locking and it's very, very hard to do.

That's definitely not needed.

Rather something good such that we can finally enforce RI ourselves in
the general case. This is currently not possible to do easily, except in
C code. This means we need to look at all the rows that exists, but are
normally be invisible to our view of the database. Still I'm not sure
about all cases, as the MVCC model is quite tricky and I'm not sure
whether my idea's about it are valid.

The basic idea is that you need to guarentee the constraint for the
'single underlaying model' (with everything visible) and for your view
(under your visibility rules). I believe, but are not certain, that
under these conditions any (valid) snapshot will obey the desired
constraints.

- Joris Dobbelsteen


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Simon Riggs
On Mon, 2007-03-05 at 21:34 -0800, Sherry Moore wrote:

 - Based on a lot of the benchmarks and workloads I traced, the
   target buffer of read operations are typically accessed again
   shortly after the read, while writes are usually not.  Therefore,
   the default operation mode is to bypass L2 for writes, but not
   for reads.

Hi Sherry,

I'm trying to relate what you've said to how we should proceed from
here. My understanding of what you've said is:

- Tom's assessment that the observed performance quirk could be fixed in
the OS kernel is correct and you have the numbers to prove it

- currently Solaris only does NTA for 128K reads, which we don't
currently do. If we were to request 16 blocks at time, we would get this
benefit on Solaris, at least. The copyout_max_cached parameter can be
patched, but isn't a normal system tunable.

- other workloads you've traced *do* reuse the same buffer again very
soon afterwards when reading sequentially (not writes). Reducing the
working set size is an effective technique in improving performance if
we don't have a kernel that does NTA or we don't read in big enough
chunks (we need both to get NTA to kick in).

and what you haven't said

- all of this is orthogonal to the issue of buffer cache spoiling in
PostgreSQL itself. That issue does still exist as a non-OS issue, but
we've been discussing in detail the specific case of L2 cache effects
with specific kernel calls. All of the test results have been
stand-alone, so we've not done any measurements in that area. I say this
because you make the point that reducing the working set size of write
workloads has no effect on the L2 cache issue, but ISTM its still
potentially a cache spoiling issue.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Plan invalidation vs. unnamed prepared statements

2007-03-06 Thread Simon Riggs
On Tue, 2007-03-06 at 12:22 -0500, Tom Lane wrote:

 A. Just accept the extra overhead, thereby preserving the current
 behavior of unnamed statements, and gaining the benefit that plan
 invalidation will work correctly in the few cases where an unnamed
 statement's plan lasts long enough to need replanning.

With connection pooling, multiple sessions will execute each statement.
If we check the cache each time this does seem more expensive for each
individual session, but we should gain synergy from other similar
sessions. Taken across multiple sessions, A will be a win because it
will reduce planning overhead by ~99%. 

 C. Don't store the unnamed statement in the plan cache.  To make sure
 it's not used anymore when the plan might be stale, don't analyze or
 plan at Parse-message time, but postpone *all* that work until Bind;
 and always discard the plan after Execute.  We could still do raw
 parsing at Parse time, since that's independent of database state,
 but all but the most trivial syntactic errors would now occur at Bind
 not Parse time, as well as the majority of the time expenditure. 

ISTM there will be some cases where the current behaviour will not be
maintained if we implement A exactly. One thing I've not seen mentioned
is the effect of constants on various plans. 

The current system plans at Bind time so it can make longer term
decisions based upon the values of initial parameters. So I'd say we
need to check the cache at Parse time, but if we do need to plan,
continue to do this at Bind time (and so don't write to plan cache until
that point). That might mean we end up giving some of our benefit away
if multiple sessions all concurrently plan a previously unplanned query.
That does seem less likely and in any case much better than taking a
step backwards in query planning of parameterised queries.

Also, some of those plans are only currently possible with actual
constants, specifically predicate proving for partial indexes and
constraint exclusion. Parameter to constant folding may change the plan
completely and make it non-reusable anyhow. How would we cope with that
type of prepared query with plan inval?

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Plan invalidation vs. unnamed prepared statements

2007-03-06 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 On Tue, 2007-03-06 at 12:22 -0500, Tom Lane wrote:
 A. Just accept the extra overhead, thereby preserving the current
 behavior of unnamed statements, and gaining the benefit that plan
 invalidation will work correctly in the few cases where an unnamed
 statement's plan lasts long enough to need replanning.

 With connection pooling, multiple sessions will execute each statement.
 If we check the cache each time this does seem more expensive for each
 individual session, but we should gain synergy from other similar
 sessions.

It seems fairly unlikely to me that client code would try to share an
unnamed statement across multiple application threads; the entire point
is that it's for one-off queries.

Or did you miss the point that the plan cache is local per-backend?

 ISTM there will be some cases where the current behaviour will not be
 maintained if we implement A exactly. One thing I've not seen mentioned
 is the effect of constants on various plans. 

There is none.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] GIST and TOAST

2007-03-06 Thread Gregory Stark
Teodor Sigaev [EMAIL PROTECTED] writes:

 It doesn't actually crash, it just fails CHECKARRVALID. I added an assertion
 in there to cause it to generate a core dump.

 Wow, catch that, see attached patch.

 g_int_decompress doesn't returns detoasted array in case it was empty.
 Previously it was safe because empty array never has been toasted.

Ah, thanks a bunch.

 Should I commit it or you'll include in your patch?

I'll include it in the patch I guess since it's fine the way it is until the
patch hits.

Now I'll try running the regressions again with the gist datatypes like hstore
etc all packed as well.


-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Jeff Davis
On Tue, 2007-03-06 at 18:47 +, Heikki Linnakangas wrote:
 Tom Lane wrote:
  Jeff Davis [EMAIL PROTECTED] writes:
  If I were to implement this idea, I think Heikki's bitmap of pages
  already read is the way to go.
  
  I think that's a good way to guarantee that you'll not finish in time
  for 8.3.  Heikki's idea is just at the handwaving stage at this point,
  and I'm not even convinced that it will offer any win.  (Pages in
  cache will be picked up by a seqscan already.)
 
 The scenario that I'm worried about is that you have a table that's 
 slightly larger than RAM. If you issue many seqscans on that table, one 
 at a time, every seqscan will have to read the whole table from disk, 
 even though say 90% of it is in cache when the scan starts.
 

If you're issuing sequential scans one at a time, that 90% of the table
that was cached is probably not cached any more, unless the scans are
close together in time without overlapping (serial sequential scans).
And the problem you describe is no worse than current behavior, where
you have exactly the same problem.

 This can be alleviated by using a large enough sync_scan_offset, but a 
 single setting like that is tricky to tune, especially if your workload 
 is not completely constant. Tune it too low, and you don't get much 
 benefit, tune it too high and your scans diverge and you lose all benefit.
 

I see why you don't want to manually tune this setting, however it's
really not that tricky. You can be quite conservative and still use a
good fraction of your physical memory. I will come up with some numbers
and see how much we have to gain.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] PostgreSQL - 'SKYLINE OF' clause added!

2007-03-06 Thread Gavin Sherry
On Tue, 6 Mar 2007, Alvaro Herrera wrote:

 Also, keep in mind that there were plenty of changes in the executor.
 This stuff is not likely to be very easy to implement efficiently using
 our extant executor machinery; note that Ranbeer mentioned
 implementation of block nested loop and other algorithms.  Not sure
 how easy would be to fold that stuff into the optimizer for multi-input
 aggregates, instead of hardwiring it to the SKYLINE OF syntax.


Yes, there's been a lot of working on calculating skyline efficiently,
with different sorting techniques and so on. This is the most interesting
part of the idea. You could calculate the query Ranbeer gave using pure
SQL and, perhaps, use of some covariance aggregates or something already.
Of course, it gets harder when you want to calculate across many
dimensions.

Personally, I'd love to see some of these newer data analysis
capabilities added to PostgreSQL -- or at least put out there as
interesting patches.

Thanks,

Gavin

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Jim Nasby

On Mar 6, 2007, at 12:17 AM, Tom Lane wrote:

Jim Nasby [EMAIL PROTECTED] writes:

An idea I've been thinking about would be to have the bgwriter or
some other background process actually try and keep the free list
populated,


The bgwriter already tries to keep pages just in front of the clock
sweep pointer clean.


True, but that still means that each backend has to run the clock- 
sweep. AFAICT that's something that backends will serialize on (due  
to BufFreelistLock), so it would be best to make StrategyGetBuffer as  
fast as possible. It certainly seems like grabbing a buffer off the  
free list is going to be a lot faster than running the clock sweep.  
That's why I think it'd be better to have the bgwriter run the clock  
sweep and put enough buffers on the free list to try and keep up with  
demand.

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Jim Nasby

On Mar 6, 2007, at 10:56 AM, Jeff Davis wrote:

We also don't need an exact count, either. Perhaps there's some way
we could keep a counter or something...


Exact count of what? The pages already in cache?


Yes. The idea being if you see there's 10k pages in cache, you can  
likely start 9k pages behind the current scan point and still pick  
everything up.


But this is nowhere near as useful as the bitmap idea, so I'd only  
look at it if it's impossible to make the bitmaps work. And like  
others have said, that should wait until there's at least a first- 
generation patch that's going to make it into 8.3.

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Aggressive freezing in lazy-vacuum

2007-03-06 Thread ITAGAKI Takahiro
Gregory Stark [EMAIL PROTECTED] wrote:

 The hoped for gain here is that vacuum finds fewer pages with tuples that
 exceed vacuum_freeze_min_age? That seems useful though vacuum is still going
 to have to read every page and I suspect most of the writes pertain to dead
 tuples, not freezing tuples.

Yes. VACUUM makes dirty pages only for freezing exceeded tuples in
particular cases and I think we can reduce the writes by keeping the
number of unfrozen tuples low.

There are three additional costs in FREEZE.
  1. CPU cost for changing the xids of target tuples.
  2. Writes cost for WAL entries of FREEZE (log_heap_freeze).
  3. Writes cost for newly created dirty pages.

I did additional freezing in the following two cases. We'll have created
dirty buffers and WAL entries for required operations then, so that I think
the additional costs of 2 and 3 are ignorable, though 1 still affects us.

| - There are another tuple to be frozen in the same page.
| - There are another dead tuples in the same page.
|   Freezing is delayed until the heap vacuum phase.


 This strikes me as something that will be more useful once we have the DSM
 especially if it ends up including a frozen map. Once we have the DSM vacuum
 will no longer be visiting every page, so it will be much easier for pages to
 get quite old and only be caught by a vacuum freeze. The less i/o that vacuum
 freeze has to do the better. If we get a freeze map then agressive freezing
 would help keep pages out of that map so they never need to be vacuumed just
 to freeze the tuples in them.

Yeah, I was planning to 2 bits/page DSM exactly for the purpose. One of the
bits means to-be-vacuumed and another means to-be-frozen. It helps us avoid
full scanning of the pages for XID wraparound vacuums, but DSM should be more
reliable and not lost any information. I made an attempt to accomplish it
in DSM, but I understand the need to demonstrate it works as designed to you.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Jeff Davis
On Tue, 2007-03-06 at 17:43 -0700, Jim Nasby wrote:
 On Mar 6, 2007, at 10:56 AM, Jeff Davis wrote:
  We also don't need an exact count, either. Perhaps there's some way
  we could keep a counter or something...
 
  Exact count of what? The pages already in cache?
 
 Yes. The idea being if you see there's 10k pages in cache, you can  
 likely start 9k pages behind the current scan point and still pick  
 everything up.
 
 But this is nowhere near as useful as the bitmap idea, so I'd only  
 look at it if it's impossible to make the bitmaps work. And like  
 others have said, that should wait until there's at least a first- 
 generation patch that's going to make it into 8.3.

You still haven't told me how we take advantage of the OS buffer cache
with the bitmap idea. What makes you think that my current
implementation is nowhere near as useful as the bitmap idea?

My current implementation is making use of OS buffers + shared memory;
the bitmap idea can only make use of shared memory, and is likely
throwing the OS buffers away completely.

I also suspect that the bitmap idea relies too much on the idea that
there's a contiguous cache trail in the shared buffers alone. Any
devation from that -- which could be caused by PG's page replacement
algorithm, especially in combination with a varied load pattern -- would
negate any benefit from the bitmap idea. I feel much more confident that
there will exist a trail of pages that are cached in *either* the PG
shared buffers *or* the OS buffer cache. There may be holes/gaps in
either one, but it's much more likely that they combine into a
contiguous series of cached pages. Do you have an idea how I might test
this claim?

Regards,
Jeff Davis



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Aggressive freezing in lazy-vacuum

2007-03-06 Thread ITAGAKI Takahiro

Tom Lane [EMAIL PROTECTED] wrote:

 I said nothing about expired tuples.  The point of not freezing is to
 preserve information about the insertion time of live tuples.

I don't know what good it will do -- for debugging?
Why don't you use CURRENT_TIMESTAMP?


 And your
 test case is unconvincing, because no sane DBA would run with such a
 small value of vacuum_freeze_min_age.

I intended to use the value for an accelerated test.
The penalties of freeze are divided for the long term in normal use,
but we surely suffer from them by bits.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Jeff Davis
On Tue, 2007-03-06 at 18:29 +, Heikki Linnakangas wrote:
 Jeff Davis wrote:
  On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote:
  On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
  Another approach I proposed back in December is to not have a  
  variable like that at all, but scan the buffer cache for pages  
  belonging to the table you're scanning to initialize the scan.  
  Scanning all the BufferDescs is a fairly CPU and lock heavy  
  operation, but it might be ok given that we're talking about large  
  I/O bound sequential scans. It would require no DBA tuning and  
  would work more robustly in varying conditions. I'm not sure where  
  you would continue after scanning the in-cache pages. At the  
  highest in-cache block number, perhaps.
  If there was some way to do that, it'd be what I'd vote for.
 
  
  I still don't know how to make this take advantage of the OS buffer
  cache. 
 
 Yep, I don't see any way to do that. I think we could live with that, 
 though. If we went with the sync_scan_offset approach, you'd have to 
 leave a lot of safety margin in that as well.
 

Right, there would certainly have to be a safety margin with
sync_scan_offset. However, your plan only works when the shared buffers
are dominated by this sequential scan. Let's say you have 40% of
physical memory for shared buffers, and say that 50% are being used for
hot pages in other parts of the database. That means you have access to
only 20% of physical memory to optimize for this sequential scan, and
20% of the physical memory is basically unavailable (being used for
other parts of the database).

In my current implementation, you could set sync_scan_offset to 1.0
(meaning 1.0 x shared_buffers), giving you 40% of physical memory that
would be used for starting this sequential scan. In this case, that
should be a good margin of error, considering that as much as 80% of the
physical memory might actually be in cache (OS or PG cache).

This all needs to be backed up by testing, of course. I'm just
extrapolating some numbers that look vaguely reasonable to me.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Luke Lonergan
Incidentally, we tried triggering NTA (L2 cache bypass) unconditionally and in 
various patterns and did not see the substantial gain as with reducing the 
working set size.

My conclusion: Fixing the OS is not sufficient to alleviate the issue.  We see 
a 2x penalty (1700MB/s versus 3500MB/s) at the higher data rates due to this 
effect.

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Sherry Moore [mailto:[EMAIL PROTECTED]
Sent:   Tuesday, March 06, 2007 10:05 PM Eastern Standard Time
To: Simon Riggs
Cc: Sherry Moore; Tom Lane; Luke Lonergan; Mark Kirkwood; Pavan Deolasee; 
Gavin Sherry; PGSQL Hackers; Doug Rady
Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant

Hi Simon,

 and what you haven't said
 
 - all of this is orthogonal to the issue of buffer cache spoiling in
 PostgreSQL itself. That issue does still exist as a non-OS issue, but
 we've been discussing in detail the specific case of L2 cache effects
 with specific kernel calls. All of the test results have been
 stand-alone, so we've not done any measurements in that area. I say this
 because you make the point that reducing the working set size of write
 workloads has no effect on the L2 cache issue, but ISTM its still
 potentially a cache spoiling issue.

What I wanted to point out was that (reiterating to avoid requoting),

- My test was simply to demonstrate that the observed performance
  difference with VACUUM was caused by whether the size of the
  user buffer caused L2 thrashing.

- In general, application should reduce the size of the working set
  to reduce the penalty of TLB misses and cache misses.

- If the application access pattern meets the NTA trigger condition,
  the benefit of reducing the working set size will be much smaller.

Whatever I said is probably orthogonal to the buffer cache issue you
guys have been discussing, but I haven't read all the email exchange
on the subject.

Thanks,
Sherry
-- 
Sherry Moore, Solaris Kernel Developmenthttp://blogs.sun.com/sherrym



Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-06 Thread Greg Smith

On Tue, 6 Mar 2007, Jim Nasby wrote:

The flipside is that it's much easier to machine-parse a table rather 
than trying to scrape the logs.


Now you might realize why I've been so vocal on the SQL log export 
implementation details.


And I don't think we'll generally care about each individual checkpoint; 
rather we'll want to look at things like 'checkpoints/hour' and 
'checkpoint written pages/hour'.


After a few months of staring at this data, I've found averages like that 
misleading.  The real problem areas correlate with the peak pages written 
at any one checkpoint.  Lowering that value is really the end-game for 
optimizing the background writer, and the peaks are what will nail you 
with a nasty fsync pause at checkpoint time.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Aggressive freezing in lazy-vacuum

2007-03-06 Thread Tom Lane
ITAGAKI Takahiro [EMAIL PROTECTED] writes:
 Tom Lane [EMAIL PROTECTED] wrote:
 I said nothing about expired tuples.  The point of not freezing is to
 preserve information about the insertion time of live tuples.

 I don't know what good it will do -- for debugging?

Exactly.  As an example, I've been chasing offline a report from Merlin
Moncure about duplicate entries in a unique index; I still don't know
what exactly is going on there, but the availability of knowledge about
which transactions inserted which entries has been really helpful.  If
we had a system designed to freeze tuples as soon as possible, that info
would have been gone forever pretty soon after the problem happened.

I don't say that this behavior can never be acceptable, but you need
much more than a marginal performance improvement to convince me that
it's worth the loss of forensic information.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-06 Thread ITAGAKI Takahiro
Greg Smith [EMAIL PROTECTED] wrote:

 After a few months of staring at this data, I've found averages like that 
 misleading.  The real problem areas correlate with the peak pages written 
 at any one checkpoint.  Lowering that value is really the end-game for 
 optimizing the background writer, and the peaks are what will nail you 
 with a nasty fsync pause at checkpoint time.

If you've already had some technical knowledge to lead best settings from
activity logs, could you write it down in the codes? I hope some kinds of
automatic control features in bgwriter if its best configurations vary by
usages or activities.


BTW, I'm planning two changes in bgwriter.

  [Load distributed checkpoint]
http://archives.postgresql.org/pgsql-patches/2007-02/msg00522.php
  [Automatic adjustment of bgwriter_lru_maxpages]
http://archives.postgresql.org/pgsql-patches/2007-03/msg00092.php

I have some results that if we have plenty of time for checkpoints,
bgwriter_all_maxpages is not a so important parameter because it is
adjusted to shared_buffers / duration of checkpoint.
Also, my recommended bgwriter_lru_maxpages is average number of
recycled buffers per cycle, that is hardly able to tune manually.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Auto creation of Partitions

2007-03-06 Thread NikhilS

Hi Shane,

Maybe I'm looking at auto-maintenance which is beyond any current planning?


Many of your suggestions are useful, but auto-maintenance will be beyond the
current plan.

Regards,
Nikhils

EnterpriseDB   http://www.enterprisedb.com


Re: [HACKERS] Auto creation of Partitions

2007-03-06 Thread NikhilS

Hi,

On 3/7/07, Tom Lane [EMAIL PROTECTED] wrote:


Peter Eisentraut [EMAIL PROTECTED] writes:
 But when I say
 CREATE TABLE ( a int PRIMARY KEY, ... ) PARTITION blah ...
 then I expect that the primary key will be enforced across all
 partitions.  We currently sidestep that issue by not offering seemingly
 transparent partitioning.  But if you are planning to offer that, the
 unique index issue needs to be solved, and I see nothing in your plan
 about that.

Agreed, it needs to Just Work.  I think it'd still be useful though
if we only support auto-partitioning on the primary key, and that
restriction avoids the indexing problem.

regards, tom lane



Sure, but as Chris mentioned earlier, wouldn't it be useful to maintain
uniqueness on a
partition-by-partition basis too?

Regards,
Nikhils
--
EnterpriseDB   http://www.enterprisedb.com


Re: [HACKERS] Auto creation of Partitions

2007-03-06 Thread Stefan Kaltenbrunner

Andrew Dunstan wrote:

Tom Lane wrote:

Peter Eisentraut [EMAIL PROTECTED] writes:
 

But when I say
CREATE TABLE ( a int PRIMARY KEY, ... ) PARTITION blah ...
then I expect that the primary key will be enforced across all 
partitions.  We currently sidestep that issue by not offering 
seemingly transparent partitioning.  But if you are planning to offer 
that, the unique index issue needs to be solved, and I see nothing in 
your plan about that.



Agreed, it needs to Just Work.  I think it'd still be useful though
if we only support auto-partitioning on the primary key, and that
restriction avoids the indexing problem.

  


Maybe. The most obvious use for automatic partitioning that I can think 
of would be based in the value of a timestamptz field rather than any 
PK. Of course I tend to work more in the OLTP field than in DW type 
apps, where other considerations might apply.


I second that - partitioning on some kind of timestamp field is a common 
usecase here too ...



Stefan

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-06 Thread Hannu Krosing
Ühel kenal päeval, T, 2007-03-06 kell 18:28, kirjutas Jeff Davis:
 On Tue, 2007-03-06 at 18:29 +, Heikki Linnakangas wrote:
  Jeff Davis wrote:
   On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote:
   On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
   Another approach I proposed back in December is to not have a  
   variable like that at all, but scan the buffer cache for pages  
   belonging to the table you're scanning to initialize the scan.  
   Scanning all the BufferDescs is a fairly CPU and lock heavy  
   operation, but it might be ok given that we're talking about large  
   I/O bound sequential scans. It would require no DBA tuning and  
   would work more robustly in varying conditions. I'm not sure where  
   you would continue after scanning the in-cache pages. At the  
   highest in-cache block number, perhaps.
   If there was some way to do that, it'd be what I'd vote for.
  
   
   I still don't know how to make this take advantage of the OS buffer
   cache. 

Maybe it should not ?

Mostly there can be use of OS cache only if it is much bigger than
shared buffer cache. It may make sense to forget about OS cache and just
tell those who can make use of sync scans to set most of memory aside
for shared buffers.

Then we can do better predictions/lookups of how much of a table is
actually in memory.

Dual caching is usually not very beneficial anyway, not to mention about
difficulties in predicting any doual-caching effects.

  Yep, I don't see any way to do that. I think we could live with that, 
  though. If we went with the sync_scan_offset approach, you'd have to 
  leave a lot of safety margin in that as well.
  
 
 Right, there would certainly have to be a safety margin with
 sync_scan_offset. However, your plan only works when the shared buffers
 are dominated by this sequential scan. Let's say you have 40% of
 physical memory for shared buffers, and say that 50% are being used for
 hot pages in other parts of the database. That means you have access to
 only 20% of physical memory to optimize for this sequential scan, and
 20% of the physical memory is basically unavailable (being used for
 other parts of the database).

The simplest thing in case table si much bigger than buffer cache usable
for it is to start the second scan at the point the first scan is
traversing *now*, and hope that the scans will stay together. Or start
at some fixed lag, which makes the first scan to be always the one
issuing reads and second just freerides on buffers already in cache. It
may even be a good idea to throttle the second scan to stay N pages
behind if the OS readahead gets confused when same file is read from
multiple processes.

If the table is smaller than the cache, then just scan it without
syncing.

Trying to read buffers in the same order starting from near the point
where ppages are still in shared buffer cache seems good mostly for case
where table is as big as or just a little larger than cache.

 In my current implementation, you could set sync_scan_offset to 1.0
 (meaning 1.0 x shared_buffers), giving you 40% of physical memory that
 would be used for starting this sequential scan. In this case, that
 should be a good margin of error, considering that as much as 80% of the
 physical memory might actually be in cache (OS or PG cache).
 
 This all needs to be backed up by testing, of course. I'm just
 extrapolating some numbers that look vaguely reasonable to me.

If there is an easy way to tell PG give me this page only if it is in
shared cache already, then a good approach might be to start 2nd scan
at the point where 1st is now, and move in both directions
simultabeously, like this:

First scan is at page N.

Second scan:

M=N-1
WHILE NOT ALL PAGES ARE READ: 
IF PAGE N IS IN CACHE :   -- FOLLOW FIRST READER
READ PAGE N
N++
ELSE IF M=0 AND PAGE M IS IN CACHE : -- READ OLDER CACHED PAGES
READ PAGE M
M--
ELSE IF FIRST READER STILL GOING: -- NO OLDER PAGES, WAIT FOR 1st
WAIT FOR PAGE N TO BECOME AVAILABLE
READ PAGE N
N++
ELSE:-- BECOME 1st reader
   READ PAGE N
   N++
PROCESS PAGE
--
IF N  PAGES_IF_TABLE: N=0
IF M  0: M=PAGES_IF_TABLE


This should work reasonably well for LRU caches and it may be made to
work with clock sweep scheme if the sweep arranges pages to purge in
file order.

If we could make the IF PAGE x IS IN CACHE part also know about OS cache
this could also make use of os cache.


Do any of you know about a way to READ PAGE ONLY IF IN CACHE in *nix
systems ?

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Auto creation of Partitions

2007-03-06 Thread Joshua D. Drake




Maybe. The most obvious use for automatic partitioning that I can 
think of would be based in the value of a timestamptz field rather 
than any PK. Of course I tend to work more in the OLTP field than in 
DW type apps, where other considerations might apply.


I second that - partitioning on some kind of timestamp field is a 
common usecase here too ...
Partitioning period needs to work. It doesn't matter what the user 
chooses as their partition key. Timestamp is an obvious
choice but there are others such as serial where you just partition 
every million rows (for example) to keep things manageable.


Joshua D. Drake




Stefan

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly




---(end of broadcast)---
TIP 6: explain analyze is your friend