date:20081028

Re: [HACKERS] Window Functions: v07 APIs and buffering strateties

2008-10-28 Thread Hitoshi Harada

2008/10/28 Hitoshi Harada [EMAIL PROTECTED]:
 Thanks for your testing all!

 2008/10/28 ITAGAKI Takahiro [EMAIL PROTECTED]:
 Hitoshi Harada [EMAIL PROTECTED] wrote:


 select relid,AVG(seq_Scan) OVER (ORDER BY relid)
 FROM pg_stat_user_tables
 GROUP BY relid,seq_scan;

 crashes with segfault.

 I reproduced it on linux of my environment, building without
 debug/cassert. It could be a problem around there.


And I fixed this problem, confirming  with/without debug/cassert/gcc
-O and push it to git. If you want delta patch, please see

http://git.postgresql.org/?p=~davidfetter/window_functions/.git;a=commitdiff;h=fbf19bfd0c8d2ac083b775f4cc724ec66e74fa8f

Thanks for all of your feedbacks.
Regards,


-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Proposal of PITR performance improvement for 8.4.

2008-10-28 Thread Koichi Suzuki

Hi,

This is my first proposal of PITR performance improvement for
PostgreSQL 8.4 development.   This proposal includes readahead
mechanism of data pages which will be read by redo() routines in the
recovery.   This is especially effective in the recovery without full
page write.   Readahead is done by posix_fadvise() as proposed in
index scan improvement.

Details of the implementation will be found in README file in the material.

-- 
--
Koichi Suzuki


pg_readahead_20081028.tar.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Window Functions: v07 APIs and buffering strateties

2008-10-28 Thread ITAGAKI Takahiro


Hitoshi Harada [EMAIL PROTECTED] wrote:

 And I fixed this problem, confirming  with/without debug/cassert/gcc
 -O and push it to git. If you want delta patch, please see
 http://git.postgresql.org/?p=~davidfetter/window_functions/.git;a=commitdiff;h=fbf19bfd0c8d2ac083b775f4cc724ec66e74fa8f

Now it passed all regression tests!


There is still one trivial warning:
parse_func.c: In function `ParseFuncOrColumn':
parse_func.c:77: warning: 'retval' might be used uninitialized in this 
function

It looks completely safe, but I suggest to initialize
the variable only to keep compiler quiet.

[src/backend/parser/parse_func.c] ParseFuncOrColumn()
else
+   {
elog(ERROR, unknown function status);
+   retval = NULL;  /* keep compiler quiet */
+   }

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] VACUUMs and WAL

2008-10-28 Thread Simon Riggs

Looking at a VACUUM's WAL records makes me think twice about the way we
issue a VACUUM.

1. First we scan the heap, issuing a HEAP2 clean record for every block
that needs cleaning.

2. Then we scan the index, issuing WAL records as appropriate.

3. Then we rescan the heap, issuing a HEAP2 clean record for every
block.

I don't see a reason why we would issue 2 WAL records per block for a
VACUUM, nor why we would prune and remove in two steps, dirtying the
block each time. Seems like we could write approximately half the amount
of data that we do.

Surely we can come up with a better plan than that one?

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Window Functions: v07 APIs and buffering strateties

2008-10-28 Thread Hitoshi Harada

2008/10/28 ITAGAKI Takahiro [EMAIL PROTECTED]:

 Hitoshi Harada [EMAIL PROTECTED] wrote:

 And I fixed this problem, confirming  with/without debug/cassert/gcc
 -O and push it to git. If you want delta patch, please see
 http://git.postgresql.org/?p=~davidfetter/window_functions/.git;a=commitdiff;h=fbf19bfd0c8d2ac083b775f4cc724ec66e74fa8f

 Now it passed all regression tests!


 There is still one trivial warning:
parse_func.c: In function `ParseFuncOrColumn':
parse_func.c:77: warning: 'retval' might be used uninitialized in this 
 function

 It looks completely safe, but I suggest to initialize
 the variable only to keep compiler quiet.

 [src/backend/parser/parse_func.c] ParseFuncOrColumn()
else
 +   {
elog(ERROR, unknown function status);
 +   retval = NULL;  /* keep compiler quiet */
 +   }


Agreed. I've just noticed it as well and I think it would be more much
like original code that the last else if clause gets else. Anyway,
thanks.


Regards,


-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] VACUUMs and WAL

2008-10-28 Thread Hannu Krosing

On Tue, 2008-10-28 at 08:49 +, Simon Riggs wrote:
 Looking at a VACUUM's WAL records makes me think twice about the way we
 issue a VACUUM.
 
 1. First we scan the heap, issuing a HEAP2 clean record for every block
 that needs cleaning.

IIRC the first heap pass just collects info and does nothing else. 
Is this just an empty/do-nothing WAL record ?

 2. Then we scan the index, issuing WAL records as appropriate.
 
 3. Then we rescan the heap, issuing a HEAP2 clean record for every
 block.
 
 I don't see a reason why we would issue 2 WAL records per block for a
 VACUUM, nor why we would prune and remove in two steps, dirtying the
 block each time. 

The first pass should just be collecting info and not dirtying anything.
Could it be side effect of setting some transaction visibility bits on
first visit ? 

In that case It would be good, if we could disable doing that that for
vacuum.

 Seems like we could write approximately half the amount
 of data that we do.
 
 Surely we can come up with a better plan than that one?

---
Hannu Krosing 

http://www.2ndQuadrant.com
PostgreSQL Scalability Training, Services and Support



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] VACUUMs and WAL

2008-10-28 Thread Simon Riggs


On Tue, 2008-10-28 at 11:45 +0200, Hannu Krosing wrote:
 On Tue, 2008-10-28 at 08:49 +, Simon Riggs wrote:
  Looking at a VACUUM's WAL records makes me think twice about the way we
  issue a VACUUM.
  
  1. First we scan the heap, issuing a HEAP2 clean record for every block
  that needs cleaning.
 
 IIRC the first heap pass just collects info and does nothing else. 
 Is this just an empty/do-nothing WAL record ?

8.3 changed that; it used to work that way. I guess I never looked at
the amount of WAL being generated.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP patch: convert SQL-language functions to return tuplestores

2008-10-28 Thread Gregory Stark


Tom Lane [EMAIL PROTECTED] writes:

 Robert Haas [EMAIL PROTECTED] writes:
 I'm pretty excited by that example.  LIMIT/OFFSET is really useful as
 a way of paginating query results for display on a web page (show
 results 1-100, 101-200, etc), and I use it on potentially expensive
 SRFs just as I do on tables and views.

 I suspect it doesn't help you as much as you think.  It's always been
 the case that SRFs in FROM-items are fed through a tuplestore, and so
 are plpgsql SRF results.  

I always thought we considered that a bug though. It sure would be nice if we
could generate results as needed instead of having to generate them in advance
and store all of them.

In particular I fear there are a lot of places that use functions where we
might expect them to use views. They're never going to get really good plans
but it would be nice if we could at least avoid the extra materialize steps.

Now your patch isn't affecting that one way or the other but does it rule it
out forever?

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP patch: convert SQL-language functions to return tuplestores

2008-10-28 Thread Simon Riggs


On Sun, 2008-10-26 at 21:49 -0400, Tom Lane wrote:
 So I'm concluding that we can easily afford to switch to
 tuplestore-always operation, especially if we are willing to put any
 effort into tuplestore optimization.  (I note that the current
 tuplestore code writes 24 bytes per row for this example, which is a
 shade on the high side for only 4 bytes payload.  It looks like it
 would be pretty easy to knock 10 bytes off that for a 40% savings in
 I/O volume.)

That seems like an important, possibly more important, change.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal of PITR performance improvement for 8.4.

2008-10-28 Thread Gregory Stark

Koichi Suzuki [EMAIL PROTECTED] writes:

 This is my first proposal of PITR performance improvement for
 PostgreSQL 8.4 development.   This proposal includes readahead
 mechanism of data pages which will be read by redo() routines in the
 recovery.   This is especially effective in the recovery without full
 page write.   Readahead is done by posix_fadvise() as proposed in
 index scan improvement.

Incidentally: a bit of background for anyone who wasn't around when last this
came up: prefetching is especially for our recovery code because it's
single-threaded. If you have a raid array you're effectively limited to using
a single drive at a time. This is a major problem because the logs could have
been written by many processes hammering the raid array concurrently. In other
words your warm standby database might not be able to keep up with the logs
from the master database even on identical (or even better) hardware.

Simon (I think?) proposed allowing our recovery code to be multi-threaded.
Heikki suggested using prefetching.

 Details of the implementation will be found in README file in the material.

I've read through this and I think I disagree with the idea of using a
separate program. It's a lot of extra code -- and duplicated code from the
normal recovery too.

I recognise that it's awkward to handle during recovery since you would have
to rework the wal reading logic quite a bit.

But it might be worth doing anyways for non-raid situations. Even if you don't
have a raid array it would be worthwhile to do the equivalent of a bitmap heap
scan by fetching blocks in order.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL 
training!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] VACUUMs and WAL

2008-10-28 Thread Gregory Stark

Simon Riggs [EMAIL PROTECTED] writes:

 I don't see a reason why we would issue 2 WAL records per block for a
 VACUUM, nor why we would prune and remove in two steps, dirtying the
 block each time. Seems like we could write approximately half the amount
 of data that we do.

 Surely we can come up with a better plan than that one?

This sounds like the same issue Pavan identified and proposed solving by
rotating the three passes so that we do step 3 at the beginning of the next
vacuum run, so it can be done in the same pass as step 1.

To do that he proposed we do:

1. scan heap doing two things: a) remove any marked tuples if they were marked
   by a previous vacuum which committed and b) prune and mark any tuples we
   find are deletable for a future vacuum to remove.

2. scan indexes and remove the tuples we marked in 1b.

The main blocking issue IIRC was:

How to mark the tuples in a way which is safe if vacuum aborts. He suggested
putting a vacuum xid in pg_class. Other suggestions were to mark the pages in
the page header (which I thought made the most sense) or to put the xid in the
line pointer (since nothing else needs to go there, the tuples are already
cleaned).

There might also have been a question of how to deal with the statistics.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's 24x7 Postgres support!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] VACUUMs and WAL

2008-10-28 Thread Simon Riggs


On Tue, 2008-10-28 at 10:59 +, Gregory Stark wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
 
  I don't see a reason why we would issue 2 WAL records per block for a
  VACUUM, nor why we would prune and remove in two steps, dirtying the
  block each time. Seems like we could write approximately half the amount
  of data that we do.
 
  Surely we can come up with a better plan than that one?
 
 This sounds like the same issue Pavan identified and proposed solving by
 rotating the three passes so that we do step 3 at the beginning of the next
 vacuum run, so it can be done in the same pass as step 1.
 
 To do that he proposed we do:
 
 1. scan heap doing two things: a) remove any marked tuples if they were marked
by a previous vacuum which committed and b) prune and mark any tuples we
find are deletable for a future vacuum to remove.
 
 2. scan indexes and remove the tuples we marked in 1b.

It's fairly hard to remove the second heap pass completely. 

I think what I am suggesting is two heap passes, but writing WAL and
dirtying blocks on only one of the passes.

The biggest I/O cost comes from the writes, not the reads, ISTM.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Visibility map, partial vacuums

2008-10-28 Thread Simon Riggs


On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote:
 One option would be to just ignore that problem for now, and not 
 WAL-log.

Probably worth skipping for now, since it will cause patch conflicts if
you do. Are there any other interactions with Hot Standby? 

But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record
to say page is now all visible, without too much work.

Does the PD_ALL_VISIBLE flag need to be set at the same time as updating
the VM? Surely heapgetpage() could do a ConditionalLockBuffer exclusive
to set the block flag (unlogged), but just not update VM. Separating the
two concepts should allow the visibility check speed gain to more
generally available. 

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal of PITR performance improvement for 8.4.

2008-10-28 Thread Simon Riggs


On Tue, 2008-10-28 at 16:33 +0900, Koichi Suzuki wrote:

 This is my first proposal of PITR performance improvement for
 PostgreSQL 8.4 development.   This proposal includes readahead
 mechanism of data pages which will be read by redo() routines in the
 recovery.   This is especially effective in the recovery without full
 page write.   Readahead is done by posix_fadvise() as proposed in
 index scan improvement.
 
 Details of the implementation will be found in README file in the material.

I'm happy with the idea of a readahead process. I thought we were
implementing a BackgroundReader process for other uses. Is that dead
now?

I see you're doing this as a standalone command. Shame we can't include
this into core for now, but I can see the reasons why.

How does the whole toolchain look with this?
Can you give a simple example of using pg_lesslog, pg_readahead and
pg_standby together, if that is appropriate?

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] VACUUMs and WAL

2008-10-28 Thread Gregory Stark

Simon Riggs [EMAIL PROTECTED] writes:

 On Tue, 2008-10-28 at 10:59 +, Gregory Stark wrote:
 To do that he proposed we do:
 
 1. scan heap doing two things: a) remove any marked tuples if they were 
 marked
by a previous vacuum which committed and b) prune and mark any tuples we
find are deletable for a future vacuum to remove.
 
 2. scan indexes and remove the tuples we marked in 1b.

 It's fairly hard to remove the second heap pass completely. 

 I think what I am suggesting is two heap passes, but writing WAL and
 dirtying blocks on only one of the passes.

How small a patch would it be? I guess you just need to disable all pruning in
the first pass and do it in the second patch? 

I would still rather see Pavan's optimizationo if we can do it cleanly, but if
it's not going to happen and this is trivial then sure, we may as well.


 The biggest I/O cost comes from the writes, not the reads, ISTM.

It's counter-intuitive but actually it's usually the other way around. Writes
can be buffered, re-ordered, and scheduled during otherwise idle time. Reads
however are always blocking.

However in this situation I think you may be right. Vacuum is doing a
sequential scan through the table so if the OS has to interrupt that scan to
go do some writes it'll end up having to go back and forth. That would be a
*lot* slower than just doing a sequential scan.



-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] VACUUMs and WAL

2008-10-28 Thread Hannu Krosing

On Tue, 2008-10-28 at 10:10 +, Simon Riggs wrote:
 On Tue, 2008-10-28 at 11:45 +0200, Hannu Krosing wrote:
  On Tue, 2008-10-28 at 08:49 +, Simon Riggs wrote:
   Looking at a VACUUM's WAL records makes me think twice about the way we
   issue a VACUUM.
   
   1. First we scan the heap, issuing a HEAP2 clean record for every block
   that needs cleaning.
  
  IIRC the first heap pass just collects info and does nothing else. 
  Is this just an empty/do-nothing WAL record ?
 
 8.3 changed that; it used to work that way. I guess I never looked at
 the amount of WAL being generated.


I can't see how it is safe to do anything more than just lookups on
first pass. 

There will be dangling index pointers if the system crashes/is rebooted
or the vacuum is just interrupted after cleaning some heap pages but
before cleaning corresponding index pages.

---
Hannu



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal of PITR performance improvement for 8.4.

2008-10-28 Thread Heikki Linnakangas


Gregory Stark wrote:

Koichi Suzuki [EMAIL PROTECTED] writes:


This is my first proposal of PITR performance improvement for
PostgreSQL 8.4 development.   This proposal includes readahead
mechanism of data pages which will be read by redo() routines in the
recovery.   This is especially effective in the recovery without full
page write.   Readahead is done by posix_fadvise() as proposed in
index scan improvement.


Incidentally: a bit of background for anyone who wasn't around when last this
came up: prefetching is especially for our recovery code because it's
single-threaded. If you have a raid array you're effectively limited to using
a single drive at a time. This is a major problem because the logs could have
been written by many processes hammering the raid array concurrently. In other
words your warm standby database might not be able to keep up with the logs
from the master database even on identical (or even better) hardware.

Simon (I think?) proposed allowing our recovery code to be multi-threaded.
Heikki suggested using prefetching.


I actually played around with the prefetching, and even wrote a quick 
prototype of it, about a year ago. It read ahead a fixed number of the 
WAL records in xlog.c, calling posix_fadvise() for all pages that were 
referenced in them. I never got around to finish it, as I wanted to see 
Greg's posix_fadvise() patch get done first and rely on the same 
infrastructure, but here's some lessons I learned:


1. You should avoid useless posix_fadvise() calls. In the naive 
implementation, where you simply call posix_fadvise() for every page 
referenced in every WAL record, you'll do 1-2 posix_fadvise() syscalls 
per WAL record, and that's a lot of overhead. We face the same design 
question as with Greg's patch to use posix_fadvise() to prefetch index 
and bitmap scans: what should the interface to the buffer manager look 
like? The simplest approach would be a new function call like 
AdviseBuffer(Relation, BlockNumber), that calls posix_fadvise() for the 
page if it's not in the buffer cache, but is a no-op otherwise. But that 
means more overhead, since for every page access, we need to find the 
page twice in the buffer cache; once for the AdviseBuffer() call, and 
2nd time for the actual ReadBuffer(). It would be more efficient to pin 
the buffer in the AdviseBuffer() call already, but that requires much 
more changes to the callers.


2. The format of each WAL record is different, so you need a readahead 
handler for every resource manager, for every record type. It would be 
a lot simpler if there was a standardized way to store that information 
in the WAL records.


3. IIRC I tried to handle just a few most important WAL records at 
first, but it turned out that you really need to handle all WAL records 
(that are used at all) before you see any benefit. Otherwise, every time 
you hit a WAL record that you haven't done posix_fadvise() on, the 
recovery stalls, and you don't need much of those to diminish the gains.


Not sure how these apply to your approach, it's very different. You seem 
to handle 1. by collecting all the page references for the WAL file, and 
sorting and removing the duplicates. I wonder how much CPU time is spent 
on that?



Details of the implementation will be found in README file in the material.


I've read through this and I think I disagree with the idea of using a
separate program. It's a lot of extra code -- and duplicated code from the
normal recovery too.


Agreed, this belongs into core. The nice thing about a separate process 
is that you could hook it into recovery_command, with no changes to the 
server, but as you note in the README, we'd want to use this in crash 
recovery as well, and the interaction between the external command and 
the startup process seems overly complex for that. Besides, we want to 
use the posix_fadvise() stuff in the backend anyway, so we should use 
the same infrastructure during WAL replay as well.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Window Functions: v07 APIs and buffering strateties

2008-10-28 Thread Tom Lane

Hitoshi Harada [EMAIL PROTECTED] writes:
 2008/10/28 ITAGAKI Takahiro [EMAIL PROTECTED]:
 I tested the patch on mingw (Windows) and
 got the following warning and error:
 
 A. gram.y: conflicts: 3 shift/reduce
 B. include/nodes/plannodes.h:650: error: syntax error before uint
 
 I have no idea about A.

 I have noticed it but didn't think it is a problem, but it doesn't
 occur in production, does it?

We have a zero-tolerance policy for bison warnings.  Patches that
introduce shift/reduce conflicts *will* be rejected.  (And no, %expect
isn't an acceptable fix.  The problem with it is you can't be sure
which warnings it ignored.  In a grammar that gets hacked on as often
as PG's does, we couldn't rely on the conflicts to not move around,
possibly resulting in unforeseen misbehavior.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] VACUUMs and WAL

2008-10-28 Thread Heikki Linnakangas


Hannu Krosing wrote:

On Tue, 2008-10-28 at 10:10 +, Simon Riggs wrote:

On Tue, 2008-10-28 at 11:45 +0200, Hannu Krosing wrote:

On Tue, 2008-10-28 at 08:49 +, Simon Riggs wrote:

Looking at a VACUUM's WAL records makes me think twice about the way we
issue a VACUUM.

1. First we scan the heap, issuing a HEAP2 clean record for every block
that needs cleaning.
IIRC the first heap pass just collects info and does nothing else. 
Is this just an empty/do-nothing WAL record ?

8.3 changed that; it used to work that way. I guess I never looked at
the amount of WAL being generated.


I can't see how it is safe to do anything more than just lookups on
first pass. 


What's done in the first pass is the same HOT pruning that is done 
opportunistically on other page accesses as well. IIRC it's required for 
correctness, though I can't remember what exactly the issue was.


I don't think the extra WAL volume is a problem; VACUUM doesn't generate 
much WAL, anyway. As for the extra data page writes it causes; yeah, 
that might cause some I/O that could be avoided, but remember that the 
first pass often dirties buffers anyway to set hint bits.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Window Functions: v07 APIs and buffering strateties

2008-10-28 Thread Hitoshi Harada

2008/10/28 Tom Lane [EMAIL PROTECTED]:
 Hitoshi Harada [EMAIL PROTECTED] writes:
 2008/10/28 ITAGAKI Takahiro [EMAIL PROTECTED]:
 I tested the patch on mingw (Windows) and
 got the following warning and error:

 A. gram.y: conflicts: 3 shift/reduce
 B. include/nodes/plannodes.h:650: error: syntax error before uint

 I have no idea about A.

 I have noticed it but didn't think it is a problem, but it doesn't
 occur in production, does it?

 We have a zero-tolerance policy for bison warnings.  Patches that
 introduce shift/reduce conflicts *will* be rejected.  (And no, %expect
 isn't an acceptable fix.  The problem with it is you can't be sure
 which warnings it ignored.  In a grammar that gets hacked on as often
 as PG's does, we couldn't rely on the conflicts to not move around,
 possibly resulting in unforeseen misbehavior.)

regards, tom lane


OK, I'll try to remove it. I'm not used to bison so my first task is
to find where the conflict is...

Regards,

-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal of PITR performance improvement for 8.4.

2008-10-28 Thread Gregory Stark

Simon Riggs [EMAIL PROTECTED] writes:

 I'm happy with the idea of a readahead process. I thought we were
 implementing a BackgroundReader process for other uses. Is that dead
 now?

You and Bruce seem to keep resurrecting that idea. I've never liked it -- I
always hated that in Oracle and thought it was a terrible kludge.

I think the inter-process communication would be way too heavy-weight, by the
time the other process is schedule the process which needed the blocks would
have probably have done many of them already anyways. Worse, you would need a
large number of reading processes and would start to run into locking
contention on the work-queue as well.

In any case it would be a lot of code to do what posix_fadvise does for us
with a simple syscall anyways.

Am I misjudging this? Is this a popular idea? We could always implement it as
a fall-back implementation for mdprefetch() where posix_fadvise (and libaio
assuming we implement that as well) don't work. It has the advantage of
working with any system at all even if it predates 1003.1.

But it seems like an awful lot of code to implement a solution that I fear
won't work very well. And are people running raid arrays not likely to be
running modern OSes anyways?

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's On-Demand Production Tuning

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] VACUUMs and WAL

2008-10-28 Thread Tom Lane

Simon Riggs [EMAIL PROTECTED] writes:
 I think what I am suggesting is two heap passes, but writing WAL and
 dirtying blocks on only one of the passes.

I think you've all forgotten about hint-bit setting.  The assumption is
that the first VACUUM pass is going to update a lot of hint bits and we
might as well get some other work done with the same write.

Now of course that doesn't necessarily entail a WAL write too, but
it makes this less than a slam-dunk win.

Also, I think that the reason the code ended up this way is that there
were pretty severe difficulties in making the VACUUM code cope correctly
with un-pruned tuples.  Pavan might remember more about that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] VACUUMs and WAL

2008-10-28 Thread Tom Lane

Simon Riggs [EMAIL PROTECTED] writes:
 Looking at a VACUUM's WAL records makes me think twice about the way we
 issue a VACUUM.

 1. First we scan the heap, issuing a HEAP2 clean record for every block
 that needs cleaning.

 2. Then we scan the index, issuing WAL records as appropriate.

 3. Then we rescan the heap, issuing a HEAP2 clean record for every
 block.

The first pass removes dead HOT tuples.  The second pass removes dead
normal tuples (it does NOT write every block, only those with dead
tuples).  In principle the set of pages written in pass 1 might be
completely disjoint from the set of pages written in pass 2 (though
I admit that's probably not real likely).

 Surely we can come up with a better plan than that one?

Maybe, but it's not as bad as you're painting it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal of PITR performance improvement for 8.4.

2008-10-28 Thread Heikki Linnakangas


Gregory Stark wrote:

Simon Riggs [EMAIL PROTECTED] writes:


I'm happy with the idea of a readahead process. I thought we were
implementing a BackgroundReader process for other uses. Is that dead
now?


You and Bruce seem to keep resurrecting that idea. I've never liked it -- I
always hated that in Oracle and thought it was a terrible kludge.

I think the inter-process communication would be way too heavy-weight, by the
time the other process is schedule the process which needed the blocks would
have probably have done many of them already anyways. Worse, you would need a
large number of reading processes and would start to run into locking
contention on the work-queue as well.

In any case it would be a lot of code to do what posix_fadvise does for us
with a simple syscall anyways.


I agree.


Am I misjudging this? Is this a popular idea? We could always implement it as
a fall-back implementation for mdprefetch() where posix_fadvise (and libaio
assuming we implement that as well) don't work. It has the advantage of
working with any system at all even if it predates 1003.1.


Yeah, if we want to support prefetching on systems that don't have 
posix_fadvise(), that would be the way to do it. posix_fadvise() is a 
good API. We can support multiple platform-specific implementations of 
it if there's interest, using posix_fadvise(), aio to a dummy buffer, 
whatever functions there's on Windows, or background reader processes or 
threads.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] VACUUMs and WAL

2008-10-28 Thread Hannu Krosing

On Tue, 2008-10-28 at 14:28 +0200, Heikki Linnakangas wrote:
 Hannu Krosing wrote:
  On Tue, 2008-10-28 at 10:10 +, Simon Riggs wrote:
  On Tue, 2008-10-28 at 11:45 +0200, Hannu Krosing wrote:
  On Tue, 2008-10-28 at 08:49 +, Simon Riggs wrote:
  Looking at a VACUUM's WAL records makes me think twice about the way we
  issue a VACUUM.
 
  1. First we scan the heap, issuing a HEAP2 clean record for every block
  that needs cleaning.
  IIRC the first heap pass just collects info and does nothing else. 
  Is this just an empty/do-nothing WAL record ?
  8.3 changed that; it used to work that way. I guess I never looked at
  the amount of WAL being generated.
  
  I can't see how it is safe to do anything more than just lookups on
  first pass. 
 
 What's done in the first pass is the same HOT pruning that is done 
 opportunistically on other page accesses as well. IIRC it's required for 
 correctness, though I can't remember what exactly the issue was.

Are you sure it is a correctness thing ? Maybe HOT pruning just happened
to be in a path used by vacuum to read pages.

 I don't think the extra WAL volume is a problem; 

Probably not ( unless you need to ship your WAL records via a very
expensive network connection ).

If it is a simple performance problem, then it can probably be fixed by
just running VACUUM slower.

 VACUUM doesn't generate 
 much WAL, anyway. As for the extra data page writes it causes; yeah, 
 that might cause some I/O that could be avoided, but remember that the 
 first pass often dirties buffers anyway to set hint bits.

Still, can't we special-case HOT pruning and hint-bit change WAL-logging
for first the pass of vacuum ? They both seem redundant in case of
VACUUM.

---
Hannu 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Window Functions: v07 APIs and buffering strateties

2008-10-28 Thread Tom Lane

Hitoshi Harada [EMAIL PROTECTED] writes:
 OK, I'll try to remove it. I'm not used to bison so my first task is
 to find where the conflict is...

Use bison -v to get details of where the conflict is.  I find that
the most common way to fix things is to postpone where the parser
has to make a decision, which usually ends up meaning a slightly
more verbose set of productions --- for instance instead of

foo: bar opt_baz

opt_baz: baz | /*EMPTY*/

you might have to do

foo: bar baz | bar

(which is actually shorter in this pseudo-example, but wouldn't be
if bar were complicated or foo had a lot of alternatives already).
The reason this can fix it is that the parser doesn't have to commit
to which case applies until after it's read the tokens for baz.
In the first form, it has to decide whether baz is there or not
without looking any further ahead than the first token of baz.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Visibility map, partial vacuums

2008-10-28 Thread Heikki Linnakangas


Simon Riggs wrote:

On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote:
One option would be to just ignore that problem for now, and not 
WAL-log.


Probably worth skipping for now, since it will cause patch conflicts if
you do. Are there any other interactions with Hot Standby? 


But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record
to say page is now all visible, without too much work.


Hmm. Even if a tuple is visible to everyone on the master, it's not 
necessarily yet visible to all the read-only transactions in the slave.



Does the PD_ALL_VISIBLE flag need to be set at the same time as updating
the VM? Surely heapgetpage() could do a ConditionalLockBuffer exclusive
to set the block flag (unlogged), but just not update VM. Separating the
two concepts should allow the visibility check speed gain to more
generally available. 


Yes, that should be possible in theory. There's no version of 
ConditionalLockBuffer() for conditionally upgrading a shared lock to 
exclusive, but it should be possible in theory. I'm not sure if it would 
be safe to set the PD_ALL_VISIBLE_FLAG while holding just a shared lock, 
though. If it is, then we could do just that.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Visibility map, partial vacuums

2008-10-28 Thread Tom Lane

Heikki Linnakangas [EMAIL PROTECTED] writes:
 ... I'm not sure if it would 
 be safe to set the PD_ALL_VISIBLE_FLAG while holding just a shared lock, 
 though. If it is, then we could do just that.

Seems like it must be safe.  If you have shared lock on a page then no
one else could be modifying the page in a way that would falsify
PD_ALL_VISIBLE.  You might have several processes concurrently try to
set the bit but that is safe (same situation as for hint bits).

The harder part is propagating the bit to the visibility map, but I
gather you intend to only allow VACUUM to do that?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP patch: convert SQL-language functions to return tuplestores

2008-10-28 Thread Tom Lane

Gregory Stark [EMAIL PROTECTED] writes:
 Tom Lane [EMAIL PROTECTED] writes:
 I suspect it doesn't help you as much as you think.  It's always been
 the case that SRFs in FROM-items are fed through a tuplestore, and so
 are plpgsql SRF results.  

 I always thought we considered that a bug though. It sure would be nice if we
 could generate results as needed instead of having to generate them in advance
 and store all of them.

I suppose, but short of a fundamental rethink of how PL functions work
that's not going to happen.  There's also the whole issue of when do
side-effects happen (such as before/after statement triggers).

 In particular I fear there are a lot of places that use functions where we
 might expect them to use views. They're never going to get really good plans
 but it would be nice if we could at least avoid the extra materialize steps.

Agreed, but I think the fundamental solution there, for simple-select
functions, is inlining.

 Now your patch isn't affecting that one way or the other but does it rule it
 out forever?

I think the PL side of the problem is the hard part --- if we knew how
to solve these issues for plpgsql then SQL functions would surely be
easy.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP patch: convert SQL-language functions to return tuplestores

2008-10-28 Thread Tom Lane

Simon Riggs [EMAIL PROTECTED] writes:
 On Sun, 2008-10-26 at 21:49 -0400, Tom Lane wrote:
 So I'm concluding that we can easily afford to switch to
 tuplestore-always operation, especially if we are willing to put any
 effort into tuplestore optimization.  (I note that the current
 tuplestore code writes 24 bytes per row for this example, which is a
 shade on the high side for only 4 bytes payload.  It looks like it
 would be pretty easy to knock 10 bytes off that for a 40% savings in
 I/O volume.)

 That seems like an important, possibly more important, change.

Yeah, seeing that both WITH and window functions will be stressing
tuplestore performance, anything we can save there is probably worth the
trouble.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP patch: convert SQL-language functions to return tuplestores

2008-10-28 Thread Kenneth Marshall

On Tue, Oct 28, 2008 at 09:28:38AM -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  On Sun, 2008-10-26 at 21:49 -0400, Tom Lane wrote:
  So I'm concluding that we can easily afford to switch to
  tuplestore-always operation, especially if we are willing to put any
  effort into tuplestore optimization.  (I note that the current
  tuplestore code writes 24 bytes per row for this example, which is a
  shade on the high side for only 4 bytes payload.  It looks like it
  would be pretty easy to knock 10 bytes off that for a 40% savings in
  I/O volume.)
 
  That seems like an important, possibly more important, change.
 
 Yeah, seeing that both WITH and window functions will be stressing
 tuplestore performance, anything we can save there is probably worth the
 trouble.
 
   regards, tom lane
 
The pre-sort for index builds would also benefit from this change.

Ken

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Visibility map, partial vacuums

2008-10-28 Thread Heikki Linnakangas


Tom Lane wrote:

The harder part is propagating the bit to the visibility map, but I
gather you intend to only allow VACUUM to do that?


Yep.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Visibility map, partial vacuums

2008-10-28 Thread Heikki Linnakangas


Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:

To modify a page:
If PD_ALL_VISIBLE flag is set, the bit in the visibility map is cleared 
first. The heap page is kept pinned, but not locked, while the 
visibility map is updated. We want to avoid holding a lock across I/O, 
even though the visibility map is likely to stay in cache. After the 
visibility map has been updated, the page is exclusively locked and 
modified as usual, and PD_ALL_VISIBLE flag is cleared before releasing 
the lock.


So after having determined that you will modify a page, you release the
ex lock on the buffer and then try to regain it later?  Seems like a
really bad idea from here.  What if it's no longer possible to do the
modification you intended?


In case of insert/update, you have to find a new target page. I put the 
logic in RelationGetBufferForTuple(). In case of delete and update (old 
page), the flag is checked and bit cleared just after pinning the 
buffer, before doing anything else. (I note that that's not actually 
what the patch is doing for heap_update, will fix..)


If we give up on the strict requirement that the bit in the visibility 
map has to be cleared if the PD_ALL_VISIBLE flag on the page is not set, 
then we could just update the visibility map after releasing the locks 
on the heap pages. I think I'll do that for now, for simplicity.


To set the PD_ALL_VISIBLE flag, you must hold an exclusive lock on the 
page, while you observe that all tuples on the page are visible to everyone.


That doesn't sound too good from a concurrency standpoint...


Well, no, but it's only done in VACUUM. And pruning. I implemented it as 
a new loop that call HeapTupleSatisfiesVacuum on each tuple, and 
checking that xmin is old enough for live tuples, but come to think of 
it, we're already calling HeapTupleSatisfiesVacuum for every tuple on 
the page during VACUUM, so it should be possible to piggyback on that by 
restructuring the code.


That's how the patch works right now. However, there's a small 
performance problem with the current approach: setting the 
PD_ALL_VISIBLE flag must be WAL-logged. Otherwise, this could happen:


I'm more concerned about *clearing* the bit being WAL-logged.  That's
necessary for correctness.


Yes, clearing the PD_ALL_VISIBLE flag always needs to be WAL-logged. 
There's a new boolean field in xl_heap_insert/update/delete records 
indicating if the operation cleared the flag. On replay, if the flag was 
cleared, the bit in the visibility map is also cleared.



--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Optimizing tuplestore usage for SRFs

2008-10-28 Thread Tom Lane

I wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
 On Sun, 2008-10-26 at 21:49 -0400, Tom Lane wrote:
 ... effort into tuplestore optimization.  (I note that the current
 tuplestore code writes 24 bytes per row for this example, which is a
 shade on the high side for only 4 bytes payload.  It looks like it
 would be pretty easy to knock 10 bytes off that for a 40% savings in
 I/O volume.)

 That seems like an important, possibly more important, change.

 Yeah, seeing that both WITH and window functions will be stressing
 tuplestore performance, anything we can save there is probably worth the
 trouble.

Six of the ten bytes I was speaking of are alignment padding, which
can be removed with some relatively simple hacking in tuplestore.c
(I think the same might apply in tuplesort.c BTW).  The other place I
was looking at is that currently, a SRF's result tuplestore is always
built with randomAccess = true, which causes the tuplestore to write
a trailing length word on each tuple, to support the possibility of
being asked to read backwards.  Of course, 99% of the time it never
will be asked to do so.  So we ought to try to suppress that overhead.

I can see two ways we might do this:

1. Stop treating nodeFunctionscan.c as supporting backwards scan.
This would be an extremely localized change (about two lines ;-))
but the downside is that calling a function scan in a SCROLL CURSOR
would result in an extra Materialize node getting stuck in front
of the Functionscan node to support backwards scanning.

2. Propagate the Functionscan node's EXEC_FLAG_BACKWARD flag bit to
the called function.  I'd be inclined to do this by adding it as an
additional bit in the rsinfo-allowedModes field.

In either case we would need cooperation from called SRFs to get
the optimization to happen --- the tuplestore_begin_heap calls are
not consolidated into one place (which maybe was a mistake) and
they're all passing constant TRUE for randomAccess.  That coding
is safe but would need to change to reflect whichever policy we
adopt.  (Note: one advantage of approach #1 is that if anyone is
mistakenly passing randomAccess = FALSE, it would become safe,
which it isn't now.)

I'm of mixed mind about which way to go.  I'm not sure that optimizing
scroll cursors on functions is something worth worrying about.  However,
if we do #1 now then we are probably locked into that approach and
could not change to #2 in the future --- we'd have to worry about
breaking third-party SRFs that are passing randomAccess = FALSE.

Comments?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Updating FSM on recovery

2008-10-28 Thread Heikki Linnakangas

The one remaining issue I'd like to address in the new FSM 
implementation is the fact that the FSM is currently not updated at all 
in WAL recovery. The old FSM wasn't updated on WAL recovery either, and 
was in fact completely thrown away if the system wasn't shut down 
cleanly. The difference is that after recovery, we used to start with no 
FSM information at all, and all inserts would have to extend the 
relations until the next vacuum, while now the inserts use the old data 
in the FSM. In case of a PITR recovery or warm stand-by, the FSM would 
information would come from the last base backup, which could be *very* old.


The first inserter after the recovery might have to visit a lot of pages 
that the FSM claimed had free space, but didn't in reality, before 
finding a suitable target. In the absolutely worst case, where the table 
was almost empty when the base backup was taken, but is now full, it 
might have to visit every single heap page. That's not good.


So we should try to update the FSM during recovery as well. It doesn't 
need to be very accurate, as the FSM information isn't accurate anyway, 
but we should try to avoid the worst case scenarios.


The attached patch is my first attempt at that. Arbitrarily, if after a 
heap insert/update there's less than 20% of free space on the page, the 
FSM is updated. Compared to updating it every time, that saves a lot of 
overhead, while doing a pretty good job at marking full pages as full in 
the FSM. My first thought was to update the FSM if there isn't enough 
room on the page for a new tuple of the same size as the one just 
inserted; that would be pretty close to the logic we have during normal 
operation, where the FSM is updated when the tuple that we're about to 
insert doesn't fit on the page. But because we don't know the fillfactor 
during recovery, I don't think we can do reliably.


One issue with this patch is that it doesn't update the FSM at all when 
pages are restored from full page images. It would require fetching the 
page and checking the free space on it, or peeking into the size of the 
backup block data, and I'm not sure if it's worth the extra code to do that.


Thoughts?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1d43b0b..a9bc17a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -54,6 +54,7 @@
 #include miscadmin.h
 #include pgstat.h
 #include storage/bufmgr.h
+#include storage/freespace.h
 #include storage/lmgr.h
 #include storage/procarray.h
 #include storage/smgr.h
@@ -4029,6 +4030,7 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record, bool clean_move)
 	int			nredirected;
 	int			ndead;
 	int			nunused;
+	Size		freespace;
 
 	if (record-xl_info  XLR_BKP_BLOCK_1)
 		return;
@@ -4068,6 +4070,15 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record, bool clean_move)
 	PageSetLSN(page, lsn);
 	PageSetTLI(page, ThisTimeLineID);
 	MarkBufferDirty(buffer);
+
+	/*
+	 * update the FSM as well
+	 *
+	 * XXX: We don't get here if the page was restored from full page image
+	 */
+	freespace = PageGetHeapFreeSpace(page);
+	XLogRecordPageWithFreeSpace(xlrec-node, xlrec-block, freespace);
+
 	UnlockReleaseBuffer(buffer);
 }
 
@@ -4212,6 +4223,7 @@ heap_xlog_insert(XLogRecPtr lsn, XLogRecord *record)
 	HeapTupleHeader htup;
 	xl_heap_header xlhdr;
 	uint32		newlen;
+	Size		freespace;
 
 	if (record-xl_info  XLR_BKP_BLOCK_1)
 		return;
@@ -4271,6 +4283,19 @@ heap_xlog_insert(XLogRecPtr lsn, XLogRecord *record)
 	PageSetLSN(page, lsn);
 	PageSetTLI(page, ThisTimeLineID);
 	MarkBufferDirty(buffer);
+
+	/*
+	 * If the page is running low on free space, update the FSM as well.
+	 * Pretty arbitrarily, our definition of low is less than 20%. We can't
+	 * do much better than that without knowing the fill-factor for the table.
+	 *
+	 * XXX: We don't get here if the page was restored from full page image
+	 */
+	freespace = PageGetHeapFreeSpace(page);
+	if (freespace  BLCKSZ / 5)
+		XLogRecordPageWithFreeSpace(xlrec-target.node, 
+	BufferGetBlockNumber(buffer), freespace);
+
 	UnlockReleaseBuffer(buffer);
 }
 
@@ -4296,6 +4321,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool move, bool hot_update)
 	xl_heap_header xlhdr;
 	int			hsize;
 	uint32		newlen;
+	Size		freespace;
 
 	if (record-xl_info  XLR_BKP_BLOCK_1)
 	{
@@ -4456,6 +4482,16 @@ newsame:;
 	PageSetLSN(page, lsn);
 	PageSetTLI(page, ThisTimeLineID);
 	MarkBufferDirty(buffer);
+
+	/*
+	 * If the page is running low on free space, update the FSM as well.
+	 * XXX: We don't get here if the page was restored from full page image
+	 */
+	freespace = PageGetHeapFreeSpace(page);
+	if (freespace  BLCKSZ / 5)
+		XLogRecordPageWithFreeSpace(xlrec-target.node,
+	BufferGetBlockNumber(buffer), freespace);
+
 	UnlockReleaseBuffer(buffer);
 }
 
diff --git a/src/backend/storage/freespace/freespace.c

Re: [HACKERS] Visibility map, partial vacuums

2008-10-28 Thread Simon Riggs


On Tue, 2008-10-28 at 14:57 +0200, Heikki Linnakangas wrote:
 Simon Riggs wrote:
  On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote:
  One option would be to just ignore that problem for now, and not 
  WAL-log.
  
  Probably worth skipping for now, since it will cause patch conflicts if
  you do. Are there any other interactions with Hot Standby? 
  
  But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record
  to say page is now all visible, without too much work.
 
 Hmm. Even if a tuple is visible to everyone on the master, it's not 
 necessarily yet visible to all the read-only transactions in the slave.

Never a problem. No query can ever see the rows removed by a cleanup
record, enforced by the recovery system.

  Does the PD_ALL_VISIBLE flag need to be set at the same time as updating
  the VM? Surely heapgetpage() could do a ConditionalLockBuffer exclusive
  to set the block flag (unlogged), but just not update VM. Separating the
  two concepts should allow the visibility check speed gain to more
  generally available. 
 
 Yes, that should be possible in theory. There's no version of 
 ConditionalLockBuffer() for conditionally upgrading a shared lock to 
 exclusive, but it should be possible in theory. I'm not sure if it would 
 be safe to set the PD_ALL_VISIBLE_FLAG while holding just a shared lock, 
 though. If it is, then we could do just that.

To be honest, I'm more excited about your perf results for that than I
am about speeding up some VACUUMs.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP patch: convert SQL-language functions to return tuplestores

2008-10-28 Thread Robert Haas

 I always thought we considered that a bug though. It sure would be nice if we
 could generate results as needed instead of having to generate them in 
 advance
 and store all of them.
 I suppose, but short of a fundamental rethink of how PL functions work
 that's not going to happen.  There's also the whole issue of when do
 side-effects happen (such as before/after statement triggers).

For PL/pgsql, I think it might be possible to execute a function to
precisely the point where you have generated a sufficient number of
records.  In other words, when someone asks for a tuple, you start
executing the function until a tuple pops out, and then save the
execution context until someone asks for another.  Conceivably you can
push LIMIT and WHERE clauses down into any RETURN QUERY statements
executed, as well.  Maybe that qualifies as a fundamental rethink,
though, and we can worry about how to suppress the tuplestore in that
case when and if someone is prepared to implement it.  For other
procedural languages, you would need support from the executor for
that PL, which in most cases will probably be lacking.

thinks a little more

In fact, I suspect that you would gain a lot by optimizing
specifically for the case of a PL/pgsql function of the form: (1)
execute 0 or more statements that may or may not have side effects but
do not return any tuples, (2) execute exactly 1 RETURN QUERY
statement, and (3) implicit or explicit RETURN.  I suspect that's a
very common usage pattern, and it wouldn't require being able to save
the entire execution context at an arbitrary point.

(I agree that BEFORE/AFTER statement triggers are a problem here but
I'm not sure that they are an insoluble one, and I'd hate for that to
be the thing that kills this type of optimization.  Even if you
implemented a full-blown partial-execution model, it would be
reasonable to always run any particular INSERT/UPDATE/DELETE to
completion.  It's really SELECT that is the problem.)

 In particular I fear there are a lot of places that use functions where we
 might expect them to use views. They're never going to get really good plans
 but it would be nice if we could at least avoid the extra materialize steps.
 Agreed, but I think the fundamental solution there, for simple-select
 functions, is inlining.

+1.  Upthread passing LIMIT and OFFSET clauses into the SRF as
parameters was suggested, but that's really intractable for real-world
use where you are also applying WHERE clauses to the SRF results.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal of PITR performance improvement for 8.4.

2008-10-28 Thread Joshua D. Drake


Gregory Stark wrote:

Simon Riggs [EMAIL PROTECTED] writes:



And are people running raid arrays not likely to be
running modern OSes anyways?



No and those that are can upgrade.

Joshua D. Drake


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Updating FSM on recovery

2008-10-28 Thread Tom Lane

Heikki Linnakangas [EMAIL PROTECTED] writes:
 So we should try to update the FSM during recovery as well. It doesn't 
 need to be very accurate, as the FSM information isn't accurate anyway, 
 but we should try to avoid the worst case scenarios.

Agreed.

 One issue with this patch is that it doesn't update the FSM at all when 
 pages are restored from full page images. It would require fetching the 
 page and checking the free space on it, or peeking into the size of the 
 backup block data, and I'm not sure if it's worth the extra code to do that.

I'd vote not to bother, at least not in the first cut.  As you say, 100%
accuracy isn't required, and I think that in typical scenarios an
insert/update that causes a page to become full would be relatively less
likely to have a full-page image.

As far as the ugliness in XLogRecordPageWithFreeSpace goes: couldn't you
just call XLogReadBufferWithFork with init = true, and then initialize
the page if PageIsNew?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP patch: convert SQL-language functions to return tuplestores

2008-10-28 Thread Dimitri Fontaine

Hi,

In the python language, functions that lazily return collections are called 
generators and use the yield keyword instead of return.
http://www.python.org/doc/2.5.2/tut/node11.html#SECTION0011100

Maybe having such a concept in PostgreSQL would allow the user to choose 
between current behavior (materializing) and lazy computing, with a new 
internal API to get done in the executor maybe.

CREATE FUNCTION mygenerator()
  returns setof integer
  language PLPGSQL
AS $f$
BEGIN
  FOR v_foo IN SELECT foo FROM table LOOP
YIELD my_expensive_function(v_foo);
  END LOOP;
  RETURN;
END;
$f$;

At the plain SQL level, we could expose this with a new function parameter, 
GENERATOR maybe?

CREATE FUNCTION my_generator_example(integer, integer)
  returns setof integer 
  generator
  language SQL
$f$
  SELECT generate_series($1, $2);
$f$;

Maybe we should prefer to add the GENERATOR (or LAZY or whatever sounds good 
for a native English speaker) parameter to PL functions to instead of 
providing YIELD, having RETURN doing YIELD in this case.

Le mardi 28 octobre 2008, Tom Lane a écrit :
 I suppose, but short of a fundamental rethink of how PL functions work
 that's not going to happen.  There's also the whole issue of when do
 side-effects happen (such as before/after statement triggers).

Would it be possible to forbid generators when using in those cases?

 Agreed, but I think the fundamental solution there, for simple-select
 functions, is inlining.

Would it be possible to maintain current behavior with ROWS estimator for 
functions, even when inlining, as a way to trick the planner when you can't 
feed it good enough stats?

 I think the PL side of the problem is the hard part --- if we knew how
 to solve these issues for plpgsql then SQL functions would surely be
 easy.

What about this python idea of GENERATORS and the YIELD control for lazy 
evaluation of functions?
-- 
dim


signature.asc
Description: This is a digitally signed message part.

Re: [HACKERS] WIP patch: convert SQL-language functions to return tuplestores

2008-10-28 Thread Pavel Stehule

2008/10/28 Dimitri Fontaine [EMAIL PROTECTED]:
 Hi,

 In the python language, functions that lazily return collections are called
 generators and use the yield keyword instead of return.
 http://www.python.org/doc/2.5.2/tut/node11.html#SECTION0011100

 Maybe having such a concept in PostgreSQL would allow the user to choose
 between current behavior (materializing) and lazy computing, with a new
 internal API to get done in the executor maybe.


lazy computing is good idea, but I am afraid so it should be really
wery hard implemented. You should to store somewhere current state,
stop execution, return back from node, and you should be able restore
PL state and continue in process. I can't to see it without thread
support.

Regards
Pavel Stehule


 CREATE FUNCTION mygenerator()
  returns setof integer
  language PLPGSQL
 AS $f$
 BEGIN
  FOR v_foo IN SELECT foo FROM table LOOP
YIELD my_expensive_function(v_foo);
  END LOOP;
  RETURN;
 END;
 $f$;

 At the plain SQL level, we could expose this with a new function parameter,
 GENERATOR maybe?

 CREATE FUNCTION my_generator_example(integer, integer)
  returns setof integer
  generator
  language SQL
 $f$
  SELECT generate_series($1, $2);
 $f$;

 Maybe we should prefer to add the GENERATOR (or LAZY or whatever sounds good
 for a native English speaker) parameter to PL functions to instead of
 providing YIELD, having RETURN doing YIELD in this case.

 Le mardi 28 octobre 2008, Tom Lane a écrit :
 I suppose, but short of a fundamental rethink of how PL functions work
 that's not going to happen.  There's also the whole issue of when do
 side-effects happen (such as before/after statement triggers).

 Would it be possible to forbid generators when using in those cases?

 Agreed, but I think the fundamental solution there, for simple-select
 functions, is inlining.

 Would it be possible to maintain current behavior with ROWS estimator for
 functions, even when inlining, as a way to trick the planner when you can't
 feed it good enough stats?

 I think the PL side of the problem is the hard part --- if we knew how
 to solve these issues for plpgsql then SQL functions would surely be
 easy.

 What about this python idea of GENERATORS and the YIELD control for lazy
 evaluation of functions?
 --
 dim


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Updating FSM on recovery

2008-10-28 Thread Simon Riggs


On Tue, 2008-10-28 at 15:35 +, Simon Riggs wrote:

 I wonder if there is merit in having an XLogInsertMulti() which inserts
 multiple records in a batch as a way of reducing WALInsertLock traffic.
 It might be possible to piggyback FSM records onto the main heap
 changes.

Or possibly an XLogInsertDeferred() which just queues up some work so
the next time we call XLogInsert() it will insert the deferred work as
well as the main work all in one lock cycle. It would only be usable for
low priority info like FSM stuff that isn't needed for recovery. Maybe
we could do that with hints also.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] FAQ_Solaris 1.28 to spanish

2008-10-28 Thread postgres Emanuel CALVO FRANCO

This file is for add to pgsql-docs, is the translation of the FAQ_solaris.

I have new things to add, specially for Open Solaris plattaform specifications.
I can write a new FAQ or just concatenate to this the other features
of OSolaris?

Reggards,

-- 
  Emanuel Calvo Franco
Syscope Postgresql DBA
  BaPUG Member


FAQ_Solaris_spanish
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal of PITR performance improvement for 8.4.

2008-10-28 Thread Simon Riggs


On Tue, 2008-10-28 at 12:34 +, Gregory Stark wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
 
  I'm happy with the idea of a readahead process. I thought we were
  implementing a BackgroundReader process for other uses. Is that dead
  now?
 
 You and Bruce seem to keep resurrecting that idea. 

Actually just asking for status, not trying to change design.

 I've never liked it -- I
 always hated that in Oracle and thought it was a terrible kludge.

But now... If you have a better way, great, but that doesn't make
perfectly workable and fairly simple userspace solutions into kludges.
That's just an emotive description of your preference.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Updating FSM on recovery

2008-10-28 Thread Simon Riggs


On Tue, 2008-10-28 at 16:22 +0200, Heikki Linnakangas wrote:

 Arbitrarily, if after a 
 heap insert/update there's less than 20% of free space on the page,
 the FSM is updated. Compared to updating it every time, that saves a
 lot of overhead, while doing a pretty good job at marking full pages
 as full in  the FSM. My first thought was to update the FSM if there
 isn't enough room on the page for a new tuple of the same size as the
 one just 
 inserted; that would be pretty close to the logic we have during
 normal 
 operation, where the FSM is updated when the tuple that we're about
 to 
 insert doesn't fit on the page. But because we don't know the
 fillfactor 
 during recovery, I don't think we can do reliably.

With HOT, we tend to hover around the nearly-full state, so this seems
like it will trigger repeatedly.

Is it possible that we could put an extra field onto a heap_clean record
to show remaining space. We would use it only for VACUUMs, not HOT, just
as we do now.

Probably good idea to make a list of user cases and say what we do in
eahc case. e.g. COPY, other bulk ops, HOT etc..

I wonder if there is merit in having an XLogInsertMulti() which inserts
multiple records in a batch as a way of reducing WALInsertLock traffic.
It might be possible to piggyback FSM records onto the main heap
changes.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Updating FSM on recovery

2008-10-28 Thread Heikki Linnakangas


Simon Riggs wrote:

On Tue, 2008-10-28 at 16:22 +0200, Heikki Linnakangas wrote:

Arbitrarily, if after a 
heap insert/update there's less than 20% of free space on the page,

the FSM is updated. Compared to updating it every time, that saves a
lot of overhead, while doing a pretty good job at marking full pages
as full in  the FSM. My first thought was to update the FSM if there
isn't enough room on the page for a new tuple of the same size as the
one just 
inserted; that would be pretty close to the logic we have during
normal 
operation, where the FSM is updated when the tuple that we're about
to 
insert doesn't fit on the page. But because we don't know the
fillfactor 
during recovery, I don't think we can do reliably.


With HOT, we tend to hover around the nearly-full state, so this seems
like it will trigger repeatedly.


Hmm, true. Perhaps we should skip updating the FSM on HOT updates. After 
recovery, the new HOT-updated tuples are prunable anyway, so for 
inserting a new tuple, the page is almost as good as it was before the 
HOT update.



Is it possible that we could put an extra field onto a heap_clean record
to show remaining space. We would use it only for VACUUMs, not HOT, just
as we do now.


Sure, we could do that. I'm more worried about killing the pages from 
the FSM that are full, though, than keeping track of pages with plenty 
of free space accurately.



I wonder if there is merit in having an XLogInsertMulti() which inserts
multiple records in a batch as a way of reducing WALInsertLock traffic.
It might be possible to piggyback FSM records onto the main heap
changes.


Umm, in the version that was finally committed, FSM doesn't generate any 
extra WAL records (except for truncation).


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal of PITR performance improvement for 8.4.

2008-10-28 Thread Gregory Stark

Simon Riggs [EMAIL PROTECTED] writes:

 On Tue, 2008-10-28 at 12:34 +, Gregory Stark wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
 
 I've never liked it -- I
 always hated that in Oracle and thought it was a terrible kludge.

 But now... If you have a better way, great, but that doesn't make
 perfectly workable and fairly simple userspace solutions into kludges.
 That's just an emotive description of your preference.

I won't dispute that calling it a kludge is a personal emotional reaction.

I do think it's *far* from simple though. We would have to manage a variable
number of worker processes, which is a fair amount of code right off the bat.

But the real complexity I fear is in managing the work queue. We would have
yet another fixed size shared memory data structure to configure and a lot of
locking and memory contention on it. 

Keep in mind that potentially every backend process will by trying to enqueue
hundreds of blocks and you'll have dozens (or even hundreds) of worker
processes trying to dequeue blocks. Worst of all, you'll (hopefully) have
someone trying to sort the blocks too which would make it basically impossible
to do anything to minimize the contention.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL 
training!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Updating FSM on recovery

2008-10-28 Thread Tom Lane

Simon Riggs [EMAIL PROTECTED] writes:
 Or possibly an XLogInsertDeferred() which just queues up some work so
 the next time we call XLogInsert() it will insert the deferred work as
 well as the main work all in one lock cycle. It would only be usable for
 low priority info like FSM stuff that isn't needed for recovery. Maybe
 we could do that with hints also.

If it isn't needed for recovery, why would we be logging it at all?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Window Functions: v07 APIs and buffering strateties

2008-10-28 Thread Hitoshi Harada

2008/10/28 Tom Lane [EMAIL PROTECTED]:
 Hitoshi Harada [EMAIL PROTECTED] writes:
 OK, I'll try to remove it. I'm not used to bison so my first task is
 to find where the conflict is...

 Use bison -v to get details of where the conflict is.  I find that
 the most common way to fix things is to postpone where the parser
 has to make a decision, which usually ends up meaning a slightly
 more verbose set of productions --- for instance instead of

foo: bar opt_baz

opt_baz: baz | /*EMPTY*/

 you might have to do

foo: bar baz | bar


Thanks for your advice. And I found the problem is around FRAME
clause. Now my question is:

Can ROWS be reserved_keyword?

In window specifications, we have

OVER (ORDER BY expr_list [(ROWS|RANGE) ... ])

and currently ROWS is not reserved so bison is confused with cases
of ROWS included in expr_list and in FRAME clause. Because there is
no delimiter between ORDER BY clause and FRAME (that is (ROWS |
RANGE)) clause, ROWS can be in expr_list as a_expr. I see ROWS has
been an unreserved keyword and that many places in gram.y such cases
have been avoided by other haking methods, but in this case, isn't it
possible such? If I missed something let me know it.

Regards,


-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Window Functions: v07 APIs and buffering strateties

2008-10-28 Thread Tom Lane

Hitoshi Harada [EMAIL PROTECTED] writes:
 Can ROWS be reserved_keyword?

 In window specifications, we have

 OVER (ORDER BY expr_list [(ROWS|RANGE) ... ])

 and currently ROWS is not reserved so bison is confused with cases
 of ROWS included in expr_list and in FRAME clause. Because there is
 no delimiter between ORDER BY clause and FRAME (that is (ROWS |
 RANGE)) clause, ROWS can be in expr_list as a_expr.

Right offhand, I don't see any alternative but to make both ROWS and
RANGE reserved.  It's pretty annoying since that might break existing
applications that have been using these as identifiers, but the SQL
committee seems to care little about that :-(

BTW, finding this sort of problem is exactly why ignoring shift/reduce
conflicts is a bad idea.  You would've ended up with unexpected
behaviors given the wrong input.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Updating FSM on recovery

2008-10-28 Thread Simon Riggs


On Tue, 2008-10-28 at 12:16 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  Or possibly an XLogInsertDeferred() which just queues up some work so
  the next time we call XLogInsert() it will insert the deferred work as
  well as the main work all in one lock cycle. It would only be usable for
  low priority info like FSM stuff that isn't needed for recovery. Maybe
  we could do that with hints also.
 
 If it isn't needed for recovery, why would we be logging it at all?

You just agreed that the info didn't need to be very accurate. There's a
few things on the server that aren't needed for recovery, but it might
be useful if they were logged occasionally to give roughly correct
values.

Contention on WALInsertLock seems to be a problem, yet writing WAL to
disk is not a bottleneck. Deferring writing it slightly to allow things
to be batched might be one way of smoothing the impact of that type of
operation. That might be better than a heuristic method of reducing the
number of inserts.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Visibility map, partial vacuums

2008-10-28 Thread Heikki Linnakangas


Simon Riggs wrote:

On Tue, 2008-10-28 at 14:57 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote:
One option would be to just ignore that problem for now, and not 
WAL-log.

Probably worth skipping for now, since it will cause patch conflicts if
you do. Are there any other interactions with Hot Standby? 


But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record
to say page is now all visible, without too much work.
Hmm. Even if a tuple is visible to everyone on the master, it's not 
necessarily yet visible to all the read-only transactions in the slave.


Never a problem. No query can ever see the rows removed by a cleanup
record, enforced by the recovery system.


Yes, but there's a problem with recently inserted tuples:

1. A query begins in the slave, taking a snapshot with xmax = 100. So 
the effects of anything more recent should not be seen.

2. Transaction 100 inserts a tuple in the master, and commits
3. A vacuum comes along. There's no other transactions running in the 
master. Vacuum sees that all tuples on the page, including the one just 
inserted, are visible to everyone, and sets PD_ALL_VISIBLE flag.

4. The change is replicated to the slave.
5. The query in the slave that began at step 1 looks at the page, sees 
that the PD_ALL_VISIBLE flag is set. Therefore it skips the visibility 
checks, and erroneously returns the inserted tuple.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Visibility map, partial vacuums

2008-10-28 Thread Simon Riggs


On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote:

 Lazy VACUUM only needs to visit pages that are '0' in the visibility 
 map. This allows partial vacuums, where we only need to scan those parts 
 of the table that need vacuuming, plus all indexes.

Just realised that this means we still have to visit each block of a
btree index with a cleanup lock.

That means the earlier idea of saying I don't need a cleanup lock if the
page is not in memory makes a lot more sense with a partial vacuum.

1. Scan all blocks in memory for the index (and so, don't do this unless
the index is larger than a certain % of shared buffers), 
2. Start reading in new blocks until you've removed the correct number
of tuples
3. Work through the rest of the blocks checking that they are either in
shared buffers and we can get a cleanup lock, or they aren't in shared
buffers and so nobody has them pinned.

If you step (2) intelligently with regard to index correlation you might
not need to do much I/O at all, if any.

(1) has a good hit ratio because mostly only active tables will be
vacuumed so are fairly likely to be in memory.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Updating FSM on recovery

2008-10-28 Thread Heikki Linnakangas


Tom Lane wrote:

As far as the ugliness in XLogRecordPageWithFreeSpace goes: couldn't you
just call XLogReadBufferWithFork with init = true, and then initialize
the page if PageIsNew?


With init=true, we don't even try to read the page from the disk (since 
8.3), so all FSM pages accessed during recovery would be zeroed out. I 
don't think that's what you intended.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Window Functions: v07 APIs and buffering strateties

2008-10-28 Thread Hitoshi Harada

2008/10/29 Tom Lane [EMAIL PROTECTED]:
 Hitoshi Harada [EMAIL PROTECTED] writes:
 Can ROWS be reserved_keyword?

 In window specifications, we have

 OVER (ORDER BY expr_list [(ROWS|RANGE) ... ])

 and currently ROWS is not reserved so bison is confused with cases
 of ROWS included in expr_list and in FRAME clause. Because there is
 no delimiter between ORDER BY clause and FRAME (that is (ROWS |
 RANGE)) clause, ROWS can be in expr_list as a_expr.

 Right offhand, I don't see any alternative but to make both ROWS and
 RANGE reserved.  It's pretty annoying since that might break existing
 applications that have been using these as identifiers, but the SQL
 committee seems to care little about that :-(

 BTW, finding this sort of problem is exactly why ignoring shift/reduce
 conflicts is a bad idea.  You would've ended up with unexpected
 behaviors given the wrong input.


I see it now. This is so good study to me, though it spent me much
time. Thanks anyway.

Regards,


-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Visibility map, partial vacuums

2008-10-28 Thread Simon Riggs


On Tue, 2008-10-28 at 19:02 +0200, Heikki Linnakangas wrote:

 Yes, but there's a problem with recently inserted tuples:
 
 1. A query begins in the slave, taking a snapshot with xmax = 100. So 
 the effects of anything more recent should not be seen.
 2. Transaction 100 inserts a tuple in the master, and commits
 3. A vacuum comes along. There's no other transactions running in the 
 master. Vacuum sees that all tuples on the page, including the one just 
 inserted, are visible to everyone, and sets PD_ALL_VISIBLE flag.
 4. The change is replicated to the slave.
 5. The query in the slave that began at step 1 looks at the page, sees 
 that the PD_ALL_VISIBLE flag is set. Therefore it skips the visibility 
 checks, and erroneously returns the inserted tuple.

Yep. I was thinking about FSM and row removal. So PD_ALL_VISIBLE must be
separately settable on the standby. Another reason why it should be able
to be set without a VACUUM - since there will never be one on standby.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Window Functions: v07 APIs and buffering strateties

2008-10-28 Thread Martijn van Oosterhout

On Tue, Oct 28, 2008 at 12:38:09PM -0400, Tom Lane wrote:
 Hitoshi Harada [EMAIL PROTECTED] writes:
  In window specifications, we have
 
  OVER (ORDER BY expr_list [(ROWS|RANGE) ... ])
 
  and currently ROWS is not reserved so bison is confused with cases
  of ROWS included in expr_list and in FRAME clause. Because there is
  no delimiter between ORDER BY clause and FRAME (that is (ROWS |
  RANGE)) clause, ROWS can be in expr_list as a_expr.
 
 Right offhand, I don't see any alternative but to make both ROWS and
 RANGE reserved.  It's pretty annoying since that might break existing
 applications that have been using these as identifiers, but the SQL
 committee seems to care little about that :-(

Given that the only problematic case is if expr_list ends with a
postfix operator, wouldn't it be sufficient to simply decree that in
that case you need parentheses? Seems a lot less painful than adding
two reserved words.

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature

Re: [HACKERS] Updating FSM on recovery

2008-10-28 Thread Tom Lane

Heikki Linnakangas [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 As far as the ugliness in XLogRecordPageWithFreeSpace goes: couldn't you
 just call XLogReadBufferWithFork with init = true, and then initialize
 the page if PageIsNew?

 With init=true, we don't even try to read the page from the disk (since 
 8.3), so all FSM pages accessed during recovery would be zeroed out. I 
 don't think that's what you intended.

Ah, right.  Maybe the API change you suggested in the comment is the
way to go.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PostgreSQL + Replicator developer meeting 10/28

2008-10-28 Thread Hannu Krosing

On Mon, 2008-10-27 at 13:42 -0700, Joshua Drake wrote:
 With the recent open sourcing of Replicator, the team has been trying
 to come up with ways to ensure an open development process. In that
 light we have decided to have our first release 1.9 meeting on
 Freenode. All people interested in participating in a discussion about
 the upcoming Replicator 1.9 are welcome to attend.

I missed the meeting at #replicator, but still have some questions

 * Is there a mailing list for replicator ?

  The current topics are:

 * New MCP architecture

What's new ? 

I have some doubts about the current architecture based on my reading of
replicator wiki, but would like to learn about the new architecture
before spending too much time in studying the old one.

 * DDL Replication

Is it there alread, or is it just  a planned feature ?

 * Release timeline
 * Questions

 * How hard would it be to extract DDL replication part and use it as
   basis for DDL after trigger support for use in trigger based
   replication/auditing like Slony ann pgQ/Londiste ?

 * special DDL triggers capturing DDL statement source could also 
 be an alternative to full DDL triggers.

 Replicator is set to be a short cycle release, hopefully landing before
 PostgreSQL 8.4. It will support PostgreSQL 8.3 and PostgreSQL 8.4 (when
 8.4 is available). We will be meeting in the #replicator channel at
 10:00 AM PDT...


--
Hannu



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Window Functions: v07 APIs and buffering strateties

2008-10-28 Thread Tom Lane

Martijn van Oosterhout [EMAIL PROTECTED] writes:
 On Tue, Oct 28, 2008 at 12:38:09PM -0400, Tom Lane wrote:
 Right offhand, I don't see any alternative but to make both ROWS and
 RANGE reserved.  It's pretty annoying since that might break existing
 applications that have been using these as identifiers, but the SQL
 committee seems to care little about that :-(

 Given that the only problematic case is if expr_list ends with a
 postfix operator, wouldn't it be sufficient to simply decree that in
 that case you need parentheses? Seems a lot less painful than adding
 two reserved words.

Hmm ... actually, it might be possible to fix it with a suitable
precedence declaration?  The trick is to make sure that in
... ORDER BY foo ! ROWS ...
the operator is taken as postfix not infix, which is the exact opposite
of what we did for AS-less column aliases (and, in fact, is the opposite
of what bison will do by default, IIRC).  So it might be possible to fix
by attaching some new precedence level to the ROWS token.

On the other hand, this would still mean that ROWS acts oddly
differently from any other column name, and in a way that would be hard
to explain or document.  So I'm really not sure that this is a better
solution than reserving it.

Still another trick in our toolbox is to use merged tokens to fix the
lack of lookahead.  If I read the spec correctly, the problematic cases
of ROWS and RANGE must be followed by UNBOUNDED or BETWEEN, so folding
these token pairs into four merged-tokens would get rid of the need to
reserve anything.  Merged tokens create their own little oddities too
though, as I was just complaining to Peter.  I wonder whether it would
be possible to improve on that problem by arranging some way for the
grammar to signal the lexer about whether lookahead is needed, and thus
not do the merging in contexts where it couldn't be appropriate.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Visibility map, partial vacuums

2008-10-28 Thread Tom Lane

Simon Riggs [EMAIL PROTECTED] writes:
 On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote:
 Lazy VACUUM only needs to visit pages that are '0' in the visibility 
 map. This allows partial vacuums, where we only need to scan those parts 
 of the table that need vacuuming, plus all indexes.

 Just realised that this means we still have to visit each block of a
 btree index with a cleanup lock.

Yes, and your proposal cannot fix that.  Read The Deletion Algorithm
in nbtree/README, particularly the second paragraph.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Visibility map, partial vacuums

2008-10-28 Thread Tom Lane

Heikki Linnakangas [EMAIL PROTECTED] writes:
 Yes, but there's a problem with recently inserted tuples:

 1. A query begins in the slave, taking a snapshot with xmax = 100. So 
 the effects of anything more recent should not be seen.
 2. Transaction 100 inserts a tuple in the master, and commits
 3. A vacuum comes along. There's no other transactions running in the 
 master. Vacuum sees that all tuples on the page, including the one just 
 inserted, are visible to everyone, and sets PD_ALL_VISIBLE flag.
 4. The change is replicated to the slave.
 5. The query in the slave that began at step 1 looks at the page, sees 
 that the PD_ALL_VISIBLE flag is set. Therefore it skips the visibility 
 checks, and erroneously returns the inserted tuple.

But this is exactly equivalent to the problem with recently deleted
tuples: vacuum on the master might take actions that are premature with
respect to the status on the slave.  Whatever solution we adopt for that
will work for this too.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PostgreSQL + Replicator developer meeting 10/28

2008-10-28 Thread Joshua Drake

On Tue, 28 Oct 2008 19:46:42 +0200
Hannu Krosing [EMAIL PROTECTED] wrote:

 On Mon, 2008-10-27 at 13:42 -0700, Joshua Drake wrote:
  With the recent open sourcing of Replicator, the team has been
  trying to come up with ways to ensure an open development process.
  In that light we have decided to have our first release 1.9 meeting
  on Freenode. All people interested in participating in a discussion
  about the upcoming Replicator 1.9 are welcome to attend.
 
 I missed the meeting at #replicator, but still have some questions

Meeting is still going on. I will answer your questions and post
results this afternoon.

Joshua D. Drake

-- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Visibility map, partial vacuums

2008-10-28 Thread Simon Riggs


On Tue, 2008-10-28 at 13:58 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote:
  Lazy VACUUM only needs to visit pages that are '0' in the visibility 
  map. This allows partial vacuums, where we only need to scan those parts 
  of the table that need vacuuming, plus all indexes.
 
  Just realised that this means we still have to visit each block of a
  btree index with a cleanup lock.
 
 Yes, and your proposal cannot fix that.  Read The Deletion Algorithm
 in nbtree/README, particularly the second paragraph.

Yes, understood. Please read the algorithm again. It does guarantee that
each block in the index has been checked to see if nobody is pinning it,
it just avoids performing I/O to prove that.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Window Functions: v07 APIs and buffering strateties

2008-10-28 Thread Martijn van Oosterhout

On Tue, Oct 28, 2008 at 01:50:26PM -0400, Tom Lane wrote:
  Given that the only problematic case is if expr_list ends with a
  postfix operator, wouldn't it be sufficient to simply decree that in
  that case you need parentheses? Seems a lot less painful than adding
  two reserved words.
 
 Hmm ... actually, it might be possible to fix it with a suitable
 precedence declaration?  The trick is to make sure that in
   ... ORDER BY foo ! ROWS ...
 the operator is taken as postfix not infix, which is the exact opposite
 of what we did for AS-less column aliases (and, in fact, is the opposite
 of what bison will do by default, IIRC).  So it might be possible to fix
 by attaching some new precedence level to the ROWS token.

Yes. Bison's default is to shift, which means that if you do nothing it
will treat ROWS as part of the expression if it makes any sense at all.
Given the requirement for a following UNBOUNDED or BETWEEN, the only
problem is that you'll get a syntax error if the expr_list ends in a
postfix operator, I don't see how you get hidden ambiguity.

Operator precedence is exactly the way to do this, since operator
precedence rules exist solely to resolve the shift/reduce conflicts.

You're right about the documentation though. I suppose you could put in
the documentation for the ROWS stuff something along the lines of: If
the last expression of your ORDER by ends in a postfix operator, you
must use parentheses. How many postfix operators are in common use in
ORDER BY expressions anyway?

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature

Re: [HACKERS] Re: [COMMITTERS] pgsql: Rework subtransaction commit protocol for hot standby.

2008-10-28 Thread Simon Riggs


On Tue, 2008-10-28 at 00:35 +, Simon Riggs wrote:

 Comments please.

I'm a little unclear as to what is happening with this.

We have a bug in CVS HEAD, so clearly doing nothing shouldn't really be
an option. It was my patch with the bug, so clearly my responsibility.
After some prototypes, I'm happy with the submitted v3 patch. 

Please could somebody commit this? Or at least say what they think needs
to happen, or what we're waiting for?

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Window Functions: v07 APIs and buffering strateties

2008-10-28 Thread Tom Lane

Martijn van Oosterhout [EMAIL PROTECTED] writes:
 On Tue, Oct 28, 2008 at 01:50:26PM -0400, Tom Lane wrote:
 ... So it might be possible to fix
 by attaching some new precedence level to the ROWS token.

 Yes. Bison's default is to shift, which means that if you do nothing it
 will treat ROWS as part of the expression if it makes any sense at all.
 Given the requirement for a following UNBOUNDED or BETWEEN, the only
 problem is that you'll get a syntax error if the expr_list ends in a
 postfix operator, I don't see how you get hidden ambiguity.

Hmm, now I see what you meant; that's a little different than what I was
envisioning.  I was thinking of trying to force a parse decision that
would support the windowing syntax, whereas you propose forcing a
parse decision that does the opposite, and making the user parenthesize
if he's got a conflict.

What the choice seems to come down to is making ROWS and RANGE reserved
(in some form or other) versus creating a corner case for users of
postfix operators.  Phrased that way it does seem like the second
alternative is better.

Hitoshi: you can probably make this happen by including ROWS and RANGE
in the %nonassoc IDENT precedence declaration, but you'll want to test
to make sure the right things happen.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Feature Request - Table Definition query

2008-10-28 Thread Wilfried Schobeiri


Howdy!

It was suggested to me that this proposed feature is interesting  
enough to be floated around for discussion.

Don't get your hopes up, because the request is extremely simple:

It could be useful to have a command that returns the table definition  
(like pg_dump -st) from within the query interface. This could be  
particularly useful if one doesn't have access to or prefers not to  
manipulate results coming from a shell. If I have an API from which to  
query the database, it seems silly to have to spawn a shell and run a  
command to query the very same database.


(I'm not exactly familiar with how pg_dump manipulates or extracts  
that data, so this could be a very, very stupid question. If so,  
please forgive me.)


Regards,
Wilfried Schobeiri

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Rework subtransaction commit protocol for hot standby.

2008-10-28 Thread Simon Riggs


On Tue, 2008-10-28 at 14:43 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  I'm a little unclear as to what is happening with this.
 
  We have a bug in CVS HEAD, so clearly doing nothing shouldn't really be
  an option.
 
 Doing nothing until release is certainly not an option ;-).  But AFAICS
 this is not a showstopper for development so I was intending to wait
 until commitfest starts before doing anything with it.  Is it blocking
 some other progress for you?

No, since I know how to fix it. 

Just uncomfortable leaving known bugs in the tree, that's all.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] current head fails regression tests on mac osx

2008-10-28 Thread Grzegorz Jaskiewicz

I won't send the full logs here, if someone's interested in it -  
here's tar.bz2 version:

http://zlew.org/pg_regress.tar.bz2


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] current head fails regression tests on mac osx

2008-10-28 Thread Heikki Linnakangas


Grzegorz Jaskiewicz wrote:
I won't send the full logs here, if someone's interested in it - here's 
tar.bz2 version:

http://zlew.org/pg_regress.tar.bz2


This seems to be the root cause:


+ ERROR:  could not open file 
/Users/gj/Projects/postgres-head/pgsql/src/test/regress/results/onek.data for 
writing: Permission denied


The buildfarm members running Max OS X seem to be working fine, so it 
seems there's something special in your environment that's causing that 
to fail. Is the results directory writable?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Rework subtransaction commit protocol for hot standby.

2008-10-28 Thread Tom Lane

Simon Riggs [EMAIL PROTECTED] writes:
 On Tue, 2008-10-28 at 14:43 -0400, Tom Lane wrote:
 Doing nothing until release is certainly not an option ;-).  But AFAICS
 this is not a showstopper for development so I was intending to wait
 until commitfest starts before doing anything with it.

 Just uncomfortable leaving known bugs in the tree, that's all.

Feel free to stick something about it on the open items list
http://wiki.postgresql.org/wiki/PostgreSQL_8.4_Open_Items
if you want to make doubly sure it's not forgotten about.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] usermap regexp support

2008-10-28 Thread Magnus Hagander

The attached patch tries to implement regexp support in the usermaps
(pg_ident.conf).

The use-case will be to do things like realm-based matches in
kerberos/GSSAPI authentication (will require some further hacks on that
one to expose the realm name). For example you could have:

krb /^(.*)@myrealm.com$ \1
krb /^(.*)@otherrealm.com   guest

and things like that.

It will also likely be quite useful once we get SSL certificate based
authentication.

Reviews particularly appreciated - I really can't claim to *know* our
regexp code :-O

Obviously docs need to be updated as well.

//Magnus

*** a/src/backend/libpq/hba.c
--- b/src/backend/libpq/hba.c
***
*** 27,32 
--- 27,33 
  
  #include libpq/ip.h
  #include libpq/libpq.h
+ #include regex/regex.h
  #include storage/fd.h
  #include utils/flatfiles.h
  #include utils/guc.h
***
*** 1325,1344  parse_ident_usermap(List *line, int line_number, const char *usermap_name,
  	token = lfirst(line_item);
  	file_pgrole = token;
  
  	/* Match? */
! 	if (case_insensitive)
  	{
! 		if (strcmp(file_map, usermap_name) == 0 
! 			pg_strcasecmp(file_pgrole, pg_role) == 0 
! 			pg_strcasecmp(file_ident_user, ident_user) == 0)
! 			*found_p = true;
  	}
  	else
  	{
! 		if (strcmp(file_map, usermap_name) == 0 
! 			strcmp(file_pgrole, pg_role) == 0 
! 			strcmp(file_ident_user, ident_user) == 0)
! 			*found_p = true;
  	}
  
  	return;
--- 1326,1453 
  	token = lfirst(line_item);
  	file_pgrole = token;
  
+ 	if (strcmp(file_map, usermap_name) != 0)
+ 		/* Line does not match the map name we're looking for, so just abort */
+ 		return;
+ 
  	/* Match? */
! 	if (file_ident_user[0] == '/')
  	{
! 		/*
! 		 * When system username starts with a slash, treat it as a regular expression.
! 		 * In this case, we process the system username as a regular expression that
! 		 * returns exactly one match. This is replaced for $1 in the database username
! 		 * string, if present.
! 		 */
! 		int			r;
! 		regex_t		re;
! 		regmatch_t	matches[2];
! 		pg_wchar   *wstr;
! 		int			wlen;
! 		char	   *ofs;
! 		char	   *regexp_pgrole;
! 
! 		wstr = palloc((strlen(file_ident_user+1) + 1) * sizeof(pg_wchar));
! 		wlen = pg_mb2wchar_with_len(file_ident_user+1, wstr, strlen(file_ident_user+1));
! 
! 		/*
! 		 * XXX: Major room for optimization: regexps could be compiled when the file is loaded
! 		 * and then re-used in every connection.
! 		 */
! 		r = pg_regcomp(re, wstr, wlen, REG_ADVANCED);
! 		if (r)
! 		{
! 			char errstr[100];
! 
! 			pg_regerror(r, re, errstr, sizeof(errstr));
! 			ereport(ERROR,
! 	(errcode(ERRCODE_INVALID_REGULAR_EXPRESSION),
! 	 errmsg(invalid regular expression '%s': %s, file_ident_user+1, errstr)));
! 
! 			pfree(wstr);
! 			*error_p = true;
! 			return;
! 		}
! 		pfree(wstr);
! 
! 		wstr = palloc((strlen(ident_user) + 1) * sizeof(pg_wchar));
! 		wlen = pg_mb2wchar_with_len(ident_user, wstr, strlen(ident_user));
! 
! 		r = pg_regexec(re, wstr, wlen, 0, NULL, 2, matches,0);
! 		if (r)
! 		{
! 			char errstr[100];
! 
! 			if (r != REG_NOMATCH)
! 			{
! /* REG_NOMATCH is not an error, everything else is */
! pg_regerror(r, re, errstr, sizeof(errstr));
! ereport(ERROR,
! 		(errcode(ERRCODE_INVALID_REGULAR_EXPRESSION),
! 		 errmsg(regular expression match for '%s' failed: %s, file_ident_user+1, errstr)));
! *error_p = true;
! 			}
! 
! 			pfree(wstr);
! 			pg_regfree(re);
! 			return;
! 		}
! 		pfree(wstr);
! 
! 		if ((ofs = strstr(file_pgrole, \\1)) != NULL)
! 		{
! 			/* substitution of the first argument requested */
! 			if (matches[1].rm_so  0)
! ereport(ERROR,
! 		(errcode(ERRCODE_INVALID_REGULAR_EXPRESSION),
! 		 errmsg(regular expression '%s' has no subexpressions as requested by backreference in '%s',
! file_ident_user+1, file_pgrole)));
! 			/* length: original length minus length of \1 plus length of match */
! 			regexp_pgrole = palloc0(strlen(file_pgrole)-2+(matches[1].rm_so-matches[1].rm_so));
! 			strncpy(regexp_pgrole, file_pgrole, (ofs-file_pgrole));
! 			memcpy(regexp_pgrole+strlen(regexp_pgrole),
!    ident_user+matches[1].rm_so,
!    matches[1].rm_eo-matches[1].rm_so);
! 			strcat(regexp_pgrole, ofs+2);
! 		}
! 		else
! 		{
! 			/* no substitution, so copy the match */
! 			regexp_pgrole = pstrdup(file_pgrole);
! 		}
! 
! 		pg_regfree(re);
! 
! 		/* now check if the username actually matched what the user is trying to connect as */
! 		if (case_insensitive)
! 		{
! 			if (pg_strcasecmp(regexp_pgrole, pg_role) == 0)
! *found_p = true;
! 		}
! 		else
! 		{
! 			if (strcmp(regexp_pgrole, pg_role) == 0)
! *found_p = true;
! 		}
! 		pfree(regexp_pgrole);
! 
! 		return;
  	}
  	else
  	{
! 		/* Not regular expression, so make complete match */
! 		if (case_insensitive)
! 		{
! 			if (pg_strcasecmp(file_pgrole, pg_role) == 0 
! pg_strcasecmp(file_ident_user, ident_user) == 0)
! *found_p = true;
! 		}
! 		else
! 		{
! 			if

Re: [HACKERS] current head fails regression tests on mac osx

2008-10-28 Thread Grzegorz Jaskiewicz


hmm, how could I miss that. but than:

thickbook:src gj$ ls -lahd /Users/gj/Projects/postgres-head/pgsql/src/ 
test/regress/results/
drwxr-xr-x  119 gj  staff   4,0K 28 paź 19:14 /Users/gj/Projects/ 
postgres-head/pgsql/src/test/regress/results/


thickbook:src gj$ touch /Users/gj/Projects/postgres-head/pgsql/src/ 
test/regress/results/onek.data


thickbook:src gj$ ls -lah /Users/gj/Projects/postgres-head/pgsql/src/ 
test/regress/results/

total 6728
drwxr-xr-x  120 gj  staff   4,0K 28 paź 20:03 .
drwxr-xr-x   29 gj  staff   986B 28 paź 19:14 ..
-rw-r--r--1 gj  staff   4,9K 28 paź 19:14 abstime.out
-rw-r--r--1 gj  staff   9,4K 28 paź 19:14 aggregates.out
-rw-r--r--1 gj  staff57K 28 paź 19:14 alter_table.out
-rw-r--r--1 gj  staff40K 28 paź 19:14 arrays.out
-rw-r--r--1 gj  staff16K 28 paź 19:14 bit.out
-rw-r--r--1 gj  staff   1,4K 28 paź 19:14 bitmapops.out
-rw-r--r--1 gj  staff   7,3K 28 paź 19:14 boolean.out
-rw-r--r--1 gj  staff   4,8K 28 paź 19:14 box.out
-rw-r--r--1 gj  staff   1,6K 28 paź 19:14 btree_index.out
-rw-r--r--1 gj  staff   6,0K 28 paź 19:14 case.out
-rw-r--r--1 gj  staff   2,1K 28 paź 19:14 char.out
-rw-r--r--1 gj  staff   2,6K 28 paź 19:14 circle.out
-rw-r--r--1 gj  staff20K 28 paź 19:14 cluster.out
-rw-r--r--1 gj  staff   3,6K 28 paź 19:14 combocid.out
-rw-r--r--1 gj  staff   1,3K 28 paź 19:14 comments.out
-rw-r--r--1 gj  staff12K 28 paź 19:14 constraints.out
-rw-r--r--1 gj  staff14K 28 paź 19:14 conversion.out
-rw-r--r--1 gj  staff   5,2K 28 paź 19:14 copy.out
-rw-r--r--1 gj  staff   6,6K 28 paź 19:14 copy2.out
-rw-r--r--1 gj  staff   2,6K 28 paź 19:14 copyselect.out
-rw-r--r--1 gj  staff   1,4K 28 paź 19:14 create_aggregate.out
-rw-r--r--1 gj  staff   3,2K 28 paź 19:14 create_function_1.out
-rw-r--r--1 gj  staff   1,9K 28 paź 19:14 create_function_2.out
-rw-r--r--1 gj  staff46K 28 paź 19:14 create_index.out
-rw-r--r--1 gj  staff   5,0K 28 paź 19:14 create_misc.out
-rw-r--r--1 gj  staff   663B 28 paź 19:14 create_operator.out
-rw-r--r--1 gj  staff   3,5K 28 paź 19:14 create_table.out
-rw-r--r--1 gj  staff   3,4K 28 paź 19:14 create_type.out
-rw-r--r--1 gj  staff   9,4K 28 paź 19:14 create_view.out
-rw-r--r--1 gj  staff22K 28 paź 19:14 date.out
-rw-r--r--1 gj  staff   1,0K 28 paź 19:14 delete.out
-rw-r--r--1 gj  staff   5,1K 28 paź 19:14 dependency.out
-rw-r--r--1 gj  staff17K 28 paź 19:14 domain.out
-rw-r--r--1 gj  staff   3,1K 28 paź 19:14 drop_if_exists.out
-rw-r--r--1 gj  staff   7,2K 28 paź 19:14 enum.out
-rw-r--r--1 gj  staff13K 28 paź 19:14 errors.out
-rw-r--r--1 gj  staff   6,8K 28 paź 19:14 float4.out
-rw-r--r--1 gj  staff13K 28 paź 19:14 float8.out
-rw-r--r--1 gj  staff53K 28 paź 19:14 foreign_key.out
-rw-r--r--1 gj  staff20K 28 paź 19:14 geometry.out
-rw-r--r--1 gj  staff12K 28 paź 19:14 guc.out
-rw-r--r--1 gj  staff   4,0K 28 paź 19:14 hash_index.out
-rw-r--r--1 gj  staff   141K 28 paź 19:14 horology.out
-rw-r--r--1 gj  staff20K 28 paź 19:14 inet.out
-rw-r--r--1 gj  staff28K 28 paź 19:14 inherit.out
-rw-r--r--1 gj  staff   2,2K 28 paź 19:14 insert.out
-rw-r--r--1 gj  staff   6,1K 28 paź 19:14 int2.out
-rw-r--r--1 gj  staff   8,0K 28 paź 19:14 int4.out
-rw-r--r--1 gj  staff28K 28 paź 19:14 int8.out
-rw-r--r--1 gj  staff15K 28 paź 19:14 interval.out
-rw-r--r--1 gj  staff69K 28 paź 19:14 join.out
-rw-r--r--1 gj  staff   1,1M 28 paź 19:14 largeobject.out
-rw-r--r--1 gj  staff   2,1K 28 paź 19:14 limit.out
-rw-r--r--1 gj  staff   1,6K 28 paź 19:14 lseg.out
-rw-r--r--1 gj  staff   3,1K 28 paź 19:14 macaddr.out
-rw-r--r--1 gj  staff15K 28 paź 19:14 misc.out
-rw-r--r--1 gj  staff   2,8K 28 paź 19:14 money.out
-rw-r--r--1 gj  staff   4,8K 28 paź 19:14 name.out
-rw-r--r--1 gj  staff   1,3K 28 paź 19:14 namespace.out
-rw-r--r--1 gj  staff57K 28 paź 19:14 numeric.out
-rw-r--r--1 gj  staff   3,4K 28 paź 19:14 numerology.out
-rw-r--r--1 gj  staff   3,6K 28 paź 19:14 oid.out
-rw-r--r--1 gj  staff24K 28 paź 19:14 oidjoins.out
-rw-r--r--1 gj  staff 0B 28 paź 20:03 onek.data
-rw-r--r--1 gj  staff41K 28 paź 19:14 opr_sanity.out
-rw-r--r--1 gj  staff   2,2K 28 paź 19:14 path.out
-rw-r--r--1 gj  staff   6,1K 28 paź 19:14 plancache.out
-rw-r--r--1 gj  staff   111K 28 paź 19:14 plpgsql.out
-rw-r--r--1 gj  staff   7,9K 28 paź 19:14 point.out
-rw-r--r--1 gj  staff   4,5K 28 paź 19:14 polygon.out
-rw-r--r--1 gj  staff24K 28 paź 19:14 polymorphism.out
-rw-r--r--1 gj  staff91K 28 paź 19:14 portals.out
-rw-r--r--1 gj  staff   5,3K 28 paź 19:14 portals_p2.out
-rw-r--r--1 gj  staff12K 28 paź 19:14 prepare.out
-rw-r--r--1 gj  staff   4,1K 28 paź 19:14 prepared_xacts.out
-rw-r--r--1 gj  staff17K 28

[HACKERS] don't use MAKE_PTR/OFFSET for shmem pointers

2008-10-28 Thread Kris Jurka



Since we require every process to map the shared memory region to the same 
address, we don't need the MAKE_PTR/OFFSET code that was needed when that
was not the case.  This patch makes shared memory pointers just like 
regular pointers.


http://archives.postgresql.org/pgsql-general/2007-08/msg01510.php

Kris Jurka*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
***
*** 122,128  typedef struct GlobalTransactionData
  typedef struct TwoPhaseStateData
  {
/* Head of linked list of free GlobalTransactionData structs */
!   SHMEM_OFFSET freeGXacts;
  
/* Number of valid prepXacts entries. */
int numPrepXacts;
--- 122,128 
  typedef struct TwoPhaseStateData
  {
/* Head of linked list of free GlobalTransactionData structs */
!   void * freeGXacts;
  
/* Number of valid prepXacts entries. */
int numPrepXacts;
***
*** 184,190  TwoPhaseShmemInit(void)
int i;
  
Assert(!found);
!   TwoPhaseState-freeGXacts = INVALID_OFFSET;
TwoPhaseState-numPrepXacts = 0;
  
/*
--- 184,190 
int i;
  
Assert(!found);
!   TwoPhaseState-freeGXacts = NULL;
TwoPhaseState-numPrepXacts = 0;
  
/*
***
*** 197,203  TwoPhaseShmemInit(void)
for (i = 0; i  max_prepared_xacts; i++)
{
gxacts[i].proc.links.next = TwoPhaseState-freeGXacts;
!   TwoPhaseState-freeGXacts = MAKE_OFFSET(gxacts[i]);
}
}
else
--- 197,203 
for (i = 0; i  max_prepared_xacts; i++)
{
gxacts[i].proc.links.next = TwoPhaseState-freeGXacts;
!   TwoPhaseState-freeGXacts = gxacts[i];
}
}
else
***
*** 243,249  MarkAsPreparing(TransactionId xid, const char *gid,
TwoPhaseState-prepXacts[i] = 
TwoPhaseState-prepXacts[TwoPhaseState-numPrepXacts];
/* and put it back in the freelist */
gxact-proc.links.next = TwoPhaseState-freeGXacts;
!   TwoPhaseState-freeGXacts = MAKE_OFFSET(gxact);
/* Back up index count too, so we don't miss scanning 
one */
i--;
}
--- 243,249 
TwoPhaseState-prepXacts[i] = 
TwoPhaseState-prepXacts[TwoPhaseState-numPrepXacts];
/* and put it back in the freelist */
gxact-proc.links.next = TwoPhaseState-freeGXacts;
!   TwoPhaseState-freeGXacts = gxact;
/* Back up index count too, so we don't miss scanning 
one */
i--;
}
***
*** 263,275  MarkAsPreparing(TransactionId xid, const char *gid,
}
  
/* Get a free gxact from the freelist */
!   if (TwoPhaseState-freeGXacts == INVALID_OFFSET)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
 errmsg(maximum number of prepared 
transactions reached),
 errhint(Increase max_prepared_transactions 
(currently %d).,
 max_prepared_xacts)));
!   gxact = (GlobalTransaction) MAKE_PTR(TwoPhaseState-freeGXacts);
TwoPhaseState-freeGXacts = gxact-proc.links.next;
  
/* Initialize it */
--- 263,275 
}
  
/* Get a free gxact from the freelist */
!   if (TwoPhaseState-freeGXacts == NULL)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
 errmsg(maximum number of prepared 
transactions reached),
 errhint(Increase max_prepared_transactions 
(currently %d).,
 max_prepared_xacts)));
!   gxact = (GlobalTransaction)TwoPhaseState-freeGXacts;
TwoPhaseState-freeGXacts = gxact-proc.links.next;
  
/* Initialize it */
***
*** 452,458  RemoveGXact(GlobalTransaction gxact)
  
/* and put it back in the freelist */
gxact-proc.links.next = TwoPhaseState-freeGXacts;
!   TwoPhaseState-freeGXacts = MAKE_OFFSET(gxact);
  
LWLockRelease(TwoPhaseStateLock);
  
--- 452,458 
  
/* and put it back in the freelist */
gxact-proc.links.next = TwoPhaseState-freeGXacts;
!   TwoPhaseState-freeGXacts = gxact;
  
LWLockRelease(TwoPhaseStateLock);

Re: [HACKERS] Re: [COMMITTERS] pgsql: Rework subtransaction commit protocol for hot standby.

2008-10-28 Thread Simon Riggs


On Tue, 2008-10-28 at 15:52 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  On Tue, 2008-10-28 at 14:43 -0400, Tom Lane wrote:
  Doing nothing until release is certainly not an option ;-).  But AFAICS
  this is not a showstopper for development so I was intending to wait
  until commitfest starts before doing anything with it.
 
  Just uncomfortable leaving known bugs in the tree, that's all.
 
 Feel free to stick something about it on the open items list
 http://wiki.postgresql.org/wiki/PostgreSQL_8.4_Open_Items
 if you want to make doubly sure it's not forgotten about.

You're pulling my chain. 

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Rework subtransaction commit protocol for hot standby.

2008-10-28 Thread Alvaro Herrera

Simon Riggs wrote:
 
 On Tue, 2008-10-28 at 15:52 -0400, Tom Lane wrote:
  Simon Riggs [EMAIL PROTECTED] writes:
   On Tue, 2008-10-28 at 14:43 -0400, Tom Lane wrote:
   Doing nothing until release is certainly not an option ;-).  But AFAICS
   this is not a showstopper for development so I was intending to wait
   until commitfest starts before doing anything with it.
  
   Just uncomfortable leaving known bugs in the tree, that's all.
  
  Feel free to stick something about it on the open items list
  http://wiki.postgresql.org/wiki/PostgreSQL_8.4_Open_Items
  if you want to make doubly sure it's not forgotten about.
 
 You're pulling my chain. 

:-)

The new entry seems too vague to be useful ...

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Any reason to have heap_(de)formtuple?

2008-10-28 Thread Zdenek Kotala


Kris Jurka napsal(a):



On Thu, 23 Oct 2008, Kris Jurka wrote:

The problem with trying to deprecate it is that the vast majority of 
the backend is still using the old interfaces, so people looking for 
inspiration for their external modules will likely end up using the 
old interface.  Like Alvaro I started this conversion a while ago, got 
bored, and forgot about it. If people do want this conversion done 
while keeping the old interface around, I can track down that patch, 
update it and finish it up for the next CommitFest.




Here's a patch that changes everything over to the the new API and 
implements the old API by calling the new API.


It seems to me OK. I have only one comment. I prefer to pfree allocated memory 
for temporary nulls array. I think that caller could call old API many times 
without memory context cleanup.



Zdenek



--
Zdenek Kotala  Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PostgreSQL + Replicator developer meeting 10/28

2008-10-28 Thread Devrim GÜNDÜZ

On Tue, 2008-10-28 at 19:46 +0200, Hannu Krosing wrote:
 
  * Is there a mailing list for replicator ?

https://lists.commandprompt.com/mailman/listinfo/replicator-general

Regards,
-- 
Devrim GÜNDÜZ, RHCE
devrim~gunduz.org, devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr
   http://www.gunduz.org


signature.asc
Description: This is a digitally signed message part

Re: [HACKERS] Re: [COMMITTERS] pgsql: Rework subtransaction commit protocol for hot standby.

2008-10-28 Thread Simon Riggs


On Tue, 2008-10-28 at 17:36 -0300, Alvaro Herrera wrote:
 Simon Riggs wrote:
  
  On Tue, 2008-10-28 at 15:52 -0400, Tom Lane wrote:
   Simon Riggs [EMAIL PROTECTED] writes:
On Tue, 2008-10-28 at 14:43 -0400, Tom Lane wrote:
Doing nothing until release is certainly not an option ;-).  But AFAICS
this is not a showstopper for development so I was intending to wait
until commitfest starts before doing anything with it.
   
Just uncomfortable leaving known bugs in the tree, that's all.
   
   Feel free to stick something about it on the open items list
   http://wiki.postgresql.org/wiki/PostgreSQL_8.4_Open_Items
   if you want to make doubly sure it's not forgotten about.
  
  You're pulling my chain. 
 
 :-)
 
 The new entry seems too vague to be useful ...

I'll correct it before release. ;-)

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] contrib/pg_stat_statements

2008-10-28 Thread Martin Pihlak

ITAGAKI Takahiro wrote:
 It might be possbile to add a queryText field into QueryDesc
 and all of the codes using QueryDesc initialize the field with
 their own query texts, but it requires modifications in many
 different modules and copying query text might be needed.
 I don't want to do it only for this contrib module,
 so I'll support only top statements at the moment. Sorry.
 

Indeed, this turned out to be more difficult than I initally thought.
Nevertheless I still think that tracking nested statements is worth the
effort. Attached is a patch that adds sourceText to QueryDesc (should
apply cleanly after auto_explain.patch). This is very WIP, for one thing
it assumes that the source text pointers can be passed around freely.
But is the idea of extending QueryDesc generally acceptable? Is it OK
to make a copy of the query string?

I tested with modified pg_stat_statements (removed toplevel checks),
and much to my delight it didn't crash immediately :) I've only done
some limited testing but a lot of interesting stuff seems to turns out --
for instance we get to see the lookups generated by referential integrity
checks (could help to identify missing indexes on FK fields).

Regards,
Martin

*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
***
*** 1056,1062  DoCopy(const CopyStmt *stmt, const char *queryString)
  		/* Create a QueryDesc requesting no output */
  		cstate-queryDesc = CreateQueryDesc(plan, GetActiveSnapshot(),
  			InvalidSnapshot,
! 			dest, NULL, false);
  
  		/*
  		 * Call ExecutorStart to prepare the plan for execution.
--- 1056,1062 
  		/* Create a QueryDesc requesting no output */
  		cstate-queryDesc = CreateQueryDesc(plan, GetActiveSnapshot(),
  			InvalidSnapshot,
! 			dest, NULL, false, queryString);
  
  		/*
  		 * Call ExecutorStart to prepare the plan for execution.
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
***
*** 172,178  ExplainOneQuery(Query *query, ExplainStmt *stmt, const char *queryString,
  		plan = pg_plan_query(query, 0, params);
  
  		/* run it (if needed) and produce output */
! 		ExplainOnePlan(plan, params, stmt, tstate);
  	}
  }
  
--- 172,178 
  		plan = pg_plan_query(query, 0, params);
  
  		/* run it (if needed) and produce output */
! 		ExplainOnePlan(plan, params, stmt, tstate, queryString);
  	}
  }
  
***
*** 219,225  ExplainOneUtility(Node *utilityStmt, ExplainStmt *stmt,
   */
  void
  ExplainOnePlan(PlannedStmt *plannedstmt, ParamListInfo params,
! 			   ExplainStmt *stmt, TupOutputState *tstate)
  {
  	QueryDesc  *queryDesc;
  	instr_time	starttime;
--- 219,226 
   */
  void
  ExplainOnePlan(PlannedStmt *plannedstmt, ParamListInfo params,
! 			   ExplainStmt *stmt, TupOutputState *tstate,
! 			   const char *queryString)
  {
  	QueryDesc  *queryDesc;
  	instr_time	starttime;
***
*** 237,243  ExplainOnePlan(PlannedStmt *plannedstmt, ParamListInfo params,
  	queryDesc = CreateQueryDesc(plannedstmt,
  GetActiveSnapshot(), InvalidSnapshot,
  None_Receiver, params,
! stmt-analyze);
  
  	INSTR_TIME_SET_CURRENT(starttime);
  
--- 238,244 
  	queryDesc = CreateQueryDesc(plannedstmt,
  GetActiveSnapshot(), InvalidSnapshot,
  None_Receiver, params,
! stmt-analyze, queryString);
  
  	INSTR_TIME_SET_CURRENT(starttime);
  
*** a/src/backend/commands/prepare.c
--- b/src/backend/commands/prepare.c
***
*** 697,703  ExplainExecuteQuery(ExecuteStmt *execstmt, ExplainStmt *stmt,
  pstmt-intoClause = execstmt-into;
  			}
  
! 			ExplainOnePlan(pstmt, paramLI, stmt, tstate);
  		}
  		else
  		{
--- 697,703 
  pstmt-intoClause = execstmt-into;
  			}
  
! 			ExplainOnePlan(pstmt, paramLI, stmt, tstate, queryString);
  		}
  		else
  		{
*** a/src/backend/executor/functions.c
--- b/src/backend/executor/functions.c
***
*** 309,320  postquel_start(execution_state *es, SQLFunctionCachePtr fcache)
  		es-qd = CreateQueryDesc((PlannedStmt *) es-stmt,
   snapshot, InvalidSnapshot,
   None_Receiver,
!  fcache-paramLI, false);
  	else
  		es-qd = CreateUtilityQueryDesc(es-stmt,
  		snapshot,
  		None_Receiver,
! 		fcache-paramLI);
  
  	/* We assume we don't need to set up ActiveSnapshot for ExecutorStart */
  
--- 309,320 
  		es-qd = CreateQueryDesc((PlannedStmt *) es-stmt,
   snapshot, InvalidSnapshot,
   None_Receiver,
!  fcache-paramLI, false, fcache-src);
  	else
  		es-qd = CreateUtilityQueryDesc(es-stmt,
  		snapshot,
  		None_Receiver,
! 		fcache-paramLI, fcache-src);
  
  	/* We assume we don't need to set up ActiveSnapshot for ExecutorStart */
  
*** a/src/backend/executor/spi.c
--- b/src/backend/executor/spi.c
***
*** 1795,1801  _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI,

Re: [HACKERS] Any reason to have heap_(de)formtuple?

2008-10-28 Thread Kris Jurka




On Tue, 28 Oct 2008, Zdenek Kotala wrote:


Kris Jurka napsal(a):


Here's a patch that changes everything over to the the new API and 
implements the old API by calling the new API.


It seems to me OK. I have only one comment. I prefer to pfree allocated 
memory for temporary nulls array. I think that caller could call old API 
many times without memory context cleanup.




Here's an incremental patch to add the suggested pfreeing.

Kris Jurka*** a/src/backend/access/common/heaptuple.c
--- b/src/backend/access/common/heaptuple.c
***
*** 801,806  heap_formtuple(TupleDesc tupleDescriptor,
--- 801,808 
  
tuple = heap_form_tuple(tupleDescriptor, values, boolNulls);
  
+   pfree(boolNulls);
+ 
return tuple;
  }
  
***
*** 894,899  heap_modifytuple(HeapTuple tuple,
--- 896,902 
  {
int numberOfAttributes = tupleDesc-natts;
int attnum;
+   HeapTuple   result;
bool   *replNulls = (bool *) palloc(numberOfAttributes * 
sizeof(bool));
bool   *replActions = (bool *) palloc(numberOfAttributes * 
sizeof(bool));
  
***
*** 903,909  heap_modifytuple(HeapTuple tuple,
replActions[attnum] = replCharActions[attnum] == 'r';
}
  
!   return heap_modify_tuple(tuple, tupleDesc, replValues, replNulls, 
replActions);
  }
  
  /*
--- 906,917 
replActions[attnum] = replCharActions[attnum] == 'r';
}
  
!   result = heap_modify_tuple(tuple, tupleDesc, replValues, replNulls, 
replActions);
! 
!   pfree(replNulls);
!   pfree(replActions);
! 
!   return result;
  }
  
  /*
***
*** 1051,1056  heap_deformtuple(HeapTuple tuple,
--- 1059,1066 
charNulls[attnum] = boolNulls[attnum] ? 'n' : ' ';
}
  
+   pfree(boolNulls);
+ 
  }
  
  /*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PostgreSQL + Replicator developer meeting 10/28

2008-10-28 Thread Hannu Krosing

On Tue, 2008-10-28 at 11:01 -0700, Joshua Drake wrote:
 On Tue, 28 Oct 2008 19:46:42 +0200
 Hannu Krosing [EMAIL PROTECTED] wrote:
 
  On Mon, 2008-10-27 at 13:42 -0700, Joshua Drake wrote:
   With the recent open sourcing of Replicator, the team has been
   trying to come up with ways to ensure an open development process.
   In that light we have decided to have our first release 1.9 meeting
   on Freenode. All people interested in participating in a discussion
   about the upcoming Replicator 1.9 are welcome to attend.
  
  I missed the meeting at #replicator, but still have some questions
 
 Meeting is still going on. 

Heh. Seems I underestimated the time difference :)

 I will answer your questions and post results this afternoon.

Thanks!


Hannu


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal of PITR performance improvement for 8.4.

2008-10-28 Thread Bruce Momjian

Gregory Stark wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
 
  I'm happy with the idea of a readahead process. I thought we were
  implementing a BackgroundReader process for other uses. Is that dead
  now?
 
 You and Bruce seem to keep resurrecting that idea. I've never liked it -- I
 always hated that in Oracle and thought it was a terrible kludge.

I didn't think I was promoting the separate reader process after you had
the posix_fadvise() idea.

-- 
  Bruce Momjian  [EMAIL PROTECTED]http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Rework subtransaction commit protocol for hot standby.

2008-10-28 Thread Tom Lane

Simon Riggs [EMAIL PROTECTED] writes:
 I'm a little unclear as to what is happening with this.

 We have a bug in CVS HEAD, so clearly doing nothing shouldn't really be
 an option.

Doing nothing until release is certainly not an option ;-).  But AFAICS
this is not a showstopper for development so I was intending to wait
until commitfest starts before doing anything with it.  Is it blocking
some other progress for you?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal of PITR performance improvement for 8.4.

2008-10-28 Thread Simon Riggs


On Tue, 2008-10-28 at 17:40 -0400, Bruce Momjian wrote:
 Gregory Stark wrote:
  Simon Riggs [EMAIL PROTECTED] writes:
  
   I'm happy with the idea of a readahead process. I thought we were
   implementing a BackgroundReader process for other uses. Is that dead
   now?
  
  You and Bruce seem to keep resurrecting that idea. I've never liked it -- I
  always hated that in Oracle and thought it was a terrible kludge.
 
 I didn't think I was promoting the separate reader process after you had
 the posix_fadvise() idea.

I think Greg is misinterpreting our occasional lack of exactness as
disagreement. The end solution is the goal, not any of the discussed
mechanisms. It's always good to have a name for it that sums up the
goals rather than the methods e.g. frequent update optimisation rather
than update-in-place.

It would be good if the solutions for normal running and recovery were
similar. Greg, please could you look into that?

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SQL/MED compatible connection manager

2008-10-28 Thread Martin Pihlak

Chris Browne wrote:

 Slony-I does some vaguely similar stuff in its handling of connection 
 paths; here's the schema:
 
 create table @[EMAIL PROTECTED] (
   pa_server   int4,
   pa_client   int4,
   pa_conninfo text NOT NULL,
   pa_connretryint4,
[snip ...]
 I wouldn't be surprised to find there being some value in using
 something like SQL/MED.
 

Here the pa_conninfo could be replaced with the connection name (actually
SERVER). For the complete connection definition a USER MAPPING (eg. remote
username and password) is also needed. But that can be fetched by the
connection connection lookup function

 One detail I'll point out, that I'm noticing from an application I'm
 working on right now.  We might want to have something like a db
 connection data type; here's a prototype I put together:
 

 slonyregress1=# create type dbconn as (port integer, dbname text, username 
 text, password text, ssl boolean);
 CREATE TYPE
[snip]
 slonyregress1=# select * from dbconns;
  id | db 
 +--
   1 | (5432,slonyregress1,slony,secret!,t)
  (1 row)
 
 I'm not certain that this is forcibly the right representation, but I
 think it is possible that we'd want a finer-grained representation
 than merely a connection string.

Yes -- the server, user mapping and FDW all take generic options. Some of them
might be part of the connect string, others could specify some hints of how the
connection should be handled (driver options etc). DBD-Excel has a particularly
nasty example of those. A fixed type would not be able to cover all of them.
This is where the SQL/MED stuff can help - all of this complexity can be reduced
to a single name. Though it adds the additional step of doing the lookup.

The dbconns example could be written like this:

test=# create table dbconns (id serial primary key, db regserver);
...
test=# insert into dbconns (db) values ('test');
INSERT 0 1
test=# select * from dbconns;
 id | db
+-
  1 | public.test
(1 row)

And then for the connection details:

test=# select * from pg_get_remote_connection_info('test');
  option  | value
--+
 host | /tmp
 port | 6543
 dbname   | foo
 username | bob
 password | secret
(5 rows)

This assumes that there is a server public.test and a user mapping for
the session user. The option/value pairs are outputted by the dummy FDW
that just dumps the generic options for the server and user mapping. A
smart FDW could turn this into just a connection string. Still, there
probably should be a set of functions for accessing the raw options/value
pairs as well

regards,
Martin



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PostgreSQL + Replicator developer meeting 10/28

2008-10-28 Thread Joshua Drake

On Tue, 28 Oct 2008 19:46:42 +0200
Hannu Krosing [EMAIL PROTECTED] wrote:

   The current topics are:
 
  * New MCP architecture
 
 What's new ? 
 
 I have some doubts about the current architecture based on my reading
 of replicator wiki, but would like to learn about the new
 architecture before spending too much time in studying the old one.

The two obvious problems with the existing MCP architecture is:

 1. Single point of failure
 2. Portability

The new architecture is set to remove both of those. The short version
is the MCP will be moved into the backend. Thus:

Master-MCP|Slave -Slave1
  -Slave2
  -Slave3

The process being, Master sends data to MCP|Slave, MCP|Slave writes it
to disk (optionally restores it) and then forwards it to 1,2,3 who then
receive the data, write it to disk and then restore it.

If master dies, you can promote to any of the slaves and the left over
slaves will connect to the promoted slave and begin receiving updates.

If the MCP|Slave dies a new Slave can begin the MCP|Slave process.

Alvaro or Alexey can speak more technically about implementation than I
can.

 
  * DDL Replication
 
 Is it there alread, or is it just  a planned feature ?

Planned feature.

 
  * How hard would it be to extract DDL replication part and use it as
basis for DDL after trigger support for use in trigger based
replication/auditing like Slony ann pgQ/Londiste ?

Hmm I am not sure. We are pretty deep into the core and only use
triggers for GRANT/REVOKE/CREATE ROLE .

Sincerely,

Joshua D. Drake



-- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PostgreSQL + Replicator developer meeting 10/28

2008-10-28 Thread Hannu Krosing

On Tue, 2008-10-28 at 15:18 -0700, Joshua Drake wrote:
 On Tue, 28 Oct 2008 19:46:42 +0200
 Hannu Krosing [EMAIL PROTECTED] wrote:
 
The current topics are:
  
   * New MCP architecture
  
  What's new ? 
  
  I have some doubts about the current architecture based on my reading
  of replicator wiki, but would like to learn about the new
  architecture before spending too much time in studying the old one.
 
 The two obvious problems with the existing MCP architecture is:
 
  1. Single point of failure

For async replication there is always SPoF, at least the master until
first slave has aquired log is a SPoF, or do you plan that both Master
and MCP|Slave to keep the log and be able to step in for each other if
the other fails?

  2. Portability

Portability to where ? Other DBMS's ? Other PG versions ?


for me there was also two more problems:

3. separate replication log, which at least seems to be able to get
out of sync with main DB. 

Why don't you just use a DB table, WAL-logged and all

4. Also, again from reading Replicator FAQ, it seems that there is a
window of corruption/data loss when rotating the Replicators transaction
log. I think that doing it with copy/truncate either needs locking the
logfile (== bad performance, during copy/truncate) or is just a
data-eating failure waiting to happen.

pgQ has a solution to that by rotating queue tables, and postgreSQL core
also does effectively rotate WAL log segments. To be robust _and_
effective, Replicator should also rotate the logfile itself.

 The new architecture is set to remove both of those. The short version
 is the MCP will be moved into the backend. Thus:
 
 Master-MCP|Slave -Slave1
   -Slave2
   -Slave3
 
 The process being, Master sends data to MCP|Slave, MCP|Slave writes it
 to disk (optionally restores it)

Will this first send be sync or async ? Or have you planned to have it
be configurable among several robustness vs. performance levels, similar
to the planned integrated WAL-shipping.

if async, will it also use MVCC for keeping log on Master (l.ike Slony
and pgQ do), just to be at least as reliable as postgreSQL core itself
and not require a full resync at server crash.

 and then forwards it to 1,2,3 who then
 receive the data, write it to disk and then restore it.
 
 If master dies, you can promote to any of the slaves and the left over
 slaves will connect to the promoted slave and begin receiving updates.
 
 If the MCP|Slave dies a new Slave can begin the MCP|Slave process.
 
 Alvaro or Alexey can speak more technically about implementation than I
 can.

Alvaro - I guess you already have discussed most of it, but basically
you need to solve all the same problems that WAL-shipping based Hot
Standby is solving and Slony/pgQ/Londiste have solved.

Hopefully you get it more robust than Slony when making changes under
high load :)

Will there be an helper application for setting up and configuring
changes in replication. or will it all be done using added SQL
commands ?

How will DDL be handled ( i understood that you don't yet have DDL
replication) ?

Will Slave tables be kind-of-read-only like Slony slaves ? Or even
_really_ read only like Simon's Hot Standby ?

  
   * DDL Replication
  
  Is it there alread, or is it just  a planned feature ?
 
 Planned feature.

Did the plans got any clearer during this meeting ?
 
   * How hard would it be to extract DDL replication part and use it as
 basis for DDL after trigger support for use in trigger based
 replication/auditing like Slony ann pgQ/Londiste ?
 
 Hmm I am not sure. We are pretty deep into the core and only use
 triggers for GRANT/REVOKE/CREATE ROLE .

By the way, why did you choose pretty deep into the core approach
instead of triggers ?

I mean, you probably end up duplicating (or missing) lots of
postgreSQL-s internal goodness instead of just using what is already
available ?

 Sincerely,
 
 Joshua D. Drake

Thanks for the update.

I hope something useful will come out of this too, though I hoped that
it already had some advantages over trigger-based replication, like
ability to replicate DDL .

---
Hannu Krosing






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal of PITR performance improvement for 8.4.

2008-10-28 Thread Gregory Stark

Simon Riggs [EMAIL PROTECTED] writes:

 On Tue, 2008-10-28 at 17:40 -0400, Bruce Momjian wrote:
 Gregory Stark wrote:
  Simon Riggs [EMAIL PROTECTED] writes:
  
   I'm happy with the idea of a readahead process. I thought we were
   implementing a BackgroundReader process for other uses. Is that dead
   now?
  
  You and Bruce seem to keep resurrecting that idea. I've never liked it -- I
  always hated that in Oracle and thought it was a terrible kludge.
 
 I didn't think I was promoting the separate reader process after you had
 the posix_fadvise() idea.

I'm sorry, I thought I remembered you mentioning it again. But perhaps I was
thinking of someone else (perhaps it was Simon again?) or perhaps it was
before you saw the actual patch.

 It would be good if the solutions for normal running and recovery were
 similar. Greg, please could you look into that?

I could do the readahead side of things but what I'm not sure how to arrange
is how to restructure the wal reading logic to read records ahead of the
actual replay.

I think we would have to maintain two pointers one for the prefetch and one
for the actual running. But the logic in for recovery is complex enough that
I'm concerned about changing it enough to do that and whether it can be done
without uglifying the code quite a bit.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SQL/MED compatible connection manager

2008-10-28 Thread Hannu Krosing

On Mon, 2008-10-27 at 16:50 -0400, Chris Browne wrote:
 [EMAIL PROTECTED] (Martin Pihlak) writes:
  Tons of details have been omitted, but should be enough to start discussion.
  What do you think, does this sound usable? Suggestions, objections?
 
 Slony-I does some vaguely similar stuff in its handling of connection 
 paths; here's the schema:

I think the whole issue was initially raised by insecurity, as dblink
conrib module exposed connection strings to all users, and SQL/MED was
seen as a standard way to hide it.

The simple credentials hiding could of course be achieved by having
something similar to pg_user/pg_shadow and some SECURITY DEFINER
functions for actually opening the connections, but probably it seemed
easier to at least base it on standards, so we can actually start with

pg_remote_server table public
pg_user_mapping_shadow table (restricted)/ pg_user_mapping view(public)

and some functions with proper grants to match the subset that Martin
outlined in http://wiki.postgresql.org/wiki/SqlMedConnectionManager

if we've got that working, then we could move to massaging it into the
parser to provide standard SQL/MED syntax.

so I think that first we should agree on functionality and get the few
system (?) tables and functions done, and worry about parser changes
once the actual functionality is field tested.

--
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability 
   Services, Consulting and Training



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Window Functions: v07 APIs and buffering strateties

2008-10-28 Thread Hitoshi Harada

2008/10/29 Tom Lane [EMAIL PROTECTED]:
 Martijn van Oosterhout [EMAIL PROTECTED] writes:
 On Tue, Oct 28, 2008 at 01:50:26PM -0400, Tom Lane wrote:
 ... So it might be possible to fix
 by attaching some new precedence level to the ROWS token.

 Yes. Bison's default is to shift, which means that if you do nothing it
 will treat ROWS as part of the expression if it makes any sense at all.
 Given the requirement for a following UNBOUNDED or BETWEEN, the only
 problem is that you'll get a syntax error if the expr_list ends in a
 postfix operator, I don't see how you get hidden ambiguity.

 Hmm, now I see what you meant; that's a little different than what I was
 envisioning.  I was thinking of trying to force a parse decision that
 would support the windowing syntax, whereas you propose forcing a
 parse decision that does the opposite, and making the user parenthesize
 if he's got a conflict.

 What the choice seems to come down to is making ROWS and RANGE reserved
 (in some form or other) versus creating a corner case for users of
 postfix operators.  Phrased that way it does seem like the second
 alternative is better.

 Hitoshi: you can probably make this happen by including ROWS and RANGE
 in the %nonassoc IDENT precedence declaration, but you'll want to test
 to make sure the right things happen.


Bison and parsing are quite new to me so it'll take a little time but
I will try it. One thing, the words following after ROWS/RANGE are not
only UNBOUNDED and BETWEEN but also CURRENT and unsigned constant
though. Still the phrasing approach doesn't seem less hope?

Regards,


-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal of PITR performance improvement for 8.4.

2008-10-28 Thread Koichi Suzuki

Thanks for a lot of inspiring discussions.

Please note that my proposal includes only a few lines of change to
the recovery code itself.   It does not affect buffer management,
order of WAL record applying etc.   Only change needed is to invoke
prefetch feature if redo is going to read WAL which has not been
handled by the prefetch (prefetch function returns last-handled LSN).

Before writing the readahead code, I ran several experiment how
posix_fadvise() speeds up random read and I found that
POSIX_FADV_WILLNEED can improve total read performance for around five
times, if we schedule the order of posix_fadvise() call to the order
of block position.   Without random position, the improvement ratio
was around three times.   This result was achieved with single
process, but for RAID configuration.   I'd like to do the similar
measurement against single disk.

I'd like to run some benchmark to clarify the improvement.I agree
I should show how my proposal is useful.

In terms of the influence to the recovery code, pg_readahead just
calls posix_fadvise() to tell the operating system to prefetch the
data page to kernel's cash, not PG's shared memory, so we don't have
to implement this in PG core code. Because of this and I think it
is more practical to have platform-specific code to outside as
possible, I wrote most of the prefetch in the external process, which
can be available at contrib or PgFoundry, perhaps the latter.
Heikki suggested to have separate reader process.   I think it's very
good idea but with this idea, but this will change PG's performance
dramatically.Better in some case, but even worse in other cases
possibly.I don't have clear on this.So I think background
reader issue should be a challange to 8.5 or further and we must call
for research works.So far, I think it is reasonable to keep
improving specific code.

I'd like to hear some more about these.   I'm more than happy to write
all the code inside PG core to avoid overhead to create another
process.

---
Koichi Suzuki


2008/10/29 Gregory Stark [EMAIL PROTECTED]:
 Simon Riggs [EMAIL PROTECTED] writes:

 On Tue, 2008-10-28 at 17:40 -0400, Bruce Momjian wrote:
 Gregory Stark wrote:
  Simon Riggs [EMAIL PROTECTED] writes:
 
   I'm happy with the idea of a readahead process. I thought we were
   implementing a BackgroundReader process for other uses. Is that dead
   now?
 
  You and Bruce seem to keep resurrecting that idea. I've never liked it -- 
  I
  always hated that in Oracle and thought it was a terrible kludge.

 I didn't think I was promoting the separate reader process after you had
 the posix_fadvise() idea.

 I'm sorry, I thought I remembered you mentioning it again. But perhaps I was
 thinking of someone else (perhaps it was Simon again?) or perhaps it was
 before you saw the actual patch.

 It would be good if the solutions for normal running and recovery were
 similar. Greg, please could you look into that?

 I could do the readahead side of things but what I'm not sure how to arrange
 is how to restructure the wal reading logic to read records ahead of the
 actual replay.

 I think we would have to maintain two pointers one for the prefetch and one
 for the actual running. But the logic in for recovery is complex enough that
 I'm concerned about changing it enough to do that and whether it can be done
 without uglifying the code quite a bit.

 --
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal of PITR performance improvement for 8.4.

2008-10-28 Thread Koichi Suzuki

Hi,

 I think we would have to maintain two pointers one for the prefetch and one
 for the actual running. But the logic in for recovery is complex enough that
 I'm concerned about changing it enough to do that and whether it can be done
 without uglifying the code quite a bit.

Yes, this is what the code is doing.  Prefetch function returns LSN
where prefetch was performed.   Redo routine has to call prefetch if
it is going to read WAL record beyond this.

-- 
--
Koichi Suzuki

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal of PITR performance improvement for 8.4.

2008-10-28 Thread Koichi Suzuki

Heikki,

1. In the code, all the referenced page is extracted from the WAL
records in scope and sorted to schedule and avoid double
posix_fadvise() calls.If full page write is included as the first
WAL record, such page is not prefetched.Although there's still
some risk to call posix_fadvise() for the pages already in the kernel
cache, but I don't think the overhead is not so big.   I'd like to
measure how much improvement we can achieve with some benchmark.

2. Yes, we have to analyze WAL record using specific code and I think
it is not good.   At present, major part of this is from xlogdump.
Are you proposing to improve WAL record format to be less rm-specific?
 It's interesting too.

3. Yes, the code includes all the WAL record handling except for those
which does not need any read operation in the recovery, such as
creating new pages with new record.Improving WAL record format as
2. will help to simplify this as well.

2008/10/28 Heikki Linnakangas [EMAIL PROTECTED]:
 Gregory Stark wrote:

 Koichi Suzuki [EMAIL PROTECTED] writes:

 This is my first proposal of PITR performance improvement for
 PostgreSQL 8.4 development.   This proposal includes readahead
 mechanism of data pages which will be read by redo() routines in the
 recovery.   This is especially effective in the recovery without full
 page write.   Readahead is done by posix_fadvise() as proposed in
 index scan improvement.

 Incidentally: a bit of background for anyone who wasn't around when last
 this
 came up: prefetching is especially for our recovery code because it's
 single-threaded. If you have a raid array you're effectively limited to
 using
 a single drive at a time. This is a major problem because the logs could
 have
 been written by many processes hammering the raid array concurrently. In
 other
 words your warm standby database might not be able to keep up with the
 logs
 from the master database even on identical (or even better) hardware.

 Simon (I think?) proposed allowing our recovery code to be multi-threaded.
 Heikki suggested using prefetching.

 I actually played around with the prefetching, and even wrote a quick
 prototype of it, about a year ago. It read ahead a fixed number of the WAL
 records in xlog.c, calling posix_fadvise() for all pages that were
 referenced in them. I never got around to finish it, as I wanted to see
 Greg's posix_fadvise() patch get done first and rely on the same
 infrastructure, but here's some lessons I learned:

 1. You should avoid useless posix_fadvise() calls. In the naive
 implementation, where you simply call posix_fadvise() for every page
 referenced in every WAL record, you'll do 1-2 posix_fadvise() syscalls per
 WAL record, and that's a lot of overhead. We face the same design question
 as with Greg's patch to use posix_fadvise() to prefetch index and bitmap
 scans: what should the interface to the buffer manager look like? The
 simplest approach would be a new function call like AdviseBuffer(Relation,
 BlockNumber), that calls posix_fadvise() for the page if it's not in the
 buffer cache, but is a no-op otherwise. But that means more overhead, since
 for every page access, we need to find the page twice in the buffer cache;
 once for the AdviseBuffer() call, and 2nd time for the actual ReadBuffer().
 It would be more efficient to pin the buffer in the AdviseBuffer() call
 already, but that requires much more changes to the callers.

 2. The format of each WAL record is different, so you need a readahead
 handler for every resource manager, for every record type. It would be a
 lot simpler if there was a standardized way to store that information in the
 WAL records.

 3. IIRC I tried to handle just a few most important WAL records at first,
 but it turned out that you really need to handle all WAL records (that are
 used at all) before you see any benefit. Otherwise, every time you hit a WAL
 record that you haven't done posix_fadvise() on, the recovery stalls, and
 you don't need much of those to diminish the gains.

 Not sure how these apply to your approach, it's very different. You seem to
 handle 1. by collecting all the page references for the WAL file, and
 sorting and removing the duplicates. I wonder how much CPU time is spent on
 that?

 Details of the implementation will be found in README file in the
 material.

 I've read through this and I think I disagree with the idea of using a
 separate program. It's a lot of extra code -- and duplicated code from the
 normal recovery too.

 Agreed, this belongs into core. The nice thing about a separate process is
 that you could hook it into recovery_command, with no changes to the server,
 but as you note in the README, we'd want to use this in crash recovery as
 well, and the interaction between the external command and the startup
 process seems overly complex for that. Besides, we want to use the
 posix_fadvise() stuff in the backend anyway, so we should use the same
 infrastructure during WAL replay as well.

Re: [HACKERS] PostgreSQL + Replicator developer meeting 10/28

2008-10-28 Thread Alvaro Herrera

Hannu Krosing wrote:
 On Tue, 2008-10-28 at 15:18 -0700, Joshua Drake wrote:

  The two obvious problems with the existing MCP architecture is:
  
   1. Single point of failure
 
 For async replication there is always SPoF, at least the master until
 first slave has aquired log is a SPoF, or do you plan that both Master
 and MCP|Slave to keep the log and be able to step in for each other if
 the other fails?

Yeah, with the new architecture there is still going to be a bit of a
SPoF in the master-MCP but it's a lot smaller than the current setup,
in which if you lose the MCP you basically lose everything.

   2. Portability
 
 Portability to where ? Other DBMS's ? Other PG versions ?

Other operating systems mainly.  The trouble is we never got around to
porting the MCP to any OS beyond Linux; I think it should work on
Solaris and BSDs, but surely not Windows.  We want to just get rid of
what I consider a (crappy) reimplementation of postmaster; instead we
should just let postmaster do the job.

Additionally we would get rid of the ugly way we import backend code
into the MCP server.


 for me there was also two more problems:
 
 3. separate replication log, which at least seems to be able to get
 out of sync with main DB. 
 
 Why don't you just use a DB table, WAL-logged and all

The whole replication log thing is a topic of dissent in the team ;-)

 4. Also, again from reading Replicator FAQ, it seems that there is a
 window of corruption/data loss when rotating the Replicators transaction
 log. I think that doing it with copy/truncate either needs locking the
 logfile (== bad performance, during copy/truncate) or is just a
 data-eating failure waiting to happen.

Hmm, what Replicator FAQ?  We used to have this copy/truncate problem,
and we rearchitected the log to avoid this (we use a rotating setup
now)

  Master-MCP|Slave -Slave1
-Slave2
-Slave3
  
  The process being, Master sends data to MCP|Slave, MCP|Slave writes it
  to disk (optionally restores it)
 
 Will this first send be sync or async ? Or have you planned to have it
 be configurable among several robustness vs. performance levels, similar
 to the planned integrated WAL-shipping.

It is async, and we haven't talked about sync.

 if async, will it also use MVCC for keeping log on Master (l.ike Slony
 and pgQ do), just to be at least as reliable as postgreSQL core itself
 and not require a full resync at server crash.

You mean WAL?  We don't currently.


  Alvaro or Alexey can speak more technically about implementation than I
  can.
 
 Alvaro - I guess you already have discussed most of it, but basically
 you need to solve all the same problems that WAL-shipping based Hot
 Standby is solving and Slony/pgQ/Londiste have solved.

If you mean that we're duplicating the effort that's already going
elsewhere, my opinion is yes, we are.

 Hopefully you get it more robust than Slony when making changes under
 high load :)

Hmm, I don't know about lack of robustness in Slony, so I don't know.

 Will there be an helper application for setting up and configuring
 changes in replication. or will it all be done using added SQL
 commands ?

Well, the interface I work on is all SQL commands :-)

 How will DDL be handled ( i understood that you don't yet have DDL
 replication) ?

We don't have it yet.  However, since we can just add any code in any
place we like, and that we have a protocol to transmit changes, it is
relatively easy to add calls to collect the needed information and
replay it on the slave.

 Will Slave tables be kind-of-read-only like Slony slaves ? Or even
 _really_ read only like Simon's Hot Standby ?

Heh -- they are read only, and they turn into read-write when the slave
promotes.  I'm not sure what kind does that make it :-)


* DDL Replication
   
   Is it there alread, or is it just  a planned feature ?
  
  Planned feature.
 
 Did the plans got any clearer during this meeting ?

Not really; we didn't talk about that.


  Hmm I am not sure. We are pretty deep into the core and only use
  triggers for GRANT/REVOKE/CREATE ROLE .
 
 By the way, why did you choose pretty deep into the core approach
 instead of triggers ?

Speed maybe?  I don't know.

 I mean, you probably end up duplicating (or missing) lots of
 postgreSQL-s internal goodness instead of just using what is already
 available ?

Yeah.


 I hope something useful will come out of this too, though I hoped that
 it already had some advantages over trigger-based replication, like
 ability to replicate DDL .

I fear that our approach to replication is so ad-hoc that there's not
much to be gained from elsewhere.  Replicator is pretty much a fork
that's not likely to yield anything useful to upstream.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your

[HACKERS] Checking stack depth

2008-10-28 Thread Suresh

Hello,

Why is check_stack_depth function enforced in context of evaluating expressions 
in PostgreSQL ? What sort of recursion we are trying to safeguard ?

thanks,
Suresh

Re: [HACKERS] Checking stack depth

2008-10-28 Thread Tom Lane

Suresh [EMAIL PROTECTED] writes:
 Why is check_stack_depth function enforced in context of evaluating 
 expressions in PostgreSQL ? What sort of recursion we are trying to safeguard 
 ?

create function foo(int) returns int as $$
select foo($1) $$ language sql;

select foo(1);

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BufferAccessStrategy for bulk insert

2008-10-28 Thread Robert Haas

And here's the patch, which based on comments thus far does the following:

- Replaces the use_wal, use_fsm arguments in various places with a
single options argument.
- Creates a BAS_BULKWRITE buffer access strategy.
- Creates a BulkInsertState object so that COPY and CTAS can use
BAS_BULKWRITE and also keep the most recent page pinned.

Note that the original purpose of this exercise was to implement the
optimization that COPY and CTAS would keep the most recent page pinned
to avoid repeated pin/unpin cycles.  This change shows a small but
measurable performance improvement on short rows.  The remaining items
were added based on reviewer comments.

One concern that I have about this approach is that the situation in
which people are probably most concerned about COPY performance is
restoring a dump.  In that case, the COPY will be the only thing
running, and using a BufferAccessStrategy is an anti-optimization.  I
don't think it's a very big effect (any testing anyone can do on real
hardware rather than what I have would be appreciated) but I'm sort of
unsold of optimizing for what I believe to be the less-common use
case.  If the consensus is to reverse course on this point I'm happy
to rip those changes back out and resubmit; they are a relatively
small proportion of the patch.

...Robert

On Sun, Oct 26, 2008 at 8:37 PM, Robert Haas [EMAIL PROTECTED] wrote:
 Seems sane to me.  I don't see the point of the HEAP_INSERT_BULK flag
 bit --- providing or not providing bistate would cover that, and if
 you have a bit as well then you have to define what the inconsistent
 combinations mean.  I concur with making all-zeroes be the typical
 state of the flag bits, too.

 Thanks for the design review.  I had thought to make the inconsistent
 combinations fail an assertion, but I'm just as happy to leave it out
 altogether.

 FWIW, we generally declare bitmask flag variables as int, unless
 there's some really good reason to do otherwise.

 OK, thanks for the tip.

 ...Robert

Index: src/backend/access/heap/heapam.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/access/heap/heapam.c,v
retrieving revision 1.266
diff -c -r1.266 heapam.c
*** src/backend/access/heap/heapam.c	27 Oct 2008 21:50:12 -	1.266
--- src/backend/access/heap/heapam.c	29 Oct 2008 03:25:41 -
***
*** 1798,1803 
--- 1798,1827 
  	}
  }
  
+ /*
+  * GetBulkInsertState - set up for a bulk insert
+  */
+ BulkInsertState
+ GetBulkInsertState(void)
+ {
+ 	BulkInsertState bistate;
+ 
+ 	bistate = palloc(sizeof(struct BulkInsertStateData));
+ 	bistate-strategy = GetAccessStrategy(BAS_BULKWRITE);
+ 	bistate-last_pin = InvalidBuffer;
+ 	return bistate;
+ }
+ 
+ /*
+  * FreeBulkInsertState - clean up after finishing a bulk insert
+  */
+ void
+ FreeBulkInsertState(BulkInsertState bistate)
+ {
+ 	if (bistate-last_pin != InvalidBuffer)
+ 		ReleaseBuffer(bistate-last_pin);		
+ 	FreeAccessStrategy(bistate-strategy);
+ }
  
  /*
   *	heap_insert		- insert tuple into a heap
***
*** 1805,1821 
   * The new tuple is stamped with current transaction ID and the specified
   * command ID.
   *
!  * If use_wal is false, the new tuple is not logged in WAL, even for a
!  * non-temp relation.  Safe usage of this behavior requires that we arrange
!  * that all new tuples go into new pages not containing any tuples from other
!  * transactions, and that the relation gets fsync'd before commit.
   * (See also heap_sync() comments)
   *
!  * use_fsm is passed directly to RelationGetBufferForTuple, which see for
!  * more info.
   *
!  * Note that use_wal and use_fsm will be applied when inserting into the
!  * heap's TOAST table, too, if the tuple requires any out-of-line data.
   *
   * The return value is the OID assigned to the tuple (either here or by the
   * caller), or InvalidOid if no OID.  The header fields of *tup are updated
--- 1829,1846 
   * The new tuple is stamped with current transaction ID and the specified
   * command ID.
   *
!  * If the HEAP_INSERT_SKIP_WAL option is supplied, the new tuple is not logged
!  * in WAL, even for a non-temp relation.  Safe usage of this behavior requires
!  * that we arrange that all new tuples go into new pages not containing any
!  * tuples from other transactions, and that the relation gets fsync'd before
!  * commit.
   * (See also heap_sync() comments)
   *
!  * The HEAP_INSERT_SKIP_FSM option is passed directly to
!  * RelationGetBufferForTuple, which see for more info.
   *
!  * Note that options will be applied when inserting into the heap's TOAST
!  * table, too, if the tuple requires any out-of-line data.
   *
   * The return value is the OID assigned to the tuple (either here or by the
   * caller), or InvalidOid if no OID.  The header fields of *tup are updated
***
*** 1825,1831 
   */
  Oid
  heap_insert(Relation relation, HeapTuple tup, CommandId cid,
! 			bool use_wal, bool

Re: [HACKERS] array_agg and array_accum (patch)

2008-10-28 Thread Robert Haas

It's worth noting that this is the third version of this idea that has
been submitted.  Ian Caulfield submitted a patch to add this, and so
did I.  Someone should probably look at all three of them and compare.

...Robert

On Mon, Oct 27, 2008 at 1:41 PM, Jeff Davis [EMAIL PROTECTED] wrote:
 On Mon, 2008-10-27 at 18:47 +0200, Peter Eisentraut wrote:
 How else will you tell an aggregate function whose result depends on the
 input order which order you want?  The only aggregates defined in the
 standard where this matters are array_agg, array_accum, and xmlagg, but

 I don't see array_accum() in the standard, I wrote it just as an
 alternative to array_agg() because I thought array_agg() ignored NULLs.

 Regards,
Jeff Davis


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 102 matches

Mail list logo