date:20110317

Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-17 Thread Fujii Masao

On Fri, Mar 18, 2011 at 2:46 AM, Robert Haas  wrote:
> On further review, I've changed my mind.  Making synchronous_commit
> trump synchronous_replication is appealing conceptually, but it's
> going to lead to some weird corner cases.  For example, a transaction
> that drops a non-temporary relation always commits synchronously; and
> 2PC also ignores synchronous_commit.  In the case where
> synchronous_commit=off and synchronous_replication=on, we'd either
> have to decide that these sorts of transactions aren't going to
> replicate synchronously (which would give synchronous_commit a rather
> long reach into areas it doesn't currently touch) or else that it's OK
> for CREATE TABLE foo () to be totally asynchronous but that DROP TABLE
> foo requires sync commit AND sync rep.  That's pretty weird.
>
> What makes more sense to me after having thought about this more
> carefully is to simply make a blanket rule that when
> synchronous_replication=on, synchronous_commit has no effect.  That is
> easy to understand and document.  I'm inclined to think it's OK to let
> synchronous_replication have this effect even if max_wal_senders=0 or
> synchronous_standby_names=''; you shouldn't turn
> synchronous_replication on just for kicks, and I don't think we want
> to complicate the test in RecordTransactionCommit() more than
> necessary.  We should, however, adjust the logic so that a transaction
> which has not written WAL can still commit asynchronously, because
> such a transaction has only touched temp or unlogged tables and so
> it's not important for it to make it to the standby, where that data
> doesn't exist anyway.

In the first place, I think that it's complicated to keep those two parameters
separately. What about merging them to one parameter? What I'm thinking
is to remove synchronous_replication and to increase the valid values of
synchronous_commit from on/off to async/local/remote/both. Each value
works as follows.

async   = (synchronous_commit = off && synchronous_replication = off)
"async" makes a transaction do local WAL flush and replication
asynchronously.

local = (synchronous_commit = on && synchronous_replication = off)
"local" makes a transaction wait for only local WAL flush.

remote = (synchronous_commit = off && synchronous_replication = on)
"remote" makes a transaction wait for only replication. Local WAL flush is
performed by walwriter. This is useless in 9.1 because we always must
wait for local WAL flush when we wait for replication. But in the future,
if we'll be able to send WAL before WAL write (i.e., send WAL from
wal_buffers), this might become useful. In 9.1, it seems reasonable to
remove this value.

both = (synchronous_commit = on && synchronous_replication = on)
"both" makes a transaction wait for local WAL flush and replication.

Thought?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-17 Thread Fujii Masao

On Fri, Mar 18, 2011 at 2:52 AM, Robert Haas  wrote:
> On Thu, Mar 10, 2011 at 3:04 PM, Robert Haas  wrote:
>>> -                       /* Let the master know that we received some data. 
>>> */
>>> -                       XLogWalRcvSendReply();
>>> -                       XLogWalRcvSendHSFeedback();
>>>
>>> This change completely eliminates the difference between write_location
>>> and flush_location in pg_stat_replication. If this change is reasoable, we
>>> should get rid of write_location from pg_stat_replication since it's 
>>> useless.
>>> If not, this change should be reverted. I'm not sure whether monitoring
>>> the difference between write and flush locations is useful. But I guess that
>>> someone thought so and that code was added.
>>
>> I could go either way on this but clearly we need to do one or the other.
>
> I'm not really sure why this was part of the synchronous replication
> patch, but after mulling it over I think it's probably right to rip
> out write_location completely.  There shouldn't ordinarily be much of
> a gap between write location and flush location, so it's probably not
> worth the extra network overhead to keep track of it.  We might need
> to re-add some form of this in the future if we have a version of
> synchronous replication that only waits for confirmation of receipt
> rather than for confirmation of flush, but we don't have that in 9.1,
> so why bother?
>
> Barring objections, I'll go do that.

I agree to get rid of write_location.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Basic Recovery Control functions for use in Hot Standby. Pause,

2011-03-17 Thread Fujii Masao

On Fri, Mar 18, 2011 at 1:17 AM, Robert Haas  wrote:
>> Sorry, I've not been able to understand the point well yet. We should
>> just use elog(ERROR) instead? But since ERROR in startup process
>> is treated as FATAL, I'm not sure whether it's worth using ERROR
>> instead. Or you meant another things?
>
> Yeah, I think he's saying that an ERROR in the startup process is
> better than a FATAL, even though the effect is the same.

We've already been using FATAL all over the recovery code. We should
s/FATAL/ERROR/g there (at least readRecoveryCommandFile)?

> On the substantive issue, I don't think we have any consensus that
> forbidding this combination of parameters is the right thing to do
> anyway.  Both Simon and I voted against that, and Tom's point has to
> do only with style.  Similarly, I voted for flipping the default for
> pause_at_recovery_target to off, rather than on, but no one else has
> bought into that suggestion either.  Unless we get some more votes in
> favor of doing one of those things, I think we should focus on the
> actual must-fix issue here, which is properly documenting the way it
> works now (i.e. adding the parameter to recovery.conf.sample with
> appropriate documentation of the current behavior).

I agree to flip the default to false, whether we forbid that combination
of settings.

> One thing I'm not quite clear on is what happens if we reach the
> recovery target before we reach consistency.  i.e. create restore
> point, flush xlog, abnormal shutdown, try to recover to named restore
> point.  Is there any possibility that we can end up paused before Hot
> Standby has actually started up.  Because that would be fairly useless
> and annoying.

Good catch! In that case, the same situation as (3) would happen.
I think that recovery should ignore pause_at_recovery_target until
it reaches consistent point. If we do so, when recovery target is
ahead of consistent point, recovery just ends in inconsistent point
and throws FATAL error.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19

2011-03-17 Thread Jeff Davis

On Wed, 2011-03-16 at 13:35 -0400, Robert Haas wrote:
> 2. If a query cancel interrupt is received (pg_cancel_backend or ^C),
> then cancel the sync rep wait and issue a warning before acknowledging
> the commit.

When I saw this commit, I noticed that the WARNING doesn't have an
errcode(). It seems like it should -- this is the kind of thing that the
client is likely to care about, and may want to handle specially. 

Regards,
Jeff Davis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FK constraints "NOT VALID" by default?

2011-03-17 Thread Robert Haas

On Thu, Mar 17, 2011 at 5:29 PM, Andrew Dunstan  wrote:
>> Is this really intended?
>
> I sure hope not.

That's a bug.  Not sure if it's a psql bug or a backend bug, but it's
definitely a bug.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FK constraints "NOT VALID" by default?

2011-03-17 Thread Andrew Dunstan




On 03/17/2011 05:22 PM, Alvaro Herrera wrote:

I just ran this quick test in HEAD:



and was very surprised to see that the foreign key is marked as NOT
VALID:




Is this really intended?


I sure hope not.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] really lazy vacuums?

2011-03-17 Thread Robert Haas

On Thu, Mar 17, 2011 at 4:02 PM, Jesper Krogh  wrote:
> On the 1 bit per page the "best case" would be 341 times better than above
> reducing the size of the visibiility map on a 10GB table to around 152KB
> which
> is extremely small (and thus also awesome) But the consequenses of a single
> update would mean that you loose visibilllity map benefit on 341 tuples in
> one shot.

True, but I'm not really sure that matters very much.  Keep in mind
also that would increase the frequency with which visibility map bits
would need to be flipped, which would carry its own costs.

> Worst case situations are, where we approach the 4 tuples per page, before
> we hit toast where the ratio of space reduction in 1 bit per tuple would be:
> 1:(2048x8) ~ 1:16384 and the 1 bit per page is 4 times better.
> In the 1 bit per tuple a visibillity map of a 10GB relation would be around
> 610KB
> 1 bit per page would then drop it to around 160KB.
>
>
> Can we drag out some average-case numbers on row-size in the heap
> from some real production systems?

Well, unless you went to some significantly more complicated data
structure, you'd need to allow room for the maximum number of tuples
per page on every page, whether the slots were all in use or not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] FK constraints "NOT VALID" by default?

2011-03-17 Thread Alvaro Herrera

I just ran this quick test in HEAD:

alvherre=# create table study (id int primary key);
NOTICE:  CREATE TABLE / PRIMARY KEY creará el índice implícito «study_pkey» 
para la tabla «study»
CREATE TABLE
alvherre=# insert into study select a from generate_series(1, 100) as a;
INSERT 0 100
alvherre=# create table studyform (id int primary key, study_id int not null 
references study);
NOTICE:  CREATE TABLE / PRIMARY KEY creará el índice implícito «studyform_pkey» 
para la tabla «studyform»
CREATE TABLE
alvherre=# insert into studyform select a, 1 + a * random() from 
generate_series(1, 10) a;
INSERT 0 10

and was very surprised to see that the foreign key is marked as NOT
VALID:

alvherre=# \d studyform
  Tabla «public.studyform»
 Columna  │  Tipo   │ Modificadores 
──┼─┼───
 id   │ integer │ not null
 study_id │ integer │ not null
Índices:
"studyform_pkey" PRIMARY KEY, btree (id)
Restricciones de llave foránea:
"studyform_study_id_fkey" FOREIGN KEY (study_id) REFERENCES study(id) NOT 
VALID


Is this really intended?

-- 
Álvaro Herrera 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 2nd Level Buffer Cache

2011-03-17 Thread Kevin Grittner

Rados*aw Smogura wrote:

> I have implemented initial concept of 2nd level cache. Idea is to
> keep some segments of shared memory for special buffers (e.g.
> indices) to prevent overwrite those by other operations. I added
> those functionality to nbtree index scan.
> 
> I tested this with doing index scan, seq read, drop system
> buffers, do index scan and in few places I saw performance
> improvements, but actually, I'm not sure if this was just "random"
> or intended improvement.

I've often wondered about this.  In a database I developed back in
the '80s it was clearly a win to have a special cache for index
entries and other special pages closer to the database than the
general cache.  A couple things have changed since the '80s (I mean,
besides my waistline and hair color), and PostgreSQL has many
differences from that other database, so I haven't been sure it
would help as much, but I have wondered.

I can't really look at this for a couple weeks, but I'm definitely
interested.  I suggest that you add this to the next CommitFest as a
WIP patch, under the Performance category.

https://commitfest.postgresql.org/action/commitfest_view/open

> There is few places to optimize code as well, and patch need many
> work, but may you see it and give opinions?

For something like this it makes perfect sense to show "proof of
concept" before trying to cover everything.

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] really lazy vacuums?

2011-03-17 Thread Jesper Krogh

On 2011-03-17 15:02, Robert Haas wrote:

On Thu, Mar 17, 2011 at 4:17 AM, Jesper Krogh wrote:

Is it obvious that the visibillity map bits should track complete
pages and not individual tuples? If the visibillity map tracks at
page-level the benefit would fall on "slim tables" where you squeeze
200 tuples into each page and having an update rate of 1% would
lower the likelyhood even more. (it may be that for slim tables the
index-only-scans are not as benefitial as to wide tables).

I'm not sure exactly what MaxHeapTuplesPerPage works out to be, but
say it's 200. If you track visibility info per tuple rather than per
page, then the size of the visibility map is going to expand by a
factor of 200. That might decrease contention, but otherwise it's a
bad thing - the whole point of having the visibility map in the first
place is that it's much, much smaller than the heap. If it were the
same size as the heap, we could just read the heap. What the map
attempts to accomplish is to allow us, by reading a small number of
pages, to check whether the tuples we're thinking of reading are
likely to be all-visible without actually looking at them.

Yes, that was sort of the math I was trying to make. I do allthough
belive that you have a way better feeling about it. But according
to this:
http://wiki.postgresql.org/wiki/FAQ#How_much_database_disk_space_is_required_to_store_data_from_a_typical_text_file.3F

The bulk row-overhead is around 24bytes, which will with 1 bit per row
give a
size reduction of 1:(24x8) ~1:192, worstcase... that gives at best 341
tuples/page

(where each tuple, does not contain any data at all). With that ratio, the
visibillitymap of a relation of 10GB would fill 52MB on disk (still
worst case)

and that by itself would by all means be awesome. (with that small tuples a
10GB relation would have around 42 billion tuples).

On the 1 bit per page the "best case" would be 341 times better than above
reducing the size of the visibiility map on a 10GB table to around 152KB
which

is extremely small (and thus also awesome) But the consequenses of a single
update would mean that you loose visibilllity map benefit on 341 tuples in
one shot.

Worst case situations are, where we approach the 4 tuples per page, before
we hit toast where the ratio of space reduction in 1 bit per tuple would
be:

1:(2048x8) ~ 1:16384 and the 1 bit per page is 4 times better.
In the 1 bit per tuple a visibillity map of a 10GB relation would be
around 610KB

1 bit per page would then drop it to around 160KB.

Can we drag out some average-case numbers on row-size in the heap
from some real production systems?

I may have gotten something hugely wrong in above calculations and/or
have missed some important points.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] 2nd Level Buffer Cache

2011-03-17 Thread Radosław Smogura

Hi,

I have implemented initial concept of 2nd level cache. Idea is to keep some 
segments of shared memory for special buffers (e.g. indices) to prevent 
overwrite those by other operations. I added those functionality to nbtree 
index scan.

I tested this with doing index scan, seq read, drop system buffers, do index 
scan and in few places I saw performance improvements, but actually, I'm not 
sure if this was just "random" or intended improvement.

There is few places to optimize code as well, and patch need many work, but 
may you see it and give opinions?

Regards,
Radek
diff --git a/.gitignore b/.gitignore
index 3f11f2e..6542e35 100644
--- a/.gitignore
+++ b/.gitignore
@@ -22,3 +22,4 @@ lcov.info
 /GNUmakefile
 /config.log
 /config.status
+/nbproject/private/
\ No newline at end of file
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 2796445..0229f5a 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -508,7 +508,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 	if (blkno != P_NEW)
 	{
 		/* Read an existing block of the relation */
-		buf = ReadBuffer(rel, blkno);
+		buf = ReadBufferLevel(rel, blkno, BUFFER_LEVEL_2ND);
 		LockBuffer(buf, access);
 		_bt_checkpage(rel, buf);
 	}
@@ -548,7 +548,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 			blkno = GetFreeIndexPage(rel);
 			if (blkno == InvalidBlockNumber)
 break;
-			buf = ReadBuffer(rel, blkno);
+			buf = ReadBufferLevel(rel, blkno, BUFFER_LEVEL_2ND);
 			if (ConditionalLockBuffer(buf))
 			{
 page = BufferGetPage(buf);
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index dadb49d..2922711 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -22,6 +22,7 @@ BufferDesc *BufferDescriptors;
 char	   *BufferBlocks;
 int32	   *PrivateRefCount;
 
+BufferLevelDesc *bufferLevels;
 
 /*
  * Data Structures:
@@ -72,8 +73,7 @@ int32	   *PrivateRefCount;
 void
 InitBufferPool(void)
 {
-	bool		foundBufs,
-foundDescs;
+	bool		foundBufs, foundDescs, foundBufferLevels = false;
 
 	BufferDescriptors = (BufferDesc *)
 		ShmemInitStruct("Buffer Descriptors",
@@ -83,19 +83,38 @@ InitBufferPool(void)
 		ShmemInitStruct("Buffer Blocks",
 		NBuffers * (Size) BLCKSZ, &foundBufs);
 
-	if (foundDescs || foundBufs)
+bufferLevels = (BufferLevelDesc*)
+ShmemInitStruct("Buffer Levels Descriptors Table",
+		sizeof(BufferLevelDesc) * BUFFER_LEVEL_SIZE, 
+&foundBufferLevels);
+	if (foundDescs || foundBufs || foundBufferLevels)
 	{
 		/* both should be present or neither */
-		Assert(foundDescs && foundBufs);
+		Assert(foundDescs && foundBufs && foundBufferLevels);
 		/* note: this path is only taken in EXEC_BACKEND case */
 	}
 	else
 	{
 		BufferDesc *buf;
+BufferLevelDesc *bufferLevelDesc;
+
 		int			i;
-
+
 		buf = BufferDescriptors;
 
+/* Initialize buffer levels. */
+//1st Level - Default
+bufferLevelDesc = bufferLevels;
+bufferLevelDesc->index = 0;
+bufferLevelDesc->super = BUFFER_LEVEL_END_OF_LIST;
+bufferLevelDesc->lower = BUFFER_LEVEL_END_OF_LIST;
+
+//2nd Level - For indices
+bufferLevelDesc++;
+bufferLevelDesc->index = 1;
+bufferLevelDesc->super = BUFFER_LEVEL_END_OF_LIST;
+bufferLevelDesc->lower = 0;
+
 		/*
 		 * Initialize all the buffer headers.
 		 */
@@ -117,6 +136,10 @@ InitBufferPool(void)
 			 */
 			buf->freeNext = i + 1;
 
+/* Assign buffer level. */
+//TODO Currently hardcoded - 
+buf->buf_level = ( 0.3 * NBuffers > i ) ? BUFFER_LEVEL_DEFAULT : BUFFER_LEVEL_2ND;
+
 			buf->io_in_progress_lock = LWLockAssign();
 			buf->content_lock = LWLockAssign();
 		}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1f89e52..867bae0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -47,7 +47,8 @@
 #include "storage/standby.h"
 #include "utils/rel.h"
 #include "utils/resowner.h"
-
+#include "catalog/pg_type.h"
+#include "funcapi.h"
 
 /* Note: these two macros only work on shared buffers, not local ones! */
 #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
@@ -85,7 +86,7 @@ static volatile BufferDesc *PinCountWaitBuf = NULL;
 static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
   ForkNumber forkNum, BlockNumber blockNum,
   ReadBufferMode mode, BufferAccessStrategy strategy,
-  bool *hit);
+  bool *hit, BufferLevel bufferLevel);
 static bool PinBuffer(vol

Re: [HACKERS] Allowing multiple concurrent base backups

2011-03-17 Thread Robert Haas

On Mon, Jan 31, 2011 at 10:45 PM, Fujii Masao  wrote:
> On Tue, Feb 1, 2011 at 1:31 AM, Heikki Linnakangas
>  wrote:
>> Hmm, good point. It's harmless, but creating the history file in the first
>> place sure seems like a waste of time.
>
> The attached patch changes pg_stop_backup so that it doesn't create
> the backup history file if archiving is not enabled.
>
> When I tested the multiple backups, I found that they can have the same
> checkpoint location and the same history file name.
>
> 
> $ for ((i=0; i<4; i++)); do
> pg_basebackup -D test$i -c fast -x -l test$i &
> done
>
> $ cat test0/backup_label
> START WAL LOCATION: 0/2B0 (file 00010002)
> CHECKPOINT LOCATION: 0/2E8
> START TIME: 2011-02-01 12:12:31 JST
> LABEL: test0
>
> $ cat test1/backup_label
> START WAL LOCATION: 0/2B0 (file 00010002)
> CHECKPOINT LOCATION: 0/2E8
> START TIME: 2011-02-01 12:12:31 JST
> LABEL: test1
>
> $ cat test2/backup_label
> START WAL LOCATION: 0/2B0 (file 00010002)
> CHECKPOINT LOCATION: 0/2E8
> START TIME: 2011-02-01 12:12:31 JST
> LABEL: test2
>
> $ cat test3/backup_label
> START WAL LOCATION: 0/2B0 (file 00010002)
> CHECKPOINT LOCATION: 0/2E8
> START TIME: 2011-02-01 12:12:31 JST
> LABEL: test3
>
> $ ls archive/*.backup
> archive/00010002.00B0.backup
> 
>
> This would cause a serious problem. Because the backup-end record
> which indicates the same "START WAL LOCATION" can be written by the
> first backup before the other finishes. So we might think wrongly that
> we've already reached a consistency state by reading the backup-end
> record (written by the first backup) before reading the last required WAL
> file.
>
>                /*
>                 * Force a CHECKPOINT.  Aside from being necessary to prevent 
> torn
>                 * page problems, this guarantees that two successive backup 
> runs will
>                 * have different checkpoint positions and hence different 
> history
>                 * file names, even if nothing happened in between.
>                 *
>                 * We use CHECKPOINT_IMMEDIATE only if requested by user (via 
> passing
>                 * fast = true).  Otherwise this can take awhile.
>                 */
>                RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT |
>                                                  (fast ? CHECKPOINT_IMMEDIATE 
> : 0));
>
> This problem happens because the above code (in do_pg_start_backup)
> actually doesn't ensure that the concurrent backups have the different
> checkpoint locations. ISTM that we should change the above or elsewhere
> to ensure that. Or we should include backup label name in the backup-end
> record, to prevent a recovery from reading not-its-own backup-end record.
>
> Thought?

This patch is on the 9.1 open items list, but I don't understand it
well enough to know whether it's correct.  Can someone else pick it
up?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Flex output missing from 9.1a4 tarballs?

2011-03-17 Thread Robert Haas

2011/3/16 Devrim GÜNDÜZ :
> On Tue, 2011-03-15 at 22:00 -0400, Tom Lane wrote:
>>
>> > My only hesitation about this is that it seems like sync rep and
>> > collation support are both still pretty broken.  Should we just not
>> > worry about that for alpha?
>>
>> FWIW, collations are probably still several days away from being
>> noticeably less broken than they were in alpha4.  I have mixed
>> feelings
>> about whether an alpha5 right now is useful.
>
> Fair enough. Looks like we will skip next week, too.

Yeah, probably best.  I think we're making good progress, though, so I
propose we wrap alpha5 on Monday, March 28.  I don't think we'll be
all the way there yet by then, but I think we will have made enough
progress that it makes sense to get another snapshot out the door and
into the hands of testers.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] why is max standby delay only 35 minutes?

2011-03-17 Thread Peter Eisentraut

On tis, 2011-03-15 at 20:44 +0200, Peter Eisentraut wrote:
> Consequently, I propose the attached patch.  I didn't find any
> relevant documentation references that would need updating.

Applied.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [COMMITTERS] pgsql: Fix various possible problems with synchronous replication.

2011-03-17 Thread Thom Brown

On 17 March 2011 17:55, Robert Haas  wrote:
> On Thu, Mar 17, 2011 at 1:24 PM, Thom Brown  wrote:
>> errdetail("The transaction has already been committed locally but
>> might have not been replicated to the standby.")));
>> errdetail("The transaction has committed locally, but may not have
>> replicated to the standby.")));
>>
>> Could we have these saying precisely the same thing?
>
> Yeah.  Which is better?

Personally I prefer the 2nd.  It reads better somehow.

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-17 Thread Simon Riggs

On Thu, 2011-03-17 at 13:46 -0400, Robert Haas wrote:
> On Fri, Mar 11, 2011 at 5:46 AM, Fujii Masao  wrote:
> > On Fri, Mar 11, 2011 at 5:04 AM, Robert Haas  wrote:
> >>>if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0 
> >>> ||
> >>> SyncRepRequested())
> >>>
> >>> Whenever synchronous_replication is TRUE, we disable synchronous_commit.
> >>> But, before disabling that, we should check also max_wal_senders and
> >>> synchronous_standby_names? Otherwise, synchronous_commit can
> >>> be disabled unexpectedly even in non replication case.
> >>
> >> Yeah, that's bad.  At the risk of repeating myself, I don't think this
> >> code should be checking SyncRepRequested() in the first place.  If the
> >> user has turned off synchronous_commit, then we should just commit
> >> asynchronously, even if sync rep is otherwise in force.  Otherwise,
> >> this if statement is going to get really complicated.   The logic is
> >> already at least mildly wrong here anyway: clearly we do NOT need to
> >> commit synchronously if the transaction has not written xlog, even if
> >> sync rep is enabled.
> >
> > Yeah, not to wait for replication when synchronous_commit is disabled
> > seems to be more reasonable.
> 
> On further review, I've changed my mind.  Making synchronous_commit
> trump synchronous_replication is appealing conceptually, but it's
> going to lead to some weird corner cases.  For example, a transaction
> that drops a non-temporary relation always commits synchronously; and
> 2PC also ignores synchronous_commit.  In the case where
> synchronous_commit=off and synchronous_replication=on, we'd either
> have to decide that these sorts of transactions aren't going to
> replicate synchronously (which would give synchronous_commit a rather
> long reach into areas it doesn't currently touch) or else that it's OK
> for CREATE TABLE foo () to be totally asynchronous but that DROP TABLE
> foo requires sync commit AND sync rep.  That's pretty weird.
> 
> What makes more sense to me after having thought about this more
> carefully is to simply make a blanket rule that when
> synchronous_replication=on, synchronous_commit has no effect.  That is
> easy to understand and document.  I'm inclined to think it's OK to let
> synchronous_replication have this effect even if max_wal_senders=0 or
> synchronous_standby_names=''; you shouldn't turn
> synchronous_replication on just for kicks, and I don't think we want
> to complicate the test in RecordTransactionCommit() more than
> necessary.  We should, however, adjust the logic so that a transaction
> which has not written WAL can still commit asynchronously, because
> such a transaction has only touched temp or unlogged tables and so
> it's not important for it to make it to the standby, where that data
> doesn't exist anyway.

Agree to that.

Not read your other stuff yet, will do that later.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [COMMITTERS] pgsql: Fix various possible problems with synchronous replication.

2011-03-17 Thread Robert Haas

On Thu, Mar 17, 2011 at 1:24 PM, Thom Brown  wrote:
> errmsg("canceling the wait for replication and terminating connection
> due to administrator command")
> errmsg("canceling wait for synchronous replication due to user request")
>
> Should that first one then also say "synchronous replication"?

I could go either way.  Clearly if it's asynchronous replication, we
wouldn't be waiting.  But you're certainly right that we should be
consistent.

> errdetail("The transaction has already been committed locally but
> might have not been replicated to the standby.")));
> errdetail("The transaction has committed locally, but may not have
> replicated to the standby.")));
>
> Could we have these saying precisely the same thing?

Yeah.  Which is better?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-17 Thread Robert Haas

On Thu, Mar 10, 2011 at 3:04 PM, Robert Haas  wrote:
>> -                       /* Let the master know that we received some data. */
>> -                       XLogWalRcvSendReply();
>> -                       XLogWalRcvSendHSFeedback();
>>
>> This change completely eliminates the difference between write_location
>> and flush_location in pg_stat_replication. If this change is reasoable, we
>> should get rid of write_location from pg_stat_replication since it's useless.
>> If not, this change should be reverted. I'm not sure whether monitoring
>> the difference between write and flush locations is useful. But I guess that
>> someone thought so and that code was added.
>
> I could go either way on this but clearly we need to do one or the other.

I'm not really sure why this was part of the synchronous replication
patch, but after mulling it over I think it's probably right to rip
out write_location completely.  There shouldn't ordinarily be much of
a gap between write location and flush location, so it's probably not
worth the extra network overhead to keep track of it.  We might need
to re-add some form of this in the future if we have a version of
synchronous replication that only waits for confirmation of receipt
rather than for confirmation of flush, but we don't have that in 9.1,
so why bother?

Barring objections, I'll go do that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-17 Thread Robert Haas

On Fri, Mar 11, 2011 at 5:46 AM, Fujii Masao  wrote:
> On Fri, Mar 11, 2011 at 5:04 AM, Robert Haas  wrote:
>>>        if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0 ||
>>> SyncRepRequested())
>>>
>>> Whenever synchronous_replication is TRUE, we disable synchronous_commit.
>>> But, before disabling that, we should check also max_wal_senders and
>>> synchronous_standby_names? Otherwise, synchronous_commit can
>>> be disabled unexpectedly even in non replication case.
>>
>> Yeah, that's bad.  At the risk of repeating myself, I don't think this
>> code should be checking SyncRepRequested() in the first place.  If the
>> user has turned off synchronous_commit, then we should just commit
>> asynchronously, even if sync rep is otherwise in force.  Otherwise,
>> this if statement is going to get really complicated.   The logic is
>> already at least mildly wrong here anyway: clearly we do NOT need to
>> commit synchronously if the transaction has not written xlog, even if
>> sync rep is enabled.
>
> Yeah, not to wait for replication when synchronous_commit is disabled
> seems to be more reasonable.

On further review, I've changed my mind.  Making synchronous_commit
trump synchronous_replication is appealing conceptually, but it's
going to lead to some weird corner cases.  For example, a transaction
that drops a non-temporary relation always commits synchronously; and
2PC also ignores synchronous_commit.  In the case where
synchronous_commit=off and synchronous_replication=on, we'd either
have to decide that these sorts of transactions aren't going to
replicate synchronously (which would give synchronous_commit a rather
long reach into areas it doesn't currently touch) or else that it's OK
for CREATE TABLE foo () to be totally asynchronous but that DROP TABLE
foo requires sync commit AND sync rep.  That's pretty weird.

What makes more sense to me after having thought about this more
carefully is to simply make a blanket rule that when
synchronous_replication=on, synchronous_commit has no effect.  That is
easy to understand and document.  I'm inclined to think it's OK to let
synchronous_replication have this effect even if max_wal_senders=0 or
synchronous_standby_names=''; you shouldn't turn
synchronous_replication on just for kicks, and I don't think we want
to complicate the test in RecordTransactionCommit() more than
necessary.  We should, however, adjust the logic so that a transaction
which has not written WAL can still commit asynchronously, because
such a transaction has only touched temp or unlogged tables and so
it's not important for it to make it to the standby, where that data
doesn't exist anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Basic Recovery Control functions for use in Hot Standby. Pause,

2011-03-17 Thread Robert Haas

On Thu, Mar 17, 2011 at 2:47 AM, Fujii Masao  wrote:
> On Wed, Mar 16, 2011 at 11:27 PM, Tom Lane  wrote:
>> Fujii Masao  writes:
>>> How should recovery work when pause_at_recovery_target is
>>> enabled but hot standby is disabled? We have three choices:
>>
>>> 1. Forbit those settings, i.e., throw FATAL error. Tom dislikes this
>>>     idea.
>>
>> No, I didn't say that.  I said not to write elog(FATAL).
>
> Oh, sorry.
>
>>  If the
>> combination is nonsensical then it's fine to forbid it, but you don't
>> need FATAL for that.  In particular, attempting to change to a
>> disallowed setting after system startup should not result in crashing
>> the postmaster.  And it won't, if you just use the normal error level
>> for complaining about an invalid GUC setting.
>
> Sorry, I've not been able to understand the point well yet. We should
> just use elog(ERROR) instead? But since ERROR in startup process
> is treated as FATAL, I'm not sure whether it's worth using ERROR
> instead. Or you meant another things?

Yeah, I think he's saying that an ERROR in the startup process is
better than a FATAL, even though the effect is the same.

On the substantive issue, I don't think we have any consensus that
forbidding this combination of parameters is the right thing to do
anyway.  Both Simon and I voted against that, and Tom's point has to
do only with style.  Similarly, I voted for flipping the default for
pause_at_recovery_target to off, rather than on, but no one else has
bought into that suggestion either.  Unless we get some more votes in
favor of doing one of those things, I think we should focus on the
actual must-fix issue here, which is properly documenting the way it
works now (i.e. adding the parameter to recovery.conf.sample with
appropriate documentation of the current behavior).

One thing I'm not quite clear on is what happens if we reach the
recovery target before we reach consistency.  i.e. create restore
point, flush xlog, abnormal shutdown, try to recover to named restore
point.  Is there any possibility that we can end up paused before Hot
Standby has actually started up.  Because that would be fairly useless
and annoying.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] I am confused after reading codes of PostgreSQL three week

2011-03-17 Thread Kevin Grittner

hom  wrote:

> I try to known how a database is implemented and I have been
> reading PG source codes for a month.

That's ambitious.

find -name '*.h' -or -name '*.c' \
  | egrep -v '^\./src/test/.+/tmp_check/' \
  | xargs cat | wc -l
1059144

Depending on how you do the math, that's about 50,000 lines of code
per day to get through it in the time you mention.

> Is there any article or some way could help understand the source
> code ?

Your best bet would be to follow links from the Developers tab on
the main PostgreSQL web site:

http://www.postgresql.org/developer/

In particular the Developer FAQ page:

http://wiki.postgresql.org/wiki/Developer_FAQ

And the "Coding" links:

http://www.postgresql.org/developer/coding

may help.

Before reading code in a directory, be sure to read any README
file(s) in that directory carefully.

It helps to read this list.

In spite of reviewing all of that myself, it was rather intimidating
when I went to work on a major patch 14 months ago.  Robert Haas
offered some good advice which served me well in that effort --
divide the effort in to a series of incremental steps, each of which
deals with a small enough portion of the code to get your head
around.  As you work in any one narrow area, it becomes increasingly
clear; with that as a base you can expand your scope.

When you're working in the code, it is tremendously helpful to use
an editor with ctags support (or similar IDE functionality).

I hope this is helpful.  Good luck.

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] I am confused after reading codes of PostgreSQL three week

2011-03-17 Thread Bruce Momjian

hom wrote:
> Hi,
> 
>   I try to known how a database is implemented and I have been reading
> PG source codes for a month.
> 
> Now, I only know a little about how PG work.  :(
> 
> I just know PG work like this but I don't know why PG work like this.  :(  :(
> 
> even worse, I feel I can better understand the source code. it may be
> that I could't split the large module into small piece which may help
> to understand.
> 
> Is there any article or some way could help understand the source code ?

I assume you have looked at these places:

http://wiki.postgresql.org/wiki/Developer_FAQ
http://www.postgresql.org/developer/coding

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [BUGS] BUG #5842: Memory leak in PL/Python when taking slices of results

2011-03-17 Thread Alvaro Herrera

Excerpts from Marko Kreen's message of jue mar 17 12:01:13 -0300 2011:
> On Thu, Mar 17, 2011 at 4:40 AM, Robert Haas  wrote:
> > On Fri, Mar 11, 2011 at 6:02 AM, Bruce Momjian  wrote:
> >> What has been done with this report/fix?
> >
> > AFAIK, nothing.  Added to 9.1 open items list.
> 
> The patch seems to do the right thing.

Looking into this.  AFAICT this needs to be backported.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [BUGS] BUG #5842: Memory leak in PL/Python when taking slices of results

2011-03-17 Thread Marko Kreen

On Thu, Mar 17, 2011 at 4:40 AM, Robert Haas  wrote:
> On Fri, Mar 11, 2011 at 6:02 AM, Bruce Momjian  wrote:
>> What has been done with this report/fix?
>
> AFAIK, nothing.  Added to 9.1 open items list.

The patch seems to do the right thing.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] I am confused after reading codes of PostgreSQL three week

2011-03-17 Thread hom

Hi,

  I try to known how a database is implemented and I have been reading
PG source codes for a month.

Now, I only know a little about how PG work.  :(

I just know PG work like this but I don't know why PG work like this.  :(  :(

even worse, I feel I can better understand the source code. it may be
that I could't split the large module into small piece which may help
to understand.

Is there any article or some way could help understand the source code ?

Thanks for help ~

-- 
Best Wishes!

                                     hom

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] volatile markings to silence compilers

2011-03-17 Thread Tom Lane

Martijn van Oosterhout  writes:
> It appears the issue is mostly that the compiler is unable to prove
> that the variables aren't changed.

IME, older versions of gcc will warn about any variable that's assigned
more than once, even if those assignments are before the setjmp call.
Presumably this is stricter than necessary, but I don't know enough
details of gcc's register usage to be sure.

> My point is, are we hopeful this problem will ever go away?

Since we're trying to silence the warning in existing and even obsolete
compilers, whether it gets improved in future compilers is not terribly
relevant.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Rectifying wrong Date outputs

2011-03-17 Thread Kevin Grittner

Tom Lane  wrote:
 
> what we should first do is see what Oracle does in such cases,
> because the main driving factor for these functions is Oracle
> compatibility not what might seem sane in a vacuum.
 
+1
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Rectifying wrong Date outputs

2011-03-17 Thread Alvaro Herrera

Excerpts from Robert Haas's message of jue mar 17 11:09:56 -0300 2011:
> On Thu, Mar 17, 2011 at 9:46 AM, Alvaro Herrera
>  wrote:

> > Keep in mind that the datetime stuff was abandoned by the maintainer
> > some years ago with quite some rough edges.  Some of it has been fixed,
> > but a lot of bugs remain.  Looks like this is one of those places and it
> > seems appropriate to spend some time fixing it.  Since it would involve
> > a behavior change, it should only go to 9.2, of course.
> 
> I wouldn't object to fixing the problem with # of digits > # of Ys in
> 9.1, if the fix is simple and clear-cut.  I think we are still
> accepting patches to make minor tweaks, like the tab-completion patch
> I committed yesterday.  It also doesn't bother me tremendously if we
> push it off, but I don't think that anyone's going to be too sad if
> TO_DATE('01-jan-2010',  'DD-MON-YYY') starts returning something more
> sensible than 3010-01-01.

If it can be delivered quickly and it is simple, sure.  But anything
more involved should respect the release schedule.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Rectifying wrong Date outputs

2011-03-17 Thread Tom Lane

Robert Haas  writes:
> On Thu, Mar 17, 2011 at 9:46 AM, Alvaro Herrera
>  wrote:
>> Keep in mind that the datetime stuff was abandoned by the maintainer
>> some years ago with quite some rough edges.  Some of it has been fixed,
>> but a lot of bugs remain.  Looks like this is one of those places and it
>> seems appropriate to spend some time fixing it.  Since it would involve
>> a behavior change, it should only go to 9.2, of course.

> I wouldn't object to fixing the problem with # of digits > # of Ys in
> 9.1, if the fix is simple and clear-cut.  I think we are still
> accepting patches to make minor tweaks, like the tab-completion patch
> I committed yesterday.  It also doesn't bother me tremendously if we
> push it off, but I don't think that anyone's going to be too sad if
> TO_DATE('01-jan-2010',  'DD-MON-YYY') starts returning something more
> sensible than 3010-01-01.

Agreed, it's certainly not too late for bug fixes in 9.1.  I agree
that this isn't something we would want to tweak in released branches,
but 9.1 isn't there yet.

Having said that, it's not entirely clear to me what sane behavior is
here.  Personally I would expect that an n-Ys format spec would consume
at most n digits from the input.  Otherwise how are you going to use
to_date to pick apart strings that don't have any separators?  So
I think the problem is actually upstream of the behavior complained of
here.  However, what we should first do is see what Oracle does in such
cases, because the main driving factor for these functions is Oracle
compatibility not what might seem sane in a vacuum.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Rectifying wrong Date outputs

2011-03-17 Thread Robert Haas

On Thu, Mar 17, 2011 at 9:46 AM, Alvaro Herrera
 wrote:
> Excerpts from Piyush Newe's message of jue mar 17 02:30:06 -0300 2011:
>> Sorry for creating the confusion. The table drawn was PostgreSQL vs EDB
>> Advanced Server.
>> Thanks Burce for clarification.
>>
>> For the 1-digit, 2-digit & 3-digit Year inputs, as I said, I didn't see any
>> document in PG which will explain what would be the century considered if it
>> is not given. If I missed out it somewhere please let me know.
>
> Keep in mind that the datetime stuff was abandoned by the maintainer
> some years ago with quite some rough edges.  Some of it has been fixed,
> but a lot of bugs remain.  Looks like this is one of those places and it
> seems appropriate to spend some time fixing it.  Since it would involve
> a behavior change, it should only go to 9.2, of course.

I wouldn't object to fixing the problem with # of digits > # of Ys in
9.1, if the fix is simple and clear-cut.  I think we are still
accepting patches to make minor tweaks, like the tab-completion patch
I committed yesterday.  It also doesn't bother me tremendously if we
push it off, but I don't think that anyone's going to be too sad if
TO_DATE('01-jan-2010',  'DD-MON-YYY') starts returning something more
sensible than 3010-01-01.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] volatile markings to silence compilers

2011-03-17 Thread Robert Haas

On Thu, Mar 17, 2011 at 12:36 AM, Bruce Momjian  wrote:
> Looking over the release notes, we have added a few 'volatile' storage
> specifications to variables which are near longjump/TRY blocks to
> silence compilers.  I am worried that these specifications don't clearly
> identify their purpose.  Can we rename these to use a macro for
> 'volatile' that will make their purpose clearer and perhaps their
> removal one day easier?

I don't particularly see the point of this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] really lazy vacuums?

2011-03-17 Thread Robert Haas

On Thu, Mar 17, 2011 at 4:17 AM, Jesper Krogh  wrote:
> Is it obvious that the visibillity map bits should track complete
> pages and not individual tuples? If the visibillity map tracks at
> page-level the benefit would fall on "slim tables" where you squeeze
> 200 tuples into each page and having an update rate of 1% would
> lower the likelyhood even more. (it may be that for slim tables the
> index-only-scans are not as benefitial as to wide tables).

I'm not sure exactly what MaxHeapTuplesPerPage works out to be, but
say it's 200.  If you track visibility info per tuple rather than per
page, then the size of the visibility map is going to expand by a
factor of 200.  That might decrease contention, but otherwise it's a
bad thing - the whole point of having the visibility map in the first
place is that it's much, much smaller than the heap.  If it were the
same size as the heap, we could just read the heap.  What the map
attempts to accomplish is to allow us, by reading a small number of
pages, to check whether the tuples we're thinking of reading are
likely to be all-visible without actually looking at them.

> In collaboration with a vacuuming discussion, I dont know if it
> is there allready but how about "opportunistic vacuuming". Say
> you have a page what due to changes in one of the tuples are
> being written out, will it, while being written out anyway get the
> other tuples on the page vacuumed?

The really lazy kind of vacuuming I'm talking about could be done this
way.  Regular vacuuming cannot, because you can't actually prune a
dead tuple until you've scanned all the indexes for references to that
CTID.  The obvious place to do this would be the background writer: if
the page is dirty anyway and we're about to evict it, we could decide
to (1) set hint bits, (2) set visibility map bits, (3) freeze tuples
that need freezing, and (4) identify dead tuples and reclaim the space
they use, but not the associated line pointer.

Sadly, I'm not sure this would help much.  If we have, say, a 4MB
relation, you're not even going to notice it when vacuum comes along
and does its thing.  Even a vacuum freeze is chump change.  The
problem is with big relations, like say 1GB+.  Processing the whole
table at once can have a material impact on system performance, so
it'd be nice to do some work incrementally.  But it's likely that
doing it opportunistically as you evict things from shared buffers is
only going to help here and there.  Even if you optimistically assumed
that we could opportunistically do 10% of the vacuuming that way,
that's still not much of a dent.  And I think it'd probably be less
than that in most real-world cases.  A further problem is that the
background writer is already a pretty busy process, and giving it more
things to do isn't very appealing from the standpoint of overall
system performance.

> It actually dont have to hook into the process directly to benefit
> the IO-usage, if it just can get the opportunity to do it before
> the page gets evicted from the OS-cache, then it would save a
> second read on that page, but it seems way harder to do something
> sane around that assumption.

Yeah.

> Really lazy vacuums would "only" benefit "really static tables" ?  where
> vacuuming is not that big a problem in the first place.

I think the benefit would be tables that are either quite large (where
the ability to do this incrementally would be an advantage over
regular VACUUM) or insert-only (where we currently have no way to get
PD_ALL_VISIBLE bits set without user intervention).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Rectifying wrong Date outputs

2011-03-17 Thread Alvaro Herrera

Excerpts from Piyush Newe's message of jue mar 17 02:30:06 -0300 2011:
> Sorry for creating the confusion. The table drawn was PostgreSQL vs EDB
> Advanced Server.
> Thanks Burce for clarification.
> 
> For the 1-digit, 2-digit & 3-digit Year inputs, as I said, I didn't see any
> document in PG which will explain what would be the century considered if it
> is not given. If I missed out it somewhere please let me know.

Keep in mind that the datetime stuff was abandoned by the maintainer
some years ago with quite some rough edges.  Some of it has been fixed,
but a lot of bugs remain.  Looks like this is one of those places and it
seems appropriate to spend some time fixing it.  Since it would involve
a behavior change, it should only go to 9.2, of course.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19

2011-03-17 Thread Robert Haas

On Thu, Mar 17, 2011 at 8:24 AM, Heikki Linnakangas
 wrote:
> Hmm, so setting synchronous_standby_names to '' takes effect immediately,
> but other changes to it don't apply to already-blocked commits. That seems a
> bit inconsistent. Perhaps walwriter should store the parsed list of
> standby-names in shared memory, not just a boolean.

I don't think this is necessary.  In general, the current or potential
WAL sender processes are responsible for working out among themselves
whose job it is to release waiters, and doing it.  As long as
synchronous_standby_names is non-empty, then either (1) there are one
or more standbys connected that can take on the role of synchronous
standby, and whoever does will release waiters or (2) there are no
standbys connected that can take on the role of synchronous standbys,
in which case no waiters should be released until one connects.  But
when synchronous_standby_names becomes completely empty, that doesn't
mean "wait until a standby connects whose application name is in the
empty set and make him the synchronous standby" but rather
"synchronous replication is administratively disabled, don't wait in
the first place".  So we just need a Boolean flag.

> +1 otherwise.

Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19

2011-03-17 Thread Robert Haas

On Thu, Mar 17, 2011 at 2:08 AM, Fujii Masao  wrote:
> This occurs to me; we should ensure that, in shutdown case, walwriter
> should exit after all the backends have gone out? I'm not sure if it's worth
> thinking of the case, but what if synchronous_standby_names is unset
> and config file is reloaded after smart shutdown is requested? In this
> case, the reload cannot wake up the waiting backends since walwriter
> has already exited. This behavior looks a bit inconsistent.

I agree we need to fix smart shutdown.  I think that's going to have
to be a separate patch, however; it's got more problems than this
patch can fix without expanding into a monster.

> In the patch, in order to read the latest value, you take a light-weight lock.
> But I wonder why taking a lock can ensure that the value is up-to-date.

A lock acquisition acts as a memory sequence point.

> + * WAL writer calls this as needed to update the shared sync_standbys_needed
>
> Typo: s/sync_standbys_needed/sync_standbys_defined

Fixed.

> + * we exit normally, or SYNC_REP_MUST_DISCONNECT in abnormal cases.
>
> Typo: the reference to SYNC_REP_MUST_DISCONNECT is not required.

Fixed.

> +                * So in this case we issue a NOTICE (which some clients may
>
> Typo: s/NOTICE/WARNING

Fixed.

> +               if (ProcDiePending)
> +               {
> +                       ereport(WARNING,
> +                                       (errcode(ERRCODE_ADMIN_SHUTDOWN),
> +                                        errmsg("canceling the wait for 
> replication and terminating
> connection due to administrator command"),
> +                                        errdetail("The transaction has 
> already been committed locally
> but might have not been replicated to the standby.")));
> +                       whereToSendOutput = DestNone;
> +                       LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
> +                       MyProc->syncRepState = SYNC_REP_WAIT_COMPLETE;
> +                       SHMQueueDelete(&(MyProc->syncRepLinks));
>
> SHMQueueIsDetached() should be checked before calling SHMQueueDelete().
> Walsender can already delete the backend from the queue before reaching here.

Fixed.  But come to think of it, doesn't this mean
SyncRepCleanupAtProcExit() needs to repeat the test after acquiring
the lock?

> +               if (QueryCancelPending)
> +               {
> +                       QueryCancelPending = false;
> +                       ereport(WARNING,
> +                                       (errmsg("canceling wait for 
> synchronous replication due to user request"),
> +                                        errdetail("The transaction has 
> committed locally, but may not
> have replicated to the standby.")));
> +                       LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
> +                       MyProc->syncRepState = SYNC_REP_WAIT_COMPLETE;
> +                       SHMQueueDelete(&(MyProc->syncRepLinks));
> +                       LWLockRelease(SyncRepLock);
>
> Same as above.

Fixed.

> +               if (!PostmasterIsAlive(true))
> +               {
> +                       whereToSendOutput = DestNone;
> +                       proc_exit(1);
>
> proc_exit() should not be called at that point because it leads PANIC.

Fixed, although I'm not sure it matters.

> I think that it's better to check ProcDiePending, QueryCancelPending
> and PostmasterIsAlive *before* waiting on the latch, not after. Because
> those events can occur before reaching there, and it's not worth waiting
> for 60 seconds to detect them.

Not necessary.  Inspired by one of your earlier patches, I made die()
and StatementCancelHandler() set the latch.  We could still wait up to
60 s to detect postmaster death, but that's a very rare situation and
it's not worth slowing down the common case for it, especially since
there is a race condition no matter what.

Thanks for the review!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


sync-rep-wait-fixes-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] really lazy vacuums?

2011-03-17 Thread Cédric Villemain

2011/3/17 Robert Haas :
> On Wed, Mar 16, 2011 at 6:36 PM, Jim Nasby  wrote:
>> One way to look at this is that any system will have a limit on how quickly 
>> it can vacuum everything. If it's having trouble dedicating enough IO to 
>> vacuum, then autovac is going to have a long list of tables that it wants to 
>> vacuum. When you're in that situation, you want to get to the next table 
>> that needs vacuuming as quickly as possible, so if you've run through the 
>> first heap scan and found only a limited number of dead tuples, it doesn't 
>> make sense to spend a bunch of time scanning indexes and making a second 
>> heap scan (though, IIRC the second scan doesn't hit the entire heap; it only 
>> hits the tuples that were remembered as being dead).
>
> I mostly agree with this, but you also can't postpone vacuuming
> indefinitely just because you're too busy; that's going to blow up in
> your face.
>
>> Of course, going along the lines of an autovac-based tuning mechanism, you 
>> have to question how a table would show up for autovac if there's not 
>> actually a number of dead tuples. One scenario is freezing (though I'm not 
>> sure if your super-lazy vacuum could freeze tuples or not). Another is 
>> inserts. That might become a big win; you might want to aggressively scan a 
>> table that gets data loaded into it in order to set hint/all visible bits.
>
> Right.  Really-lazy vacuum could freeze tuples.  Unlike regular
> vacuum, it can also sensibly be done incrementally.  One thing I was
> thinking about is counting the number of times that we fetched a tuple
> that was older than RecentGlobalXmin and had a committed xmin and an
> invalid xmax, but where the page was not PD_ALL_VISIBLE.  If that's
> happening a lot, it probably means that some vacuuming would speed
> things up, by getting those PD_ALL_VISIBLE bits set.  Perhaps you
> could work out some formula where you do a variable amount of
> super-lazy vacuuming depending on the number of such tuple fetches.
> The trick would be to avoid overdoing it (so that you swamp the I/O
> system) or underdoing it (so that the system never converges).  It
> would be really nice (for this and for other things) if we had some
> way of measuring the I/O saturation of the system, so that we could
> automatically adjust the aggressiveness of background processes
>

Yes. I am thinking of something like that (the IO saturation
measurement) to let the background writer try to work on hint bit when
it does not have so much to do, if IO ressources are ok.

>
> Note also that if and when we get index-only scans, making sure the
> PD_ALL_VISIBLE bits (and thus the visibility map bits) actually get
> set is going to be a lot more important.
>
>> From a manual standpoint, ISTM that super-lazy vac would be extremely useful 
>> for dealing with hint bits after a bulk insert to a table that also has some 
>> update activity. Using a regular vacuum in that case would result in a lot 
>> of extra work to deal with the small number of dead tuples.
>
> I can see that.
>
>> Perhaps it would be useful to write a script that analyzed the output of 
>> vacuum verbose looking for tables where a super-lazy vacuum would have made 
>> sense (assuming vacuum verbose provides the needed info). If we had such a 
>> script we could ask folks to run it and see how much super-lazy vacuuming 
>> would help in the real world.
>
> I'm a bit doubtful about this part.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19

2011-03-17 Thread Heikki Linnakangas


On 16.03.2011 19:35, Robert Haas wrote:

3. If synchronous_standby_names is changed to '' by editing
postgresql.conf and issuing pg_ctl reload, then cancel all waits in
progress and wake everybody up.  As I mentioned before, reloading the
config file from within the waiting backend (which can't safely throw
an error) seems risky, so what I did instead is made WAL writer
responsible for handling this.  Nobody's allowed to wait for sync rep
unless a global shared memory flag is set, and the WAL writer process
is responsible for setting and clearing this flag when the config file
is reloaded.  This has basically no performance cost; WAL writer only
ever does any extra work at all with this code when it receives a
SIGHUP, and even then the work is trivial except in the case where
synchronous_standby_names has changed from empty to non-empty or visca
versa.  The advantage of putting this in WAL writer rather than, say,
bgwriter is that WAL writer doesn't have nearly as many jobs to do and
they don't involve nearly as much I/O, so the chances of a long delay
due to the process being busy are much less.


Hmm, so setting synchronous_standby_names to '' takes effect 
immediately, but other changes to it don't apply to already-blocked 
commits. That seems a bit inconsistent. Perhaps walwriter should store 
the parsed list of standby-names in shared memory, not just a boolean.


+1 otherwise.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] really lazy vacuums?

2011-03-17 Thread Jesper Krogh


Robert Haas wrote:

Right.  Really-lazy vacuum could freeze tuples.  Unlike regular
vacuum, it can also sensibly be done incrementally.  One thing I was
thinking about is counting the number of times that we fetched a tuple
that was older than RecentGlobalXmin and had a committed xmin and an
invalid xmax, but where the page was not PD_ALL_VISIBLE.  If that's
happening a lot, it probably means that some vacuuming would speed
things up, by getting those PD_ALL_VISIBLE bits set.  Perhaps you
could work out some formula where you do a variable amount of
super-lazy vacuuming depending on the number of such tuple fetches.
The trick would be to avoid overdoing it (so that you swamp the I/O
system) or underdoing it (so that the system never converges).  It
would be really nice (for this and for other things) if we had some
way of measuring the I/O saturation of the system, so that we could
automatically adjust the aggressiveness of background processes
accordingly.

Note also that if and when we get index-only scans, making sure the
PD_ALL_VISIBLE bits (and thus the visibility map bits) actually get
set is going to be a lot more important.
  


Is it obvious that the visibillity map bits should track complete
pages and not individual tuples? If the visibillity map tracks at
page-level the benefit would fall on "slim tables" where you squeeze
200 tuples into each page and having an update rate of 1% would
lower the likelyhood even more. (it may be that for slim tables the
index-only-scans are not as benefitial as to wide tables).

In collaboration with a vacuuming discussion, I dont know if it
is there allready but how about "opportunistic vacuuming". Say
you have a page what due to changes in one of the tuples are
being written out, will it, while being written out anyway get the
other tuples on the page vacuumed?

It actually dont have to hook into the process directly to benefit
the IO-usage, if it just can get the opportunity to do it before
the page gets evicted from the OS-cache, then it would save a
second read on that page, but it seems way harder to do something
sane around that assumption.

Really lazy vacuums would "only" benefit "really static tables" ?  where
vacuuming is not that big a problem in the first place.


--
Jesper - Demonstrating totally lack of insight I would assume.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] volatile markings to silence compilers

2011-03-17 Thread Martijn van Oosterhout

On Thu, Mar 17, 2011 at 12:36:59AM -0400, Bruce Momjian wrote:
> Looking over the release notes, we have added a few 'volatile' storage
> specifications to variables which are near longjump/TRY blocks to
> silence compilers.  I am worried that these specifications don't clearly
> identify their purpose.  Can we rename these to use a macro for
> 'volatile' that will make their purpose clearer and perhaps their
> removal one day easier?

The question is, are they wrong? The longjmp manpage says:

   The values of automatic variables are unspecified after a call
   to longjmp() if they meet all the following criteria:

   Â·  they are local to the function that made the corresponding
  setjmp(3) call;

   Â·  their values are changed between the calls to setjmp(3) and
  longjmp(); and

   Â·  they are not declared as volatile.

It appears the issue is mostly that the compiler is unable to prove
that the variables aren't changed. It's hard because the buffer created
by setjmp() doesn't expire. We know that after PG_END_TRY() the buffer
won't be used, but apparently the compiler doesn't.

My point is, are we hopeful this problem will ever go away?

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first. 
>   - Charles de Gaulle


signature.asc
Description: Digital signature

39 matches

Mail list logo