Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Simon Riggs
On 20 June 2012 08:35, Robert Haas  wrote:

>> I expect it would be fine to have a tool that pulls LCRs out of WAL to
>> prepare that to be sent to remote locations.  Is that what you have in
>> mind?
>
> Yes.  I think it should be possible to generate LCRs from WAL, but I
> think that the on-the-wire format for LCRs should be different from
> the WAL format.  Trying to use the same format for both things seems
> like an unpleasant straightjacket.

You're confusing things here.

Yes, we can use a different on-the-wire format. No problem.

As I've already said, the information needs to be in WAL first before
we can put it into LCRs, so the "don't use same format" argument is
not relevant as to why the info must be on the WAL record in the first
place. And as said elsewhere, doing that does not cause problems in
any of these areas:  wal bloat, performance, long term restrictions on
numbers of nodeids, as has so far been claimed on this thread.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Heikki Linnakangas

On 20.06.2012 01:27, Kevin Grittner wrote:

Andres Freund  wrote:


Yes, thats definitely a valid use-case. But that doesn't preclude
the other - also not uncommon - use-case where you want to have
different master which all contain up2date data.


I agree.  I was just saying that while one requires an origin_id,
the other doesn't.  And those not doing MM replication definitely
don't need it.


I think it would be helpful to list down a few concrete examples of 
this. The stereotypical multi-master scenario is that you have a single 
table that's replicated to two servers, and you can insert/update/delete 
on either server. Conflict resolution stretegies vary.


The reason we need an origin id in this scenario is that otherwise this 
will happen:


1. A row is updated on node A
2. Node B receives the WAL record from A, and updates the corresponding 
row in B. This generates a new WAL record.
3. Node A receives the WAL record from B, and updates the rows again. 
This again generates a new WAL record, which is replicated to A, and you 
loop indefinitely.


If each WAL record carries an origin id, node A can use it to refrain 
from applying the WAL record it receives from B, which breaks the loop.


However, note that in this simple scenario, if the logical log replay / 
conflict resolution is smart enough to recognize that the row has 
already been updated, because the old and the new rows are identical, 
the loop is broken at step 3 even without the origin id. That works for 
the newest-update-wins and similar strategies. So the origin id is not 
absolutely necessary in this case.


Another interesting scenario is that you maintain a global counter, like 
in an inventory system, and conflicts are resolved by accumulating the 
updates. For example, if you do "UPDATE SET counter = counter + 1" 
simultaneously on two nodes, the result is that the counter is 
incremented by two. The avoid-update-if-already-identical optimization 
doesn't help in this case, the origin id is necessary.


Now, let's take the inventory system example further. There are actually 
two ways to update a counter. One is when an item is checked in or out 
of the warehouse, ie. "UPDATE counter = counter + 1". Those updates 
should accumulate. But another operation resets the counter to a 
specific value, "UPDATE counter = 10", like when taking an inventory. 
That should not accumulate with other changes, but should be 
newest-update-wins. The origin id is not enough for that, because by 
looking at the WAL record and the origin id, you don't know which type 
of an update it was.


So, I don't like the idea of adding the origin id to the record header. 
It's only required in some occasions, and on some record types. And I'm 
worried it might not even be enough in more complicated scenarios.


Perhaps we need a more generic WAL record annotation system, where a 
plugin can tack arbitrary information to WAL records. The extra 
information could be stored in the WAL record after the rmgr payload, 
similar to how backup blocks are stored. WAL replay could just ignore 
the annotations, but a replication system could use it to store the 
origin id or whatever extra information it needs.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Simon Riggs
On 20 June 2012 11:26, Tom Lane  wrote:
> "Kevin Grittner"  writes:
>> Simon Riggs  wrote:
>>> The proposal is to use WAL to generate the logical change stream.
>>> That has been shown in testing to be around x4 faster than having
>>> a separate change stream, which must also be WAL logged (as Jan
>>> noted).
>
>> Sure, that's why I want it.
>
> I think this argument is basically circular.  The reason it's 4x faster
> is that the WAL stream doesn't actually contain all the information
> needed to generate LCRs (thus all the angst about maintaining catalogs
> in sync, what to do about unfriendly datatypes, etc).  By the time the
> dust has settled and you have a workable system, you will have bloated
> WAL and given back a large chunk of that multiple, thereby invalidating
> the design premise.  Or at least that's my prediction.

The tests were conducted with the additional field added, so your
prediction is not verified.

The additional fields do not bloat WAL records - they take up exactly
the same space as before.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata

2012-06-19 Thread Amit Kapila
>>  I've got a problem with the assumption that, when pg_control is trash,
>>  megabytes or gigabytes of WAL can still be relied on completely.
>>
>>  I'm almost inclined to suggest that we not get next-LSN from WAL, but
>>  by scanning all the pages in the main data store and computing the max
>>  observed LSN.  This is clearly not very attractive from a performance
>>  standpoint, but it would avoid the obvious failure mode where you lost
>>  some recent WAL segments along with pg_control.

> I think it could be useful to have a tool that scans all the blocks
> and computes that value, but I'd want it to just print the value out
> and let me decide what to do about it.  There are cases where you
> don't necessarily want to clobber pg_control, but you do have future
> LSNs in your data file pages.  This can be either because the disk ate
> your WAL, or because you didn't create recovery.conf, or because your
> disk corrupted the LSNs on the data file pages.  I'd want a tool that
> could be either run on an individual file, or recursively on a
> directory.

The whole point is we need to find a valid next-LSN (Redo Replay location as
I understand).
If we let user decide about it, I think it can lead to inconsistent
database.

As per my understanding postgres database can come to consistent point only
if it has
both datafiles and WAL after crash. 
So I am not able to think if it lost WAL, how we can it make a consistent
database.

> If these values seem acceptable, use -f to force reset.
> [rhaas pgsql]$ pg_resetxlog -f ~/pgdata
> pg_resetxlog: pg_control exists but is broken or unknown version; ignoring
it
> Transaction log reset
> [rhaas pgsql]$ postgres
> LOG:  database system was shut down at 2012-06-19 15:25:28 EDT
> LOG:  database system is ready to accept connections
> LOG:  autovacuum launcher started

> So I still don't understand what problem we're solving here.

1. The values (like nextoid, nextxid, etc) are guessed values which can be
improved by having
   these values from last checkpoint record using WAL files. 

2. The value for next-LSN (ControlFile.checkPointCopy.redo) will be guessed
value which if
   directly used for recovery after pg_resetxlog will lead to inconsistent
database.
   So I want to improve the logic to have either appropriate value for
next-LSN or more reliable value.

In documentation, it is mentioned that starting database after using
pg_resetxlog can contain inconsistent data.
The exact wording is mentioned below in mail.

My purposal to work on this Todo item is to improve the values generated for
pg_control, so that it becomes more easy for users to recover from database
corruption scenario's.
I don't think even after working on this feature, user can recover database
for all corruption scenario's. However it can improve the situation from
now. 

Pg_resetxlog documentation related excerpts- 
"After running this command, it should be possible to start the server, but
bear in mind that the database might contain inconsistent data due to
partially-committed transactions. You should immediately dump your data, run
initdb, and reload. After reload, check for inconsistencies and repair as
needed."

   

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] return values of backend sub-main functions

2012-06-19 Thread Josh Kupershmidt
On Tue, Jun 19, 2012 at 4:31 AM, Peter Eisentraut  wrote:
> On ons, 2012-01-18 at 21:21 +0200, Peter Eisentraut wrote:
>> On lör, 2012-01-07 at 16:41 -0500, Tom Lane wrote:
>> > Peter Eisentraut  writes:
>> > > I suggest that we change PostgresMain(), PostmasterMain(), BackendRun(),
>> > > WalSenderMain(), and WalSndLoop() to return void as well.
>> >
>> > I agree this code is not very consistent or useful, but one question:
>> > what should the callers do if one of these functions *does* return?
>>
>> I was thinking of a two-pronged approach:  First, add
>> __attribute__((noreturn)) to the functions.  This will cause a suitable
>> compiler to verify on a source-code level that nothing actually returns
>> from the function.  And second, at the call site, put an abort();  /*
>> not reached */.  Together, this will make the code cleaner and more
>> consistent, and will also help the compiler out a bit about the control
>> flow.
>
> Patch for 9.3 attached.

+1. Should this call around line 4114 of postmaster.c not bother with
proc_exit() anymore:

/* And run the backend */
proc_exit(BackendRun(&port));

I was hoping that some of the clang static analyzer complaints would
go away with these changes, though it looks like only one[1] did. I
would be interested to see the similar elog/ereport patch you
mentioned previously, perhaps that will eliminate a bunch of warnings.

Josh

[1] this one goes away with the patch:
http://kupershmidt.org/pg/scan-build-2012-06-19-master/report-E2cUqJ.html#EndPath

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Backport of fsync queue compaction

2012-06-19 Thread Greg Smith

On 06/19/2012 08:41 PM, Robert Haas wrote:

On Tue, Jun 19, 2012 at 7:33 PM, Josh Berkus  wrote:

To what version will we backport it?


All supported branches, I would think.


Yeah, I would feel a lot better if this was improved in 8.3 and up.  I 
didn't bother trying to push it before now because I wanted both more 
bad examples and some miles on 9.1 without the new code triggering any 
problems from the field.  Seems we're at that point now.


I fear a lot of people are going to stay on 8.3 for a long time even 
after it's officially unsupported.  It's been hard enough to move some 
people off of 8.1 and 8.2 even with the carrot of "8.3 will be much 
faster at everything".  I fear 8.3 is going to end up like 7.4, where a 
7 year long support window passes and people are still using it years later.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata

2012-06-19 Thread Amit Kapila
>> I believe if WAL files are proper as mentioned in Alvaro's mail, the
purposed logic should generate
>> correct values.
>> Do you see any problem in logic purposed in my original mail.
>> Can I resume my work on this feature?

> Maybe I'm missing your point, but... why don't you just use PITR to
> recover from the corruption of pg_control?

AFAIK PITR can be used in a scenario where there is a base back-up and we
have archived
the WAL files after that, now it can use WAL files to apply on base-backup.

In this scenario we don't know a point from where to start the next replay. 
So I believe it will be difficult to use PITR in this scenario.


-Original Message-
From: Fujii Masao [mailto:masao.fu...@gmail.com] 
Sent: Tuesday, June 19, 2012 7:44 PM
To: Amit Kapila
Cc: Tom Lane; Alvaro Herrera; Cédric Villemain; Pg Hackers; Robert Haas
Subject: Re: [HACKERS] Allow WAL information to recover corrupted
pg_controldata

On Tue, Jun 19, 2012 at 2:44 AM, Amit Kapila  wrote:
>> AFAIR you can create pg_control from scratch already with pg_resetxlog.
>> The hard part is coming up with values for the counters, such as the
>> next WAL location.  Some of them such as next OID are pretty harmless
>> if you don't guess right, but I'm worried that wrong next WAL could
>> make things worse not better.
>
> I believe if WAL files are proper as mentioned in Alvaro's mail, the
purposed logic should generate
> correct values.
> Do you see any problem in logic purposed in my original mail.
> Can I resume my work on this feature?

Maybe I'm missing your point, but... why don't you just use PITR to
recover from the corruption of pg_control?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Tom Lane
"Kevin Grittner"  writes:
> Simon Riggs  wrote:
>> The proposal is to use WAL to generate the logical change stream.
>> That has been shown in testing to be around x4 faster than having
>> a separate change stream, which must also be WAL logged (as Jan
>> noted).
 
> Sure, that's why I want it.

I think this argument is basically circular.  The reason it's 4x faster
is that the WAL stream doesn't actually contain all the information
needed to generate LCRs (thus all the angst about maintaining catalogs
in sync, what to do about unfriendly datatypes, etc).  By the time the
dust has settled and you have a workable system, you will have bloated
WAL and given back a large chunk of that multiple, thereby invalidating
the design premise.  Or at least that's my prediction.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] performance regression in 9.2 when loading lots of small tables

2012-06-19 Thread Robert Haas
On Tue, Jun 19, 2012 at 10:56 PM, Jeff Janes  wrote:
> But in the 9.2 branch, the slow phenotype was re-introduced in
> 1575fbcb795fc331f4, although perhaps the details of who is locking
> what differs.  I haven't yet sorted that out.

It very much does.  That commit prevents people from creating a
relation in - or renaming a relation into - a schema that is being
concurrently dropped, which in previous releases would have resulted
in inconsistent catalog contents.  I admit that it harms your test
case, but how likely is it that someone is going to put every single
table into its own schema?  And have shared_buffers low enough for
this to be the dominant cost?  I think in real-world scenarios this
isn't going to be a problem - although, of course, making the lock
manager faster would be nifty if we can do it, and this might be a
good test case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)

2012-06-19 Thread Robert Haas
On Tue, Jun 19, 2012 at 4:22 PM, Andres Freund  wrote:
>> > 1. dllist.h has double the element overhead by having an inline value
>> > pointer (which is not needed when embedding) and a pointer to the list
>> > (which I have a hard time seing as being useful)
>> > 2. only double linked list, mine provided single and double linked ones
>> > 3. missing macros to use when embedded in a larger struct (containerof()
>> > wrappers and for(...) support basically)
>> > 4. most things are external function calls...
>> > 5. way much more branches/complexity in most of the functions. My
>> > implementation doesn't use any branches for the typical easy
>> > modifications (push, pop, remove element somewhere) and only one for the
>> > typical tests (empty, has-next, ...)
>> >
>> > The performance and memory aspects were crucial for the aforementioned
>> > toy project (slab allocator for postgres). Its not that crucial for the
>> > applycache where the lists currently are mostly used although its also
>> > relatively performance sensitive and obviously does a lot of list
>> > manipulation/iteration.
>> >
>> > If I had to decide I would add the missing api in dllist.h to my
>> > implementation and then remove it. Its barely used - and only in an
>> > embedded fashion - as far as I can see.
>> > I can understand though if that argument is met with doubt by others ;).
>> > If thats the way it has to go I would add some more convenience support
>> > for embedding data to dllist.h and settle for that.
>>
>> I think it might be simpler to leave the name as Dllist and just
>> overhaul the implementation along the lines you suggest, rather than
>> replacing it with something completely different.  Mostly, I don't
>> want to add a third thing if we can avoid it, given that Dllist as it
>> exists today is used only lightly.
> Well, if its the name, I have no problem with changing it, but I don't see how
> you can keep the api as it currently is and address my points.
>
> If there is some buyin I can try to go either way (keeping the existing name,
> changing the api, adjusting the callers or just adjust the callers, throw away
> the old implementation) I just don't want to get into that just to see
> somebody isn't agreeing with the fundamental idea.

My guess is that it wouldn't be too hard to remove some of the extra
pointers.  Anyone who is using Dllist as a non-inline list could be
converted to List * instead.  Also, the performance-critical things
could be reimplemented as macros.  I question, though, whether we
really need both single and doubly linked lists.  That seems like it's
almost certainly micro-optimization that we are better off not doing.

> The most contentious point is probably relying on USE_INLINE being available
> anywhere. Which I believe to be the point now that we have gotten rid of some
> platforms.

I would be hesitant to chuck that even though I realize it's unlikely
that we really need !USE_INLINE.  But see sortsupport for an example
of how we've handled this in the recent past.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] performance regression in 9.2 when loading lots of small tables

2012-06-19 Thread Jeff Janes
On Tue, Jun 19, 2012 at 2:38 PM, Robert Haas  wrote:
> On Tue, Jun 19, 2012 at 4:33 PM, Robert Haas  wrote:
>> On Mon, Jun 18, 2012 at 8:42 PM, Jeff Janes  wrote:
>>> There was a regression introduced in 9.2 that effects the creation and
>>> loading of lots of small tables in a single transaction.
>>>
>>> It affects the loading of a pg_dump file which has a large number of
>>> small tables (10,000 schemas, one table per schema, 10 rows per
>>> table).  I did not test other schema configurations, so these
>>> specifics might not be needed to invoke the problem.
>>>
>>> It causes the loading of a dump with "psql -1 -f " to run at half the
>>> previous speed.  Speed of loading without the -1 is not changed.
>>
>> I tried to figure out why this was happening.  I tried it out on the
>> IBM POWER7 box after building from
>> f5297bdfe4c4a47376c41b96161fb55c2294a0b1 and it ran for about 14
>> minutes.  'time' reports that psql used 38 seconds of user time and 11
>> seconds of system time.  The remaining time was presumably spent
>> waiting for the backend and/or the kernel.
>>
...

>> I did a separate run using perf and it had this to say:
>>
>> Events: 1M cycles
>> +  13.18%         postgres  postgres               [.] FlushRelationBuffers
>> +   9.65%         postgres  postgres               [.] comparetup_index_btree
...
>
> I built REL9_1_STABLE from commit
> 1643031e5fe02d2de9ae6b8f86ef9ffd09fe7d3f and it took 19m45.250s, even
> slower than master.  So something is different in my environment
> versus yours.

Yes, I forgot the perhaps the most important part.
FlushRelationBuffers is a real pig under "-1 -f -" if you are running
on large shared_buffers, and might swamp the other issue.  I run with
shared_buffers=8MB to get around that problem.  And with fsync=off if
you don't have BBU or something.  The unlocked test in
FlushRelationBuffers recently added to HEAD probably is what makes
HEAD faster if you are using large shared_buffers.

With small shared_buffers and fsync off I get 1 minute to load 10,000
tables/schemas in the fast phenotype, or 2 minutes in the slow
phenotype.

It looks like the phenotype actually has come and gone more than once
(for different but related reasons), which is why I couldn't find the
place in the REL9_2_STABLE where it happened very easily with git
bisect.

The fundamental problem seems to be that the slow phenotype takes
twice as many locks on which makes the local hash table twice as big,
which in turn makes the reassignment of resource owners twice as slow.

The original regression seems to be because each table had both an
AccessExclusiveLock and a ShareUpdateExclusiveLock taken on it when it
was checked to see if a toast table was needed.  I now see that that
was reverted in both branches.

But in the 9.2 branch, the slow phenotype was re-introduced in
1575fbcb795fc331f4, although perhaps the details of who is locking
what differs.  I haven't yet sorted that out.

Since the fundamental problem will hopefully be fixed in 9.3, can 9.2
live with this regression in what is probably a rather rare case
(loading huge number of very small tables, and with a small enough
shared_buffers avoid the other limitation)?

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql tab completion for GRANT role

2012-06-19 Thread Robert Haas
On Tue, Jun 19, 2012 at 3:16 AM, Peter Eisentraut  wrote:
> On tor, 2012-06-14 at 13:38 -0400, Robert Haas wrote:
>> On Sun, Jan 8, 2012 at 3:48 PM, Peter Eisentraut  wrote:
>> > psql tab completion currently only supports the form GRANT privilege ON
>> > something TO someone (and the analogous REVOKE), but not the form GRANT
>> > role TO someone.  Here is a patch that attempts to implement the latter.
>>
>> This seems to have fallen through the cracks.
>
> No, it was committed in January.

Oops, I missed that.  Sorry for the noise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Libxml2 load error on Windows

2012-06-19 Thread Robert Haas
On Tue, Jun 19, 2012 at 6:36 AM, Alex Shulgin  wrote:
> Dave Page  writes:
>> On Tue, Jun 19, 2012 at 11:04 AM, Alex Shulgin  
>> wrote:
>>>
>>> In a real bug-tracking system we would create a new bug/ticket and set
>>> it's target version to 'candidate for next minor release' or something
>>> like that.  This way, if we don't release unless all targeted bugs are
>>> resolved, this would be taken care of (hopefully.)
>>
>> Well yes, but the point is that that is not how the project works. I'm
>> asking how we do handle this problem at the moment, because I realised
>> I haven't seen this happen in the past (largely because I haven't been
>> paying attention).
>
> It only works as long as there is some mechanism to ensure that the bugs
> which were not submitted to commitfest are looked at.  If there is no,
> then it doesn't work actually.
>
> So is there a real and reliable mechanism for that?

No.

There probably should be, but in the meantime adding it to the
CommitFest is better than a poke in the eye with a sharp stick.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-19 Thread Robert Haas
On Tue, Jun 19, 2012 at 2:23 PM, Andres Freund  wrote:
>> Well, the words are fuzzy, but I would define logical replication to
>> be something which is independent of the binary format in which stuff
>> gets stored on disk.  If it's not independent of the disk format, then
>> you can't do heterogenous replication (between versions, or between
>> products).  That precise limitation is the main thing that drives
>> people to use anything other than SR in the first place, IME.
> Not in mine. The main limitation I see is that you cannot write anything on
> the standby. Which sucks majorly for many things. Its pretty much impossible
> to "fix" that for SR outside of very limited cases.
> While many scenarios don't need multimaster *many* need to write outside of
> the standby's replication set.

Well, that's certainly a common problem, even if it's not IME the most
common, but I don't think we need to argue about which one is more
common, because I'm not arguing against it.  The point, though, is
that if the logical format is independent of the on-disk format, the
things we can do are a strict superset of the things we can do if it
isn't.  I don't want to insist that catalogs be the same (or else you
get garbage when you decode tuples).  I want to tolerate the fact that
they may very well be different.  That will in no way preclude writing
outside the standby's replication set, nor will it prevent
multi-master replication.  It will, however, enable heterogenous
replication, which is a very important use case.  It will also mean
that innocent mistakes (like somehow ending up with a column that is
text on one server and numeric on another server) produce
comprehensible error messages, rather than garbage.

> Its not only the logging side which is a limitation in todays replication
> scenarios. The apply side scales even worse because its *very* hard to
> distribute it between multiple backends.

I don't think that making LCR format = on-disk format is going to
solve that problem.  To solve that problem, we need to track
dependencies between transactions, so that if tuple A is modified by
T1 and T2, in that order, we apply T1 before T2.  But if T3 - which
committed after both T1 and T2 - touches none of the same data as T1
or T2 - then we can apply it in parallel, so long as we don't commit
until T1 and T2 have committed (because allowing T3 to commit early
would produce a serialization anomaly from the point of view of a
concurrent reader).

>> Because the routines that decode tuples don't include enough sanity
>> checks to prevent running off the end of the block, or even the end of
>> memory completely.  Consider a corrupt TOAST pointer that indicates
>> that there is a gigabyte of data stored in an 8kB block.  One of the
>> common symptoms of corruption IME is TOAST requests for -3 bytes of
>> memory.
> Yes, but we need to put safeguards against that sort of thing anyway. So sure,
> we can have bugs but this is not a fundamental limitation.

There's a reason we haven't done that already, though: it's probably
going to stink for performance.  If it turns out that it doesn't stink
for performance, great.  But if it causes a 5% slowdown on common use
cases, I suspect we're not gonna do it, and I bet I can construct a
case where it's worse than that (think: 400 column table with lots of
varlenas, sorting by column 400 to return column 399).  I think it's
treading on dangerous ground to assume we're going to be able to "just
go fix" this.

> Postgis uses one information table in a few more complex functions but not in
> anything low-level. Evidenced by the fact that it was totally normal for that
> to go out of sync before < 2.0.
>
> But even if such a thing would be needed, it wouldn't be problematic to make
> extension configuration tables be replicated as well.

Ugh.  That's a hack on top of a hack.  Now it all works great if type
X is installed as an extension but if it isn't installed as an
extension then the world blows up.

> I am pretty sure its not bad-behaved. But how should the code know that? You
> want each type to explictly say that its unsafe if it is?

Yes, exactly.  Or maybe there are varying degrees of non-safety,
allowing varying degrees of optimization.  Like: wire format = binary
format is super-safe.  Then having to call an I/O function that
promises not to look at any catalogs is a bit less safe.  And then
there's really unsafe.

> I have played with several ideas:
>
> 1.)
> keep the decoding catalog in sync with command/event triggers, correctly
> replicating oids. If those log into some internal event table its easy to keep
> the catalog in a correct transactional state because the events from that
> table get decoded in the transaction and replayed at exactly the right spot in
> there *after* it has been reassembled. The locking on the generating side
> takes care of the concurrency aspects.

I am not following this one completely.

> 2.)
> Keep the decoding site up2date by replicating the

Re: [HACKERS] Backport of fsync queue compaction

2012-06-19 Thread Robert Haas
On Tue, Jun 19, 2012 at 7:33 PM, Josh Berkus  wrote:
> To what version will we backport it?

All supported branches, I would think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Robert Haas
On Tue, Jun 19, 2012 at 6:14 PM, Andres Freund  wrote:
> I definitely agree that low-level apply is possible as a module. Sure change
> extraction needs core support but I was talking about what you need to
> implement it reusing the "plain" logical support...
>
> What I do not understand is how you want to prevent loops in a simple manner
> without in core support:
>
> A generates a HEAP_INSERT record. Gets decoded into the lcr stream as a INSERT
> action.
> B reads the lcr stream from A and applies the changes. A new HEAP_INSERT
> record. Gets decoded into the lcr stream as a INSERT action.
> A reads the lcr stream from B and ???
>
> At this point you need to prevent a loop. If you have the information where a
> change originally happened (xl_origin_id = A in this case) you can have the
> simple filter on A which ignores change records if lcr_origin_id ==
> local_replication_origin_id).

See my email to Chris Browne, which I think covers this.  It needs a
bit in WAL (per txn, or, heck, if it's one bit, maybe per record) but
not a whole node ID.

>> You need a backend-local hash table inside the wal reader process, and
>> that hash table needs to map XIDs to node IDs.  And you occasionally
>> need to prune it, so that it doesn't eat too much memory.  None of
>> that sounds very hard.
> Its not very hard. Its just more complex than what I propose(d).

True, but not a whole lot more complex, and a moderate amount of
complexity to save bit-space is a good trade.  Especially when Tom has
come down against eating up the bit space.  And I agree with him.  If
we've only got 16 bits of padding to work with, we surely be judicious
in burning them when it can be avoided for the expense of a few
hundred lines of code.

>> > Btw, what do you mean with "conflating" the stream? I don't really see
>> > that being proposed.
>> It seems to me that you are intent on using the WAL stream as the
>> logical change stream.  I think that's a bad design.  Instead, you
>> should extract changes from WAL and then ship them around in a format
>> that is specific to logical replication.
> No, I don't want that. I think we will need some different format once we have
> agreed how changeset extraction works.

I think you are saying that you agree with me that the formats should
be different, but that the LCR format is undecided as yet.  If that is
in fact what you are saying, great.  We'll need to decide that, of
course, but I think there is a lot of cool stuff that can be done that
way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Robert Haas
On Tue, Jun 19, 2012 at 5:59 PM, Christopher Browne  wrote:
> On Tue, Jun 19, 2012 at 5:46 PM, Robert Haas  wrote:
>>> Btw, what do you mean with "conflating" the stream? I don't really see that
>>> being proposed.
>>
>> It seems to me that you are intent on using the WAL stream as the
>> logical change stream.  I think that's a bad design.  Instead, you
>> should extract changes from WAL and then ship them around in a format
>> that is specific to logical replication.
>
> Yeah, that seems worth elaborating on.
>
> What has been said several times is that it's pretty necessary to
> capture the logical changes into WAL.  That seems pretty needful, in
> order that the replication data gets fsync()ed avidly, and so that we
> don't add in the race condition of needing to fsync() something *else*
> almost exactly as avidly as is the case for WAL today..

Check.

> But it's undesirable to pull *all* the bulk of contents of WAL around
> if it's only part of the data that is going to get applied.  On a
> "physical streaming" replica, any logical data that gets captured will
> be useless.  And on a "logical replica," they "physical" bits of WAL
> will be useless.
>
> What I *want* you to mean is that there would be:
> a) WAL readers that pull the "physical bits", and
> b) WAL readers that just pull "logical bits."
>
> I expect it would be fine to have a tool that pulls LCRs out of WAL to
> prepare that to be sent to remote locations.  Is that what you have in
> mind?

Yes.  I think it should be possible to generate LCRs from WAL, but I
think that the on-the-wire format for LCRs should be different from
the WAL format.  Trying to use the same format for both things seems
like an unpleasant straightjacket.  This discussion illustrates why:
we're talking about consuming scarce bit-space in WAL records for a
feature that only a tiny minority of users will use, and it's still
not really enough bit space.  That stinks.  If LCR transmission is a
separate protocol, this problem can be engineered away at a higher
level.

Suppose we have three servers, A, B, and C, that are doing
multi-master replication in a loop.  A sends LCRs to B, B sends them
to C, and C sends them back to A.  Obviously, we need to make sure
that each server applies each set of changes just once, but it
suffices to have enough information in WAL to distinguish between
replication transactions and non-replication transactions - that is,
one bit.  So suppose a change is made on server A.  A generates LCRs
from WAL, and tags each LCR with node_id = A.  It then sends those
LCRs to B.  B applies them, flagging the apply transaction in WAL as a
replication transaction, AND ALSO sends the LCRs to C.  The LCR
generator on B sees the WAL from apply, but because it's flagged as a
replication transaction, it does not generate LCRs.  So C receives
LCRs from B just once, without any need for the node_id to to be known
in WAL.  C can now also apply those LCRs (again flagging the apply
transaction as replication) and it can also skip sending them to A,
because it seems that they originated at A.

Now suppose we have a more complex topology.  Suppose we have a
cluster of four servers A .. D which, for improved tolerance against
network outages, are all connected pairwise.  Normally all the links
are up, so each server sends all the LCRs it generates directly to all
other servers.  But how do we prevent cycles?  A generates a change
and sends it to B, C, and D.  B then sees that the change came from A
so it sends it to C and D.  C, receiving that change, sees that came
from A via B, so it sends it to D again, whereupon D, which got it
from C and knows that the origin is A, sends it to B, who will then
send it right back over to D.  Obviously, we've got an infinite loop
here, so this topology will not work.  However, there are several
obvious ways to patch it by changing the LCR protocol.  Most
obviously, and somewhat stupidly, you could add a TTL.  A bit smarter,
you could have each LCR carry a LIST of node_ids that it had already
visited, refusing to send it to any node it had already been to it,
instead of a single node_id.  Smarter still, you could send
handshaking messages around the cluster so that each node can build up
a spanning tree and prefix each LCR it sends with the list of
additional nodes to which the recipient must deliver it.  So,
normally, A would send a message to each of B, C, and D destined only
for that node; but if the A-C link went down, A would choose either B
or D and send each LCR to that node destined for that node *and C*;
then, A would forward the message.  Or perhaps you think this is too
complex and not worth supporting anyway, and that might be true, but
the point is that if you insist that all of the identifying
information must be carried in WAL, you've pretty much ruled it out,
because we are not going to put TTL fields, or lists of node IDs, or
lists of destinations, in WAL.  But there is no reason they can't be
attached to LCRs, whi

Re: [HACKERS] REVIEW: Optimize referential integrity checks (todo item)

2012-06-19 Thread Tom Lane
Dean Rasheed  writes:
> On 12 February 2012 02:06, Vik Reykja  wrote:
>> I decided to take a crack at the todo item created from the following post:
>> http://archives.postgresql.org/pgsql-performance/2005-10/msg00458.php

> Here's my review of this patch.

I've marked this patch committed, although in the end there was nothing
left of it ;-).  After teaching the trigger skip logic that old PK nulls
or new FK nulls mean the constraint needn't be checked, there is no case
where ri_KeysEqual is called on data that is not known to be
all-non-null on one side or the other.  So it doesn't matter whether
we use plain equality or is-not-distinct logic.  We could have applied
these changes anyway but I didn't see much value, since as previously
noted there would be some cases where the comparison got microscopically
slower.  I did add a comment about the point in case anybody revisits
the code again in future.

> There's also a separate question about whether the RI trigger
> functions need to be doing these key comparisons, given that they are
> done earlier when the triggers are queued
> (http://archives.postgresql.org/pgsql-performance/2005-10/msg00459.php),
> but the savings to be made there are likely to be smaller than the
> savings this patch makes by not queuing the triggers at all.

I haven't looked into this question.  It would only matter for PK
updates (since the FK-side triggers make no such comparisons), so it's
probably not going to make much difference in typical workloads.
Still, if anybody wants to investigate ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Backport of fsync queue compaction

2012-06-19 Thread Josh Berkus
To what version will we backport it?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign keys in pgbench

2012-06-19 Thread Tom Lane
Jeff Janes  writes:
> I think that pgbench should it make it easy to assess the impact of
> foreign key constraints.

> The attached adds a --foreign-keys option to initialization mode which
> creates all the relevant constraints between the default tables.

I had need of this for testing what I'm doing with foreign keys,
so I went ahead and gave it a quick review (just cosmetic changes)
and committed it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format changes

2012-06-19 Thread Andres Freund
Hi,

On Wednesday, June 20, 2012 12:24:54 AM Heikki Linnakangas wrote:
> On 19.06.2012 18:46, Andres Freund wrote:
> > On Tuesday, June 19, 2012 10:14:08 AM Heikki Linnakangas wrote:
> >> Well, that was easier than I thought. Attached is a patch to make
> >> XLogRecPtr a uint64, on top of my other WAL format patches. I think we
> >> should go ahead with this.
> > 
> > Cool. You plan to merge XLogSegNo with XLogRecPtr in that case? I am not
> > sure if having two representations which just have a constant factor
> > inbetween really makes sense.

> I wasn't planning to, it didn't even occur to me that we might be able
> to get rid of XLogSegNo to be honest. There's places that deal whole
> segments, rather than with specific byte positions in the WAL, so I
> think XLogSegNo makes more sense in that context. Take
> XLogArchiveNotifySeg(), for example. It notifies the archiver that a
> given segment is ready for archiving, so we pass an XLogSegNo to
> identify that segment as an argument. I suppose we could pass an
> XLogRecPtr that points to the beginning of the segment instead, but it
> doesn't really feel like an improvement to me.
I am not sure its a win either, was just wondering because they now are that 
similar.

> >> Should we keep the old representation in the replication protocol
> >> messages? That would make it simpler to write a client that works with
> >> different server versions (like pg_receivexlog). Or, while we're at it,
> >> perhaps we should mandate network-byte order for all the integer and
> >> XLogRecPtr fields in the replication protocol.
> > 
> > The replication protocol uses pq_sendint for integers which should take
> > care of converting to big endian already. I don't think anything but the
> > wal itself is otherwise transported in a binary fashion? So I don't
> > think there is any such architecture dependency in the protocol
> > currently?
> We only use pg_sendint() for the few values exchanged in the handshake
> before we start replicating, but once we begin, we just send structs
> 
> around. For example, in ProcessStandbyReplyMessage():
> > static void
> > ProcessStandbyReplyMessage(void)
> > {
> > 
> > StandbyReplyMessage reply;
> > 
> > pq_copymsgbytes(&reply_message, (char *) &reply,
> > sizeof(StandbyReplyMessage));
> > 
>  > ...
> 
> After that, we just the fields in the reply struct like in any other
> struct.
Yes, forgot that, true. I guess the best fix would be to actually send normal 
messages instead of CopyData ones? Much more to type though...

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Kevin Grittner
Andres Freund  wrote:
 
> Yes, thats definitely a valid use-case. But that doesn't preclude
> the other - also not uncommon - use-case where you want to have
> different master which all contain up2date data.
 
I agree.  I was just saying that while one requires an origin_id,
the other doesn't.  And those not doing MM replication definitely
don't need it.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format changes

2012-06-19 Thread Heikki Linnakangas

On 19.06.2012 18:46, Andres Freund wrote:

On Tuesday, June 19, 2012 10:14:08 AM Heikki Linnakangas wrote:

Well, that was easier than I thought. Attached is a patch to make
XLogRecPtr a uint64, on top of my other WAL format patches. I think we
should go ahead with this.

Cool. You plan to merge XLogSegNo with XLogRecPtr in that case? I am not sure
if having two representations which just have a constant factor inbetween
really makes sense.


I wasn't planning to, it didn't even occur to me that we might be able 
to get rid of XLogSegNo to be honest. There's places that deal whole 
segments, rather than with specific byte positions in the WAL, so I 
think XLogSegNo makes more sense in that context. Take 
XLogArchiveNotifySeg(), for example. It notifies the archiver that a 
given segment is ready for archiving, so we pass an XLogSegNo to 
identify that segment as an argument. I suppose we could pass an 
XLogRecPtr that points to the beginning of the segment instead, but it 
doesn't really feel like an improvement to me.



Should we keep the old representation in the replication protocol
messages? That would make it simpler to write a client that works with
different server versions (like pg_receivexlog). Or, while we're at it,
perhaps we should mandate network-byte order for all the integer and
XLogRecPtr fields in the replication protocol.

The replication protocol uses pq_sendint for integers which should take care
of converting to big endian already. I don't think anything but the wal itself
is otherwise transported in a binary fashion? So I don't think there is any
such architecture dependency in the protocol currently?


We only use pg_sendint() for the few values exchanged in the handshake 
before we start replicating, but once we begin, we just send structs 
around. For example, in ProcessStandbyReplyMessage():



static void
ProcessStandbyReplyMessage(void)
{
StandbyReplyMessage reply;

pq_copymsgbytes(&reply_message, (char *) &reply, 
sizeof(StandbyReplyMessage));

> ...

After that, we just the fields in the reply struct like in any other struct.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Backport of fsync queue compaction

2012-06-19 Thread Andres Freund
On Tuesday, June 19, 2012 11:39:46 PM Robert Haas wrote:
> On Tue, Jun 19, 2012 at 5:33 PM, Greg Smith  wrote:
> > In January of 2011 Robert committed
> > 7f242d880b5b5d9642675517466d31373961cf98 to try and compact the fsync
> > queue when clients find it full.  There's no visible behavior change,
> > just a substantial performance boost possible in the rare but extremely
> > bad situations where the background writer stops doing fsync absorption.
> >  I've been running that in production at multiple locations since
> > practically the day it hit this mailing list, with backports all the way
> > to 8.3 being common (and straightforward to construct).  I've never seen
> > a hint of a problem with this new code.
> 
> I've been in favor of back-porting this for a while, so you'll get no
> argument from me.
> 
> Anyone disagree?
Not me, I have seen several sites having problems as well.

+1

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Andres Freund
On Wednesday, June 20, 2012 12:15:03 AM Kevin Grittner wrote:
> Simon Riggs  wrote:
> > If we use WAL in this way, multi-master implies that the data will
> > *always* be in a loop. So in any configuration we must be able to
> > tell difference between changes made by one node and another.
> 
> Only if you assume that multi-master means identical databases all
> replicating the same data to all the others.  If I have 72 master
> replicating non-conflicting data to one consolidated database, I
> consider that to be multi-master, too.
> ...
> Of course, none of these databases have the same OID for any given
> object, and there are numerous different schemas among the
> replicating databases, so I need to get to table and column names
> before the data is of any use to me.
Yes, thats definitely a valid use-case. But that doesn't preclude the other - 
also not uncommon - use-case where you want to have different master which all 
contain up2date data.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Kevin Grittner
Simon Riggs  wrote:
 
> The proposal is to use WAL to generate the logical change stream.
> That has been shown in testing to be around x4 faster than having
> a separate change stream, which must also be WAL logged (as Jan
> noted).
 
Sure, that's why I want it.
 
> If we use WAL in this way, multi-master implies that the data will
> *always* be in a loop. So in any configuration we must be able to
> tell difference between changes made by one node and another.
 
Only if you assume that multi-master means identical databases all
replicating the same data to all the others.  If I have 72 master
replicating non-conflicting data to one consolidated database, I
consider that to be multi-master, too.  Especially if I have other
types of databases replicating disjoint data to the same
consolidated database, and the 72 sources have several databases
replicating disjoint sets of data to them.  We have about 1000
replications paths, none of which create a loop which can send data
back to the originator or cause conflicts.
 
Of course, none of these databases have the same OID for any given
object, and there are numerous different schemas among the
replicating databases, so I need to get to table and column names
before the data is of any use to me.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Andres Freund
On Tuesday, June 19, 2012 11:46:56 PM Robert Haas wrote:
> On Tue, Jun 19, 2012 at 3:18 PM, Andres Freund  
wrote:
> > More seriously: Even if we don't put MM in core I think putting the basis
> > for it in core so that somebody can build such a solution reusing the
> > existing infrastructure is a sensible idea. Imo the only thing that
> > requires explicit support which is hard to add outside of core is
> > prevention of loops (aka this patch). Everything else should be doable
> > reusing the hopefully modular pieces.
> I don't think prevention of loops is hard to do outside of core
> either, unless you insist on tying "logical" replication so closely to
> WAL that anyone doing MMR is necessarily getting the change stream
> from WAL.  In fact, I'd go so far as to say that the ONLY part of this
> that's hard to do outside of core is change extraction.  Even
> low-level apply can be implemented as a loadable module.
I definitely agree that low-level apply is possible as a module. Sure change 
extraction needs core support but I was talking about what you need to 
implement it reusing the "plain" logical support...

What I do not understand is how you want to prevent loops in a simple manner 
without in core support:

A generates a HEAP_INSERT record. Gets decoded into the lcr stream as a INSERT 
action.
B reads the lcr stream from A and applies the changes. A new HEAP_INSERT 
record. Gets decoded into the lcr stream as a INSERT action.
A reads the lcr stream from B and ???

At this point you need to prevent a loop. If you have the information where a 
change originally happened (xl_origin_id = A in this case) you can have the 
simple filter on A which ignores change records if lcr_origin_id == 
local_replication_origin_id).


> >> Right.  If we decide we need this, and if we did decide to conflate
> >> the WAL stream, both of which I disagree with as noted above, then we
> >> still don't need it on every record.  It would probably be sufficient
> >> for local transactions to do nothing at all (and we can implicitly
> >> assume that they have master node ID = local node ID) and transactions
> >> which are replaying remote changes to emit one record per XID per
> >> checkpoint cycle containing the remote node ID.
> > 
> > Youve gone from a pretty trivial 150 line patch without any runtime/space
> > overhead to something *considerably* more complex in that case though.
> > 
> > You suddently need to have relatively complex logic to remember which
> > information you got for a certain xid (and forget that information
> > afterwards) and whether you already logged that xid and you need to have
> > to decide about logging that information at multiple places.
> You need a backend-local hash table inside the wal reader process, and
> that hash table needs to map XIDs to node IDs.  And you occasionally
> need to prune it, so that it doesn't eat too much memory.  None of
> that sounds very hard.
Its not very hard. Its just more complex than what I propose(d).

> > Btw, what do you mean with "conflating" the stream? I don't really see
> > that being proposed.
> It seems to me that you are intent on using the WAL stream as the
> logical change stream.  I think that's a bad design.  Instead, you
> should extract changes from WAL and then ship them around in a format
> that is specific to logical replication.
No, I don't want that. I think we will need some different format once we have 
agreed how changeset extraction works.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Simon Riggs
On 20 June 2012 00:11, Tom Lane  wrote:
> Andres Freund  writes:
>> On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote:
 ...  (If you are thinking
 of something sufficiently high-level that merging could possibly work,
 then it's not WAL, and we shouldn't be trying to make the WAL
 representation cater for it.)
>
>> Do you really see this as such a big problem?
>
> It looks suspiciously like "I have a hammer, therefore every problem
> must be a nail".  I don't like the design concept of cramming logical
> replication records into WAL in the first place.

The evidence from prototypes shown at PgCon was that using the WAL in
this way was very efficient and that this level of efficiency is
necessary to make it work in a practical manner. Postgres would not be
the first database to follow this broad design.

> However, if we're dead set on doing it that way, let us put information
> that is only relevant to logical replication records into only the
> logical replication records.

Agreed. In this case, the origin node id information needs to be
available on the WAL record so we can generate the logical changes
(LCRs).

> Saving a couple bytes in each such record
> is penny-wise and pound-foolish, I'm afraid; especially when you're
> nailing down hard, unexpansible limits at the very beginning of the
> development process in order to save those bytes.

Restricting the number of node ids is being done so that there is no
impact on anybody not using this feature.

In later implementations, it should be possible to support greater
numbers of node ids by having a variable length header. But that is an
unnecessary complication for a first release.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Simon Riggs
On 20 June 2012 05:59, Christopher Browne  wrote:

> But it's undesirable to pull *all* the bulk of contents of WAL around
> if it's only part of the data that is going to get applied.  On a
> "physical streaming" replica, any logical data that gets captured will
> be useless.  And on a "logical replica," they "physical" bits of WAL
> will be useless.

Completely agree.

What we're discussing here is the basic information content of a WAL
record to enable the construction of suitable LCRs.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transactions over pathological TCP connections

2012-06-19 Thread Leon Smith
I just realized this is essentially an instance of the Two General's
Problem;  which is something I feel should have been more obvious to me.

On Tue, Jun 19, 2012 at 5:50 PM, Leon Smith  wrote:

> On Tue, Jun 19, 2012 at 11:59 AM, Robert Haas wrote:
>
>> On Tue, Jun 19, 2012 at 1:56 AM, Tom Lane  wrote:
>> > The transaction would be committed before a command success report is
>> > delivered to the client, so I don't think delivered-and-not-marked is
>> > possible.
>>
>> ...unless you have configured synchronous_commit=off, or fsync=off.
>>
>> Or unless your disk melts into a heap of slag and you have to restore
>> from backup.  You can protect against that last case using synchronous
>> replication.
>>
>
>
> But hard disk failure isn't in the failure model I was concerned about.
> =)To be perfectly honest,  I'm not too concerned with either hard drive
> failure or network failure,   as we are deploying on Raid 1+0 database
> server talking to the client over a redundant LAN,  and using asynchronous
> (Slony) replication to an identical database server just in case.   No
> single point of failure is a key philosophy of this app from top to bottom.
> Like I said,  this is mostly idle curiosity.
>
> But I'm also accustomed to trying to get work done on shockingly
> unreliable internet connections.   As a result, network failure is
> something I think about quite a lot when writing networked applications.
> So this is not entirely idle curiosity either.
>
> And thinking about this a bit more,   it's clear that the database has to
> commit before the result is sent,  on the off chance that the transaction
> fails and needs to be retried.   And that an explicit transaction block
> isn't really a solution either,  because   a "BEGIN;  SELECT dequeue_row()"
>  would get the row to the client without marking it as taken,   but the
> pathological TCP disconnect could then attack the  following "COMMIT;",
>  leaving the client to think that the row has not been actually taken when
> it in fact has.
>
> It's not clear to me that this is even a solvable problem without
> modifying the schema to include both a "taken" and a "finished processing"
> state,  and then letting elements be re-delievered after a period of time.
>   But this would then allow a pathological demon with the power to cause
> TCP connects have a single element delivered and processed multiple times.
>
> In any case, thanks for the responses...
>
> Best,
> Leon
>


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Simon Riggs
On 20 June 2012 04:31, Kevin Grittner  wrote:

> I've done a lot of MM replication,
> and so far have not had to use a topology which allowed loops.

The proposal is to use WAL to generate the logical change stream. That
has been shown in testing to be around x4 faster than having a
separate change stream, which must also be WAL logged (as Jan noted).

If we use WAL in this way, multi-master implies that the data will
*always* be in a loop. So in any configuration we must be able to tell
difference between changes made by one node and another.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] use of int4/int32 in C code

2012-06-19 Thread Tom Lane
"Kevin Grittner"  writes:
> Tom Lane  wrote:
>> I have not looked to see how many places do that.  If it's a reasonably
>> small number of places, I'm OK with getting rid of int4 at the C level.
>> (int2/int8 the same of course.)
 
> $ find -name '*.h' -or -name '*.c' | egrep -v '/tmp_check/' | xargs cat
>   | egrep -c '\bint4\b'
> 570

That one looks like a lot, but a quick eyeball check suggests that many
of them are in comments and/or embedded SQL fragments where they
actually represent references to the SQL datatype, and so should be
left alone.  The impression I get is that the actual typedef uses are
concentrated in a couple of contrib modules and some of the tsearch
stuff, where they'd probably not be that painful to fix.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Andres Freund
On Tuesday, June 19, 2012 10:58:44 PM Marko Kreen wrote:
> On Mon, Jun 18, 2012 at 6:35 PM, Simon Riggs  wrote:
> > On 13 June 2012 19:28, Andres Freund  wrote:
> >> This adds a new configuration parameter multimaster_node_id which
> >> determines the id used for wal originating in one cluster.
> > 
> > Looks good and it seems this aspect at least is commitable in this CF.
> > 
> > Design decisions I think we need to review are
> > 
> > * Naming of field. I think origin is the right term, borrowing from
> > Slony.
> 
> I have not read too deeply here, so maybe I am missing
> some important detail here, but idea that users need
> to coordinate a integer config parameter globally does not
> sound too attractive to me.
> 
> Why not limit integers to local storage only and map
> them to string idents on config, UI and transport?
That should be possible. I don't want to go there till we have the base stuff 
in but it shouldn't be too hard.

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Simon Riggs
On 20 June 2012 05:46, Robert Haas  wrote:

> It seems to me that you are intent on using the WAL stream as the
> logical change stream.  I think that's a bad design.  Instead, you
> should extract changes from WAL and then ship them around in a format
> that is specific to logical replication.

The proposal is to read the WAL in order to generate LCRs. The
information needs to be in the original data in order to allow it to
be passed onwards.

In a multi-master config there will be WAL records with many origin
node ids at any time, so this information must be held at WAL record
level not page level.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Christopher Browne
On Tue, Jun 19, 2012 at 5:46 PM, Robert Haas  wrote:
>> Btw, what do you mean with "conflating" the stream? I don't really see that
>> being proposed.
>
> It seems to me that you are intent on using the WAL stream as the
> logical change stream.  I think that's a bad design.  Instead, you
> should extract changes from WAL and then ship them around in a format
> that is specific to logical replication.

Yeah, that seems worth elaborating on.

What has been said several times is that it's pretty necessary to
capture the logical changes into WAL.  That seems pretty needful, in
order that the replication data gets fsync()ed avidly, and so that we
don't add in the race condition of needing to fsync() something *else*
almost exactly as avidly as is the case for WAL today..

But it's undesirable to pull *all* the bulk of contents of WAL around
if it's only part of the data that is going to get applied.  On a
"physical streaming" replica, any logical data that gets captured will
be useless.  And on a "logical replica," they "physical" bits of WAL
will be useless.

What I *want* you to mean is that there would be:
a) WAL readers that pull the "physical bits", and
b) WAL readers that just pull "logical bits."

I expect it would be fine to have a tool that pulls LCRs out of WAL to
prepare that to be sent to remote locations.  Is that what you have in
mind?  Or are you feeling that the "logical bits" shouldn't get
captured in WAL altogether, so we need to fsync() them into a
different stream of files?
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_restore logging inconsistency

2012-06-19 Thread Alvaro Herrera

Excerpts from Josh Kupershmidt's message of mié may 30 14:55:12 -0400 2012:
> Hi all,
> 
> Bosco Rama recently complained[1] about not seeing a message printed
> by pg_restore for each LO to be restored. The culprit seems to be the
> different level passed to ahlog() for this status message:
> 
> pg_backup_archiver.c:ahlog(AH, 2, "restoring large object with OID %u\n", 
> oid);
> pg_backup_tar.c:ahlog(AH, 1, "restoring large object OID 
> %u\n", oid);
> 
> depending on whether one is restoring a tar-format or custom-format
> dump. I think these messages should be logged at the same level, to
> avoid this inconsistency. The attached patch logs them both with
> level=1, and makes the message texts identical. Note, as of 9.0 there
> is already a line like this printed for each LO:
> 
> pg_restore: executing BLOB 135004
> 
> so I could see the argument for instead wanting to hide the "restoring
> large object" messages. However, the OP was interested in seeing
> something like a status indicator for the lo_write() calls which may
> take a long time, and the above message isn't really helpful for that
> purpose as it is printed earlier in the restore process. Plus it seems
> reasonable to make verbose mode, well, verbose.

I applied this patch all the way back to 8.3.  Thanks.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] use of int4/int32 in C code

2012-06-19 Thread Kevin Grittner
Tom Lane  wrote:
 
> I have not looked to see how many places do that.  If it's a
reasonably
> small number of places, I'm OK with getting rid of int4 at the C
level.
> (int2/int8 the same of course.)
 
$ find -name '*.h' -or -name '*.c' | egrep -v '/tmp_check/' | xargs cat
\

  | egrep -c '\bint2\b'
178

  | egrep -c '\bint4\b'
570

  | egrep -c '\bint8\b'
212
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Simon Riggs
On 19 June 2012 14:03, Tom Lane  wrote:

> "Every WAL record"?  Why in heck would you attach it to every record?
> Surely putting it in WAL page headers would be sufficient.  We could
> easily afford to burn a page switch (if not a whole segment switch)
> when changing masters.

This does appear to be a reasonable idea at first glance, since it
seems that each node has just a single node id, but that is not the
case.

As we pass changes around we maintain the same origin id for a change,
so there is a mix of origin node ids at the WAL record level, not the
page level. The concept of originating node id is essentially same as
that used in Slony.

> I'm against the idea of eating any spare space we have in WAL record
> headers for this purpose, anyway; there are likely to be more pressing
> needs in future.

Not sure what those pressing needs are, but I can't see any. What we
are doing here is fairly important, just not as important as crash
recovery. But then that has worked pretty much unchanged for some time
now.

I raised the possibility of having variable length headers, but there
is no requirement for that yet.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transactions over pathological TCP connections

2012-06-19 Thread Leon Smith
On Tue, Jun 19, 2012 at 11:59 AM, Robert Haas  wrote:

> On Tue, Jun 19, 2012 at 1:56 AM, Tom Lane  wrote:
> > The transaction would be committed before a command success report is
> > delivered to the client, so I don't think delivered-and-not-marked is
> > possible.
>
> ...unless you have configured synchronous_commit=off, or fsync=off.
>
> Or unless your disk melts into a heap of slag and you have to restore
> from backup.  You can protect against that last case using synchronous
> replication.
>


But hard disk failure isn't in the failure model I was concerned about.
=)To be perfectly honest,  I'm not too concerned with either hard drive
failure or network failure,   as we are deploying on Raid 1+0 database
server talking to the client over a redundant LAN,  and using asynchronous
(Slony) replication to an identical database server just in case.   No
single point of failure is a key philosophy of this app from top to bottom.
Like I said,  this is mostly idle curiosity.

But I'm also accustomed to trying to get work done on shockingly unreliable
internet connections.   As a result, network failure is something I think
about quite a lot when writing networked applications.   So this is not
entirely idle curiosity either.

And thinking about this a bit more,   it's clear that the database has to
commit before the result is sent,  on the off chance that the transaction
fails and needs to be retried.   And that an explicit transaction block
isn't really a solution either,  because   a "BEGIN;  SELECT dequeue_row()"
 would get the row to the client without marking it as taken,   but the
pathological TCP disconnect could then attack the  following "COMMIT;",
 leaving the client to think that the row has not been actually taken when
it in fact has.

It's not clear to me that this is even a solvable problem without modifying
the schema to include both a "taken" and a "finished processing" state,
 and then letting elements be re-delievered after a period of time.   But
this would then allow a pathological demon with the power to cause TCP
connects have a single element delivered and processed multiple times.

In any case, thanks for the responses...

Best,
Leon


Re: [HACKERS] Backport of fsync queue compaction

2012-06-19 Thread Alvaro Herrera

Excerpts from Robert Haas's message of mar jun 19 17:39:46 -0400 2012:
> On Tue, Jun 19, 2012 at 5:33 PM, Greg Smith  wrote:
> > In January of 2011 Robert committed 7f242d880b5b5d9642675517466d31373961cf98
> > to try and compact the fsync queue when clients find it full.  There's no
> > visible behavior change, just a substantial performance boost possible in
> > the rare but extremely bad situations where the background writer stops
> > doing fsync absorption.  I've been running that in production at multiple
> > locations since practically the day it hit this mailing list, with backports
> > all the way to 8.3 being common (and straightforward to construct).  I've
> > never seen a hint of a problem with this new code.
> 
> I've been in favor of back-porting this for a while, so you'll get no
> argument from me.

+1.  I even thought we had already backported it and was surprised to
discover we hadn't, when we had this problem at a customer, not long
ago.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Backport of fsync queue compaction

2012-06-19 Thread Christopher Browne
On Tue, Jun 19, 2012 at 5:39 PM, Robert Haas  wrote:
> On Tue, Jun 19, 2012 at 5:33 PM, Greg Smith  wrote:
>> In January of 2011 Robert committed 7f242d880b5b5d9642675517466d31373961cf98
>> to try and compact the fsync queue when clients find it full.  There's no
>> visible behavior change, just a substantial performance boost possible in
>> the rare but extremely bad situations where the background writer stops
>> doing fsync absorption.  I've been running that in production at multiple
>> locations since practically the day it hit this mailing list, with backports
>> all the way to 8.3 being common (and straightforward to construct).  I've
>> never seen a hint of a problem with this new code.
>
> I've been in favor of back-porting this for a while, so you'll get no
> argument from me.
>
> Anyone disagree?

I recall reviewing that; it seemed like quite a good change.  Me likes.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Robert Haas
On Tue, Jun 19, 2012 at 3:18 PM, Andres Freund  wrote:
> More seriously: Even if we don't put MM in core I think putting the basis for
> it in core so that somebody can build such a solution reusing the existing
> infrastructure is a sensible idea. Imo the only thing that requires explicit
> support which is hard to add outside of core is prevention of loops (aka this
> patch). Everything else should be doable reusing the hopefully modular pieces.

I don't think prevention of loops is hard to do outside of core
either, unless you insist on tying "logical" replication so closely to
WAL that anyone doing MMR is necessarily getting the change stream
from WAL.  In fact, I'd go so far as to say that the ONLY part of this
that's hard to do outside of core is change extraction.  Even
low-level apply can be implemented as a loadable module.

>> Right.  If we decide we need this, and if we did decide to conflate
>> the WAL stream, both of which I disagree with as noted above, then we
>> still don't need it on every record.  It would probably be sufficient
>> for local transactions to do nothing at all (and we can implicitly
>> assume that they have master node ID = local node ID) and transactions
>> which are replaying remote changes to emit one record per XID per
>> checkpoint cycle containing the remote node ID.
> Youve gone from a pretty trivial 150 line patch without any runtime/space
> overhead to something *considerably* more complex in that case though.
>
> You suddently need to have relatively complex logic to remember which
> information you got for a certain xid (and forget that information afterwards)
> and whether you already logged that xid and you need to have to decide about
> logging that information at multiple places.

You need a backend-local hash table inside the wal reader process, and
that hash table needs to map XIDs to node IDs.  And you occasionally
need to prune it, so that it doesn't eat too much memory.  None of
that sounds very hard.

> Btw, what do you mean with "conflating" the stream? I don't really see that
> being proposed.

It seems to me that you are intent on using the WAL stream as the
logical change stream.  I think that's a bad design.  Instead, you
should extract changes from WAL and then ship them around in a format
that is specific to logical replication.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] use of int4/int32 in C code

2012-06-19 Thread Tom Lane
Robert Haas  writes:
> On Tue, Jun 19, 2012 at 9:47 AM, Tom Lane  wrote:
>> I thought the general idea was to use int32 most places, but int4 in
>> catalog declarations.  I don't think it's tremendously important if
>> somebody uses the other though.

> I concur with Peter that TMTOWTDI is not the right way to do this.  I
> think we ought to get rid of int4 in code and use int32 everywhere.

I have not looked to see how many places do that.  If it's a reasonably
small number of places, I'm OK with getting rid of int4 at the C level.
(int2/int8 the same of course.)

If we are going to do that, though, we need to actually remove those
typedefs.  Leaving them around on the grounds that third-party code
might be using them will just allow cases to creep back in.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Backport of fsync queue compaction

2012-06-19 Thread Robert Haas
On Tue, Jun 19, 2012 at 5:33 PM, Greg Smith  wrote:
> In January of 2011 Robert committed 7f242d880b5b5d9642675517466d31373961cf98
> to try and compact the fsync queue when clients find it full.  There's no
> visible behavior change, just a substantial performance boost possible in
> the rare but extremely bad situations where the background writer stops
> doing fsync absorption.  I've been running that in production at multiple
> locations since practically the day it hit this mailing list, with backports
> all the way to 8.3 being common (and straightforward to construct).  I've
> never seen a hint of a problem with this new code.

I've been in favor of back-porting this for a while, so you'll get no
argument from me.

Anyone disagree?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Simon Riggs
On 20 June 2012 04:58, Marko Kreen  wrote:
> On Mon, Jun 18, 2012 at 6:35 PM, Simon Riggs  wrote:
>> On 13 June 2012 19:28, Andres Freund  wrote:
>>> This adds a new configuration parameter multimaster_node_id which determines
>>> the id used for wal originating in one cluster.
>>
>> Looks good and it seems this aspect at least is commitable in this CF.
>>
>> Design decisions I think we need to review are
>>
>> * Naming of field. I think origin is the right term, borrowing from Slony.
>
> I have not read too deeply here, so maybe I am missing
> some important detail here, but idea that users need
> to coordinate a integer config parameter globally does not
> sound too attractive to me.
>
> Why not limit integers to local storage only and map
> them to string idents on config, UI and transport?

Yes, that can be done. This is simply how the numbers are used
internally, similar to oids.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] performance regression in 9.2 when loading lots of small tables

2012-06-19 Thread Robert Haas
On Tue, Jun 19, 2012 at 4:33 PM, Robert Haas  wrote:
> On Mon, Jun 18, 2012 at 8:42 PM, Jeff Janes  wrote:
>> There was a regression introduced in 9.2 that effects the creation and
>> loading of lots of small tables in a single transaction.
>>
>> It affects the loading of a pg_dump file which has a large number of
>> small tables (10,000 schemas, one table per schema, 10 rows per
>> table).  I did not test other schema configurations, so these
>> specifics might not be needed to invoke the problem.
>>
>> It causes the loading of a dump with "psql -1 -f " to run at half the
>> previous speed.  Speed of loading without the -1 is not changed.
>
> I tried to figure out why this was happening.  I tried it out on the
> IBM POWER7 box after building from
> f5297bdfe4c4a47376c41b96161fb55c2294a0b1 and it ran for about 14
> minutes.  'time' reports that psql used 38 seconds of user time and 11
> seconds of system time.  The remaining time was presumably spent
> waiting for the backend and/or the kernel.
>
> I ran the backend under strace -cf and it had this to say:
>
> % time     seconds  usecs/call     calls    errors syscall
> -- --- --- - - 
>  45.31   39.888639          10   3836379           write
>  18.80   16.553878           3   4929697           lseek
>  13.51   11.890826           6   1906179     12819 read
>  10.24    9.011690           7   1372354      5995 recv
>  6.27    5.517929           5   1107766           brk
>  2.07    1.818345          91     20068           fsync
>  1.46    1.281999          15     87260           send
>  1.15    1.015621          24     42527     11334 open
>
> I did a separate run using perf and it had this to say:
>
> Events: 1M cycles
> +  13.18%         postgres  postgres               [.] FlushRelationBuffers
> +   9.65%         postgres  postgres               [.] comparetup_index_btree
> +   9.34%          swapper  [kernel.kallsyms]      [k]
> .pseries_dedicated_idle_sleep
> +   4.41%         postgres  postgres               [.] CopyReadLine
> +   4.23%         postgres  postgres               [.] tuplesort_heap_siftup
> +   4.17%         postgres  postgres               [.] 
> LockReassignCurrentOwner
> +   3.88%         postgres  postgres               [.] pg_mbstrlen_with_len
> +   2.74%         postgres  postgres               [.] pg_verify_mbstr_len
> +   2.46%         postgres  postgres               [.] NextCopyFromRawFields
> +   2.44%         postgres  postgres               [.] btint4cmp
> +   2.08%         postgres  libc-2.14.90.so        [.] memcpy
> +   2.06%         postgres  postgres               [.] hash_seq_search
> +   1.55%         postgres  [kernel.kallsyms]      [k] .__copy_tofrom_user
> +   1.29%         postgres  libc-2.14.90.so        [.]
> __GI_strtoll_l_internal
> +   1.16%         postgres  postgres               [.] pg_utf_mblen
>
> There are certainly opportunities for optimization there but it's sure
> not obvious to me what it has to do with either of the commits you
> mention.  I haven't actually verified that there's a regression on
> this box as result of either of those commits, though: I just profiled
> master.  Maybe I better go check that.

I built REL9_1_STABLE from commit
1643031e5fe02d2de9ae6b8f86ef9ffd09fe7d3f and it took 19m45.250s, even
slower than master.  So something is different in my environment
versus yours.

I ran this build with perf also, and got this:

38.77% postgres  postgres   [.] FlushRelationBuffers
 6.80%  swapper  [kernel.kallsyms]  [k]
.pseries_dedicated_idle_sleep
 4.84% postgres  postgres   [.] comparetup_index_btree
 3.23% postgres  postgres   [.] CopyReadLine
 2.24% postgres  postgres   [.] tuplesort_heap_siftup
 2.23% postgres  postgres   [.] pg_mbstrlen_with_len
 2.08% postgres  postgres   [.] LockReassignCurrentOwner
 2.01% postgres  postgres   [.] pg_verify_mbstr_len
 1.90% postgres  postgres   [.] NextCopyFromRawFields
 1.62% postgres  postgres   [.] btint4cmp
 1.41% postgres  libc-2.14.90.so[.] memcpy
 1.39% postgres  postgres   [.] LWLockAcquire
 1.22% postgres  postgres   [.] LWLockRelease
 1.19% postgres  postgres   [.] pg_utf_mblen
 1.05% postgres  [kernel.kallsyms]  [k] .__copy_tofrom_user

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-19 Thread Kevin Grittner
Andres Freund  wrote:
 
> The problem is just that to support basically arbitrary decoding
> requirements you need to provide at least those pieces of
> information in a transactionally consistent manner:
> * the data
> * table names
> * column names
> * type information
> * replication configuration
 
I'm not sure that the last one needs to be in scope for the WAL
stream, but the others I definitely agree eventually need to be
available to a logical transaction stream consumer.  You lay out the
alternative ways to get all of this pretty clearly, and I don't know
what the best answer is; it seems likely that there is not one best
answer.  In the long run, more than one of those options might need
to be supported, to support different environments.
 
As an initial implementation, I'm leaning toward the position that
requiring a hot standby or a catalog-only proxy is acceptable.  I
think that should allow an application to be written which emits
everything except the replication configuration.  That will allow us
to hook up everything we need at our shop.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Testing 9.2 in ~production environment

2012-06-19 Thread Peter Eisentraut
On tis, 2012-06-19 at 09:33 -0400, Tom Lane wrote:
> Come to think of it, another possible factor is that LIKE can't use
> ordinary indexes on text if the locale isn't C.

But he reported that the plans are the same.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Near-duplicate RI NO ACTION and RESTRICT triggers

2012-06-19 Thread Dean Rasheed
On 19 June 2012 17:48, Tom Lane  wrote:
> I think that the argument for having the RESTRICT triggers behave
> like this is that the SQL spec envisions the RESTRICT check occurring
> immediately when the individual PK row is updated/deleted, and so there
> would be no opportunity for another PK row to be updated into its place.
> (Or, in plainer English, RESTRICT should mean "you can't modify this
> row's keys at all if it has dependents".)  Because we implement RESTRICT
> through an AFTER trigger that can't run earlier than end-of-statement,
> we can't exactly match the spec's semantics, but we can get fairly
> close so long as you don't think about what would be seen by
> e.g. user-written triggers executing during the statement.
>
> I'm happy with continuing to have this behavioral difference between
> the two sets of triggers, but wanted to throw it up for discussion:
> does anyone think it'd be better to apply ri_Check_Pk_Match in the
> RESTRICT triggers too?
>

In SQL:2008 they've re-worded the descriptions of these actions and
added an explicit note to clarify the intended difference:

"""
— ON UPDATE RESTRICT: any change to a referenced column in the
referenced table is prohibited if there
is a matching row.
— ON UPDATE NO ACTION (the default): there is no referential update
action; the referential constraint
only specifies a constraint check.

NOTE 38 — Even if constraint checking is not deferred, ON UPDATE
RESTRICT is a stricter condition than ON UPDATE NO
ACTION. ON UPDATE RESTRICT prohibits an update to a particular row if
there are any matching rows; ON UPDATE NO
ACTION does not perform its constraint check until the entire set of
rows to be updated has been processed.
"""

and there's a similar note in the DELETE case. So I think the current
behaviour is correct, and there are probably genuine use cases for
both types of check.

Regards,
Dean

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Marko Kreen
On Mon, Jun 18, 2012 at 6:35 PM, Simon Riggs  wrote:
> On 13 June 2012 19:28, Andres Freund  wrote:
>> This adds a new configuration parameter multimaster_node_id which determines
>> the id used for wal originating in one cluster.
>
> Looks good and it seems this aspect at least is commitable in this CF.
>
> Design decisions I think we need to review are
>
> * Naming of field. I think origin is the right term, borrowing from Slony.

I have not read too deeply here, so maybe I am missing
some important detail here, but idea that users need
to coordinate a integer config parameter globally does not
sound too attractive to me.

Why not limit integers to local storage only and map
them to string idents on config, UI and transport?

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] performance regression in 9.2 when loading lots of small tables

2012-06-19 Thread Robert Haas
On Mon, Jun 18, 2012 at 8:42 PM, Jeff Janes  wrote:
> There was a regression introduced in 9.2 that effects the creation and
> loading of lots of small tables in a single transaction.
>
> It affects the loading of a pg_dump file which has a large number of
> small tables (10,000 schemas, one table per schema, 10 rows per
> table).  I did not test other schema configurations, so these
> specifics might not be needed to invoke the problem.
>
> It causes the loading of a dump with "psql -1 -f " to run at half the
> previous speed.  Speed of loading without the -1 is not changed.

I tried to figure out why this was happening.  I tried it out on the
IBM POWER7 box after building from
f5297bdfe4c4a47376c41b96161fb55c2294a0b1 and it ran for about 14
minutes.  'time' reports that psql used 38 seconds of user time and 11
seconds of system time.  The remaining time was presumably spent
waiting for the backend and/or the kernel.

I ran the backend under strace -cf and it had this to say:

% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 45.31   39.888639  10   3836379   write
 18.80   16.553878   3   4929697   lseek
 13.51   11.890826   6   1906179 12819 read
 10.249.011690   7   1372354  5995 recv
  6.275.517929   5   1107766   brk
  2.071.818345  91 20068   fsync
  1.461.281999  15 87260   send
  1.151.015621  24 42527 11334 open

I did a separate run using perf and it had this to say:

Events: 1M cycles
+  13.18% postgres  postgres   [.] FlushRelationBuffers
+   9.65% postgres  postgres   [.] comparetup_index_btree
+   9.34%  swapper  [kernel.kallsyms]  [k]
.pseries_dedicated_idle_sleep
+   4.41% postgres  postgres   [.] CopyReadLine
+   4.23% postgres  postgres   [.] tuplesort_heap_siftup
+   4.17% postgres  postgres   [.] LockReassignCurrentOwner
+   3.88% postgres  postgres   [.] pg_mbstrlen_with_len
+   2.74% postgres  postgres   [.] pg_verify_mbstr_len
+   2.46% postgres  postgres   [.] NextCopyFromRawFields
+   2.44% postgres  postgres   [.] btint4cmp
+   2.08% postgres  libc-2.14.90.so[.] memcpy
+   2.06% postgres  postgres   [.] hash_seq_search
+   1.55% postgres  [kernel.kallsyms]  [k] .__copy_tofrom_user
+   1.29% postgres  libc-2.14.90.so[.]
__GI_strtoll_l_internal
+   1.16% postgres  postgres   [.] pg_utf_mblen

There are certainly opportunities for optimization there but it's sure
not obvious to me what it has to do with either of the commits you
mention.  I haven't actually verified that there's a regression on
this box as result of either of those commits, though: I just profiled
master.  Maybe I better go check that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Kevin Grittner
Andres Freund  wrote:
> Robert Haas wrote:
>> Tom Lane  wrote:
 
>>> However, if we're dead set on doing it that way, let us put
>>> information that is only relevant to logical replication records
>>> into only the logical replication records.
>> Right.  If we decide we need this, and if we did decide to
>> conflate the WAL stream, both of which I disagree with as noted
>> above, then we still don't need it on every record.  It would
>> probably be sufficient for local transactions to do nothing at
>> all (and we can implicitly assume that they have master node ID =
>> local node ID) and transactions which are replaying remote
>> changes to emit one record per XID per checkpoint cycle
>> containing the remote node ID.
> Youve gone from a pretty trivial 150 line patch without any
> runtime/space overhead to something *considerably* more complex in
> that case though. 
 
I think it might be worth it.  I've done a lot of MM replication,
and so far have not had to use a topology which allowed loops.  Not
only would you be reserving space in the WAL stream which was not
useful for those not using MM replication, you would be reserving it
when even many MM configurations would not need it.  Now you could
argue that the 16 bits you want to use are already there and are not
yet used for anything; but there are two counter-arguments to that:
you lose the opportunity to use them for something else, and you
might want more than 16 bits.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)

2012-06-19 Thread Andres Freund
On Tuesday, June 19, 2012 09:58:43 PM Robert Haas wrote:
> On Tue, Jun 19, 2012 at 3:48 PM, Andres Freund  
wrote:
> > Why is that code not used more widely? Quite a bit of our list usage
> > should be replaced embedding list element in larger structs imo. There
> > are also open- coded inline list manipulations around (check aset.c for
> > example).
> Because we've got a million+ lines of code and nobody knows where all
> the bodies are buried.
Heh, yes ;). Ive hit several times as you uncovered at least twice ;)

> > 1. dllist.h has double the element overhead by having an inline value
> > pointer (which is not needed when embedding) and a pointer to the list
> > (which I have a hard time seing as being useful)
> > 2. only double linked list, mine provided single and double linked ones
> > 3. missing macros to use when embedded in a larger struct (containerof()
> > wrappers and for(...) support basically)
> > 4. most things are external function calls...
> > 5. way much more branches/complexity in most of the functions. My
> > implementation doesn't use any branches for the typical easy
> > modifications (push, pop, remove element somewhere) and only one for the
> > typical tests (empty, has-next, ...)
> > 
> > The performance and memory aspects were crucial for the aforementioned
> > toy project (slab allocator for postgres). Its not that crucial for the
> > applycache where the lists currently are mostly used although its also
> > relatively performance sensitive and obviously does a lot of list
> > manipulation/iteration.
> > 
> > If I had to decide I would add the missing api in dllist.h to my
> > implementation and then remove it. Its barely used - and only in an
> > embedded fashion - as far as I can see.
> > I can understand though if that argument is met with doubt by others ;).
> > If thats the way it has to go I would add some more convenience support
> > for embedding data to dllist.h and settle for that.
> 
> I think it might be simpler to leave the name as Dllist and just
> overhaul the implementation along the lines you suggest, rather than
> replacing it with something completely different.  Mostly, I don't
> want to add a third thing if we can avoid it, given that Dllist as it
> exists today is used only lightly.
Well, if its the name, I have no problem with changing it, but I don't see how 
you can keep the api as it currently is and address my points.

If there is some buyin I can try to go either way (keeping the existing name, 
changing the api, adjusting the callers or just adjust the callers, throw away 
the old implementation) I just don't want to get into that just to see 
somebody isn't agreeing with the fundamental idea.

The most contentious point is probably relying on USE_INLINE being available 
anywhere. Which I believe to be the point now that we have gotten rid of some 
platforms.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)

2012-06-19 Thread Marko Kreen
On Tue, Jun 19, 2012 at 11:02 PM, Andres Freund  wrote:
> On Tuesday, June 19, 2012 09:59:48 PM Marko Kreen wrote:
>> On Wed, Jun 13, 2012 at 2:28 PM, Andres Freund 
> wrote:
>> > +/*
>> > + * removes a node from a list
>> > + * Attention: O(n)
>> > + */
>> > +static inline void ilist_s_remove(ilist_s_head *head,
>> > +                                  ilist_s_node *node)
>> > +{
>> > +       ilist_s_node *last = &head->head;
>> > +       ilist_s_node *cur;
>> > +#ifndef NDEBUG
>> > +       bool found = false;
>> > +#endif
>> > +       while ((cur = last->next))
>> > +       {
>> > +               if (cur == node)
>> > +               {
>> > +                       last->next = cur->next;
>> > +#ifndef NDEBUG
>> > +                       found = true;
>> > +#endif
>> > +                       break;
>> > +               }
>> > +               last = cur;
>> > +       }
>> > +       assert(found);
>> > +}
>>
>> This looks weird.
>>
>> In cyclic list removal is:
>>
>>   node->prev->next = node->next;
>>   node->next->prev = node->prev;
>>
>> And thats it.
> Thats the single linked list, not the double linked one. Thats why it has a
> O(n) warning tacked on...

Oh, you have several list implementations there.
Sorry, I was just browsing and it caught my eye.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 01/16] Overhaul walsender wakeup handling

2012-06-19 Thread Andres Freund
On Tuesday, June 19, 2012 09:55:30 PM Robert Haas wrote:
> On Wed, Jun 13, 2012 at 7:28 AM, Andres Freund  
wrote:
> > From: Andres Freund 
> > 
> > The previous coding could miss xlog writeouts at several places. E.g.
> > when wal was written out by the background writer or even after a commit
> > if synchronous_commit=off.
> > This could lead to delays in sending data to the standby of up to 7
> > seconds.
> > 
> > To fix this move the responsibility of notification to the layer where
> > the neccessary information is actually present. We take some care not to
> > do the notification while we hold conteded locks like WALInsertLock or
> > WalWriteLock locks.
> 
> I am not convinced that it's a good idea to wake up every walsender
> every time we do XLogInsert().  XLogInsert() is a super-hot code path,
> and adding more overhead there doesn't seem warranted.  We need to
> replicate commit, commit prepared, etc. quickly, by why do we need to
> worry about a short delay in replicating heap_insert/update/delete,
> for example?  They don't really matter until the commit arrives.  7
> seconds might be a bit long, but that could be fixed by decreasing the
> polling interval for walsender to, say, a second.
Its not woken up every XLogInsert call. Its only woken up if there was an 
actual disk write + fsync in there. Thats exactly the point of the patch.
The wakeup rate is actually lower for synchronous_commit=on than before 
because then it unconditionally did a wakeup for every commit (and similar) 
and now only does that if something has been written + fsynced.

> Parenthetically, I find it difficult to extract inline patches.  No
> matter whether I try to use it using Gmail + show original or the web
> site, something always seems to get garbled.
Will use git send-mail --attach next time... Btw, git am should be able to 
extract the patches for you.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)

2012-06-19 Thread Andres Freund
Hi,

On Tuesday, June 19, 2012 09:59:48 PM Marko Kreen wrote:
> On Wed, Jun 13, 2012 at 2:28 PM, Andres Freund  
wrote:
> > +/*
> > + * removes a node from a list
> > + * Attention: O(n)
> > + */
> > +static inline void ilist_s_remove(ilist_s_head *head,
> > +  ilist_s_node *node)
> > +{
> > +   ilist_s_node *last = &head->head;
> > +   ilist_s_node *cur;
> > +#ifndef NDEBUG
> > +   bool found = false;
> > +#endif
> > +   while ((cur = last->next))
> > +   {
> > +   if (cur == node)
> > +   {
> > +   last->next = cur->next;
> > +#ifndef NDEBUG
> > +   found = true;
> > +#endif
> > +   break;
> > +   }
> > +   last = cur;
> > +   }
> > +   assert(found);
> > +}
> 
> This looks weird.
> 
> In cyclic list removal is:
> 
>   node->prev->next = node->next;
>   node->next->prev = node->prev;
> 
> And thats it.
Thats the single linked list, not the double linked one. Thats why it has a 
O(n) warning tacked on...

The double linked one is just as you said:
/*
 * removes a node from a list
 */
static inline void ilist_d_remove(unused_attr ilist_d_head *head, ilist_d_node 
*node)
{
ilist_d_check(head);
node->prev->next = node->next;
node->next->prev = node->prev;
ilist_d_check(head);
}

Greetings,

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)

2012-06-19 Thread Marko Kreen
On Wed, Jun 13, 2012 at 2:28 PM, Andres Freund  wrote:
> +/*
> + * removes a node from a list
> + * Attention: O(n)
> + */
> +static inline void ilist_s_remove(ilist_s_head *head,
> +                                  ilist_s_node *node)
> +{
> +       ilist_s_node *last = &head->head;
> +       ilist_s_node *cur;
> +#ifndef NDEBUG
> +       bool found = false;
> +#endif
> +       while ((cur = last->next))
> +       {
> +               if (cur == node)
> +               {
> +                       last->next = cur->next;
> +#ifndef NDEBUG
> +                       found = true;
> +#endif
> +                       break;
> +               }
> +               last = cur;
> +       }
> +       assert(found);
> +}

This looks weird.

In cyclic list removal is:

  node->prev->next = node->next;
  node->next->prev = node->prev;

And thats it.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)

2012-06-19 Thread Robert Haas
On Tue, Jun 19, 2012 at 3:48 PM, Andres Freund  wrote:
> Why is that code not used more widely? Quite a bit of our list usage should be
> replaced embedding list element in larger structs imo. There are also open-
> coded inline list manipulations around (check aset.c for example).

Because we've got a million+ lines of code and nobody knows where all
the bodies are buried.

> 1. dllist.h has double the element overhead by having an inline value pointer
> (which is not needed when embedding) and a pointer to the list (which I have a
> hard time seing as being useful)
> 2. only double linked list, mine provided single and double linked ones
> 3. missing macros to use when embedded in a larger struct (containerof()
> wrappers and for(...) support basically)
> 4. most things are external function calls...
> 5. way much more branches/complexity in most of the functions. My
> implementation doesn't use any branches for the typical easy modifications
> (push, pop, remove element somewhere) and only one for the typical tests
> (empty, has-next, ...)
>
> The performance and memory aspects were crucial for the aforementioned toy
> project (slab allocator for postgres). Its not that crucial for the applycache
> where the lists currently are mostly used although its also relatively
> performance sensitive and obviously does a lot of list manipulation/iteration.
>
> If I had to decide I would add the missing api in dllist.h to my
> implementation and then remove it. Its barely used - and only in an embedded
> fashion - as far as I can see.
> I can understand though if that argument is met with doubt by others ;). If
> thats the way it has to go I would add some more convenience support for
> embedding data to dllist.h and settle for that.

I think it might be simpler to leave the name as Dllist and just
overhaul the implementation along the lines you suggest, rather than
replacing it with something completely different.  Mostly, I don't
want to add a third thing if we can avoid it, given that Dllist as it
exists today is used only lightly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 01/16] Overhaul walsender wakeup handling

2012-06-19 Thread Robert Haas
On Wed, Jun 13, 2012 at 7:28 AM, Andres Freund  wrote:
> From: Andres Freund 
>
> The previous coding could miss xlog writeouts at several places. E.g. when wal
> was written out by the background writer or even after a commit if
> synchronous_commit=off.
> This could lead to delays in sending data to the standby of up to 7 seconds.
>
> To fix this move the responsibility of notification to the layer where the
> neccessary information is actually present. We take some care not to do the
> notification while we hold conteded locks like WALInsertLock or WalWriteLock
> locks.

I am not convinced that it's a good idea to wake up every walsender
every time we do XLogInsert().  XLogInsert() is a super-hot code path,
and adding more overhead there doesn't seem warranted.  We need to
replicate commit, commit prepared, etc. quickly, by why do we need to
worry about a short delay in replicating heap_insert/update/delete,
for example?  They don't really matter until the commit arrives.  7
seconds might be a bit long, but that could be fixed by decreasing the
polling interval for walsender to, say, a second.

When I was doing some testing recently, the case that was sort of
confusing was the replay of AccessExcusiveLocks.  The lock didn't show
up on the standby for many seconds after it had showed up on the
master.  But that's more a feature than a bug when you really thinking
about it - postponing the lock on the slave for as long as possible
just reduces the user impact of having to take them there at all.

Parenthetically, I find it difficult to extract inline patches.  No
matter whether I try to use it using Gmail + show original or the web
site, something always seems to get garbled.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)

2012-06-19 Thread Andres Freund
On Tuesday, June 19, 2012 09:16:41 PM Robert Haas wrote:
> On Wed, Jun 13, 2012 at 7:28 AM, Andres Freund  
wrote:
> > Adds a single and a double linked list which can easily embedded into
> > other datastructures and can be used without any additional allocations.
> 
> dllist.h advertises that it's embeddable.  Can you use that instead,
> or enhance it slightly to support what you want to do?
Oh, wow. Didn't know that existed. I had $subject lying around from a play-
around project, so I didn't look too hard.
Why is that code not used more widely? Quite a bit of our list usage should be 
replaced embedding list element in larger structs imo. There are also open-
coded inline list manipulations around (check aset.c for example).

*looks*

Not really that happy with it though:

1. dllist.h has double the element overhead by having an inline value pointer 
(which is not needed when embedding) and a pointer to the list (which I have a 
hard time seing as being useful)
2. only double linked list, mine provided single and double linked ones
3. missing macros to use when embedded in a larger struct (containerof() 
wrappers and for(...) support basically)
4. most things are external function calls...
5. way much more branches/complexity in most of the functions. My 
implementation doesn't use any branches for the typical easy modifications 
(push, pop, remove element somewhere) and only one for the typical tests 
(empty, has-next, ...)

The performance and memory aspects were crucial for the aforementioned toy 
project (slab allocator for postgres). Its not that crucial for the applycache 
where the lists currently are mostly used although its also relatively 
performance sensitive and obviously does a lot of list manipulation/iteration.

If I had to decide I would add the missing api in dllist.h to my 
implementation and then remove it. Its barely used - and only in an embedded 
fashion - as far as I can see.
I can understand though if that argument is met with doubt by others ;). If 
thats the way it has to go I would add some more convenience support for 
embedding data to dllist.h and settle for that.

Greetings,

Andres


-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] return values of backend sub-main functions

2012-06-19 Thread Robert Haas
On Tue, Jun 19, 2012 at 7:31 AM, Peter Eisentraut  wrote:
> On ons, 2012-01-18 at 21:21 +0200, Peter Eisentraut wrote:
>> On lör, 2012-01-07 at 16:41 -0500, Tom Lane wrote:
>> > Peter Eisentraut  writes:
>> > > I suggest that we change PostgresMain(), PostmasterMain(), BackendRun(),
>> > > WalSenderMain(), and WalSndLoop() to return void as well.
>> >
>> > I agree this code is not very consistent or useful, but one question:
>> > what should the callers do if one of these functions *does* return?
>>
>> I was thinking of a two-pronged approach:  First, add
>> __attribute__((noreturn)) to the functions.  This will cause a suitable
>> compiler to verify on a source-code level that nothing actually returns
>> from the function.  And second, at the call site, put an abort();  /*
>> not reached */.  Together, this will make the code cleaner and more
>> consistent, and will also help the compiler out a bit about the control
>> flow.
>
> Patch for 9.3 attached.

Seems reasonable on a quick read-through.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pl/perl and utf-8 in sql_ascii databases

2012-06-19 Thread Robert Haas
On Tue, Jun 19, 2012 at 3:03 PM, Alvaro Herrera
 wrote:
> That's what I thought.  I will commit to both branches soon, then.

I think there might be three branches involved.

> Mind you, this should have been an "open item", not a commitfest item.
> (Actually not even an open item.  We should have committed it right
> away.)

True, but it's not a perfect world, and I didn't feel qualified to
commit it myself.  Adding it the CommitFest at least prevented it from
getting lost forever.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow WAL information to recover corrupted pg_controldata

2012-06-19 Thread Robert Haas
On Tue, Jun 19, 2012 at 1:43 AM, Tom Lane  wrote:
> Amit Kapila  writes:
>>> AFAIR you can create pg_control from scratch already with pg_resetxlog.
>>> The hard part is coming up with values for the counters, such as the
>>> next WAL location.  Some of them such as next OID are pretty harmless
>>> if you don't guess right, but I'm worried that wrong next WAL could
>>> make things worse not better.
>
>> I believe if WAL files are proper as mentioned in Alvaro's mail, the
>> purposed logic should generate correct values.
>
> I've got a problem with the assumption that, when pg_control is trash,
> megabytes or gigabytes of WAL can still be relied on completely.
>
> I'm almost inclined to suggest that we not get next-LSN from WAL, but
> by scanning all the pages in the main data store and computing the max
> observed LSN.  This is clearly not very attractive from a performance
> standpoint, but it would avoid the obvious failure mode where you lost
> some recent WAL segments along with pg_control.

I think it could be useful to have a tool that scans all the blocks
and computes that value, but I'd want it to just print the value out
and let me decide what to do about it.  There are cases where you
don't necessarily want to clobber pg_control, but you do have future
LSNs in your data file pages.  This can be either because the disk ate
your WAL, or because you didn't create recovery.conf, or because your
disk corrupted the LSNs on the data file pages.  I'd want a tool that
could be either run on an individual file, or recursively on a
directory.

In terms of the TODO item, I haven't yet heard anyone clearly state "I
wanted to use pg_controldata but it couldn't because X so therefore we
need this patch".  Alvaro mentioned the case where pg_control is
missing altogether, but:

[rhaas pgsql]$ rm ~/pgdata/global/pg_control
[rhaas pgsql]$ postgres
postgres: could not find the database system
Expected to find it in the directory "/Users/rhaas/pgdata",
but could not open file "/Users/rhaas/pgdata/global/pg_control": No
such file or directory
[rhaas pgsql]$ pg_resetxlog ~/pgdata
pg_resetxlog: could not open file "global/pg_control" for reading: No
such file or directory
If you are sure the data directory path is correct, execute
  touch global/pg_control
and try again.
[rhaas pgsql]$ touch ~/pgdata/global/pg_control
[rhaas pgsql]$ pg_resetxlog ~/pgdata
pg_resetxlog: pg_control exists but is broken or unknown version; ignoring it
Guessed pg_control values:

First log file ID after reset:0
First log file segment after reset:   69
pg_control version number:922
Catalog version number:   201206141
Database system identifier:   5755831325641078488
Latest checkpoint's TimeLineID:   1
Latest checkpoint's full_page_writes: off
Latest checkpoint's NextXID:  0/3
Latest checkpoint's NextOID:  1
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:3
Latest checkpoint's oldestXID's DB:   0
Latest checkpoint's oldestActiveXID:  0
Maximum data alignment:   8
Database block size:  8192
Blocks per segment of large relation: 131072
WAL block size:   8192
Bytes per WAL segment:16777216
Maximum length of identifiers:64
Maximum columns in an index:  32
Maximum size of a TOAST chunk:1996
Date/time type storage:   64-bit integers
Float4 argument passing:  by value
Float8 argument passing:  by value

If these values seem acceptable, use -f to force reset.
[rhaas pgsql]$ pg_resetxlog -f ~/pgdata
pg_resetxlog: pg_control exists but is broken or unknown version; ignoring it
Transaction log reset
[rhaas pgsql]$ postgres
LOG:  database system was shut down at 2012-06-19 15:25:28 EDT
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

So I still don't understand what problem we're solving here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] use of int4/int32 in C code

2012-06-19 Thread Peter Geoghegan
On 19 June 2012 20:11, Robert Haas  wrote:
> On Tue, Jun 19, 2012 at 9:47 AM, Tom Lane  wrote:
>> Peter Eisentraut  writes:
>>> What is the latest theory on using int4 vs. int32 in C code?
>>> (equivalently int2, int16)
>>
>> I thought the general idea was to use int32 most places, but int4 in
>> catalog declarations.  I don't think it's tremendously important if
>> somebody uses the other though.
>
> I concur with Peter that TMTOWTDI is not the right way to do this.  I
> think we ought to get rid of int4 in code and use int32 everywhere.
>
>>> While we're at it, how do we feel about using C standard types like
>>> int32_t instead of (or initially in addition to) our own definitions?
>>
>> Can't get very excited about this either.  The most likely outcome of
>> a campaign to substitute the standard types is that back-patching would
>> become a truly painful activity.  IMO, anything that is going to result
>> in tens of thousands of diffs had better have a more-than-cosmetic
>> reason.  (That wouldn't apply if we only used int32_t in new code ...
>> but then, instead of two approved ways to do it, there would be three.
>> Which doesn't seem like it improves matters.)
>
> On this one, I agree with you.

Yeah. I find pgindent changes annoying when doing a git blame myself.
Now, granted, you can mostly take care of that by having the tool
ignore whitespace changes, but that doesn't always work perfectly, and
I haven't been able to figure out a better way of managing that.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Andres Freund
On Tuesday, June 19, 2012 07:24:13 PM Robert Haas wrote:
> On Tue, Jun 19, 2012 at 12:11 PM, Tom Lane  wrote:
> > Andres Freund  writes:
> >> On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote:
>  ...  (If you are thinking
>  of something sufficiently high-level that merging could possibly work,
>  then it's not WAL, and we shouldn't be trying to make the WAL
>  representation cater for it.)
> >> 
> >> Do you really see this as such a big problem?
> > 
> > It looks suspiciously like "I have a hammer, therefore every problem
> > must be a nail".  I don't like the design concept of cramming logical
> > replication records into WAL in the first place.
> 
> Me, neither.  I think it's necessary to try to find a way of
> generating logical replication records from WAL.  But once generated,
> I think those records should form their own stream, independent of
> WAL.  If you take the contrary position that they should be included
> in WAL, then when you filter the WAL stream down to just the records
> of interest to logical replication, you end up with a WAL stream with
> holes in it, which is one of the things that Andres listed as an
> unresolved design problem in his original email.
Yes.

> Moreover, this isn't necessary at all for single-master replication,
> or even multi-source replication where each table has a single master.
>  It's only necessary for full multi-master replication, which we have
> no consensus to include in core, and even if we did have a consensus
> to include it in core, it certainly shouldn't be the first feature we
> design.
Well, you can't blame a patch/prototype trying to implement what it claims to 
implement ;)

More seriously: Even if we don't put MM in core I think putting the basis for 
it in core so that somebody can build such a solution reusing the existing 
infrastructure is a sensible idea. Imo the only thing that requires explicit 
support which is hard to add outside of core is prevention of loops (aka this 
patch). Everything else should be doable reusing the hopefully modular pieces.

> > However, if we're dead set on doing it that way, let us put information
> > that is only relevant to logical replication records into only the
> > logical replication records.
> Right.  If we decide we need this, and if we did decide to conflate
> the WAL stream, both of which I disagree with as noted above, then we
> still don't need it on every record.  It would probably be sufficient
> for local transactions to do nothing at all (and we can implicitly
> assume that they have master node ID = local node ID) and transactions
> which are replaying remote changes to emit one record per XID per
> checkpoint cycle containing the remote node ID.
Youve gone from a pretty trivial 150 line patch without any runtime/space 
overhead to something *considerably* more complex in that case though. 

You suddently need to have relatively complex logic to remember which 
information you got for a certain xid (and forget that information afterwards) 
and whether you already logged that xid and you need to have to decide about 
logging that information at multiple places.

Btw, what do you mean with "conflating" the stream? I don't really see that 
being proposed.

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 04/16] Add embedded list interface (header only)

2012-06-19 Thread Robert Haas
On Wed, Jun 13, 2012 at 7:28 AM, Andres Freund  wrote:
> Adds a single and a double linked list which can easily embedded into other
> datastructures and can be used without any additional allocations.

dllist.h advertises that it's embeddable.  Can you use that instead,
or enhance it slightly to support what you want to do?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] use of int4/int32 in C code

2012-06-19 Thread Robert Haas
On Tue, Jun 19, 2012 at 9:47 AM, Tom Lane  wrote:
> Peter Eisentraut  writes:
>> What is the latest theory on using int4 vs. int32 in C code?
>> (equivalently int2, int16)
>
> I thought the general idea was to use int32 most places, but int4 in
> catalog declarations.  I don't think it's tremendously important if
> somebody uses the other though.

I concur with Peter that TMTOWTDI is not the right way to do this.  I
think we ought to get rid of int4 in code and use int32 everywhere.

>> While we're at it, how do we feel about using C standard types like
>> int32_t instead of (or initially in addition to) our own definitions?
>
> Can't get very excited about this either.  The most likely outcome of
> a campaign to substitute the standard types is that back-patching would
> become a truly painful activity.  IMO, anything that is going to result
> in tens of thousands of diffs had better have a more-than-cosmetic
> reason.  (That wouldn't apply if we only used int32_t in new code ...
> but then, instead of two approved ways to do it, there would be three.
> Which doesn't seem like it improves matters.)

On this one, I agree with you.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pl/perl and utf-8 in sql_ascii databases

2012-06-19 Thread Alvaro Herrera

Excerpts from Robert Haas's message of mar jun 19 11:36:41 -0400 2012:
> 
> On Mon, Jun 18, 2012 at 3:30 PM, Alvaro Herrera
>  wrote:
> > Excerpts from Alex Hunsaker's message of vie feb 10 16:53:05 -0300 2012:
> >
> >> Seems like we missed the fact that we still did SvUTF8_on() in sv2cstr
> >> and SvPVUTF8() when turning a perl string into a cstring.
> >
> > Hmm, this patch belongs into back branches too, right?  Not just the
> > current development tree?
> 
> It seems like a bug fix to me.

That's what I thought.  I will commit to both branches soon, then.
Mind you, this should have been an "open item", not a commitfest item.
(Actually not even an open item.  We should have committed it right
away.)

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-19 Thread Peter Geoghegan
On 19 June 2012 19:44, Peter Geoghegan  wrote:
> You could do that, and some people do use custom collations for
> various reasons. That's obviously very much of minority interest
> though. Most people will just use citext or something. However, since
> citext is itself a client of varstr_cmp(), this won't help you.

I spoke too soon - that would work fine in the U.S, since the text is
converted to lower case on-the-fly in a locale and collation aware
fashion.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-19 Thread Tom Lane
"Kevin Grittner"  writes:
> I wasn't aware that en_US.UTF-8 doesn't have equivalence without
> equality.  I guess that surprising result in my last post is just
> plain inevitable with that collation then.  Bummer.  Is there
> actually anyone who finds that to be a useful behavior?  For a
> collation which considered upper-case and lower-case to be
> equivalent, would PostgreSQL sort as I wanted, or is it doing some
> tie-break per column within equivalent values?

Well, there are two different questions there.

As far as the overall structure of an ORDER BY or btree index is
concerned, all it knows is what the datatype comparator functions
tell it.  There is no such thing as a second pass to reconsider
values found to be "equal"; that's all the info there is.

As far as the behavior of the comparator function for text is concerned,
we choose to break ties reported by strcoll via strcmp.  But that's not
visible from outside the comparator, so there's no notion of an
"equivalence" behavior different from "equality".

I'm not exactly convinced that we could have a second-pass tiebreak
mechanism without creating serious problems for btree indexes; in
particular it seems like ordering considering only some leading keys
might not match the actual index ordering.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers reduced, v1

2012-06-19 Thread Robert Haas
On Fri, Jun 15, 2012 at 4:27 PM, Dimitri Fontaine
 wrote:
> Allow me to open the new season of the DML trigger series, named
> pg_event_trigger. This first episode is all about setting up the drama,
> so that next ones make perfect sense.

Comments:

1. I still think we ought to get rid of the notion of BEFORE or AFTER
(i.e. pg_event_trigger.evttype) and just make that detail part of the
event name (e.g. pg_event_trigger.evtevent).  Many easily forseeable
event types will be more like "during" rather than "before" or
"after", and for those that do have a notion of before and after, we
can have two different event names and include the word "before" or
"after" there.  I am otherwise satisfied with the schema you've
chosen.

2. I think it's important to be able to add new types of event
triggers without creating excessive parser bloat.  I think it's
important to use some kind of generic syntax here which will be able
to apply to all types of triggers we may want to add, now or in the
future.  The easiest way to do that is to use literal syntax for the
list of command tags, rather than writing them out as key words: e.g.
'ALTER TABLE' rather than ALTER TABLE.  It's not quite as pretty, but
the savings in parser bloat and future code change seem well worth it.
 Or, alternatively, we could use identifier style, e.g. alter_table,
as I previously suggested.

3. The event trigger cache seems to be a few bricks shy of a load.
First, event_trigger_cache_is_stalled is mis-named; I think you mean
"stale", not "stalled".  Second, instead of setting that flag and then
rebuilding the cache when you see the flag set, how about just blowing
away the cache contents whenever you would have set the flag?  That
seems a whole lot simpler and cleaner, and removes the need for a
force_rebuild flag on BuildEventTriggerCache().  Third, ISTM that this
isn't going to work correctly if backend A performs an event after
backend B has built its cache.  To fix this, I think you need to rip
out all the places where you force a rebuild locally and instead use
something like CacheRegisterSyscacheCallback() to blow away the cache
whenever something changes; you might find it helpful to look at
attoptcache.c.

4. The documentation doesn't build.

openjade:reference.sgml:44:4:W: cannot generate system identifier for
general entity "alterEventTrigger"
openjade:reference.sgml:44:4:E: general entity "alterEventTrigger" not
defined and no default entity
openjade:reference.sgml:44:21:E: reference to entity
"alterEventTrigger" for which no system identifier could be generated
openjade:reference.sgml:44:3: entity was defined here
openjade:reference.sgml:86:4:W: cannot generate system identifier for
general entity "createEventTrigger"
openjade:reference.sgml:86:4:E: general entity "createEventTrigger"
not defined and no default entity
openjade:reference.sgml:86:22:E: reference to entity
"createEventTrigger" for which no system identifier could be generated
openjade:reference.sgml:86:3: entity was defined here
openjade:reference.sgml:125:4:W: cannot generate system identifier for
general entity "dropEventTrigger"
openjade:reference.sgml:125:4:E: general entity "dropEventTrigger" not
defined and no default entity
openjade:reference.sgml:125:20:E: reference to entity
"dropEventTrigger" for which no system identifier could be generated
openjade:reference.sgml:125:3: entity was defined here
openjade:catalogs.sgml:1868:35:E: character "_" is not allowed in the
value of attribute "ZONE"
openjade:catalogs.sgml:1868:19:X: reference to non-existent ID
"CATALOG-PG-EVENT_TRIGGER"
openjade:trigger.sgml:43:47:X: reference to non-existent ID
"SQL-CREATECOMMANDTRIGGER"

5. In terms of a psql command, I think that \dev is both not very
mnemonic and, as you mentioned in the comment, possibly confusable
with SQL/MED.  If we're going to have a \d command for this, I suggest
something like \dy, which is not very mnemonic either but at least
seems unlikely to be confused with anything else.  Other things that
are unused include \dh, \dj, \dk, \dq, \dw, and \dz, if any of those
grab you, or a somewhat broader range of things (but still nothing
very memorable) if we include capital letters.  Or we could branch out
into punctuation, like \d& -- or things that don't begin with the
letter d, but that doesn't seem like a particularly good idea.

6. In objectaddress.c, I think that get_object_address_event_trigger
should be eliminated in favor of an additional case in
get_object_address_unqualified.

7. There are many references to command triggers that still need to be
cleaned up.  trigger.sgml still makes reference to the name command
triggers.  plpgsql.sgml also contains vestiges of the command trigger
notation, and references to some TG_* variables that I don't think
exist in this version of the patch.  event_trigger.c is identified
(twice) as cmdtrigger.c in the file header comment.  The header
comment for InsertEventTriggerTuple makes reference to pg_cmdtrigger,
as does a l

Re: [HACKERS] sortsupport for text

2012-06-19 Thread Peter Geoghegan
On 19 June 2012 18:57, Kevin Grittner  wrote:
> We weren't using en_US.UTF-8 collation (or any other "proper"
> collation) on Sybase -- I'm not sure whether they even supported
> proper collation sequences on the versions we used.  I'm thinking of
> when we were using their "case insensitive" sorting.  I don't know
> the implementation details, but the behavior was consistent with
> including each character-based column twice: once in the requested
> position in the ORDER BY clause but folded to a consistent case, and
> again after all the columns in the ORDER BY clause in original form,
> with C collation.
>
> I wasn't aware that en_US.UTF-8 doesn't have equivalence without
> equality.

Not that many do. The underlying cause of the problem back in 2005 was
the tacit assumption that none do, which presumably we got away with
for a while. I mentioned Hungarian, but it happens a bit in Swedish
too.

PostgreSQL supported Unicode before 2005, when the tie-breaker was
introduced. I know at least one Swede who used Postgres95. I just took
a look at the REL6_4 branch, and it looks much the same in 1999 as it
did in 2005, in that there is no tie-breaker after the strcoll(). Now,
that being the case, and Hungarian in particular having a whole bunch
of these equivalencies, I have to wonder if the original complainant's
problem really was diagnosed correctly. It could of had something to
do with the fact that texteq() was confused about whether it reported
equality or equivalency - it may have taken that long for the (len1 !=
len2) fastpath thing (only holds for equality, not equivalence,
despite the fact that the 2005-era strcoll() call checks equivalence
within texteq() ) to trip someone out, because texteq() would have
thereby given inconsistent answers in a very subtle way, that were not
correct either according to the Hungarian locale, nor according to
simple bitwise equality. That's mostly speculation, but the question
must be asked.

> I guess that surprising result in my last post is just
> plain inevitable with that collation then.  Bummer.  Is there
> actually anyone who finds that to be a useful behavior?

Looking at the details again and assuming a US locale, yeah, it is.
The substance of your complain holds though.

> For a collation which considered upper-case and lower-case to be
> equivalent, would PostgreSQL sort as I wanted, or is it doing some
> tie-break per column within equivalent values?

You could do that, and some people do use custom collations for
various reasons. That's obviously very much of minority interest
though. Most people will just use citext or something. However, since
citext is itself a client of varstr_cmp(), this won't help you.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

2012-06-19 Thread Andres Freund
Hi,

The most important part, even for people not following my discussion with 
Robert is at the bottom where the possible wal decoding strategies are laid 
out.

On Tuesday, June 19, 2012 03:20:58 AM Robert Haas wrote:
> On Sat, Jun 16, 2012 at 7:43 AM, Andres Freund  
wrote:
> >> > Hm. Yes, you could do that. But I have to say I don't really see a
> >> > point. Maybe the fact that I do envision multimaster systems at some
> >> > point is clouding my judgement though as its far less easy in that
> >> > case.
> >> 
> >> Why?  I don't think that particularly changes anything.
> > 
> > Because it makes conflict detection very hard. I also don't think its a
> > feature worth supporting. Whats the use-case of updating records you
> > cannot properly identify?
> Don't ask me; I just work here.  I think it's something that some
> people want, though.  I mean, if you don't support replicating a table
> without a primary key, then you can't even run pgbench in a
> replication environment.
Well, I have no problem with INSERT only tables not having a PK. And 
pgbench_history is the only pgbench table that doesn't have a pkey? And thats 
only truncated...

> Is that an important workload?  Well,
> objectively, no.  But I guarantee you that other people with more
> realistic workloads than that will complain if we don't have it.
> Absolutely required on day one?  Probably not.  Completely useless
> appendage that no one wants?  Not that, either.
Maybe. And I don't really care, so if others see that as important I am happy 
to appease them ;). 

> >> In my view, a logical replication solution is precisely one in which
> >> the catalogs don't need to be in sync.  If the catalogs have to be in
> >> sync, it's not logical replication.  ISTM that what you're talking
> >> about is sort of a hybrid between physical replication (pages) and
> >> logical replication (tuples) - you want to ship around raw binary
> >> tuple data, but not entire pages.
> > Ok, thats a valid point. Simon argued at the cluster summit that
> > everything thats not physical is logical. Which has some appeal because
> > it seems hard to agree what exactly logical rep is. So definition by
> > exclusion makes kind of sense ;)
> Well, the words are fuzzy, but I would define logical replication to
> be something which is independent of the binary format in which stuff
> gets stored on disk.  If it's not independent of the disk format, then
> you can't do heterogenous replication (between versions, or between
> products).  That precise limitation is the main thing that drives
> people to use anything other than SR in the first place, IME.
Not in mine. The main limitation I see is that you cannot write anything on 
the standby. Which sucks majorly for many things. Its pretty much impossible 
to "fix" that for SR outside of very limited cases.
While many scenarios don't need multimaster *many* need to write outside of 
the standby's replication set.

> > I think what you categorized as "hybrid logical/physical" rep solves an
> > important use-case thats very hard to solve at the moment. Before my
> > 2ndquadrant days I had several client which had huge problemsing the
> > trigger based solutions because their overhead simply was to big a
> > burden on the master. They couldn't use SR either because every
> > consuming database kept loads of local data.
> > I think such scenarios are getting more and more common.

> I think this is to some extent true, but I also think you're
> conflating two different things.  Change extraction via triggers
> introduces overhead that can be eliminated by reconstructing tuples
> from WAL in the background rather than forcing them to be inserted
> into a shadow table (and re-WAL-logged!) in the foreground.  I will
> grant that shipping the tuple as a binary blob rather than as text
> eliminates additional overehead on both ends, but it also closes off a
> lot of important use cases.  As I noted in my previous email, I think
> that ought to be a performance optimization that we do, if at all,
> when it's been proven safe, not a baked-in part of the design.  Even a
> solution that decodes WAL to text tuples and ships those around and
> reinserts the via pure SQL should be significantly faster than the
> replication solutions we have today; if it isn't, something's wrong.
Its not only the logging side which is a limitation in todays replication 
scenarios. The apply side scales even worse because its *very* hard to 
distribute it between multiple backends.

> >> The problem with that is it's going to be tough to make robust.  Users
> >> could easily end up with answers that are total nonsense, or probably
> >> even crash the server.
> > Why?
> Because the routines that decode tuples don't include enough sanity
> checks to prevent running off the end of the block, or even the end of
> memory completely.  Consider a corrupt TOAST pointer that indicates
> that there is a gigabyte of data stored in an 8kB block.  One of the
> common s

Re: [HACKERS] sortsupport for text

2012-06-19 Thread Kevin Grittner
Peter Geoghegan  wrote:
> Kevin Grittner  wrote:
 
>> I'm pretty sure that when I was using Sybase ASE the order for
>> non-equal values was always predictable, and it behaved in the
>> manner I describe below.  I'm less sure about any other product.
> 
> Maybe it used a physical row identifier as a tie-breaker? Note
> that we use ItemPointers as a tie-breaker for sorting index
> tuples.
> 
> I imagine that it was at least predictable among columns being
> sorted, if only because en_US.UTF-8 doesn't have any notion of
> equivalence (that is, it just so happens that there are no two
> strings that are equivalent but not bitwise equal). It would
> surely be impractical to do a comparison for the entire row, as
> that could be really expensive.
 
We weren't using en_US.UTF-8 collation (or any other "proper"
collation) on Sybase -- I'm not sure whether they even supported
proper collation sequences on the versions we used.  I'm thinking of
when we were using their "case insensitive" sorting.  I don't know
the implementation details, but the behavior was consistent with
including each character-based column twice: once in the requested
position in the ORDER BY clause but folded to a consistent case, and
again after all the columns in the ORDER BY clause in original form,
with C collation.
 
I wasn't aware that en_US.UTF-8 doesn't have equivalence without
equality.  I guess that surprising result in my last post is just
plain inevitable with that collation then.  Bummer.  Is there
actually anyone who finds that to be a useful behavior?  For a
collation which considered upper-case and lower-case to be
equivalent, would PostgreSQL sort as I wanted, or is it doing some
tie-break per column within equivalent values?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Do we want a xmalloc or similar function in the Backend?

2012-06-19 Thread Andres Freund
On Tuesday, June 19, 2012 07:35:53 PM Robert Haas wrote:
> On Tue, Jun 19, 2012 at 10:17 AM, Andres Freund  
wrote:
> > There are 70+ calls of malloc in the backend in the form of
> > 
> > type* foo = malloc(sizeof(...));
> > if(!foo)
> >   elog(ERROR, "could not allocate memory");
> > 
> > which is a bit annoying to write at times. Would somebody argue against
> > introducing a function that does the above named xmalloc() or
> > malloc_or_die()?
> 
> I can't even find 70 malloc calls in the entire backend, let alone 70
> with that pattern.  Still, I don't think malloc_or_error (not die)
> would be a bad idea.
$ ack '\bmalloc\s*\(' src/backend/|wc -l
70

10-15 or so of those are comments.

The 70+ came from me running on some development branch with commitfest 
patches applied...

> But the error should definitely be written as:
> 
> ereport(ERROR,
> (errcode(ERRCODE_OUT_OF_MEMORY),
>  errmsg("out of memory")));
> 
> ...not elog.
Yes, definitely. Currently some of those locations (e.g. in xlog.c) are only 
protected by Asserts... 

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Do we want a xmalloc or similar function in the Backend?

2012-06-19 Thread Robert Haas
On Tue, Jun 19, 2012 at 10:17 AM, Andres Freund  wrote:
> There are 70+ calls of malloc in the backend in the form of
>
> type* foo = malloc(sizeof(...));
> if(!foo)
>   elog(ERROR, "could not allocate memory");
>
> which is a bit annoying to write at times. Would somebody argue against
> introducing a function that does the above named xmalloc() or malloc_or_die()?

I can't even find 70 malloc calls in the entire backend, let alone 70
with that pattern.  Still, I don't think malloc_or_error (not die)
would be a bad idea.

But the error should definitely be written as:

ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
 errmsg("out of memory")));

...not elog.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pl/perl and utf-8 in sql_ascii databases

2012-06-19 Thread Alex Hunsaker
On Mon, Jun 18, 2012 at 1:30 PM, Alvaro Herrera
 wrote:
>
> Excerpts from Alex Hunsaker's message of vie feb 10 16:53:05 -0300 2012:
>
>> Seems like we missed the fact that we still did SvUTF8_on() in sv2cstr
>> and SvPVUTF8() when turning a perl string into a cstring.
>
> Hmm, this patch belongs into back branches too, right?  Not just the
> current development tree?

(Have been out of town, sorry for the late reply)

Yeah, back to 9.1.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-06-19 Thread Andres Freund
On Tuesday, June 19, 2012 07:22:02 PM Jeff Davis wrote:
> On Mon, 2012-06-18 at 21:41 +0200, Andres Freund wrote:
> > It calls pg_flush_data inside of copy_file which does the
> > posix_fadvise... So maybe just put the sync_file_range in pg_flush_data?
> 
> The functions in fd.c aren't linked to initdb, so it's a challenge to
> share that code (I remember now: that's why I didn't use pg_flush_data
> originally). I don't see a clean way to resolve that, do you?
> 
> Also, it seems that sync_file_range() doesn't help with creating a
> database much (on my machine), so it doesn't seem important to use it
> there.
> 
> Right now I'm inclined to leave the patch as-is.
Fine with that, I wanted to bring it up and see it documented.

I have marked it with ready for committer. That committer needs to decide on -
N in the regression tests or not, but that shouldn't be much of a problem ;)

Cool!

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Robert Haas
On Tue, Jun 19, 2012 at 12:11 PM, Tom Lane  wrote:
> Andres Freund  writes:
>> On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote:
 ...  (If you are thinking
 of something sufficiently high-level that merging could possibly work,
 then it's not WAL, and we shouldn't be trying to make the WAL
 representation cater for it.)
>
>> Do you really see this as such a big problem?
>
> It looks suspiciously like "I have a hammer, therefore every problem
> must be a nail".  I don't like the design concept of cramming logical
> replication records into WAL in the first place.

Me, neither.  I think it's necessary to try to find a way of
generating logical replication records from WAL.  But once generated,
I think those records should form their own stream, independent of
WAL.  If you take the contrary position that they should be included
in WAL, then when you filter the WAL stream down to just the records
of interest to logical replication, you end up with a WAL stream with
holes in it, which is one of the things that Andres listed as an
unresolved design problem in his original email.

Moreover, this isn't necessary at all for single-master replication,
or even multi-source replication where each table has a single master.
 It's only necessary for full multi-master replication, which we have
no consensus to include in core, and even if we did have a consensus
to include it in core, it certainly shouldn't be the first feature we
design.

> However, if we're dead set on doing it that way, let us put information
> that is only relevant to logical replication records into only the
> logical replication records.

Right.  If we decide we need this, and if we did decide to conflate
the WAL stream, both of which I disagree with as noted above, then we
still don't need it on every record.  It would probably be sufficient
for local transactions to do nothing at all (and we can implicitly
assume that they have master node ID = local node ID) and transactions
which are replaying remote changes to emit one record per XID per
checkpoint cycle containing the remote node ID.

> Saving a couple bytes in each such record
> is penny-wise and pound-foolish, I'm afraid; especially when you're
> nailing down hard, unexpansible limits at the very beginning of the
> development process in order to save those bytes.

I completely agree.  I think that, as Dan said upthread, having a 64
or 128 bit ID so that it can be generated automatically rather than
configured by an administrator who must be careful not to duplicate
node IDs in any pair of systems that could ever end up talking to each
other would be a vast usability improvement.  Perhaps systems A, B,
and C are replicating to each other today, as are systems D and E.
But now suppose that someone decides they want to replicate one table
between A and D.  Suddenly the node IDs have to be distinct where they
didn't before, and now there's potentially a problem to hassle with
that wouldn't have been an issue if the node IDs had been wide enough
to begin with.  It is not unusual for people to decide after-the-fact
to begin replicating between machines where this wasn't originally
anticipated and which may even be even be under different
administrative control.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-06-19 Thread Jeff Davis
On Mon, 2012-06-18 at 21:41 +0200, Andres Freund wrote:
> It calls pg_flush_data inside of copy_file which does the posix_fadvise... So 
> maybe just put the sync_file_range in pg_flush_data?

The functions in fd.c aren't linked to initdb, so it's a challenge to
share that code (I remember now: that's why I didn't use pg_flush_data
originally). I don't see a clean way to resolve that, do you?

Also, it seems that sync_file_range() doesn't help with creating a
database much (on my machine), so it doesn't seem important to use it
there.

Right now I'm inclined to leave the patch as-is.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-19 Thread Peter Geoghegan
On 19 June 2012 17:45, Kevin Grittner  wrote:
> Peter Geoghegan  wrote:
>> Are you sure that they actually have a tie-breaker, and don't just
>> make the distinction between equality and equivalence (if only
>> internally)?
>
> I'm pretty sure that when I was using Sybase ASE the order for
> non-equal values was always predictable, and it behaved in the
> manner I describe below.  I'm less sure about any other product.

Maybe it used a physical row identifier as a tie-breaker? Note that we
use ItemPointers as a tie-breaker for sorting index tuples.

I imagine that it was at least predictable among columns being sorted,
if only because en_US.UTF-8 doesn't have any notion of equivalence
(that is, it just so happens that there are no two strings that are
equivalent but not bitwise equal). It would surely be impractical to
do a comparison for the entire row, as that could be really expensive.

>> I don't see why you'd want a tie-breaker across multiple keys. I
>> mean, you could, I just don't see any reason to.
>
> So that you can have entirely repeatable results across multiple
> runs?

I suppose that isn't an unreasonable use-case, but it would be
unreasonable to require that everyone pay for it if, particularly if
it implied another strcmp(). You don't need me to tell you that we
generally discourage the idea that the order that tuples are returned
in is defined in any way beyond that asked for by the user.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework

2012-06-19 Thread Alvaro Herrera

Excerpts from Boszormenyi Zoltan's message of mar jun 19 12:44:04 -0400 2012:

> OK, all 4 Check* functions are now moved back into proc.c,
> nothing outside of timeout.c touches anything in it. New patches
> are attached.

Yeah, I like this one better, thanks.

It seems to me that the "check" functions are no longer "check" anymore,
right?  I mean, they don't check whether the time has expired.  It can
be argued that CheckDeadLock is well named, because it does check
whether there is a deadlock before doing anything else; but
CheckStandbyTimeout is no longer a check -- it just sends a signal.
Do we need to rename these functions?

Why are you using the deadlock timeout for authentication?  Wouldn't it
make more sense to have a separate TimeoutName, just to keep things
clean?

The "NB:" comment here doesn't seem to be useful anymore:
+ /*
+  * Init, Destroy and Check functions for different timeouts.
+  *
+  * NB: all Check* functions are run inside a signal handler, so be very wary
+  * about what is done in them or in called routines.
+  
*/


In base_timeouts you don't initialize fin_time for any of the members.
This is probably unimportant but then why initialize start_time?

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Near-duplicate RI NO ACTION and RESTRICT triggers

2012-06-19 Thread Tom Lane
Our current interpretation of the difference between foreign keys with
ON UPDATE/DELETE NO ACTION and those with ON UPDATE/DELETE RESTRICT
is that they mean the same thing but RESTRICT checks are not deferrable.
It follows from this that the trigger code ought to be the same for
NO ACTION and RESTRICT cases, and so it occurred to me that we could
get rid of a few hundred lines in ri_triggers.c if we removed the
duplicated code.

Comparing the actual code in the different functions, though, there is
a difference: the NO ACTION triggers call ri_Check_Pk_Match to see if
another PK row has been inserted/modified to provide the same key values
that the trigger subject row no longer does.  (ri_Check_Pk_Match also
makes some checks for NULL-key cases, but these are redundant with tests
made in the main trigger functions, so they ought to go away.)  The
RESTRICT triggers do not do this.  It's fairly easy to construct a case
where it makes a difference:

create temp table pp (f1 int primary key);
create temp table cc (f1 int references pp on update no action);
insert into pp values(12);
insert into pp values(11);
update pp set f1=f1+1; -- now we have 13 and 12
insert into cc values(13); -- ok
update pp set f1=f1+1; -- now we have 14 and 13, FK is still OK
update pp set f1=f1+1; -- would result in 15 and 14, so FK fails

If you change the foreign key to be ON UPDATE RESTRICT, the second
UPDATE fails because the RESTRICT trigger doesn't notice that another
PK row has been modified to provide the key required by the FK row.

I think that the argument for having the RESTRICT triggers behave
like this is that the SQL spec envisions the RESTRICT check occurring
immediately when the individual PK row is updated/deleted, and so there
would be no opportunity for another PK row to be updated into its place.
(Or, in plainer English, RESTRICT should mean "you can't modify this
row's keys at all if it has dependents".)  Because we implement RESTRICT
through an AFTER trigger that can't run earlier than end-of-statement,
we can't exactly match the spec's semantics, but we can get fairly
close so long as you don't think about what would be seen by
e.g. user-written triggers executing during the statement.

I'm happy with continuing to have this behavioral difference between
the two sets of triggers, but wanted to throw it up for discussion:
does anyone think it'd be better to apply ri_Check_Pk_Match in the
RESTRICT triggers too?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-19 Thread Kevin Grittner
Peter Geoghegan  wrote:
> Kevin Grittner  wrote:
 
>> Since we appear to be questioning everything in this area, I'll
>> raise something which has been bugging me for a while: in some
>> other systems I've used, the "tie-breaker" comparison for
>> equivalent values comes after equivalence sorting on *all* sort
>> keys, rather than *each* sort key.
> 
> Are you sure that they actually have a tie-breaker, and don't just
> make the distinction between equality and equivalence (if only
> internally)?
 
I'm pretty sure that when I was using Sybase ASE the order for
non-equal values was always predictable, and it behaved in the
manner I describe below.  I'm less sure about any other product.
 
> I don't see why you'd want a tie-breaker across multiple keys. I
> mean, you could, I just don't see any reason to.
 
So that you can have entirely repeatable results across multiple
runs?
 
>> test=# select * from c order by 2;
>>  last_name | first_name
>> ---+
>>  smith | bob
>>  SMITH | EDWARD
>>  smith | peter
>> (3 rows)
>>
>> This seems completely wrong:
>>
>> test=# select * from c order by 1,2;
>>  last_name | first_name
>> ---+
>>  smith | bob
>>  smith | peter
>>  SMITH | EDWARD
>> (3 rows)
> 
> Agreed. Definitely a POLA violation.
> 
>> I'm sure the latter is harder to do and slower to execute; but
>> the former just doesn't seem defensible as correct.
> 
> This same gripe is held by the author of that sorting document I
> linked to from the Unicode consortium, with a very similar
> example. So it seems like this could be a win from several
> perspectives, as it would enable the strxfrm() optimisation. I'm
> pretty sure that pg_upgrade wouldn't be very happy about this, so
> we'd have to have a legacy compatibility mode.
 
At a minimum, it could require rebuilding indexes with multiple
columns where any indexed value before the last index column can be
equivalent but not equal.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Andres Freund
Hi,

On Tuesday, June 19, 2012 06:11:20 PM Tom Lane wrote:
> Andres Freund  writes:
> > On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote:
> >>> ...  (If you are thinking
> >>> of something sufficiently high-level that merging could possibly work,
> >>> then it's not WAL, and we shouldn't be trying to make the WAL
> >>> representation cater for it.)
> > 
> > Do you really see this as such a big problem?
> It looks suspiciously like "I have a hammer, therefore every problem
> must be a nail".  I don't like the design concept of cramming logical
> replication records into WAL in the first place.
There are - so far - no specific "logical replication records". Its a 
relatively minor amount of additional data under wal_level=logical for 
existing records. HEAP_UPDATE gets the old primary key on updates changing the 
pkey and HEAP_DELETE always has the pkey. HEAP_INSERT|UPDATE|
DELETE,HEAP2_MULTI_INSERT put their information in another XLogRecData block 
than the page to handle full page writes. Thats it.

I can definitely understand hesitation about that, but I simply see no 
realistic way to solve the issues of existing replication solutions otherwise.
Do you have a better idea to solve those than the above? Without significant 
complications of the backend code and without loads of additional writes going 
on?
I *really* would like to hear them if you do.

> However, if we're dead set on doing it that way, let us put information
> that is only relevant to logical replication records into only the
> logical replication records. 
I found, and still do, the idea of having the origin_id in there rather 
elegant. If people prefer adding the same block to all of the above xlog 
records: I can live with that and will then do so. It makes some things more 
complicated, but its not too bad.

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file foreign tables

2012-06-19 Thread Kohei KaiGai
Hi Fujita-san,

Could you rebase this patch towards the latest tree?
It was unavailable to apply the patch cleanly.

I looked over the patch, then noticed a few points.

At ProcessCopyOptions(), defel->arg can be NIL, isn't it?
If so, cstate->convert_binary is not a suitable flag to check
redundant option. It seems to me cstate->convert_selectively
is more suitable flag to check it.

+   else if (strcmp(defel->defname, "convert_binary") == 0)
+   {
+   if (cstate->convert_binary)
+   ereport(ERROR,
+   (errcode(ERRCODE_SYNTAX_ERROR),
+errmsg("conflicting or redundant options")));
+   cstate->convert_selectively = true;
+   if (defel->arg == NULL || IsA(defel->arg, List))
+   cstate->convert_binary = (List *) defel->arg;
+   else
+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("argument to option \"%s\" must be a
list of column names",
+   defel->defname)));
+   }


At NextCopyFrom(), this routine computes default values if configured.
In case when these values are not referenced, it might be possible to
skip unnecessary calculations.
Is it unavailable to add logic to avoid to construct cstate->defmap on
unreferenced columns at  BeginCopyFrom()?

Thanks,

2012/5/11 Etsuro Fujita :
>> -Original Message-
>> From: Robert Haas [mailto:robertmh...@gmail.com]
>> Sent: Friday, May 11, 2012 1:36 AM
>> To: Etsuro Fujita
>> Cc: pgsql-hackers@postgresql.org
>> Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file
>> foreign tables
>>
>> On Tue, May 8, 2012 at 7:26 AM, Etsuro Fujita
> 
>> wrote:
>> > I would like to propose to improve parsing efficiency of
>> > contrib/file_fdw by selective parsing proposed by Alagiannis et
>> > al.[1], which means that for a CSV/TEXT file foreign table, file_fdw
>> > performs binary conversion only for the columns needed for query
>> > processing.  Attached is a WIP patch implementing the feature.
>>
>> Can you add this to the next CommitFest?  Looks interesting.
>
> Done.
>
> Best regards,
> Etsuro Fujita
>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL
>> Company
>>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



-- 
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-19 Thread Peter Geoghegan
On 19 June 2012 16:17, Kevin Grittner  wrote:
> Peter Geoghegan  wrote:
>
>> So, just to give a bit more weight to my argument that we should
>> recognise that equivalent strings ought to be treated identically
>
> Since we appear to be questioning everything in this area, I'll
> raise something which has been bugging me for a while: in some other
> systems I've used, the "tie-breaker" comparison for equivalent
> values comes after equivalence sorting on *all* sort keys, rather
> than *each* sort key.

Are you sure that they actually have a tie-breaker, and don't just
make the distinction between equality and equivalence (if only
internally)? I would have checked that myself already, but I don't
have access to any other RDBMS that I'd expect to care about these
kinds of distinctions. They make sense for ensuring that the text
comparator's notion of equality is consistent with text's general
notion (if that's bitwise equality, which I suspect it is in these
other products too for the same reasons it is for us). I don't see why
you'd want a tie-breaker across multiple keys. I mean, you could, I
just don't see any reason to.

> test=# select * from c order by 2;
>  last_name | first_name
> ---+
>  smith     | bob
>  SMITH     | EDWARD
>  smith     | peter
> (3 rows)
>
> This seems completely wrong:
>
> test=# select * from c order by 1,2;
>  last_name | first_name
> ---+
>  smith     | bob
>  smith     | peter
>  SMITH     | EDWARD
> (3 rows)

Agreed. Definitely a POLA violation.

> I'm sure the latter is harder to do and slower to execute; but the
> former just doesn't seem defensible as correct.

This same gripe is held by the author of that sorting document I
linked to from the Unicode consortium, with a very similar example. So
it seems like this could be a win from several perspectives, as it
would enable the strxfrm() optimisation. I'm pretty sure that
pg_upgrade wouldn't be very happy about this, so we'd have to have a
legacy compatibility mode.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw in contrib

2012-06-19 Thread Merlin Moncure
On Tue, Jun 19, 2012 at 10:15 AM, Kohei KaiGai  wrote:
> Let me push the pgsql_fdw in core from different perspective.
>
> Right now, FDW is a feature that will take many enhancement in
> the near future like join-pushdown, writable APIs and so on.
> If we would not have a FDW extension in core that communicate
> with actual RDBMS, instead of flat files, it makes our development
> efforts more complex. Please assume a scenario people alwats
> tries to submit two patches; one for the core, and the other for
> a particular out-of-tree extension.

yeah. plus, it's nice to have a very high quality fdw implementation
in core for others to crib from.  I have no objection -- I just took a
convenient opportunity to hijack the thread for some hand-wavy ideas
about an extension repository. :-).

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 10/16] Introduce the concept that wal has a 'origin' node

2012-06-19 Thread Tom Lane
Andres Freund  writes:
> On Tuesday, June 19, 2012 04:30:59 PM Tom Lane wrote:
>>> ...  (If you are thinking
>>> of something sufficiently high-level that merging could possibly work,
>>> then it's not WAL, and we shouldn't be trying to make the WAL
>>> representation cater for it.)

> Do you really see this as such a big problem?

It looks suspiciously like "I have a hammer, therefore every problem
must be a nail".  I don't like the design concept of cramming logical
replication records into WAL in the first place.

However, if we're dead set on doing it that way, let us put information
that is only relevant to logical replication records into only the
logical replication records.  Saving a couple bytes in each such record
is penny-wise and pound-foolish, I'm afraid; especially when you're
nailing down hard, unexpansible limits at the very beginning of the
development process in order to save those bytes.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transactions over pathological TCP connections

2012-06-19 Thread Robert Haas
On Tue, Jun 19, 2012 at 1:56 AM, Tom Lane  wrote:
> The transaction would be committed before a command success report is
> delivered to the client, so I don't think delivered-and-not-marked is
> possible.

...unless you have configured synchronous_commit=off, or fsync=off.

Or unless your disk melts into a heap of slag and you have to restore
from backup.  You can protect against that last case using synchronous
replication.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Skip checkpoint on promoting from streaming replication

2012-06-19 Thread Fujii Masao
On Tue, Jun 19, 2012 at 5:30 PM, Kyotaro HORIGUCHI
 wrote:
> Thank you.
>
>> What happens if the server skips an end-of-recovery checkpoint,
>> is promoted to the master, runs some write transactions,
>> crashes and restarts automatically before it completes
>> checkpoint? In this case, the server needs to do crash recovery
>> from the last checkpoint record with old timeline ID to the
>> latest WAL record with new timeline ID. How does crash recovery
>> do recovery beyond timeline?
>
> Basically the same as archive recovery as far as I saw. It is
> already implemented to work in that way.
>
> After this patch applied, StartupXLOG() gets its
> recoveryTargetTLI from the new field lastestTLI in the control
> file instead of the latest checkpoint. And the latest checkpoint
> record informs its TLI and WAL location as before, but its TLI
> does not have a significant meaning in the recovery sequence.
>
> Suggest the case following,
>
>      |seg 1     | seg 2    |
> TLI 1 |...c..|00|
>          C           P  X
> TLI 2            |00|
>
> * C - checkpoint, P - promotion, X - crash just after here
>
> This shows the situation that the latest checkpoint(restartpoint)
> has been taken at TLI=1/SEG=1/OFF=4 and promoted at
> TLI=1/SEG=2/OFF=5, then crashed just after TLI=2/SEG=2/OFF=8.
> Promotion itself inserts no wal records but creates a copy of the
> current segment for new TLI. the file for TLI=2/SEG=1 should not
> exist. (Who will create it?)
>
> The control file will looks as follows
>
> latest checkpoint : TLI=1/SEG=1/OFF=4
> latest TLI        : 2
>
> So the crash recovery sequence starts from SEG=1/LOC=4.
> expectedTLIs will be (2, 1) so 1 will naturally be selected as
> the TLI for SEG1 and 2 for SEG2 in XLogFileReadAnyTLI().
>
> In the closer view, startup constructs expectedTLIs reading the
> timeline hisotry file corresponds to the recoveryTargetTLI. Then
> runs the recovery sequence from the redo point of the latest
> checkpoint using WALs with the largest TLI - which is
> distinguised by its file name, not header - within the
> expectedTLIs in XLogPageRead(). The only difference to archive
> recovery is XLogFileReadAnyTLI() reads only the WAL files already
> sit in pg_xlog directory, and not reaches archive. The pages with
> the new TLI will be naturally picked up as mentioned above in
> this sequence and then will stop at the last readable record.
>
> latestTLI field in the control file is updated just after the TLI
> was incremented and the new WAL files with the new TLI was
> created. So the crash recovery sequence won't stop before
> reaching the WAL with new TLI disignated in the control file.

Is it guaranteed that all the files (e.g., the latest timeline history file)
required for such crash recovery exist in pg_xlog? If yes, your
approach might work well.

Regards,

-- 
Fujii Masao

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format changes

2012-06-19 Thread Robert Haas
On Tue, Jun 19, 2012 at 4:14 AM, Heikki Linnakangas
 wrote:
> Well, that was easier than I thought. Attached is a patch to make XLogRecPtr
> a uint64, on top of my other WAL format patches. I think we should go ahead
> with this.

+1.

> The LSNs on pages are still stored in the old format, to avoid changing the
> on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the control
> file and WAL are changed, however, so an initdb (or at least pg_resetxlog)
> is required.

Seems fine.

> Should we keep the old representation in the replication protocol messages?
> That would make it simpler to write a client that works with different
> server versions (like pg_receivexlog). Or, while we're at it, perhaps we
> should mandate network-byte order for all the integer and XLogRecPtr fields
> in the replication protocol. That would make it easier to write a client
> that works across different architectures, in >= 9.3. The contents of the
> WAL would of course be architecture-dependent, but it would be nice if
> pg_receivexlog and similar tools could nevertheless be
> architecture-independent.

I share Andres' question about how we're doing this already.  I think
if we're going to break this, I'd rather do it in 9.3 than 5 years
from now.  At this point it's just a minor annoyance, but it'll
probably get worse as people write more tools that understand WAL.

> I kept the %X/%X representation in error messages etc. I'm quite used to
> that output, so reluctant to change it, although it's a bit silly now that
> it represents just 64-bit value. Using UINT64_FORMAT would also make the
> messages harder to translate.

I could go either way on this one, but I have no problem with the way
you did it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pg default's verbosity?

2012-06-19 Thread Robert Haas
On Tue, Jun 19, 2012 at 2:15 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> There might be something to the idea of demoting a few of the things
>> we've traditionally had as NOTICEs, though.  IME, the following two
>> messages account for a huge percentage of the chatter:
>
>> NOTICE:  CREATE TABLE will create implicit sequence "foo_a_seq" for
>> serial column "foo.a"
>> NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
>> "foo_pkey" for table "foo"
>
> Personally, I'd have no problem with flat-out dropping (not demoting)
> both of those two specific messages.  I seem to recall that Bruce has
> lobbied for them heavily in the past, though.

My vote would be to make 'em DEBUG1, and similarly with the UNIQUE
message that Fabian mentions downthread.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format changes

2012-06-19 Thread Andres Freund
On Tuesday, June 19, 2012 10:14:08 AM Heikki Linnakangas wrote:
> On 18.06.2012 21:08, Heikki Linnakangas wrote:
> > On 18.06.2012 21:00, Robert Haas wrote:
> >> On Thu, Jun 14, 2012 at 5:58 PM, Andres Freund
> >> 
> >> wrote:
>  1. Use a 64-bit segment number, instead of the log/seg combination.
>  And don't waste the last segment on each logical 4 GB log file. The
>  concept of a "logical log file" is now completely gone. XLogRecPtr is
>  unchanged,
>  but it should now be understood as a plain 64-bit value, just split
>  into
>  two 32-bit integers for historical reasons. On disk, this means that
>  there will be log files ending in FF, those were skipped before.
> >>> 
> >>> Whats the reason for keeping that awkward split now? There aren't
> >>> that many
> >>> users of xlogid/xcrecoff and many of those would be better served by
> >>> using
> >>> helper macros.
> >> 
> >> I wondered that, too. There may be a good reason for keeping it split
> >> up that way, but we at least oughta think about it a bit.
> > 
> > The page header contains an XLogRecPtr (LSN), so if we change it we'll
> > have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
> > around as the on-disk representation, and convert between the 64-bit
> > integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
> > many xlog calculations would admittedly be simpler if it was an uint64.
> 
> Well, that was easier than I thought. Attached is a patch to make
> XLogRecPtr a uint64, on top of my other WAL format patches. I think we
> should go ahead with this.
Cool. You plan to merge XLogSegNo with XLogRecPtr in that case? I am not sure 
if having two representations which just have a constant factor inbetween 
really makes sense.

> The LSNs on pages are still stored in the old format, to avoid changing
> the on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the
> control file and WAL are changed, however, so an initdb (or at least
> pg_resetxlog) is required.
Sounds sensible.

> Should we keep the old representation in the replication protocol
> messages? That would make it simpler to write a client that works with
> different server versions (like pg_receivexlog). Or, while we're at it,
> perhaps we should mandate network-byte order for all the integer and
> XLogRecPtr fields in the replication protocol.
The replication protocol uses pq_sendint for integers which should take care 
of converting to big endian already. I don't think anything but the wal itself 
is otherwise transported in a binary fashion? So I don't think there is any 
such architecture dependency in the protocol currently?

I don't really see a point in keeping around a backward-compatible 
representation just for the sake of running such tools on multiple versions. I 
might not be pragmatic enough, but: Why would you want to do that *at the 
moment*? Many of the other tools are already version specific, so...
When the protocol starts to be used by more tools, maybe, but imo were not 
there yet.

But then its not hard to convert to the old representation for that.

> I kept the %X/%X representation in error messages etc. I'm quite used to
> that output, so reluctant to change it, although it's a bit silly now
> that it represents just 64-bit value. Using UINT64_FORMAT would also
> make the messages harder to translate.
No opinion on that. Its easier to see for me whether two values are exactly 
the same or very similar with the 64bit representation but its harder to gauge 
bigger differences. So ...

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pl/perl and utf-8 in sql_ascii databases

2012-06-19 Thread Robert Haas
On Mon, Jun 18, 2012 at 3:30 PM, Alvaro Herrera
 wrote:
> Excerpts from Alex Hunsaker's message of vie feb 10 16:53:05 -0300 2012:
>
>> Seems like we missed the fact that we still did SvUTF8_on() in sv2cstr
>> and SvPVUTF8() when turning a perl string into a cstring.
>
> Hmm, this patch belongs into back branches too, right?  Not just the
> current development tree?

It seems like a bug fix to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq compression

2012-06-19 Thread Robert Haas
On Mon, Jun 18, 2012 at 1:42 PM, Martijn van Oosterhout
 wrote:
> On Sun, Jun 17, 2012 at 12:29:53PM -0400, Tom Lane wrote:
>> The fly in the ointment with any of these ideas is that the "configure
>> list" is not a list of exact cipher names, as per Magnus' comment that
>> the current default includes tests like "!aNULL".  I am not sure that
>> we know how to evaluate such conditions if we are applying an
>> after-the-fact check on the selected cipher.  Does OpenSSL expose any
>> API for evaluating whether a selected cipher meets such a test?
>
> I'm not sure whether there's an API for it, but you can certainly check
> manually with "openssl ciphers -v", for example:
>
> $ openssl ciphers -v 'ALL:!ADH:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP'
> NULL-SHA                SSLv3 Kx=RSA      Au=RSA  Enc=None      Mac=SHA1
> NULL-MD5                SSLv3 Kx=RSA      Au=RSA  Enc=None      Mac=MD5
>
> ...etc...
>
> So unless the openssl includes the code twice there must be a way to
> extract the list from the library.

There doubtless is, but I'd being willing to wager that you won't be
able to figure out the exact method without reading the source code
for 'opennssl ciphers' to see how it was done there, and most likely
you'll find that at least one of the functions they use has no man
page.  Documentation isn't their strong point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-06-19 Thread David Fetter
On Mon, Jun 18, 2012 at 09:34:30PM +0300, Peter Eisentraut wrote:
> On mån, 2012-06-18 at 18:05 +0200, Andres Freund wrote:
> > - defaulting to initdb -N in the regression suite is not a good imo,
> > because that way the buildfarm won't catch problems in that area...
> > 
> The regression test suite also starts postgres with fsync off.  This is
> good, and I don't want to have more overhead in the tests.  It's not
> like we already have awesome test coverage for OS interaction.

What sorts of coverage would we want?

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sortsupport for text

2012-06-19 Thread Kevin Grittner
Peter Geoghegan  wrote:
 
> So, just to give a bit more weight to my argument that we should
> recognise that equivalent strings ought to be treated identically
 
Since we appear to be questioning everything in this area, I'll
raise something which has been bugging me for a while: in some other
systems I've used, the "tie-breaker" comparison for equivalent
values comes after equivalence sorting on *all* sort keys, rather
than *each* sort key.  For example, this much makes sense with
lc_collate = 'en_US.UTF-8':
 
test=# create table c (last_name text not null, first_name text);
CREATE TABLE
test=# insert into c values ('smith', 'bob'), ('smith', 'peter'),
('SMITH', 'EDWARD');
INSERT 0 3
test=# select * from c order by 2;
 last_name | first_name 
---+
 smith | bob
 SMITH | EDWARD
 smith | peter
(3 rows)
 
This seems completely wrong:
 
test=# select * from c order by 1,2;
 last_name | first_name 
---+
 smith | bob
 smith | peter
 SMITH | EDWARD
(3 rows)
 
I have seen other databases which get it in the order I would expect
-- where the C compare only matters within groups of equivalent
rows.  It seems that PostgreSQL effectively orders by:
 
  last_name using collation 'en_US.UTF-8'
  last_name using collation 'C'
  first_name using collation 'en_US.UTF-8'
  first_name using collation 'C'
 
while some other products order by:
 
  last_name using collation 'en_US.UTF-8'
  first_name using collation 'en_US.UTF-8'
  last_name using collation 'C'
  first_name using collation 'C'
 
I'm sure the latter is harder to do and slower to execute; but the
former just doesn't seem defensible as correct.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw in contrib

2012-06-19 Thread Kohei KaiGai
2012/6/19 Merlin Moncure :
> On Mon, Jun 18, 2012 at 11:01 PM, Robert Haas  wrote:
>> On Mon, Jun 18, 2012 at 12:24 PM, Merlin Moncure  wrote:
>>> I can't help but wonder (having been down the contrib/core/extension
>>> road myself) if it isn't better to improve the facilities to register
>>> and search for qualified extensions (like Perl CPAN) so that people
>>> looking for code to improve their backends can find it.  That way,
>>> you're free to release/do xyz/abandon your project as you see fit
>>> without having to go through -hackers.  This should also remove a lot
>>> of the stigma with not being in core since if stock postgres
>>> installations can access the necessary modules via CREATE EXTENSION, I
>>> think it will make it easier for projects like this to get used with
>>> the additional benefit of decentralizing project management.
>>
>> Well, if you're the type that likes to install everything via RPMs
>> (and I am) then you wouldn't want this behavior, especially not
>> automagically.  It seems to open up a host of security risks, too: I
>> believe Tom has previously stated that Red Hat (and other distro)
>> packaging guidelines forbid packages from installing software in
>> places where they can then turn around and run it.  I suppose CPAN
>> must have some sort of exception to this policy, though, so maybe it's
>> not ironclad.
>
> Huh? CPAN is just one example -- PyPM for python, npm for node etc
> etc.  There is some distinction to that rule that is being missed so
> that it doesn't apply to cases like this -- probably that it is
> transacted by the user and/or requires su.
>
> I think your points are supporting mine: the vast majority of postgres
> administrators deal only with o/s packaged rpms and therefore with the
> status quo are unable to utilize any extension that is not packaged
> with contrib.  This means that there are two possible outcomes if you
> want to write an extension:
>
> 1) get accepted into core
> 2) never get used
>
> Given that running the gauntlet to #1 is not a particularly attractive
> option (or even possible) in many cases for various reasons it tends
> to discourage module development by 3rd parties.  There are several
> very high quality extensions that could get a lot more exposure...for
> example pl/sh.
>
> But anyways, if you're happy with pgsql_fdw (aside: i looked at it,
> and it's great!) in core, then by all means...
>
Let me push the pgsql_fdw in core from different perspective.

Right now, FDW is a feature that will take many enhancement in
the near future like join-pushdown, writable APIs and so on.
If we would not have a FDW extension in core that communicate
with actual RDBMS, instead of flat files, it makes our development
efforts more complex. Please assume a scenario people alwats
tries to submit two patches; one for the core, and the other for
a particular out-of-tree extension.
I believe it is reasonable choice to have a FDW for PostgreSQL
in core, rather than MySQL or Oracle. We can use this extension
to show how new feature works, whenever we try to extend FDW
APIs.

BTW, Hanada-san-
It seems to me the expected result of regression test was not
included in this patch. Please submit it again-

Thanks,
-- 
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework

2012-06-19 Thread Alvaro Herrera

Excerpts from Boszormenyi Zoltan's message of mar jun 19 04:44:35 -0400 2012:
> 
> 2012-06-18 19:46 keltezéssel, Alvaro Herrera írta:
> > Excerpts from Boszormenyi Zoltan's message of vie may 11 03:54:13 -0400 
> > 2012:
> >> Hi,
> >>
> >> another rebasing and applied the GIT changes in
> >> ada8fa08fc6cf5f199b6df935b4d0a730aaa4fec to the
> >> Windows implementation of PGSemaphoreTimedLock.
> > Hi,
> >
> > I gave the framework patch a look.  One thing I'm not sure about is the
> > way you've defined the API.  It seems a bit strange to have a nice and
> > clean, generic interface in timeout.h; and then have the internal
> > implementation of the API cluttered with details such as what to do when
> > the deadlock timeout expires.  Wouldn't it be much better if we could
> > keep the details of CheckDeadLock itself within proc.c, for example?
> 
> Do you mean adding a callback function argument to for enable_timeout()
> would be better?

Well, that's not exactly what I was thinking, but maybe your idea is
better than what I had in mind.

> > Same for the other specific Check*Timeout functions.  It seems to me
> > that the only timeout.c specific requirement is to be able to poke
> > base_timeouts[].indicator and fin_time.  Maybe that can get done in
> > timeout.c and then have the generic checker call the module-specific
> > checker.
> 
> Or instead of static functions, Check* functions can be external
> to timeout.c. It seemed to be a good idea to move all the timeout
> related functions to timeout.c.

Yeah, except that they play with members of the timeout_params struct,
which hopefully does not need to be exported.  So if you can just move
(that is, put back in their original places) the portions that do not
touch that strcut, it'd be great.

> > Also, I see you're calling GetCurrentTimestamp() in the checkers; maybe
> > more than once per signal if you have multiple timeouts enabled.
> 
> Actually, GetCurrentTimestamp() is not called multiple times,
> because only the first timeout source in the queue can get triggered.

I see.

> > BTW it doesn't seem that datatype/timestamp.h is really necessary in
> > timeout.h.
> 
> You are wrong. get_timeout_start() returns TimestampTz and it's
> defined in datatype/timestamp.h.

Oh, right.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >