date:20131205

 - When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is
 disappeared for 8 core machines and come back with 16 core machines on
 Amazon EC2. Would it be related with PostgreSQL locking mechanism?

If we build with -DLWLOCK_STATS to print locking stats from PostgreSQL, we
see tons of contention with default value of NUM_BUFFER_PARTITIONS which is
16:

$ tail -f /tmp/logfile | grep lwlock | egrep -v blk 0
...
PID 15965 lwlock 0: shacq 0 exacq 33 blk 2
PID 15965 lwlock 34: shacq 14010 exacq 27134 blk 6192
PID 15965 lwlock 35: shacq 14159 exacq 27397 blk 5426
PID 15965 lwlock 36: shacq 14111 exacq 27322 blk 4959
PID 15965 lwlock 37: shacq 14211 exacq 27507 blk 4370
PID 15965 lwlock 38: shacq 14110 exacq 27294 blk 3980
PID 15965 lwlock 39: shacq 13962 exacq 27027 blk 3719
PID 15965 lwlock 40: shacq 14023 exacq 27156 blk 3273
PID 15965 lwlock 41: shacq 14107 exacq 27309 blk 3201
PID 15965 lwlock 42: shacq 14120 exacq 27304 blk 2904
PID 15965 lwlock 43: shacq 14007 exacq 27129 blk 2740
PID 15965 lwlock 44: shacq 13948 exacq 27027 blk 2616
PID 15965 lwlock 45: shacq 14041 exacq 27198 blk 2431
PID 15965 lwlock 46: shacq 14067 exacq 27277 blk 2345
PID 15965 lwlock 47: shacq 14050 exacq 27203 blk 2106
PID 15965 lwlock 48: shacq 13910 exacq 26910 blk 2155
PID 15965 lwlock 49: shacq 14170 exacq 27360 blk 1989

After we increased NUM_BUFFER_PARTITIONS to 1024, lock contention is
decreased:
...
PID 25220 lwlock 1000: shacq 247 exacq 494 blk 1
PID 25220 lwlock 1001: shacq 198 exacq 394 blk 1
PID 25220 lwlock 1002: shacq 203 exacq 404 blk 1
PID 25220 lwlock 1003: shacq 226 exacq 452 blk 1
PID 25220 lwlock 1004: shacq 235 exacq 470 blk 1
PID 25220 lwlock 1006: shacq 226 exacq 452 blk 2
PID 25220 lwlock 1007: shacq 214 exacq 428 blk 1
PID 25220 lwlock 1008: shacq 225 exacq 448 blk 1
PID 25220 lwlock 1010: shacq 209 exacq 418 blk 1
PID 25220 lwlock 1015: shacq 199 exacq 398 blk 1
PID 25220 lwlock 1016: shacq 214 exacq 426 blk 1
PID 25220 lwlock 1018: shacq 230 exacq 456 blk 1
PID 25220 lwlock 1019: shacq 222 exacq 444 blk 3
PID 25220 lwlock 1023: shacq 262 exacq 524 blk 1
PID 25220 lwlock 1027: shacq 213 exacq 426 blk 1
PID 25220 lwlock 1028: shacq 246 exacq 491 blk 1
PID 25220 lwlock 1029: shacq 226 exacq 452 blk 1

Re: [HACKERS] Parallel Select query performance and shared buffers

On 2013-12-05 11:15:20 +0200, Metin Doslu wrote:
  - When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is
  disappeared for 8 core machines and come back with 16 core machines on
  Amazon EC2. Would it be related with PostgreSQL locking mechanism?
 
 If we build with -DLWLOCK_STATS to print locking stats from PostgreSQL, we
 see tons of contention with default value of NUM_BUFFER_PARTITIONS which is
 16:

Is your workload bigger than RAM? I think a good bit of the contention
you're seeing in that listing is populating shared_buffers - and might
actually vanish once you're halfway cached.
From what I've seen so far the bigger problem than contention in the
lwlocks itself, is the spinlock protecting the lwlocks...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Select query performance and shared buffers

 Is your workload bigger than RAM?

RAM is bigger than workload (more than a couple of times).

 I think a good bit of the contention
 you're seeing in that listing is populating shared_buffers - and might
 actually vanish once you're halfway cached.
 From what I've seen so far the bigger problem than contention in the
 lwlocks itself, is the spinlock protecting the lwlocks...

Could you clarify a bit what do you mean by halfway cached and spinlock
protecting the lwlocks.

Re: [HACKERS] Parallel Select query performance and shared buffers

On 2013-12-05 11:33:29 +0200, Metin Doslu wrote:
  Is your workload bigger than RAM?
 
 RAM is bigger than workload (more than a couple of times).

  I think a good bit of the contention
  you're seeing in that listing is populating shared_buffers - and might
  actually vanish once you're halfway cached.
  From what I've seen so far the bigger problem than contention in the
  lwlocks itself, is the spinlock protecting the lwlocks...
 
 Could you clarify a bit what do you mean by halfway cached

Well, your stats showed a) fairly low lock counts overall b) a high
percentage of exclusive locks.
a) indicates the system wasn't running long.
b) tells me there were lots of changes to the buffer mapping - which
   basically only happens if a buffer is placed or removed from
   shared-buffers.

If your shared_buffers is big enough to contain most of the data you
shouldn't see many exclusive locks in comparison to the number of shared
locks.

 and spinlock protecting the lwlocks.

Every LWLock has an internal spinlock to protect its state. So whenever
somebody does a LWLockAcquire()/Release(), even if only in shared mode,
we currently acquire that spinlock, manipulate the LWLocks state, and
release the spinlock again. In lots of workloads that internal spinlock
is the contention point, not the lenght over which the lwlock is
held. Especially when they are mostly held in shared mode.

Makes sense?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] same-address mappings vs. relative pointers

Hi Robert,

On 2013-12-04 23:32:27 -0500, Robert Haas wrote:
 But I'm also learning painfully that this kind of thing only goes so
 far.  For example, I spent some time looking at what it would take to
 provide a dynamic shared memory equivalent of palloc/pfree, a facility
 that I feel fairly sure would attract a few fans.

I have to admit, I don't fully see the point in a real palloc()
equivalent for shared memory. If you're allocating shared memory in a
granular fashion at somewhat high frequency you're doing something
*wrong*. And the result is not going to be scalable.

  Well, for regular
 palloc, we store a pointer to the memory context before the beginning
 of the chunk, so that when you call pfree you can find the memory
 context and stick the chunk on one of its free lists.  So there are
 two pointers there: the pointer to the context is a pointer, of
 course, but so is the free list.  Heh, heh.

This is one of those times where live really, really would be easier if
we were using c++. Just have small smart-pointer wrapper, and we
wouldn't need to worry too much.

 I still have mixed feelings about the idea of same-address mappings.
 On the one hand, on 64-bit architectures, there's a huge amount of
 address space available.  Assuming the OS does something even vaguely
 sensible in terms of laying out the text, heap, stack, and shared
 library mappings, there ought to be many petabytes of address space
 that never gets touched, and it's hard to see why we couldn't find
 some place in there to stick our stuff.

It's actually quite a bit away from petabytes. Most x86-64 CPUs have
48bit of virtual memory, with the OS eating up a far bit of that (so it
doesn't need to tear down TLBs in system calls). I think cross-platform
you end up with something around 8TB, up to 64TB on others OSs.
Now, there are some CPUs coming out with bigger virtual memory sizes,
but it's going to be longer than I am willing to wait till we are there.

 Now, on the other hand, as far as dynamic shared memory allocation and
 freeing is concerned, there aren't really THAT many places where I
 need a pointer, so using Size or uint64 or something to store an
 offset instead is annoying, but I have an idea how to do this that
 only uses pointers in a couple of places, so I think it can be made to
 work.  I am not sure how much complaint that will provoke, though.
 And even if I do it, the first poor slob that wants to store a linked
 list or so in dynamic shared memory is going to be unhappy if they
 can't get a same-address mapping.  Maybe that's OK; using linked lists
 in shared memory might not be a great idea in the first place.  I'm
 sure there will be more than one person who wants to do it, though.
 a

Just because people want it, doesn't make it worthwile to provide it ;)

 Any thoughts on what the least painful compromise is here?

I think a reasonable route is having some kind of smart-pointer on C
level that abstracts away the offset math and allows to use pointers
locally. Something like void *sptr_deref(sptr *); where the returned
pointer can be used as long as it is purely in memory. And
sptr_update(sptr *, void *); which allows a sptr to point elsewhere in
the same segment.
+ lots of smarts

I continue to believe that a) using pointers in dynamically allocated
segments is going to end up with lots of pain. b) the pain from not
having real pointers is manageable.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time-Delayed Standbys

2013-12-05 Thread Simon Riggs

On 5 December 2013 08:51, Magnus Hagander mag...@hagander.net wrote:

 Not recalling the older thread, but it seems the breaks on clock drift, I
 think we can fairly easily make that situation good enough. Just have
 IDENTIFY_SYSTEM return the current timestamp on the master, and refuse to
 start if the time difference is too great. Yes, that doesn't catch the case
 when the machines are in perfect sync when they start up and drift *later*,
 but it will catch the most common cases I bet. But I think that's good
 enough that we can accept the feature, given that *most* people will have
 ntp, and that it's a very useful feature for those people. But we could help
 people who run into it because of a simple config error..

 Or maybe the suggested patch already does this, in which case ignore that
 part :)

I think the very nature of *this* feature is that it doesnt *require*
the clocks to be exactly in sync, even though that is the basis for
measurement.

The setting of this parameter for sane usage would be minimum 5 mins,
but more likely 30 mins, 1 hour or more.

In that case, a few seconds drift either way makes no real difference
to this feature.

So IMHO, without prejudice to other features that may be more
critically reliant on time synchronisation, we are OK to proceed with
this specific feature.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [BUGS] BUG #7873: pg_restore --clean tries to drop tables that don't exist

Hello


2013/12/5 Peter Eisentraut pete...@gmx.net

 Can someone in this thread clarify the commit fest situation?  I see two
 entries that appear to be the same:

 https://commitfest.postgresql.org/action/patch_view?id=1174
 https://commitfest.postgresql.org/action/patch_view?id=1175

 I think the first one is a duplicate or obsolete.



no, both are valid, and every solve different issue.

https://commitfest.postgresql.org/action/patch_view?id=1175

it implements fully fault tolerant DROP IF EXISTS statements. This patch is
prerequisite for second patch


https://commitfest.postgresql.org/action/patch_view?id=1174

This is a implementation of new pg_dump option --if-exists. This option
ensure using fault tolerant DROPs statement by dump.

Regards

Pavel

Re: [HACKERS] Re: [BUGS] BUG #7873: pg_restore --clean tries to drop tables that don't exist

2013/12/5 Dean Rasheed dean.a.rash...@gmail.com

 On 5 December 2013 01:33, Peter Eisentraut pete...@gmx.net wrote:
  Can someone in this thread clarify the commit fest situation?  I see two
  entries that appear to be the same:
 
  https://commitfest.postgresql.org/action/patch_view?id=1174
  https://commitfest.postgresql.org/action/patch_view?id=1175
 
  I think the first one is a duplicate or obsolete.
 

 #1174 looks to be a separate feature. I don't think it's dependent on
 #1175 from a code standpoint, but it probably needs it to work
 properly in all situations.

 I think #1175 is close to being ready for commit. Pavel, will you
 produce an updated patch based on our last discussion? I'll set this
 patch to waiting on author.


I expected so your version was a final. I have no problem to do other
enhancing (by me) , but I don't fully understand to your last proposal. Can
you specify it more, please?

Regards

Pavel



 Regards,
 Dean

Re: [HACKERS] Changes in Trigger Firing

2013-12-05 Thread Sameer Kumar



  I read somewhere that the best editor is the
 one you master (1) :)
 1: http://www.postgresql.org/message-id/m2wrs6giyp@hi-media.com


Thanks, I am using eclipse now.


Any comments about the utility of this feature? Or is it just me who thinks
this can be useful? I think users/developers of trigger based replication
tools can benefit from this.

I am not sure how replication tools like slony handle a scenario where
replicated table is partitioned and I add a new partition to it.


Regards
Sameer

Re: [HACKERS] Performance optimization of btree binary search

2013-12-05 Thread Peter Geoghegan

On Wed, Dec 4, 2013 at 5:28 PM, Peter Geoghegan p...@heroku.com wrote:
 I'm also curious about the impact on insertion into primary key
 indexes. Presently, we hold an exclusive buffer lock for the duration
 of a couple of operations when checkUnique != UNIQUE_CHECK_NO.
 _bt_binsrch() is one such operation. The other one there,
 _bt_check_unique(), is likely to be a lot cheaper than _bt_binsrch()
 on average, I think, so I'm cautiously optimistic that it'll be
 noticeable. I better go and check it out.

I did a relatively short variant 'insert' pgbench-tools benchmark,
with a serial primary index created on the pgbench_history table.
Results:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/insert/

Note that in the interest of increasing the signal to noise ratio, I
used unlogged tables. These are considerably shorter two minute runs,
of inserts/writes, but even still it's interesting to compare it to my
original 'select' benchmark. Checkpoint settings were very high.

This looks to be about a wash. I'm not that surprised, because this is
all about memory bandwidth, not lock contention. Even with the lock
contention on the rightmost btree page, we top out at about the same
TPS level as the 'select' benchmark (at least compared to Postgres
master). However, we top out at that level much sooner here (growth is
much steeper), because the contention is mostly on the same one or two
btree leaf pages, which are well cached by CPU L1 cache. Writes are
actually quite a bit faster than uniformly-distributed reads, even
with the exclusive buffer locking (and lock contention) necessitated
by writing. You might even say that my patch makes the first benchmark
look more like this one (a less awkward comparison could probably be
made between what I've done for uniform distribution read workloads,
and a read workload with far more uneven data access, which will
naturally have fewer cache misses and less TLB contention). I strongly
suspect I wouldn't have seen a significant number of cache misses when
binary searching on btree pages if I was to instrument this workload.

It might still be worth pursuing the SERIAL hinting mechanism, but
it's not going to be nearly as big a win as what I've done for reads,
I think. It might also be worth looking at a workload with uniformly
distributed btree index tuple insertions, where I'd expect similar
advantages to those originally shown for reads.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: variant of regclass

On 2013-12-04 20:25:53 -0500, Tom Lane wrote:
 Tatsuo Ishii is...@sraoss.co.jp writes:
  I would like to add a variant of regclass, which is exactly same as
  current regclass except it does not raise an error when the target
  table is not found. Instead it returns InvalidOid (0).
 
 I've sometimes thought we should just make all the reg* input converters
 act that way.  It's not terribly consistent that they'll happily take
 numeric inputs that don't correspond to any existing OID.  And more
 often than not, I've found the throw-an-error behavior to be annoying
 not helpful.

I find that to be a bit of a scary change. I have seen application check
for the existance of tables using the error thrown by ::regclass. Now,
they could change that to check for IS NULL which would be better for
them performancewise, but the likelihood they will notice in time seems
small.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: variant of regclass

2013/12/5 Andres Freund and...@2ndquadrant.com

 On 2013-12-04 20:25:53 -0500, Tom Lane wrote:
  Tatsuo Ishii is...@sraoss.co.jp writes:
   I would like to add a variant of regclass, which is exactly same as
   current regclass except it does not raise an error when the target
   table is not found. Instead it returns InvalidOid (0).
 
  I've sometimes thought we should just make all the reg* input converters
  act that way.  It's not terribly consistent that they'll happily take
  numeric inputs that don't correspond to any existing OID.  And more
  often than not, I've found the throw-an-error behavior to be annoying
  not helpful.

 I find that to be a bit of a scary change. I have seen application check
 for the existance of tables using the error thrown by ::regclass. Now,
 they could change that to check for IS NULL which would be better for
 them performancewise, but the likelihood they will notice in time seems
 small.


this change can break some applications

but personally I like this change.

We can introduce some assert polymorphic function

CREATE OR REPLACE FUNCTION notnull(any, message text) RETURNS any, that can
be used for check inside SQL

Regards

Pavel


 Greetings,

 Andres Freund

 --
  Andres Freund http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: variant of regclass

On 2013-12-05 11:54:20 +0100, Pavel Stehule wrote:
 2013/12/5 Andres Freund and...@2ndquadrant.com
 We can introduce some assert polymorphic function
 
 CREATE OR REPLACE FUNCTION notnull(any, message text) RETURNS any, that can
 be used for check inside SQL

Uh. How is that going to help applications that upgraded, without having
noticed a pretty obscure notice in the release notes?

If this were day one, I would agree we should go that way, but today...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol

2013-12-05 Thread Simon Riggs

On 5 December 2013 01:55, Peter Eisentraut pete...@gmx.net wrote:
 On Thu, 2013-11-14 at 12:11 +0530, Amit Kapila wrote:
If an application wants to allow these connection parameters to be
 used, it would need to do PQenableStartServer() first. If it doesn't,
 those connection parameters will be rejected.

 Stupid idea: Would it work that we require an environment variable to be
 set before we allow the standalone_backend connection parameter?  That's
 easy to do, easy to audit, and doesn't require any extra code in the
 individual clients.

I like the idea... should it be in pg_hba.conf ?
Or should it be next to listen_addresses in postgresql.conf?

hba might be less convenient but seems like the correct place

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol

On 2013-12-04 20:55:08 -0500, Peter Eisentraut wrote:
 On Thu, 2013-11-14 at 12:11 +0530, Amit Kapila wrote:
 If an application wants to allow these connection parameters to be
  used, it would need to do PQenableStartServer() first. If it doesn't,
  those connection parameters will be rejected.
 
 Stupid idea: Would it work that we require an environment variable to be
 set before we allow the standalone_backend connection parameter?  That's
 easy to do, easy to audit, and doesn't require any extra code in the
 individual clients.

I still don't think it's ok to start forking in arbitrary applications
without their knowledge. So I don't think that buys us enough.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: variant of regclass

2013/12/5 Andres Freund and...@2ndquadrant.com

 On 2013-12-05 11:54:20 +0100, Pavel Stehule wrote:
  2013/12/5 Andres Freund and...@2ndquadrant.com
  We can introduce some assert polymorphic function
 
  CREATE OR REPLACE FUNCTION notnull(any, message text) RETURNS any, that
 can
  be used for check inside SQL

 Uh. How is that going to help applications that upgraded, without having
 noticed a pretty obscure notice in the release notes?


this function doesn't replace a obscure notice in the release notes.

On second hand is better to throw unpractically  designed feature early
than hold it forever.

If there was not too aversion against GUC, I can say, so for some time GUC
can be solution. But it isnot

Regards

Pavel



 If this were day one, I would agree we should go that way, but today...

 Greetings,

 Andres Freund

 --
  Andres Freund http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services

Re: [HACKERS] better atomics - v0.2

On 2013-11-19 10:37:35 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  The only animal we have that doesn't support quiet inlines today is
  HP-UX/ac++, and I think - as in patch 1 in the series - we might be able
  to simply suppress the warning there.
 
 Or just not worry about it, if it's only a warning?

So, my suggestion on that end is that we remove the requirement for
quiet inline now and see if it that has any negative consequences on the
buildfarm for a week or so. Imo that's a good idea regardless of us
relying on inlining support.
Does anyone have anything against that plan? If not, I'll prepare a
patch.

 Or does the warning
 mean code bloat (lots of useless copies of the inline function)?

After thinking on that for a bit, yes that's a possible consequence, but
it's quite possible that it happens in cases where we don't get the
warning too, so I don't think that argument has too much bearing as I
don't recall a complaint about it.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get more from indices.

2013-12-05 Thread Etsuro Fujita

Kyotaro HORIGUCHI wrote:
 Thank you, but it seems to me too simplified. You made two major
functional
 changes.

Thank you for the comments!

 One is, you put the added code for getrelation_info() out of the block for
 the condition (info-relam == BTREE_AM_OID) (though amcanorder would be
 preferable..) Anyway the reason for the place is to guarantee
'full_ordered'
 index always to be orderable.  I believe the relation between them are not
 obvious. So your patch has an oppotunity to make wrong assumption for
 possible indexes which is not orderable but unique. Going on your way some
 additional works would be needed to judge an index to be orderable or not
 on checking the extensibility of pathkeys.

By checking the following equation in build_index_paths(), the updated
version of the patch guarantees that the result of an index scan is ordered:

index_is_ordered = (index-sortopfamily != NULL);

 Another is, you changed pathkeys expantion to be all-or-nothing decision.
 While this change should simplify the code slightly, it also dismisses the
 oppotunity for partially-extended pathkeys. Could you let me know the
reason
 why you did so.

At first I thought the partially-extended pathkey list that is made from
query_pathkeys, as you proposed in the original versions of the patch.  But
I've started to doubt whether it's worth doing that because I think the
partially-extended pathkey list is merely one example while the original
pathkey list can be partially-extended in different ways, ie, ISTM the
partially-extended pathkey list doesn't necessarily have the optimality in
anything significant.  We might be able to partially-extend the original
pathkey list optimally in something significant, but that seems useless
complexity to me.  So, I modified the patch to do the all-or-nothing
decision.

Thanks,

Best regards,
Etsuro Fujita



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Get more from indices.

2013-12-05 Thread Etsuro Fujita

I wrote:
 Kyotaro HORIGUCHI wrote:
  Another is, you changed pathkeys expantion to be all-or-nothing
decision.
  While this change should simplify the code slightly, it also dismisses
  the oppotunity for partially-extended pathkeys. Could you let me know
  the
 reason
  why you did so.

 At first I thought the partially-extended pathkey list that is made from
 query_pathkeys, as you proposed in the original versions of the patch.
But
 I've started to doubt whether it's worth doing that because I think the
 partially-extended pathkey list is merely one example while the original
 pathkey list can be partially-extended in different ways, ie, ISTM the
 partially-extended pathkey list doesn't necessarily have the optimality
 in anything significant.  We might be able to partially-extend the
original
 pathkey list optimally in something significant, but that seems useless
 complexity to me.  So, I modified the patch to do the all-or-nothing
 decision.

Here I mean the optimality for use in merge joins.

Thanks,

Best regards,
Etsuro Fujita



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [bug fix] pg_ctl stop times out when it should respond quickly


From: Tom Lane t...@sss.pgh.pa.us

If you're going to do a postmaster_is_alive check, why bother with
repeated get_pgpid()?


As I said yesterday, I removed get_pgpid() calls.  I'll add this patch to 
2014-1 commitfest this weekend if it is not committed until then.


Regards
MauMau


pg_stop_fail_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance optimization of btree binary search

On 2013-12-04 18:48:44 -0500, Robert Haas wrote:
  * When a type narrower than Datum is stored in a Datum, we place it in the
  * low-order bits and are careful that the DatumGetXXX macro for it discards
  * the unused high-order bits (as opposed to, say, assuming they are zero).
  * This is needed to support old-style user-defined functions, since depending
  * on architecture and compiler, the return value of a function returning char
  * or short may contain garbage when called as if it returned Datum.
 
 And record_image_eq does a rather elaborate dance around here, calling
 the appropriate GET_x_BYTES macro depending on the type-width.  If we
 can really count on the high-order bits to be zero, that's all
 completely unnecessary tomfoolery.

I don't think we can get rid of that dance in record_image_eq - it very
well could used on records originally generated when those bits haven't
been guaranteed to be zeroed.
Other usecases where the appropriate DatumGetFoo() macros are used don't
have a problem with that since it's cleared again there, but in
record_image_eq we can't rely on that.

Or am I missing something?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: variant of regclass

2013-12-05 Thread Pavel Golub

Hello, Andres.

You wrote:

AF On 2013-12-04 20:25:53 -0500, Tom Lane wrote:
 Tatsuo Ishii is...@sraoss.co.jp writes:
  I would like to add a variant of regclass, which is exactly same as
  current regclass except it does not raise an error when the target
  table is not found. Instead it returns InvalidOid (0).
 
 I've sometimes thought we should just make all the reg* input converters
 act that way.  It's not terribly consistent that they'll happily take
 numeric inputs that don't correspond to any existing OID.  And more
 often than not, I've found the throw-an-error behavior to be annoying
 not helpful.

AF I find that to be a bit of a scary change. I have seen application check
AF for the existance of tables using the error thrown by ::regclass. Now,
AF they could change that to check for IS NULL which would be better for
AF them performancewise, but the likelihood they will notice in time seems
AF small.

I personally see two approaches:
1. Implement GUC variable controling this behaviour per session
2. Introduce new safe reg* variables, e.g. sregclass, sregtype etc.

AF Greetings,

AF Andres Freund

AF -- 
AF  Andres Freund http://www.2ndQuadrant.com/
AF  PostgreSQL Development, 24x7 Support, Training  Services





-- 
With best wishes,
 Pavel  mailto:pa...@gf.microolap.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tracking commit timestamps

On 2013-12-02 02:39:55 -0500, Jaime Casanova wrote:
 === performance ===
 
 i expected a regression on performance with the module turned on
 because of the new XLOG records and wrote of files in pg_committs but
 the performance drop is excessive.
 Master 437.835674 tps
 Patch, guc off   436.4340982 tps
 Patch, guc on   0.370524 tps

There clearly is something wrong. The additional amount of xlog records
should be nearly unnoticeable because committs piggybacks on commit
records.


 I started the server with the module off, then after some work turned
 the module on and restarted the server and run a few benchs then
 turned it off again and restart the server and get a message like:
 
 
 LOG:  database system was not properly shut down; automatic recovery in 
 progress
 LOG:  record with zero length at 0/3112AE58
 LOG:  redo is not required
 FATAL:  cannot make new WAL entries during recovery
 LOG:  startup process (PID 24876) exited with exit code 1
 LOG:  aborting startup due to startup process failure
 

Alvaro: That's because of the location you call StartupCommitts - a)
it's called at the beginning of recovery if HS is enabled b) it's called
before StartupXLOG() does LocalSetXLogInsertAllowed().

So I think you need to split StartupCommitts into StartupCommitts() and
TrimCommitts() where only the latter does the trimming if committs is
disabled.
I also wonder if track_commit_timestamp should be tracked by the the
XLOG_PARAMETER_CHANGE stuff? So it's not disabled on the standby when
it's enabled on the primary?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] same-address mappings vs. relative pointers

On Thu, Dec 5, 2013 at 4:56 AM, Andres Freund and...@2ndquadrant.com wrote:
 Hi Robert,

 On 2013-12-04 23:32:27 -0500, Robert Haas wrote:
 But I'm also learning painfully that this kind of thing only goes so
 far.  For example, I spent some time looking at what it would take to
 provide a dynamic shared memory equivalent of palloc/pfree, a facility
 that I feel fairly sure would attract a few fans.

 I have to admit, I don't fully see the point in a real palloc()
 equivalent for shared memory. If you're allocating shared memory in a
 granular fashion at somewhat high frequency you're doing something
 *wrong*. And the result is not going to be scalable.

Why?  Lots of people have written lots of programs that do just that.
I agree it's overdone, but saying it should never happen seems like an
over-rotation in the opposite direction.

In any case, the application that led me to this was actually parallel
internal sort.  Let's suppose we had an implementation of parallel
internal sort.  You'd load all the tuples you want to sort into
dynamic shared memory (just as we now load them into palloc'd chunks),
start a bunch of workers, and the original backend and the
newly-started workers would cooperate to sort your data.  Win.

It's fairly clear that, technically speaking, you do NOT need a real
allocator for this.  You can just shove the first tuple that you store
into shared memory, say right up against the end of the region.  Then
you can place the next tuple immediately before it and the following
one just before that, and so on.  This is very simple to implement and
also extremely memory-efficient, so superficially it's quite an
appealing approach.

But now let's suppose the input data was estimated to fit in work_mem
but in the end did not, and therefore we need to instead do an
external sort.  That means we're going to start evicting tuples from
memory and to make room for new tuples we read from the input stream.
Well, now our strategy of tight-packing everything does not look so
good, because we have no way of tracking which tuples we no longer
need and reusing that space for new tuples.  If external sort is not
something we know how to do in parallel, we can potentially copy all
of the tuples out of the dynamic shared memory segment and back to
backend-private memory in palloc'd chunks, and then deallocate the
dynamic shared memory segment... but that's potentially expensive if
work_mem is large, and it temporarily uses double what we're allowed
by work_mem.

And suppose we want to do a parallel external sort with, say, the
worker backend pushing tuples into a heap stored in the dynamic shared
memory segment and other backends writing them out to tapes and
removing them from the heap.  You can't do that without some kind of
space management, i.e. an allocator.

 This is one of those times where live really, really would be easier if
 we were using c++. Just have small smart-pointer wrapper, and we
 wouldn't need to worry too much.

Yes, a template class would be perfect here.

 I still have mixed feelings about the idea of same-address mappings.
 On the one hand, on 64-bit architectures, there's a huge amount of
 address space available.  Assuming the OS does something even vaguely
 sensible in terms of laying out the text, heap, stack, and shared
 library mappings, there ought to be many petabytes of address space
 that never gets touched, and it's hard to see why we couldn't find
 some place in there to stick our stuff.

 It's actually quite a bit away from petabytes. Most x86-64 CPUs have
 48bit of virtual memory, with the OS eating up a far bit of that (so it
 doesn't need to tear down TLBs in system calls). I think cross-platform
 you end up with something around 8TB, up to 64TB on others OSs.
 Now, there are some CPUs coming out with bigger virtual memory sizes,
 but it's going to be longer than I am willing to wait till we are there.

Oh.  Well, that's a shame.

 Any thoughts on what the least painful compromise is here?

 I think a reasonable route is having some kind of smart-pointer on C
 level that abstracts away the offset math and allows to use pointers
 locally. Something like void *sptr_deref(sptr *); where the returned
 pointer can be used as long as it is purely in memory. And
 sptr_update(sptr *, void *); which allows a sptr to point elsewhere in
 the same segment.
 + lots of smarts

Can you elaborate on this a little bit?  I'm not sure I understand
what you're getting at here.

 I continue to believe that a) using pointers in dynamically allocated
 segments is going to end up with lots of pain. b) the pain from not
 having real pointers is manageable.

Fair opinion, but I think we will certainly need to pass around memory
offsets in some form for certain things.  Even for purely internal
parallel sort, the Tuplesort objects need to store the offset of the
actual tuple within the segment.  My first guess as to how to do that
was to assume that sizeof(Size) == sizeof(void *) and just

Re: [HACKERS] [bug fix] pg_ctl fails with config-only directory


From: Amit Kapila amit.kapil...@gmail.com

On Wed, Dec 4, 2013 at 7:57 PM, MauMau maumau...@gmail.com wrote:

* Approach 1
When postgres starts, it removes Administrator privileges from its own
process.  But is this possible at all?  Windows security API is complex 
and
provides many functions.  It seems difficult to understand them.  I'm 
afraid
it would take a long time to figure out the solution.  Is there any good 
web

page to look at?

* Approach 2
Do not call check_root() on Windows when -C, --describe-config, 
or --single
is specified when running postgres.  This would be easy, and should not 
be
dangerous in terms of security because attackers cannot get into the 
server

process via network.


Approach-2 has been discussed previously to resolve it and it doesn't seem 
to be

a good way to handle it. Please refer link:
http://www.postgresql.org/message-id/1339601668-sup-4...@alvh.no-ip.org

You can go through that mail chain and see if there can be a better
solution than Approach-2.


Thanks for the info.  I understand your feeling, but we need to be 
practical.  I believe we should not leave a bug and inconvenience by 
worrying about theory too much.  In addition to the config-only directory, 
the DBA with admin privs will naturally want to run postgres -C and 
postgres --describe-config, because they are useful and so described in 
the manual.  I don't see any (at least big) risk in allowing 
postgres -C/--describe-config to run with admin privs.  In addition, recent 
Windows versions help to secure the system by revoking admin privs with UAC, 
don't they?  Disabling UAC is not recommended.


I couldn't find a way to let postgres delete its token groups from its own 
primary access token.  There doesn't seem to be a reasonably clean and good 
way.


So I had to choose approach 2.  Please find attached the patch.  This simple 
and not-complex change worked well.  I'd like to add this to 2014-1 
commitfest this weekend unless a better approach is proposed.


Regards
MauMau



config_dir_win.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v6.7

2013-12-05 Thread Kyotaro HORIGUCHI

Hello,

 Will send the rebased version as soon as I've addressed your comments.

Thank you.

  = 0001:
  
   - You assined HeapTupleGetOid(tuple) into relid to read in
 several points but no modification. Nevertheless, you left
 HeapTupleGetOid not replaced there. I think 'relid' just for
 repeated reading has far small merit compared to demerit of
 lowering readability. You'd be better to make them uniform in
 either way.
 
 It's primarily to get the line lengths halfway under control.

Mm. I'm afraid I couldn't catch your words, do you mean that
IsSystemClass() or isTempNamespace() could change the NULL bitmap
in the tuple?

  = 0002:
  
   - You are identifying the wal_level with the expr 'wal_level =
 WAL_LEVEL_LOGICAL' but it seems somewhat improper.
 
 Hm. Why?

It actually does no harm and somewhat trifling so I don't assert
you should fix it.

The reason for the comment is the greater values for wal_level
are undefined at the moment, so strictly saying, such values
should be handled as invalid ones. Although there is a practice
to avoid loop overruns by comparing counters with the expression
like (i  CEILING).

 For instance, I found a macro for which comment reads as follows
and I feel a bit uneasy with it :-) It's nothing more than that.

| /* Do we need to WAL-log information required only for Hot Standby? */

| #define XLogStandbyInfoActive() (wal_level = WAL_LEVEL_HOT_STANDBY)

   - RelationIsAccessibleInLogicalDecoding and
 RelationIsLogicallyLogged are identical in code. Together with
 the name ambiguity, this is quite confising and cause of
 future misuse between these macros, I suppose. Plus the names
 seem too long.
 
 Hm, don't think they are equivalent, rather the contrary. Note one
 returns false if IsCatalogRelation(), the other true.

Oops, I'm sorry. I understand they are not same. Then I have
other questions. The name for the first one
'RelationIsAccessibleInLogicalDecoding' doesn't seem representing
what its comment reads.

|  /* True if we need to log enough information to have access via
|  decoding snapshot. */

Making the macro name for this comment directly, I suppose it
would be something like 'NeedsAdditionalInfoInLogicalDecoding' or
more directly 'LogicalDeodingNeedsCids' or so..

   - In heap_insert, the information conveyed on rdata[3] seems to
 be better to be in rdata[2] because of the scarecity of the
 additional information. XLOG_HEAP_CONTAINS_NEW_TUPLE only
 seems to be needed. Also is in heap_multi_insert and
 heap_upate.
 
 Could you explain a bit more what you mean by that? The reason it's a
 separate rdata entry is that otherwise a full page write will remove the
 information.

Sorry, I missed the comment 'so that an eventual FPW doesn't
remove the tuple's data'. Although given the necessity of removal
prevention, rewriting rdata[].buffer which is required by design
(correct?)  with InvalidBuffer seems a bit outrageous for me and
obfuscating the objective of it.  Other mechanism should be
preferable, I suppose. The most straight way to do that should be
new flag bit for XLogRecData, say, survive_fpw or something.

   - In heap_multi_insert, need_cids referred only once so might be
 better removed.
 
 It's accessed in a loop over potentially quite some items, that's why I
 moved it into an extra variable.

Sorry bothering you with comments biside the point.. But the
scope of needs_cids is narrower than it is. I think the
definition should be moved into the block for 'if (needwal)'.

   - In heap_delete, at the point adding replica identity, same to
 the aboves, rdata[3] could be moved into rdata[2] making new
 type like 'xl_heap_replident'.
 
 Hm. I don't think that'd be a good idea, because we'd then need special
 case decoding code for deletes because the wal format would be different
 for inserts/updates and deletes.

Hmm. Although one common xl_heap_replident is impractical,
splitting a logcally single entity into two or more XLogRecDatas
also seems not to be a good idea.

   - In heapam_xlog.h, the new type xl_heap_header_len is
 inadequate in both of naming which is confising and
 construction on which the header in xl_heap_header is no
 longer be a header and indecisive member name 't_len'..
 
 The header bit in the name refers to the fact that it's containing
 information about the a HeapTuple's header, not that it's a header
 itself. Do you have a better suggestion than xl_heap_header_len?

Sorry, I'm confused during writing the comment, The order of
members in xl_heap_header_len doesn't matter.  I got the reason
for the xl_header_len and whole xlog record image after
re-reading the relevant code. The update record became to contain
two variable length data by this patch. So the length of the
tuple body cannot be calculated only with whole record length and
header lengths.

Considering above, looking

Re: [HACKERS] Performance optimization of btree binary search

2013-12-05 Thread Heikki Linnakangas


On 12/05/2013 07:30 AM, Tom Lane wrote:

Peter Eisentraut pete...@gmx.net writes:

On Wed, 2013-12-04 at 20:27 -0500, Tom Lane wrote:

Lazy people?  I'm not in a hurry to drop it; it's not costing us much to
just sit there, other than in this connection which we see how to fix.



Actually, I think it probably costs a fair portion of extension authors
when their initial code crashes because they forgot to declare all their
functions V1.  I think it might actually be more of a bother to lazy
people than a benefit.


Hm.  We have heard one or two complaints like that, but not a huge number.


It happens to me about 75% of the time when I write a new C function. 
These days, I usually realize it pretty quickly.


I wonder how easy it would be to make the compiler produce a warning 
about it. Or issue a warning in PostgreSQL when you do CREATE FUNCTION 
and the C function appears to be a V0 function.



I'm worried about breaking code that's been working since god-knows-when;
but I will concede there's little evidence that there's very much of that
out there either.


I wouldn't mind just dropping the V0 support. It's not difficult to 
modify your function for V1 convention, so if there's still anyone out 
there using V0, it wouldn't be unreasonable to require them to update.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] shared memory message queues

Hi,

Planned to look at this for a while... Not a detailed review, just some
thoughts. I'll let what I read sink in and possibly comment later.

On 2013-10-31 12:21:31 -0400, Robert Haas wrote:
 The attached patches attempt to rectify some of these problems.

Well, I wouldn't call it problems. Just natural, incremental development
;)

 Patch #1, on-dsm-detach-v1.patch, adds the concept of on_dsm_detach
 hooks
 [snip]
 The part of this patch which I
 suppose will elicit some controversy is that I've had to rearrange
 on_shmem_exit a bit.  It turns out that during shmem_exit, we do
 user-level cleanup, like aborting the transaction, first.  We expect
 that will probably release all of our shared-memory resources.

Hm. The API change of on_shmem_exit() is going to cause a fair bit of
pain. There are a fair number of extensions out there using it and all
would need to be littered by version dependent #ifdefs. What about
providing a version of on_shmem_exit() that allows to specify the exit
phase, and make on_shmem_exit() a backward compatible wrapper?

FWIW, I find on_dsm_detach_cancel() to be a confusing name. I thought it
might be a variant that is called in different circumstances than
on_dsm_detach(). Maybe go with cancel_on_shmem_detach(), like with
cancel_shmem_exit()?

 Patch #2, shm-toc-v1.patch, provides a facility for sizing a dynamic
 shared memory segment before creation, and for dividing it up into
 chunks after it's been created.  It therefore serves a function quite
 similar to RequestAddinShmemSpace, except of course that there is only
 one main shared memory segment created at postmaster startup time,
 whereas new dynamic shared memory segments can come into existence on
 the fly; and it serves even more conspicuously the function of
 ShmemIndex, which enables backends to locate particular data
 structures within the shared memory segment.  It is however quite a
 bit simpler than the ShmemIndex mechanism: we don't need the same
 level of extensibility here that we do for the main shared memory
 segment, because a new extension need not piggyback on an existing
 dynamic shared memory segment, but can create a whole segment of its
 own.

Hm. A couple of questions/comments:
* How do you imagine keys to be used/built?
* Won't the sequential search over toc entries become problematic?
* Some high level overview of how it's supposed to be used wouldn't
  hurt.
* the estimator stuff doesn't seem to belong in the public header?

 Patch #3, shm-mq-v1.patch, is the heart of this series.  It creates an
 infrastructure for sending and receiving messages of arbitrary length
 using ring buffers stored in shared memory (presumably dynamic shared
 memory, but hypothetically the main shared memory segment could be
 used).

The API seems to assume it's in dsm tho?

 Queues are single-reader and single-writer; they use process
 latches to implement waiting for the queue to fill (in the case of the
 reader) or drain (in the case of the writer).

Hm. Too bad, I'd hoped for single-reader, multiple-writer :P

 A non-blocking mode is also available for situations where other
 options might lead to deadlock.

That's cool. I wonder if there's a way to discover, on the writing end,
when new data can be sent. For reading there's a comment about waiting
for the latch...

Couple of questions:
* Some introductory comments about the concurrency approach would be
  good.
* shm_mq_attach() is talking about BackgroundWorkerHandles - is there a
  limitation that a bgworker has to be involved?


Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Changes in Trigger Firing

2013-12-05 Thread Sameer Kumar

I have been finally able to get the right set of files. I going with below
approach:


1) Add a new column in pg_trigger called tgiscascaded

2) Change pg_trigger.h for this

3) Made changes in trigger.c to insert this values

4) Corresponding changes made in
 - reltrigger.h
 - parsenode.h

5) Changed gram.y to understand a new key word in CREATE TRIGGER statement
CASCADED.
- The new option will be optional and will apply only to
non-constraint triggers
- If the option is specified trigger will CASCADE to child
tables

6) I just complied the source code (modified with above changes) and it
- Added a new column in pg_trigger
- The new column is able to derive its value from CREATE
TRIGGER statement based on whether CASDADED was specified or not
- The value is True if the option was specified
- The value is false if the option was not specified


7) Now I will work on trigger firing mechanism with below approach
   - Before firing triggers, check if it's an inherited table
   - First get CASCADED triggers for parents and fire them
   - Proceed as usual


Any suggestion on improving the approach or comments on need for this
feature are welcome.


Regards
Sameer

Re: [HACKERS] pgsql: Fix a couple of bugs in MultiXactId freezing

2013-12-05 Thread Alvaro Herrera

Andres Freund wrote:
 On 2013-12-03 19:55:40 -0300, Alvaro Herrera wrote:
  I added a new isolation spec to test this specific case, and noticed
  something that seems curious to me when that test is run in REPEATABLE
  READ mode: when the UPDATE is aborted, the concurrent FOR UPDATE gets a
  can't serialize due to concurrent update, but when the UPDATE is
  committed, FOR UPDATE works fine.  Shouldn't it happen pretty much
  exactly the other way around?
 
 That's 247c76a989097f1b4ab6fae898f24e75aa27fc1b . Specifically the
 DidCommit() branch in test_lockmode_for_conflict(). You forgot something
 akin to
   /* locker has finished, all good to go */
   if (!ISUPDATE_from_mxstatus(status))
   return HeapTupleMayBeUpdated;

So I did.  Here are two patches, one to fix this issue, and the other to
fix the issue above.  I intend to apply these two to 9.3 and master, and
then apply your freeze fix on top (which I'm cleaning up a bit -- will
resend later.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services
From 48b6f5eead73a41880dd311b7a55d2b88794ca3d Mon Sep 17 00:00:00 2001
From: Alvaro Herrera alvhe...@alvh.no-ip.org
Date: Wed, 4 Dec 2013 17:20:14 -0300
Subject: [PATCH 1/4] Avoid resetting Xmax when it's a multi with an aborted
 update

HeapTupleSatisfiesUpdate can very easily forget tuple locks while
checking the contents of a multixact and finding it contains an aborted
update, by setting the HEAP_XMAX_INVALID bit.  This would lead to
concurrent transactions not noticing any previous locks held by
transactions that might still be running, and thus being able to acquire
subsequent locks they wouldn't be normally able to acquire.

This bug was introduced in commit 1ce150b7bb; backpatch this fix to 9.3,
like that commit.

This change reverts the change to the delete-abort-savept isolation test
in 1ce150b7bb, because that behavior change was caused by this bug.

Noticed by Andres Freund while investigating a different issue reported
by Noah Misch.
---
 src/backend/utils/time/tqual.c |   21 
 .../isolation/expected/delete-abort-savept.out |   13 +---
 2 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 4d63b1c..f787f2c 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -789,13 +789,26 @@ HeapTupleSatisfiesUpdate(HeapTupleHeader tuple, CommandId curcid,
 		if (TransactionIdDidCommit(xmax))
 			return HeapTupleUpdated;
 
-		/* no member, even just a locker, alive anymore */
+		/*
+		 * By here, the update in the Xmax is either aborted or crashed, but
+		 * what about the other members?
+		 */
+
 		if (!MultiXactIdIsRunning(HeapTupleHeaderGetRawXmax(tuple)))
+		{
+			/*
+			 * There's no member, even just a locker, alive anymore, so we can
+			 * mark the Xmax as invalid.
+			 */
 			SetHintBits(tuple, buffer, HEAP_XMAX_INVALID,
 		InvalidTransactionId);
-
-		/* it must have aborted or crashed */
-		return HeapTupleMayBeUpdated;
+			return HeapTupleMayBeUpdated;
+		}
+		else
+		{
+			/* There are lockers running */
+			return HeapTupleBeingUpdated;
+		}
 	}
 
 	if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple)))
diff --git a/src/test/isolation/expected/delete-abort-savept.out b/src/test/isolation/expected/delete-abort-savept.out
index 5b8c444..3420cf4 100644
--- a/src/test/isolation/expected/delete-abort-savept.out
+++ b/src/test/isolation/expected/delete-abort-savept.out
@@ -23,11 +23,12 @@ keyvalue
 step s1svp: SAVEPOINT f;
 step s1d: DELETE FROM foo;
 step s1r: ROLLBACK TO f;
-step s2l: SELECT * FROM foo FOR UPDATE;
+step s2l: SELECT * FROM foo FOR UPDATE; waiting ...
+step s1c: COMMIT;
+step s2l: ... completed
 keyvalue  
 
 1  1  
-step s1c: COMMIT;
 step s2c: COMMIT;
 
 starting permutation: s1l s1svp s1d s1r s2l s2c s1c
@@ -38,12 +39,8 @@ keyvalue
 step s1svp: SAVEPOINT f;
 step s1d: DELETE FROM foo;
 step s1r: ROLLBACK TO f;
-step s2l: SELECT * FROM foo FOR UPDATE;
-keyvalue  
-
-1  1  
-step s2c: COMMIT;
-step s1c: COMMIT;
+step s2l: SELECT * FROM foo FOR UPDATE; waiting ...
+invalid permutation detected
 
 starting permutation: s1l s1svp s1d s2l s1r s1c s2c
 step s1l: SELECT * FROM foo FOR KEY SHARE;
-- 
1.7.10.4

From fbe641eeb7eb535c1cac422e3f5962928e3b0362 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera alvhe...@alvh.no-ip.org
Date: Wed, 4 Dec 2013 17:26:07 -0300
Subject: [PATCH 2/4] Fix improper abort during update chain locking

In 247c76a98909, I added some code to do fine-grained checking of
MultiXact status of locking/updating transactions when traversing an
update chain.  There was a thinko in that patch which would have the
traversing abort, that is return HeapTupleUpdated, when the

Re: [HACKERS] Why we are going to have to go DirectIO

2013-12-05 Thread Stephen Frost

* Peter Geoghegan (p...@heroku.com) wrote:
 On Wed, Dec 4, 2013 at 11:07 AM, Josh Berkus j...@agliodbs.com wrote:
  But you know what?  2.6, overall, still performs better than any kernel
  in the 3.X series, at least for Postgres.
 
 What about the fseek() scalability issue?

Not to mention that the 2.6 which I suspect you're referring to (RHEL)
isn't exactly 2.6...

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] pgsql: Fix a couple of bugs in MultiXactId freezing

Hi,

On 2013-12-05 10:42:35 -0300, Alvaro Herrera wrote:
 I intend to apply these two to 9.3 and master, and
 then apply your freeze fix on top (which I'm cleaning up a bit -- will
 resend later.)

I sure hope it get's cleaned up - it's an evening's hack, with a good
glass of wine ontop. Do you agree with the general direction?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Changes in Trigger Firing

2013-12-05 Thread Sameer Kumar

One scenario where I can forsee an issue is when someone uses the tablename
with-in the trigger function on which the trigger was fired. I am not sure
how and what issue might crop up but this will be one of my test cases.



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Hacker-RFC-Changes-in-Trigger-Firing-tp5781566p5781861.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Feature request: Logging SSL connections

2013-12-05 Thread Dr. Andreas Kunert


Hello,

we were really missing the information in our log files if (and which 
of) our users are using SSL during their connections.


The attached patch is a very simple solution to this problem - it just 
tests if the ssl pointer in Port is null. If no, it adds SSL to the 
logfile, otherwise it adds NOSSL.


Or have I been missing an existing way to reach the same?

Regards,
Andreas

--
-
  __  
Dr. Andreas Kunert / __/ / / / __/
HU-Berlin, ZE Rechenzentrum (CMS) / /_  / / / / __\\
www.hu-berlin.de/~kunert /___/ /_/_/_/ /___/

-
--- postgresql-9.1.10/src/backend/utils/init/postinit.c	2013-10-08 05:13:47.0 +0200
+++ /usr/postgres/sources/postgresql-9.1.10/src/backend/utils/init/postinit.c	2013-12-05 14:21:53.180148568 +0100
@@ -225,9 +225,22 @@
 	(errmsg(replication connection authorized: user=%s,
 			port-user_name)));
 		else
+#ifdef USE_SSL
+			if (port-ssl  0) {
+ereport(LOG,
+		(errmsg(connection authorized: user=%s database=%s SSL,
+port-user_name, port-database_name)));
+			}
+			else {
+ereport(LOG,
+		(errmsg(connection authorized: user=%s database=%s NOSSL,
+port-user_name, port-database_name)));
+			}
+#else
 			ereport(LOG,
 	(errmsg(connection authorized: user=%s database=%s,
 			port-user_name, port-database_name)));
+#endif
 	}
 
 	set_ps_display(startup, false);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] same-address mappings vs. relative pointers

2013-12-05 Thread Heikki Linnakangas


On 12/05/2013 06:32 AM, Robert Haas wrote:

During development of the dynamic shared memory facility, Noah and I
spent a lot of time arguing about whether it was practical to ensure
that a dynamic shared memory segment got mapped at the same address in
every backend that used it.


My vote goes for not trying to map at same address. I don't see how you 
could do that reliably, and I don't see much need for it anyway.


That said, it naturally depends on what you're going to use the dynamic 
shared memory facility for. It's the same problem I have with reviewing 
the already-committed DSM patch and the message queue patch. The patches 
look fine as far as they go, but I have the nagging feeling that there 
are a bunch of big patches coming up later that use the facilities, and 
I can't tell if the facilities are over-engineered for what's actually 
needed, or not sufficient.


As a side-note, I've been thinking that we don't really need 
same-address mapping for shared_buffers either. Getting rid of it 
wouldn't buy us anything right now, but if we wanted e.g to make 
shared_buffers changeable without a restart, that would be useful.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance optimization of btree binary search

Andres Freund and...@2ndquadrant.com writes:
 On 2013-12-04 18:48:44 -0500, Robert Haas wrote:
 And record_image_eq does a rather elaborate dance around here, calling
 the appropriate GET_x_BYTES macro depending on the type-width.  If we
 can really count on the high-order bits to be zero, that's all
 completely unnecessary tomfoolery.

 I don't think we can get rid of that dance in record_image_eq - it very
 well could used on records originally generated when those bits haven't
 been guaranteed to be zeroed.

No, you're failing to think about the context here.  A Datum is not an
on-disk concept, only an in-memory one.  In the case of record_image_eq,
simplifying the comparison of by-value Datums is unconditionally safe
as long as heap_deform_tuple is consistent about using the proper
GetDatum macros, which it is.  (So actually, that elaborate dance
is a 100% waste of effort today, regardless of any changes we're
discussing here.)

The risk we take by simplifying comparisons in a more general context
is that some function somewhere might've been sloppy about doing a
native-type-to-Datum conversion on its result.  In the case of V0
functions that risk is unavoidable except by adding some explicit cleanup
code, but I'm a bit worried that somebody, particularly third-party code,
might've sloppily written return foo in a V1 function when return
Int32GetDatum(foo) would be correct.  In that case, the resultant Datum
might have not-per-spec high-order bits, and if it reaches the fast
comparator without ever having been squeezed into a physical tuple,
we've got a problem.

I would certainly regard this as a bug in that function, but nonetheless
it's a hazard that we need to set against any performance improvement
that we can buy this way.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] same-address mappings vs. relative pointers

On 2013-12-05 15:57:22 +0200, Heikki Linnakangas wrote:
 As a side-note, I've been thinking that we don't really need same-address
 mapping for shared_buffers either. Getting rid of it wouldn't buy us
 anything right now, but if we wanted e.g to make shared_buffers changeable
 without a restart, that would be useful.

I doubt it's that easy to gid of atm (at least in !EXEC_BACKEND), but if
we ever want to properly support ALSR in EXEC_BACKEND environments, we
might need to go there. The hacks windows does around it are already
quite ugly.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance optimization of btree binary search

On 2013-12-05 08:58:55 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  I don't think we can get rid of that dance in record_image_eq - it very
  well could used on records originally generated when those bits haven't
  been guaranteed to be zeroed.
 
 No, you're failing to think about the context here.

Ah yes. I was completely forgetting that
heap_form_tuple()/heap_fill_tuple() will properly take care to only use
meaningful parts of the (to-be-)stored data, not random padding.

Thanks.

 The risk we take by simplifying comparisons in a more general context
 is that some function somewhere might've been sloppy about doing a
 native-type-to-Datum conversion on its result.  In the case of V0
 functions that risk is unavoidable except by adding some explicit cleanup
 code, but I'm a bit worried that somebody, particularly third-party code,
 might've sloppily written return foo in a V1 function when return
 Int32GetDatum(foo) would be correct.  In that case, the resultant Datum
 might have not-per-spec high-order bits, and if it reaches the fast
 comparator without ever having been squeezed into a physical tuple,
 we've got a problem.

Too bad V1 hasn't insisted on using PG_RETURN_* macros. That would have
allowed asserts checking against such cases by setting
fcinfo-has_returned = true or such...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Regression tests failing if not launched on db regression

Michael Paquier michael.paqu...@gmail.com writes:
 It happens that the following regression tests are failing if they are
 run on a database not named regression:

This does not seem like a bug to me, although maybe we'd better update the
documentation to specify that you need to use a DB named regression.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [bug fix] pg_ctl always uses the same event source


Hello,

I've removed a limitation regarding event log on Windows with the attached 
patch.  I hesitate to admit this is a bug fix and want to regard this an 
improvement, but maybe it's a bug fix from users' perspective.  Actually, I 
received problem reports from some users.



[Problem]
pg_ctl always uses event source PostgreSQL to output messages to the event 
log.  Some example messages are:


server starting
server started

This causes the problems below:

1. Cannot distinguish PostgreSQL instances because they use the same event 
source.


2. If you use multiple installations of PostgreSQL on the same machine, 
whether they are the same or different versions, Windows event viewer 
reports an error event source not found in the following sequence.  Other 
system administration software which deal with event log may show other 
anomalies.
2-1 Install the first PostgreSQL (e.g. 9.3) into dir_1, register 
dir_1\lib\pgevent.dll for event source PostgreSQL, then creates 
instance_1.
2-2 Install the second PostgreSQL (e.g. 9.2) into dir_2 as part of some 
packaged application, register dir_2\lib\pgevent.dll for event source 
PostgreSQL, then creates instance_2.  The application uses PostgreSQL as 
its internal data repository.  It sets event_source = 'appname_db' in 
instance_2's postgresql.conf.

2-3 At this point, there's no problem.
2-4 Uninstall the second PostgreSQL.  instance_2 is deleted.  The packaged 
application installer may or may not unregister pgevent.dll with 
regsvr32.exe /u.  ANyway, dir_2\bin\pgevent.dll is deleted.

2-5 instance_1 continues to run and output messages to event log.
2-6 When you view the messages with event viewer, it reports an error 
because the event source PostgreSQL and/or pgevent.dll was deleted in 2-4.




[Fix]
Make pg_ctl use event_source setting in postgresql.conf.  This eliminates 
the need for every instance to use the event source PostgreSQL for its 
pg_ctl.



Regards
MauMau



pg_ctl_eventsrc.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Backup throttling

2013-12-05 Thread Antonin Houska

On 12/02/2013 02:23 PM, Boszormenyi Zoltan wrote:
 Hi,
 
 I am reviewing your patch.

Thanks. New version attached.

 
 * Does it follow the project coding guidelines?
 
 Yes. A nitpicking: this else branch below might need brackets
 because there is also a comment in that branch:
 
 +   /* The 'real data' starts now (header was ignored). */
 +   throttled_last = GetCurrentIntegerTimestamp();
 +   }
 +   else
 +   /* Disable throttling. */
 +   throttling_counter = -1;
 +

Done.

 
 * Does it do what it says, correctly?
 
 Yes.
 
 Although it should be mentioned in the docs that rate limiting
 applies to walsenders individually, not globally. I tried this
 on a freshly created database:
 
 $ time pg_basebackup -D data2 -r 1M -X stream -h 127.0.0.1
 real0m26.508s
 user0m0.054s
 sys0m0.360s
 
 The source database had 3 WAL files in pg_xlog, one of them was
 also streamed. The final size of data2 was 43MB or 26MB without pg_xlog,
 i.e. without the -X stream option. The backup used 2 walsenders
 in parallel (one for WAL) which is a known feature.

Right, if the method is 'stream', a separate WAL sender is used and the
transfer is not limited: client must keep up with the stream
unconditionally. I added a note to documentation.

But there's no reason not to throttle WAL files if the method is
'fetch'. It's fixed now.

 Another note. This chunk should be submitted separately as a comment bugfix:
 
 diff --git a/src/backend/utils/adt/timestamp.c 
 b/src/backend/utils/adt/timestamp.c
 index c3c71b7..5736fd8 100644
 --- a/src/backend/utils/adt/timestamp.c
 +++ b/src/backend/utils/adt/timestamp.c
 @@ -1288,7 +1288,7 @@ GetCurrentTimestamp(void)
   /*
* GetCurrentIntegerTimestamp -- get the current operating system time as 
 int64
*
 - * Result is the number of milliseconds since the Postgres epoch. If compiled
 + * Result is the number of microseconds since the Postgres epoch. If compiled
* with --enable-integer-datetimes, this is identical to 
 GetCurrentTimestamp(),
* and is implemented as a macro.
*/

Will do.

// Tony

diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index c379df5..e878592 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -189,6 +189,26 @@ PostgreSQL documentation
  /varlistentry
 
  varlistentry
+  termoption-r/option/term
+  termoption--max-rate/option/term
+  listitem
+   para
+The maximum amount of data transferred from server per second.
+The purpose is to limit impact of applicationpg_basebackup/application
+on a running master server.
+   /para
+   para
+This option always affects transfer of the data directory. Transfer of
+WAL files is only affected if the collection method is literalfetch/literal.
+   /para
+   para
+Suffixes literalk/literal (kilobytes) and literalM/literal
+(megabytes) are accepted. For example: literal10M/literal
+   /para
+  /listitem
+ /varlistentry
+
+ varlistentry
   termoption-R/option/term
   termoption--write-recovery-conf/option/term
   listitem
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index ba8d173..f194746 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -33,6 +33,7 @@
 #include utils/builtins.h
 #include utils/elog.h
 #include utils/ps_status.h
+#include utils/timestamp.h
 #include pgtar.h
 
 typedef struct
@@ -42,6 +43,7 @@ typedef struct
 	bool		fastcheckpoint;
 	bool		nowait;
 	bool		includewal;
+	uint32		maxrate;
 } basebackup_options;
 
 
@@ -59,6 +61,7 @@ static void perform_base_backup(basebackup_options *opt, DIR *tblspcdir);
 static void parse_basebackup_options(List *options, basebackup_options *opt);
 static void SendXlogRecPtrResult(XLogRecPtr ptr, TimeLineID tli);
 static int	compareWalFileNames(const void *a, const void *b);
+static void throttle(size_t increment);
 
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
@@ -68,6 +71,41 @@ static bool backup_started_in_recovery = false;
  */
 #define TAR_SEND_SIZE 32768
 
+
+/*
+ * The maximum amount of data per second - bounds of the user input.
+ *
+ * If the maximum should be increased to more than 4 GB, uint64 must
+ * be introduced for the related variables. However such high values have
+ * little to do with throttling.
+ */
+#define MAX_RATE_LOWER	32768
+#define MAX_RATE_UPPER	(1024  20)
+
+/*
+ * Transfer rate is only measured when this number of bytes has been sent.
+ * (Too frequent checks would impose too high CPU overhead.)
+ *
+ * The default value is used unless it'd result in too frequent checks.
+ */
+#define THROTTLING_SAMPLE_MIN	32768
+
+/* The maximum number of checks per

Re: [HACKERS] More legacy code: pg_ctl

On 11/18/13, 8:09 PM, Josh Berkus wrote:
 a) by default, it returns to the caller without waiting for postgres to
 actually start/stop/restart.  In this mode, it also always returns
 success regardless of result.

The reason for this is that until sometime recently (PQping) we didn't
have a reliable way to check whether the server is up.  This might be
different now, but it would need to be verified.

 b) by default, it remains attached to the caller's tty for stdout and
 stderr, even after it has switched to the regular postgres log.

Can pg_ctl tell when the postgres log has switched?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More legacy code: pg_ctl

On 11/18/13, 8:20 PM, Josh Berkus wrote:
 c) that stop defaults to smart mode, instead of fast mode.

This has been discussed many times already, so you'd need to check the
archives.  (I'm not in favor of smart mode either.)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] same-address mappings vs. relative pointers

On 2013-12-05 07:44:27 -0500, Robert Haas wrote:
 On Thu, Dec 5, 2013 at 4:56 AM, Andres Freund and...@2ndquadrant.com wrote:
  Hi Robert,
 
  On 2013-12-04 23:32:27 -0500, Robert Haas wrote:
  But I'm also learning painfully that this kind of thing only goes so
  far.  For example, I spent some time looking at what it would take to
  provide a dynamic shared memory equivalent of palloc/pfree, a facility
  that I feel fairly sure would attract a few fans.
 
  I have to admit, I don't fully see the point in a real palloc()
  equivalent for shared memory. If you're allocating shared memory in a
  granular fashion at somewhat high frequency you're doing something
  *wrong*. And the result is not going to be scalable.

 Why?  Lots of people have written lots of programs that do just that.

Well, but we're a database, not a generic programming library ;)

 I agree it's overdone, but saying it should never happen seems like an
 over-rotation in the opposite direction.

That might be true.

 But now let's suppose the input data was estimated to fit in work_mem
 but in the end did not, and therefore we need to instead do an
 external sort.  That means we're going to start evicting tuples from
 memory and to make room for new tuples we read from the input stream.
 Well, now our strategy of tight-packing everything does not look so
 good, because we have no way of tracking which tuples we no longer
 need and reusing that space for new tuples.  If external sort is not
 something we know how to do in parallel, we can potentially copy all
 of the tuples out of the dynamic shared memory segment and back to
 backend-private memory in palloc'd chunks, and then deallocate the
 dynamic shared memory segment... but that's potentially expensive if
 work_mem is large, and it temporarily uses double what we're allowed
 by work_mem.

But what's your alternative if you have a shared_palloc() like thingy?
The overhead is going to be crazy if you allocate so granular that you
can easily grow, and if you allocate on a coarse level, this isn't going
to buy you much?

Essentially, I am not arguing against a facility to dynamically allocate
memory from shared memory, but I think the tradeoffs are different
enough that I think palloc()/mcxt.c isn't necessarily the thing to model
it after. But then, I don't really have an idea how it should look like,
not having thought about it much.

 And suppose we want to do a parallel external sort with, say, the
 worker backend pushing tuples into a heap stored in the dynamic shared
 memory segment and other backends writing them out to tapes and
 removing them from the heap.  You can't do that without some kind of
 space management, i.e. an allocator.

Hm. I'd have thought one would implement that by pushing fixed size
amounts of tuples to individual workers, let them sort each chunk in
memory and then write the result out to disk.

  Any thoughts on what the least painful compromise is here?
 
  I think a reasonable route is having some kind of smart-pointer on C
  level that abstracts away the offset math and allows to use pointers
  locally. Something like void *sptr_deref(sptr *); where the returned
  pointer can be used as long as it is purely in memory. And
  sptr_update(sptr *, void *); which allows a sptr to point elsewhere in
  the same segment.
  + lots of smarts

 Can you elaborate on this a little bit?  I'm not sure I understand
 what you're getting at here.

So, my thought is that we really want something that acts like a
pointer, just in shared memory, so we don't have to do too much offset
math. And I think for performance and readability we want to use
pointers when operating locally in a backend.

So, I guess there are situations in which we essentially want pointers:
a) pointing somewhere in the same segment.
b) pointing at an offset in *another* segment.

So, for a) what if we, instead of storing plain offsets have something like:
typedef struct sptr
{
Offset sptr_self;
Offset sptr_offset;
} sptr;

and always store that in dynamically allocated shared memory instead of
raw offsets.
That allows to define a function like:
void *
sptr_deref(sptr *sp)
{
return ((char *)sp) - sp-sptr_self + sp-sptr_offset;
}
to get what we point to. Without having to reference the segment base,
at the price of having to store two offsets.
To update the pointer we could have:
void
sptr_set(sptr *sp, void *p)
{
char *segment_base;

segment_base = ((char *)sp) - sp-sptr_self;

/* make sure pointer doesn't point below the beginning of the segment */
Assert(segment_base = p);

/* make sure pointer doesn't point beyond the segment */
Assert(p  segment_base + GetSegmentSize(segment_base));

sp-sptr_offset = ((char *)p) - segment_base;
}
To initialize we'd have:
void
sptr_init(dsm_segment *segment, sptr *sp, void *p)
{
sp-sptr_self = ((char *)sp) - (char *)(segment-mapped_address)
sptr_set(sp, p);
}

for b) we could have
typedef struct sptr_far
{

Re: [HACKERS] Feature request: Logging SSL connections

On 12/5/13, 8:53 AM, Dr. Andreas Kunert wrote:
 we were really missing the information in our log files if (and which
 of) our users are using SSL during their connections.
 
 The attached patch is a very simple solution to this problem - it just
 tests if the ssl pointer in Port is null. If no, it adds SSL to the
 logfile, otherwise it adds NOSSL.

That seems useful.  Do we need more information, like whether a client
certificate was presented, or what ciphers were used?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we are going to have to go DirectIO

2013-12-05 Thread Greg Stark

On Thu, Dec 5, 2013 at 8:35 AM, KONDO Mitsumasa
kondo.mitsum...@lab.ntt.co.jp wrote:
 Yes.　And using something efficiently DirectIO is more difficult than
 BufferedIO.
 If we change write() flag with direct IO in PostgreSQL, it will execute
 hardest ugly randomIO.

Using DirectIO presumes you're using libaio or threads to implement
prefetching and asynchronous I/O scheduling.

I think in the long term there are only two ways to go here. Either a)
we use DirectIO and implement an I/O scheduler in Postgres or b) We
use mmap and use new system calls to give the kernel all the
information Postgres has available to it to control the I/O scheduler.

(a) is by far the lower risk option as it's well trodden and doesn't
depend on other projects to do anything. The most that would be
valuable is if the kernel provided an interface to learn about the
hardware properties such as the raid geometry and queue depth for
different parts of the devices.

(b) is the way more interesting research project though. I don't think
anyone's tried it and the kernel interface to provide the kinds of
information Postgres needs requires a lot of thought. If it's done
right then Postgres wouldn't need a buffer cache manager at all. It
would just mmap the entire database and tell the kernel when it's safe
to flush buffers and let the kernel decide when based on when it's
convenient for the hardware.

I don't think it's tenable in the long run to have Postgres manage
buffers that are then copied to another buffer in memory which are
then flushed to disk based on another scheduler. That it works at all
is a testament to the quality of the code in Postgres and Linux but
it's implausibly inefficient.






-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: variant of regclass

Pavel Golub pa...@microolap.com writes:
 I personally see two approaches:
 1. Implement GUC variable controling this behaviour per session
 2. Introduce new safe reg* variables, e.g. sregclass, sregtype etc.

I don't think new types are a good idea.  If we are afraid to change
the behavior of the input converters, what we should do is introduce
new functions, eg toregclass(text) returns regclass.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] same-address mappings vs. relative pointers

On 2013-12-05 15:44:34 +0100, Andres Freund wrote:
 On 2013-12-05 07:44:27 -0500, Robert Haas wrote:
  And then I thought, boy, it sucks
  not to be able to declare what kind of a thing we're pointing *at*
  here, but apart from using C++ I see no solution to that problem.  I
  guess we could do something like this:
 
  #define relptr(type)   Size
 
  So the compiler wouldn't enforce anything, but at least notationally
  we'd know what sort of object we were supposedly referencing.
 
 There might be some ugly compiler dependent magic we could do. Depending
 on how we decide to declare offsets. Like (very, very roughly)
 
 #define relptr(type, struct_name, varname) union struct_name##_##varname{ \
 type relptr_type; \
 Offset relptr_off;
 }
 
 And then, for accessing have:
 #define relptr_access(seg, off) \
   typeof(off.relptr_type)* (((char *)seg-base_address) + off.relptr_off)
 
 But boy, that's ugly.

On second thought - there's probably no reason to name the union, making
it somewhat less ugly.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we are going to have to go DirectIO

2013-12-05 Thread Claudio Freire

On Thu, Dec 5, 2013 at 11:42 AM, Greg Stark st...@mit.edu wrote:
 (b) is the way more interesting research project though. I don't think
 anyone's tried it and the kernel interface to provide the kinds of
 information Postgres needs requires a lot of thought. If it's done
 right then Postgres wouldn't need a buffer cache manager at all. It
 would just mmap the entire database and tell the kernel when it's safe
 to flush buffers and let the kernel decide when based on when it's
 convenient for the hardware.


That's a bad idea in the current state of affairs. MM files haven't
been designed for that usage, and getting stable performance out of
that will be way too difficult.

systemd's journal is finding that out the hard way. It uses mmap too.

Having the buffer manager mmap buffers into its shared address space,
however, might be an interesting idea to pursue. However, one must not
forget that the kernel has similar scalability issues when the number
of memory mappings increase arbitrarily.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: variant of regclass

On 12/5/13, 9:41 AM, Tom Lane wrote:
 Pavel Golub pa...@microolap.com writes:
 I personally see two approaches:
 1. Implement GUC variable controling this behaviour per session
 2. Introduce new safe reg* variables, e.g. sregclass, sregtype etc.
 
 I don't think new types are a good idea.  If we are afraid to change
 the behavior of the input converters, what we should do is introduce
 new functions, eg toregclass(text) returns regclass.

We could invent some sneaky syntax variants, like 'pg_klass'::regclass
errors, but '?pg_klass'::regclass does not.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] same-address mappings vs. relative pointers

On Thu, Dec 5, 2013 at 8:57 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 12/05/2013 06:32 AM, Robert Haas wrote:
 During development of the dynamic shared memory facility, Noah and I
 spent a lot of time arguing about whether it was practical to ensure
 that a dynamic shared memory segment got mapped at the same address in
 every backend that used it.

 My vote goes for not trying to map at same address. I don't see how you
 could do that reliably, and I don't see much need for it anyway.

 That said, it naturally depends on what you're going to use the dynamic
 shared memory facility for. It's the same problem I have with reviewing the
 already-committed DSM patch and the message queue patch. The patches look
 fine as far as they go, but I have the nagging feeling that there are a
 bunch of big patches coming up later that use the facilities, and I can't
 tell if the facilities are over-engineered for what's actually needed, or
 not sufficient.

Sure, well, that's one of the challenges of any doing any sort of
large software development project.  One rarely knows at the outset
all of the problems that one will encounter before finishing said
project.  For small projects, you can usually predict these things
pretty well, but as the scope of the project goes, it becomes more and
more difficult to know whether you've made the right initial steps.
That having been said, I'm pretty confident in the steps taken thus
far - but if you're imagining that I have all the answers completely
worked out and am choosing to reveal them only bit by bit, it's not
like that.

If you want to see the overall vision for this project, see
https://wiki.postgresql.org/wiki/Parallel_Sort .  Here, I expect to
use dynamic shared memory for three purposes.  First, I expect to use
a shared memory message queue to propagate any error or warning
messages generated in the workers back to the user backend.  That's
the point of introducing that infrastructure now, though I hope it
will eventually also be suitable for streaming tuples between
backends, so that you can run one part of the query tree in one
backend and stream the output to a different backend that picks it up
and processes it further.  We could surely contrive a simpler solution
just for error messages, but I viewed that as short-sighted.  Second,
I expect to store the SortTuple array, or some analogue of it, in
dynamic shared memory.  Third, I expect to store the tuples themselves
in dynamic shared memory.

Down the road, I imagine wanting to put hash tables in shared memory,
so we can parallelize things like hash joins and hash aggregates.

 As a side-note, I've been thinking that we don't really need same-address
 mapping for shared_buffers either. Getting rid of it wouldn't buy us
 anything right now, but if we wanted e.g to make shared_buffers changeable
 without a restart, that would be useful.

Very true.  One major obstacle to that is that changing the size of
shared_buffers also means resizing the LWLock array and the buffer
descriptor array.  If we got rid of the idea of having lwlocks in
their own data structure and moved the buffer lwlocks into the buffer
descriptors, that would get us down to two segments, but that still
feels like one too many.  There's also the problem of the buffer
mapping hash table, which would need to grow somehow as well.
Technically, the size of the fsync absorb queue also depends on
shared_buffers, but we could decide not to care about that, I think.

The other problem here is that once you do implement all this, a
reference to a buffer beyond what your backend has mapped will
necessitate an unmap and remap of the shared-buffers segment.  If the
remap fails, and you hold any buffer pins, you will have to PANIC.
There could be some performance overhead from inserting bounds checks
in all the right places too, although there might not be enough places
to matter, since a too-high buffer number can only come from a limited
number of places - either a lookup in the buffer mapping table, or a
buffer allocation event.

I don't mention any of these things to discourage you from working on
the problem, but rather to because I've thought about it too - and the
aforementioned problems have are the things that have stumped me so
far.  If you've got ideas about how to solve them, or even better yet
want to implement something, great!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance optimization of btree binary search

Andres Freund and...@2ndquadrant.com writes:
 On 2013-12-05 08:58:55 -0500, Tom Lane wrote:
 I'm a bit worried that somebody, particularly third-party code,
 might've sloppily written return foo in a V1 function when return
 Int32GetDatum(foo) would be correct.  In that case, the resultant Datum
 might have not-per-spec high-order bits, and if it reaches the fast
 comparator without ever having been squeezed into a physical tuple,
 we've got a problem.

 Too bad V1 hasn't insisted on using PG_RETURN_* macros. That would have
 allowed asserts checking against such cases by setting
 fcinfo-has_returned = true or such...

[ shrug... ]  PG_RETURN_DATUM has no practical way to verify that the
given Datum was constructed safely, so I think we'd just be adding
overhead with not much real safety gain.

In practice, if we were to change Datum to be a signed type (intptr_t
not uintptr_t), the most common cases would probably do the right thing
anyway, ie an int or short return value would get promoted to Datum
with sign-extension.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: variant of regclass

Peter Eisentraut pete...@gmx.net writes:
 We could invent some sneaky syntax variants, like 'pg_klass'::regclass
 errors, but '?pg_klass'::regclass does not.

Hmm ... cute idea, but shoehorning it into regoperator might be
problematic.  You'd have to pick a flag character that wasn't a
valid operator character, which lets out '?' as well as a lot
of the other mnemonically-reasonable choices.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance optimization of btree binary search

On 2013-12-05 10:02:56 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2013-12-05 08:58:55 -0500, Tom Lane wrote:
  I'm a bit worried that somebody, particularly third-party code,
  might've sloppily written return foo in a V1 function when return
  Int32GetDatum(foo) would be correct.  In that case, the resultant Datum
  might have not-per-spec high-order bits, and if it reaches the fast
  comparator without ever having been squeezed into a physical tuple,
  we've got a problem.
 
  Too bad V1 hasn't insisted on using PG_RETURN_* macros. That would have
  allowed asserts checking against such cases by setting
  fcinfo-has_returned = true or such...
 
 [ shrug... ]  PG_RETURN_DATUM has no practical way to verify that the
 given Datum was constructed safely, so I think we'd just be adding
 overhead with not much real safety gain.

I was thinking of doing it for assert only anyway.

 In practice, if we were to change Datum to be a signed type (intptr_t
 not uintptr_t), the most common cases would probably do the right thing
 anyway, ie an int or short return value would get promoted to Datum
 with sign-extension.

I was actually thinking about making Datum (and some other types we
have) structs or unions. Currently it's far, far to easy to mix them. We throw
away pretty much all of the little typesafety C has by typedef'ing them
to integral types with lots of autocasting behaviour.
Making Datum a union with all the base-types inside, would also get rid
of us violating the standard's aliasing rules...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] same-address mappings vs. relative pointers

On Thu, Dec 5, 2013 at 9:44 AM, Andres Freund and...@2ndquadrant.com wrote:
 Why?  Lots of people have written lots of programs that do just that.

 Well, but we're a database, not a generic programming library ;)

I think we're arguably both.

 But what's your alternative if you have a shared_palloc() like thingy?
 The overhead is going to be crazy if you allocate so granular that you
 can easily grow, and if you allocate on a coarse level, this isn't going
 to buy you much?

I don't know if you've noticed, but the overhead of palloc is pretty
much insane for large memory contexts right now.  Consider a bunch of
100-byte tuples that you want to sort, on a 64-bit machine.  IIUC,
we're going to round the request up to 128 bytes and then add 16 bytes
of overhead for 2 pointers before the beginning of the chunk.  That's
44% overhead.  For small memory contexts like the ones that store
syscache entries it's really hard to do any better, but there are much
better algorithms (which I am investigating) for cases where you
expect to allocate a large amount of memory.   44% overhead on a 1kB
is no big deal; on 1GB it's a big deal.

But the question of how efficient palloc, or a hypothetical
shared_palloc() is, is really separate from whether or not we ought to
have one.  All allocators have some overhead, and people use them
anyway because it drastically simplifies programming.

 And suppose we want to do a parallel external sort with, say, the
 worker backend pushing tuples into a heap stored in the dynamic shared
 memory segment and other backends writing them out to tapes and
 removing them from the heap.  You can't do that without some kind of
 space management, i.e. an allocator.

 Hm. I'd have thought one would implement that by pushing fixed size
 amounts of tuples to individual workers, let them sort each chunk in
 memory and then write the result out to disk.

I haven't really investigated external sort algorithms at this point,
so it is possible that you are right.  However, that sounds like the
equivalent of constructing runs by loading a batch of tuples,
quicksorting them, writing them out, and then moving on to the next
batch, a modification to tuplesort.c which I have tried and found to
be significantly inferior to the present algorithm.  As Tom has been
at pains to point out, our current algorithm produces much longer
runs, especially if the input data has some intrinsic ordering, which
is not an uncommon situation.  So while I don't know for sure, I think
it's probably unwise to assume that the algorithm we pick won't
require the ability to remove tuples from the dsm incrementally, and I
don't want the choice of algorithm to be constrained by overly-rigid
architectural decisions from the outset.

 So, for a) what if we, instead of storing plain offsets have something like:
 typedef struct sptr
 {
 Offset sptr_self;
 Offset sptr_offset;
 } sptr;

I like the name Offset, but I'm firmly convinced that a relative
pointer needs to be the same size as a regular pointer.  8 bytes for a
pointer is already painfully large in some contexts; 16 is just too
much.

 And then I thought, boy, it sucks
 not to be able to declare what kind of a thing we're pointing *at*
 here, but apart from using C++ I see no solution to that problem.  I
 guess we could do something like this:

 #define relptr(type)   Size

 So the compiler wouldn't enforce anything, but at least notationally
 we'd know what sort of object we were supposedly referencing.

 There might be some ugly compiler dependent magic we could do. Depending
 on how we decide to declare offsets. Like (very, very roughly)

 #define relptr(type, struct_name, varname) union struct_name##_##varname{ \
 type relptr_type; \
 Offset relptr_off;
 }

 And then, for accessing have:
 #define relptr_access(seg, off) \
   typeof(off.relptr_type)* (((char *)seg-base_address) + off.relptr_off)

 But boy, that's ugly.

It's clever, though, and probably worth further experimentation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [RFC] Shouldn't we remove annoying FATAL messages from server log?


Hello,

My customers and colleagues sometimes (or often?) ask about the following 
message:


FATAL:  the database system is starting up

This message is often output dozens of times during a failover or PITR.  The 
users seem to be worried because the message level is FATAL and they don't 
know why such severe message is output in a successful failover and 
recovery.  I can blame the users, because the message is merely a 
sub-product of pg_ctl's internal ping.


Similarly, the below message is output when I stop the standby server 
normally.  Why FATAL as a result of successful operation?  I'm afraid DBAs 
are annoyed by these messages, as system administration software collects 
ERROR and more severe messages for daily monitoring.


FATAL:  terminating walreceiver process due to administrator command

Shouldn't we lower the severity or avoiding those messages to server log? 
How about the following measures?


1. FATAL:  the database system is starting up
2. FATAL:  the database system is shutting down
3. FATAL:  the database system is in recovery mode
4. FATAL:  sorry, too many clients already
Report these as FATAL to the client because the client wants to know the 
reason.  But don't output them to server log because they are not necessary 
for DBAs (4 is subtle.)


5. FATAL:  terminating walreceiver process due to administrator command
6. FATAL:  terminating background worker \%s\ due to administrator command
Don't output these to server log.  Why are they necessary?  For 
troubleshooting purposes?  If necessary, the severity should be LOG (but I 
wonder why other background processes are not reported...)



To suppress server log output, I think we can do as follows.  I guess 
ereport(FATAL) is still needed for easily handling both client report and 
process termination.


log_min_messages = PANIC;
ereport(FATAL,
   (errcode(ERRCODE_CANNOT_CONNECT_NOW),
   errmsg(the database system is starting up)));


May I hear your opinions?

Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: variant of regclass

On Thu, Dec 5, 2013 at 9:41 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Pavel Golub pa...@microolap.com writes:
 I personally see two approaches:
 1. Implement GUC variable controling this behaviour per session
 2. Introduce new safe reg* variables, e.g. sregclass, sregtype etc.

 I don't think new types are a good idea.  If we are afraid to change
 the behavior of the input converters, what we should do is introduce
 new functions, eg toregclass(text) returns regclass.

That seems like a pretty reasonable approach.

I don't have a strong opinion on whether it's worth the
backward-compatibility break that would ensue from just changing this
outright.  I admit that I've been annoyed by this behavior more than
once, but I've also been beaten by customers enough times to know that
backward-compatibility has value to other people in some cases where
it does not have such value to me.

Another advantage of this approach is that, IIUC, type input functions
can't return a NULL value.  So 'pg_klass'::regclass could return 0,
but not NULL.  On the other hand, toregclass('pg_klass') *could*
return NULL, which seems conceptually cleaner.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] shared memory message queues

2013-12-05 Thread Kohei KaiGai

Sorry for my late.

I checked the part-3 (shm-mq-v1.patc) portion as below.
Your comments towards part-1 and part-2 are reasonable for me,
so I have no argue on this portion.

Even though shm_mq_create() expects the address is aligned,
however, no mechanism to ensure. How about to put Assert() here?


shm_mq_send_bytes() has an Assert() to check the length to be
added to mq_bytes_written. Is the MAXALIGN64() right check in
32-bit architecture?
The sendnow value might be Min(ringsize - used, nbytes - sent),
and the ringsize came from mq-mq_ring_size being aligned
using MAXALIGN(_DOWN) that has 32bit width.

/*
 * Update count of bytes written, with alignment padding.  Note
 * that this will never actually insert any padding except at the
 * end of a run of bytes, because the buffer size is a multiple of
 * MAXIMUM_ALIGNOF, and each read is as well.
 */
Assert(sent == nbytes || sendnow == MAXALIGN64(sendnow));
shm_mq_inc_bytes_written(mq, MAXALIGN64(sendnow));


What will happen if sender tries to send a large chunk that needs to
be split into multiple sub-chunks and receiver concurrently detaches
itself from the queue during the writes by sender?
It seems to me the sender gets SHM_MQ_DETACHED and only
earlier half of the chunk still remains on the queue even though
its total length was already in the message queue.
It may eventually lead infinite loop on the receiver side when another
receiver appeared again later, then read incomplete chunk.
Does it a feasible scenario? If so, it might be a solution to prohibit
enqueuing something without receiver, and reset queue when a new
receiver is attached.


The source code comments in shm_mq_wait_internal() is a little bit
misleading for me. It says nothing shall be written without attaching
the peer process, however, we have no checks as long as nsend is
less than queue size.

 * If handle != NULL, the queue can be read or written even before the
 * other process has attached.  We'll wait for it to do so if needed.  The
 * handle must be for a background worker initialized with bgw_notify_pid
 * equal to our PID.

Right now, that's all I can comment on. I'll do follow-up code reading
in the weekend.

Thanks,

2013/11/20 Robert Haas robertmh...@gmail.com:
 On Tue, Nov 19, 2013 at 12:33 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 * on-dsm-detach-v2.patch
 It reminded me the hook registration/invocation mechanism on apache/httpd.
 It defines five levels for invocation order (REALLY_FIRST, FIRST, MIDDLE,
 LAST, REALLY_LAST), but these are alias of integer values, in other words,
 they are just kicks the hook in order to the priority value associated with a
 function pointer.
 These flexibility may make sense. We may want to control the order to
 release resources more fine grained in the future. For example, isn't it
 a problem if we have only two levels when a stuff in a queue needs to be
 released earlier than the queue itself?
 I'm not 100% certain on this suggestion because on_shmen_exit is a hook
 that does not host so much callbacks, thus extension may implement
 their own way on the SHMEM_EXIT_EARLY or SHMEM_EXIT_LATE stage
 of course.

 I don't really see much point in adding more flexibility here until we
 need it, but I can imagine that we might someday need it, for reasons
 that are not now obvious.

 * shm-toc-v1.patch

 From my experience, it makes sense to put a magic number on the tail of
 toc segment to detect shared-memory usage overrun. It helps to find bugs
 bug hard to find because of concurrent jobs.
 Probably, shm_toc_freespace() is a point to put assertion.

 How many entries is shm_toc_lookup() assumed to chain?
 It assumes a liner search from the head of shm_toc segment, and it is
 prerequisite for lock-less reference, isn't it?
 Does it make a problem if shm_toc host many entries, like 100 or 1000?
 Or, it is not an expected usage?

 It is not an expected usage.In typical usage, I expect that the
 number of TOC entries will be about N+K, where K is a small constant
 ( 10) and N is the number of cooperating parallel workers.  It's
 possible that we'll someday be in a position to leverage 100 or 1000
 parallel workers on the same task, but I don't expect it to be soon.
 And, actually, I doubt that revising the data structure would pay off
 at N=100.  At N=1000, maybe.  At N=1, probably.  But we are
 *definitely* not going to need that kind of scale any time soon, and I
 don't think it makes sense to design a complex data structure to
 handle that case when there are so many more basic problems that need
 to be solved before we can go there.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company



-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] Performance optimization of btree binary search

Andres Freund and...@2ndquadrant.com writes:
 I was actually thinking about making Datum (and some other types we
 have) structs or unions. Currently it's far, far to easy to mix them. We throw
 away pretty much all of the little typesafety C has by typedef'ing them
 to integral types with lots of autocasting behaviour.

That's intentional; on many ABIs, making Datum a struct would be
catastrophic performance-wise because it would not be eligible for simple
register pass or return conventions.  In any case, the number of bugs
I can remember that such a thing would've prevented is negligible.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: variant of regclass

Robert Haas robertmh...@gmail.com writes:
 On Thu, Dec 5, 2013 at 9:41 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I don't think new types are a good idea.  If we are afraid to change
 the behavior of the input converters, what we should do is introduce
 new functions, eg toregclass(text) returns regclass.

 That seems like a pretty reasonable approach.

 I don't have a strong opinion on whether it's worth the
 backward-compatibility break that would ensue from just changing this
 outright.  I admit that I've been annoyed by this behavior more than
 once, but I've also been beaten by customers enough times to know that
 backward-compatibility has value to other people in some cases where
 it does not have such value to me.

I'm getting less enamored of just-change-the-input-behavior myself.
The case that occurred to me is, suppose somebody's got a table containing
a regclass or regproc column, and he dumps and reloads it.  If the input
converter silently replaces unknown names by 0, he's at risk of unexpected
data loss, if the reload is done before he's created all the referenced
objects.

 Another advantage of this approach is that, IIUC, type input functions
 can't return a NULL value.  So 'pg_klass'::regclass could return 0,
 but not NULL.  On the other hand, toregclass('pg_klass') *could*
 return NULL, which seems conceptually cleaner.

Yeah, I was thinking of that too.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we are going to have to go DirectIO

2013-12-05 Thread Greg Stark

On Thu, Dec 5, 2013 at 2:54 PM, Claudio Freire klaussfre...@gmail.com wrote:
 That's a bad idea in the current state of affairs. MM files haven't
 been designed for that usage, and getting stable performance out of
 that will be way too difficult.

I'm talking about long-term goals here. Either of these two routes
would require whole new kernel interfaces to work effectively. Without
those new kernel interfaces our current approach is possibly the best
we can get.

I think the way to use mmap would be to mmap very large chunks,
possibly whole tables. We would need some way to control page flushes
that doesn't involve splitting mappings and can be efficiently
controlled without having the kernel storing arbitrarily large tags on
page tables or searching through all the page tables to mark pages
flushable.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Shouldn't we remove annoying FATAL messages from server log?

On 12/5/13, 10:25 AM, MauMau wrote:
 Report these as FATAL to the client because the client wants to know the
 reason.  But don't output them to server log because they are not
 necessary for DBAs

Yeah, this is part of a more general problem, which you have
characterized correctly: What is fatal (or error, or warning, ...) to
the client isn't necessarily fatal (or error, or warning, ...) to the
server or DBA.  Fixing this would need a larger enhancement of the
logging infrastructure.  It's been discussed before, but it's a bit of work.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Shouldn't we remove annoying FATAL messages from server log?

MauMau maumau...@gmail.com writes:
 Shouldn't we lower the severity or avoiding those messages to server log? 

No.  They are FATAL so far as the individual session is concerned.
Possibly some documentation effort is needed here, but I don't think
any change in the code behavior would be an improvement.

 1. FATAL:  the database system is starting up
 2. FATAL:  the database system is shutting down
 3. FATAL:  the database system is in recovery mode
 4. FATAL:  sorry, too many clients already
 Report these as FATAL to the client because the client wants to know the 
 reason.  But don't output them to server log because they are not necessary 
 for DBAs (4 is subtle.)

The notion that a DBA should not be allowed to find out how often #4 is
happening is insane.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Select query performance and shared buffers

 You could try my lwlock-scalability improvement patches - for some
 workloads here, the improvements have been rather noticeable. Which
 version are you testing?

I tried your patches on next link. As you suspect I didn't see any
improvements. I tested it on PostgreSQL 9.2 Stable.

http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/REL9_2_STABLE-rwlock-contention


On Wed, Dec 4, 2013 at 8:26 PM, Andres Freund and...@2ndquadrant.comwrote:

 On 2013-12-04 20:19:55 +0200, Metin Doslu wrote:
  - When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is
  disappeared for 8 core machines and come back with 16 core machines on
  Amazon EC2. Would it be related with PostgreSQL locking mechanism?

 You could try my lwlock-scalability improvement patches - for some
 workloads here, the improvements have been rather noticeable. Which
 version are you testing?

 Greetings,

 Andres Freund

 --
  Andres Freund http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services

Re: [HACKERS] same-address mappings vs. relative pointers

On 2013-12-05 10:17:18 -0500, Robert Haas wrote:
 On Thu, Dec 5, 2013 at 9:44 AM, Andres Freund and...@2ndquadrant.com wrote:
  Why?  Lots of people have written lots of programs that do just that.
 
  Well, but we're a database, not a generic programming library ;)
 
 I think we're arguably both.

Fair enough.

 I don't know if you've noticed, but the overhead of palloc is pretty
 much insane for large memory contexts right now.

Yea, I have more than noticed. I actually still have a partial
replacement of aset.c lying around. It didn't even have 16bytes
overhead, although that had required some ugly hacks when wanting to
continue allowing several parallel context implementations.

Btw, it's even worse if you have lots of allocations above
ALLOC_CHUNK_LIMIT, we walk a linear list on reset, and we reset damn
frequently - partially because we think it's cheap.

 Consider a bunch of
 100-byte tuples that you want to sort, on a 64-bit machine.  IIUC,
 we're going to round the request up to 128 bytes and then add 16 bytes
 of overhead for 2 pointers before the beginning of the chunk.

I think we add the 16bytes overhead before rounding up, but I might
misremember.

  That's
 44% overhead.  For small memory contexts like the ones that store
 syscache entries it's really hard to do any better, but there are much
 better algorithms (which I am investigating) for cases where you
 expect to allocate a large amount of memory.

What I was working on was a slab allocator where, in addition to default
sizes like the current ones, you could create specific slabs for certain
sizes one knew would be frequently be used which then would a) be faster
to allocate b) have lower space overhead c) were guaranteed not to cause
fragmentation.

While I agree that the memory consumption is one concern, I am more
worried about the time it eats up in many workloads.

  So, for a) what if we, instead of storing plain offsets have something like:
  typedef struct sptr
  {
  Offset sptr_self;
  Offset sptr_offset;
  } sptr;
 
 I like the name Offset, but I'm firmly convinced that a relative
 pointer needs to be the same size as a regular pointer.  8 bytes for a
 pointer is already painfully large in some contexts; 16 is just too
 much.

I don't find that all that convincing. In 90% of the cases only a few
pointers will be needed, for those something like the above will be
helpful, allow bounds checking and all. For the 10% of other cases, we
can still open-code stuff. Quite possibly we would want to avoid using
8 bytes there anyway.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Select query performance and shared buffers

On 2013-12-05 17:46:44 +0200, Metin Doslu wrote:
 I tried your patches on next link. As you suspect I didn't see any
 improvements. I tested it on PostgreSQL 9.2 Stable.

You tested the correct branch, right? Which commit does git rev-parse
HEAD show?

But generally, as long as your profile hides all the important
information behind the hypervisor's cost, you're going to have a hard
time analyzing the problems. You really should try to reproduce the
problems on native hardware (as similar to the host hardware as
possible), to get accurate data. On CPU bound workloads that information
is often transportable to the virtual world.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol

On 12/5/13, 6:07 AM, Simon Riggs wrote:
 On 5 December 2013 01:55, Peter Eisentraut pete...@gmx.net wrote:
 On Thu, 2013-11-14 at 12:11 +0530, Amit Kapila wrote:
If an application wants to allow these connection parameters to be
 used, it would need to do PQenableStartServer() first. If it doesn't,
 those connection parameters will be rejected.

 Stupid idea: Would it work that we require an environment variable to be
 set before we allow the standalone_backend connection parameter?  That's
 easy to do, easy to audit, and doesn't require any extra code in the
 individual clients.
 
 I like the idea... should it be in pg_hba.conf ?
 Or should it be next to listen_addresses in postgresql.conf?

No, it's an environment variable.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: variant of regclass

Robert Haas robertmh...@gmail.com writes:
 Another advantage of this approach is that, IIUC, type input functions
 can't return a NULL value.  So 'pg_klass'::regclass could return 0,
 but not NULL.  On the other hand, toregclass('pg_klass') *could*
 return NULL, which seems conceptually cleaner.

BTW, another arguable advantage of fixing this via new functions is
that users could write equivalent (though no doubt slower) functions
for use in pre-9.4 releases, and thus not need to maintain multiple
versions of app code that relies on this behavior.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Select query performance and shared buffers

 You tested the correct branch, right? Which commit does git rev-parse
 HEAD show?

I applied last two patches manually on PostgreSQL 9.2 Stable.

Re: [HACKERS] Why we are going to have to go DirectIO

Greg Stark st...@mit.edu writes:
 I think the way to use mmap would be to mmap very large chunks,
 possibly whole tables. We would need some way to control page flushes
 that doesn't involve splitting mappings and can be efficiently
 controlled without having the kernel storing arbitrarily large tags on
 page tables or searching through all the page tables to mark pages
 flushable.

I might be missing something, but AFAICS mmap's API is just fundamentally
wrong for this.  The kernel is allowed to write-back a modified mmap'd
page to the underlying file at any time, and will do so if say it's under
memory pressure.  You can tell the kernel to sync now, but you can't tell
it *not* to sync.  I suppose you are thinking that some wart could be
grafted onto that API to reverse that, but I wouldn't have a lot of
confidence in it.  Any VM bug that caused the kernel to sometimes write
too soon would result in nigh unfindable data consistency hazards.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Select query performance and shared buffers

 From what I've seen so far the bigger problem than contention in the
 lwlocks itself, is the spinlock protecting the lwlocks...

Postgres 9.3.1 also reports spindelay, it seems that there is no contention
on spinlocks.

PID 21121 lwlock 0: shacq 0 exacq 33 blk 1 spindelay 0
PID 21121 lwlock 33: shacq 7602 exacq 14688 blk 4381 spindelay 0
PID 21121 lwlock 34: shacq 7826 exacq 15113 blk 3786 spindelay 0
PID 21121 lwlock 35: shacq 7792 exacq 15110 blk 3356 spindelay 0
PID 21121 lwlock 36: shacq 7803 exacq 15125 blk 3075 spindelay 0
PID 21121 lwlock 37: shacq 7822 exacq 15177 blk 2756 spindelay 0
PID 21121 lwlock 38: shacq 7694 exacq 14863 blk 2513 spindelay 0
PID 21121 lwlock 39: shacq 7914 exacq 15320 blk 2400 spindelay 0
PID 21121 lwlock 40: shacq 7855 exacq 15203 blk 2220 spindelay 0
PID 21121 lwlock 41: shacq 7942 exacq 15363 blk 1996 spindelay 0
PID 21121 lwlock 42: shacq 7828 exacq 15115 blk 1872 spindelay 0
PID 21121 lwlock 43: shacq 7820 exacq 15159 blk 1833 spindelay 0
PID 21121 lwlock 44: shacq 7709 exacq 14916 blk 1590 spindelay 0
PID 21121 lwlock 45: shacq 7831 exacq 15134 blk 1619 spindelay 0
PID 21121 lwlock 46: shacq 7744 exacq 14989 blk 1559 spindelay 0
PID 21121 lwlock 47: shacq 7808 exacq 15111 blk 1473 spindelay 0
PID 21121 lwlock 48: shacq 7729 exacq 14929 blk 1381 spindelay 0

Re: [HACKERS] Performance optimization of btree binary search

On 2013-12-05 10:34:16 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  I was actually thinking about making Datum (and some other types we
  have) structs or unions. Currently it's far, far to easy to mix them. We 
  throw
  away pretty much all of the little typesafety C has by typedef'ing them
  to integral types with lots of autocasting behaviour.
 
 That's intentional; on many ABIs, making Datum a struct would be
 catastrophic performance-wise because it would not be eligible for simple
 register pass or return conventions.

Unions should behave saner in that regard tho? And it be fairly easy to
make it an optional thing.

 In any case, the number of bugs I can remember that such a thing
 would've prevented is negligible.

Cases talked about upthread, where a plain datatype is returned as a
Datum instead of using FooGetDatum() and the reverse, would be
impossible. I don't think those are that infrequent?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance optimization of btree binary search

Andres Freund and...@2ndquadrant.com writes:
 On 2013-12-05 10:34:16 -0500, Tom Lane wrote:
 In any case, the number of bugs I can remember that such a thing
 would've prevented is negligible.

 Cases talked about upthread, where a plain datatype is returned as a
 Datum instead of using FooGetDatum() and the reverse, would be
 impossible. I don't think those are that infrequent?

[ shrug... ]  The performance changes we're talking about here would have
the effect of making the compiler's implicit casts be the right thing
anyway.  In any case, I don't think you'd have accomplished much by
forcing people to use FooGetDatum, unless you can force them to use the
*right* FooGetDatum.  The errors I can remember seeing in this area were
more in the line of choosing the wrong macro.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Dynamic Shared Memory stuff

2013-12-05 Thread Heikki Linnakangas


On 11/20/2013 09:58 PM, Robert Haas wrote:

On Wed, Nov 20, 2013 at 8:32 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

How many allocations? What size will they have have typically, minimum and
maximum?


The facility is intended to be general, so the answer could vary
widely by application.  The testing that I have done so far suggests
that for message-passing, relatively small queue sizes (a few kB,
perhaps 1 MB at the outside) should be sufficient.  However,
applications such as parallel sort could require vast amounts of
shared memory.  Consider a machine with 1TB of memory performing a
512GB internal sort.  You're going to need 512GB of shared memory for
that.


Hmm. Those two use cases are quite different. For message-passing, you 
want a lot of small queues, but for parallel sort, you want one huge 
allocation. I wonder if we shouldn't even try a one-size-fits-all solution.


For message-passing, there isn't much need to even use dynamic shared 
memory. You could just assign one fixed-sized, single-reader 
multiple-writer queue for each backend.


For parallel sort, you'll want to utilize all the available memory and 
all CPUs for one huge sort. So all you really need is a single huge 
shared memory segment. If one process is already using that 512GB 
segment to do a sort, you do *not* want to allocate a second 512GB 
segment. You'll want to wait for the first operation to finish first. Or 
maybe you'll want to have 3-4 somewhat smaller segments in use at the 
same time, but not more than that.



* As discussed in the Something fishy happening on frogmouth thread, I
don't like the fact that the dynamic shared memory segments will be
permanently leaked if you kill -9 postmaster and destroy the data directory.


Your test elicited different behavior for the dsm code vs. the main
shared memory segment because it involved running a new postmaster
with a different data directory but the same port number on the same
machine, and expecting that that new - and completely unrelated -
postmaster would clean up the resources left behind by the old,
now-destroyed cluster.  I tend to view that as a defect in your test
case more than anything else, but as I suggested previously, we could
potentially change the code to use something like 100 + (port *
100) with a forward search for the control segment identifier, instead
of using a state file, mimicking the behavior of the main shared
memory segment.  I'm not sure we ever reached consensus on whether
that was overall better than what we have now.


I really think we need to do something about it. To use your earlier 
example of parallel sort, it's not acceptable to permanently leak a 512 
GB segment on a system with 1 TB of RAM.


One idea is to create the shared memory object with shm_open, and wait 
until all the worker processes that need it have attached to it. Then, 
shm_unlink() it, before using it for anything. That way the segment will 
be automatically released once all the processes close() it, or die. In 
particular, kill -9 will release it. (This is a variant of my earlier 
idea to create a small number of anonymous shared memory file 
descriptors in postmaster startup with shm_open(), and pass them down to 
child processes with fork()). I think you could use that approach with 
SysV shared memory as well, by destroying the segment with 
sgmget(IPC_RMID) immediately after all processes have attached to it.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Feature request: Logging SSL connections

2013-12-05 Thread Marko Kreen

On Thu, Dec 05, 2013 at 09:43:31AM -0500, Peter Eisentraut wrote:
 On 12/5/13, 8:53 AM, Dr. Andreas Kunert wrote:
  we were really missing the information in our log files if (and which
  of) our users are using SSL during their connections.
  
  The attached patch is a very simple solution to this problem - it just
  tests if the ssl pointer in Port is null. If no, it adds SSL to the
  logfile, otherwise it adds NOSSL.
 
 That seems useful.  Do we need more information, like whether a client
 certificate was presented, or what ciphers were used?

Yes, please show ciphersuite and TLS version too.  Andreas, you can use my
recent \conninfo patch as template:

  
https://github.com/markokr/postgres/commit/7d1b27ac74643abd15007cc4ec0b56ba92b39d90

Also, please show the SSL level also for walsender connections.  It's
quite important to know whether they are using SSL or not.

But I think the 'bits' output is unnecessary, as it's cipher strength
is known by ciphersuite.  Perhaps it can be removed from \conninfo too.

-- 
marko



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [RFC] Shouldn't we remove annoying FATAL messages from server log?

2013-12-05 Thread David Johnston

Tom Lane-2 wrote
MauMau lt;

maumau307@

gt; writes:
Shouldn't we lower the severity or avoiding those messages to server log?

No. They are FATAL so far as the individual session is concerned.
Possibly some documentation effort is needed here, but I don't think
any change in the code behavior would be an improvement.

1. FATAL: the database system is starting up
2. FATAL: the database system is shutting down
3. FATAL: the database system is in recovery mode
4. FATAL: sorry, too many clients already
Report these as FATAL to the client because the client wants to know the
reason. But don't output them to server log because they are not
necessary
for DBAs (4 is subtle.)

The notion that a DBA should not be allowed to find out how often #4 is
happening is insane.

Agreed #4 is definitely DBA territory.

ISTM that instituting some level of categorization for messages would be
helpful. Then logging and reporting frameworks would be able to identify
and segregate the logs in whatever way they and the configuration deems
appropriate.

FATAL: [LOGON] too many clients already

I'd make the category output disabled by default for a long while then
eventually enabled by default but leave the ability to disable. Calls that
do not supply a category get [N/A] output in category mode.

David J.

--
View this message in context:
http://postgresql.1045698.n5.nabble.com/RFC-Shouldn-t-we-remove-annoying-FATAL-messages-from-server-log-tp5781899p5781925.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol

I think this proposal is a bit deadlocked now.

- There are technical concerns about launching a server executable from
within a client.

- There are conceptual concerns about promoting an embedded database mode.

On the other hand:

- Everyone would like to have a way to use psql (and other basic
clients) in stand-alone mode.

The compromise would be to not launch the server from within the client,
but have client and server communicate over external mechanisms (e.g.,
Unix-domain socket).

The concern about that was that it would open up standalone mode to
accidental third-party connections.  While there are some ways around
that (socket in private directory), they are not easy and not portable.
 So standalone mode would became less robust and reliable overall.

The only solutions I see are:

1. do nothing

2. do everything (i.e., existing terminal mode plus socket mode plus
embedded mode), letting the user work out the differences

Pick one. ;-)



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v6.7

Hi,

On 2013-12-05 22:03:51 +0900, Kyotaro HORIGUCHI wrote:
- You assined HeapTupleGetOid(tuple) into relid to read in
  several points but no modification. Nevertheless, you left
  HeapTupleGetOid not replaced there. I think 'relid' just for
  repeated reading has far small merit compared to demerit of
  lowering readability. You'd be better to make them uniform in
  either way.
 
  It's primarily to get the line lengths halfway under control.

 Mm. I'm afraid I couldn't catch your words, do you mean that
 IsSystemClass() or isTempNamespace() could change the NULL bitmap
 in the tuple?

Huh? No. I just meant that the source code lines are longer if you use
HeapTupleGetOid(tuple) instead of just relid. Anway, that patch has
since been committed...

   = 0002:
  
- You are identifying the wal_level with the expr 'wal_level =
  WAL_LEVEL_LOGICAL' but it seems somewhat improper.
 
  Hm. Why?

 It actually does no harm and somewhat trifling so I don't assert
 you should fix it.

 The reason for the comment is the greater values for wal_level
 are undefined at the moment, so strictly saying, such values
 should be handled as invalid ones.

Note that other checks for wal_level are written the same way. Consider
how much bigger this patch would be if every usage of wal_level would
need to get changed because a new level had been added.

- RelationIsAccessibleInLogicalDecoding and
  RelationIsLogicallyLogged are identical in code. Together with
  the name ambiguity, this is quite confising and cause of
  future misuse between these macros, I suppose. Plus the names
  seem too long.
 
  Hm, don't think they are equivalent, rather the contrary. Note one
  returns false if IsCatalogRelation(), the other true.

 Oops, I'm sorry. I understand they are not same. Then I have
 other questions. The name for the first one
 'RelationIsAccessibleInLogicalDecoding' doesn't seem representing
 what its comment reads.

 |  /* True if we need to log enough information to have access via
 |  decoding snapshot. */

 Making the macro name for this comment directly, I suppose it
 would be something like 'NeedsAdditionalInfoInLogicalDecoding' or
 more directly 'LogicalDeodingNeedsCids' or so..

The comment talks about logging enough information that it is accessible
- just as the name.

- In heap_insert, the information conveyed on rdata[3] seems to
  be better to be in rdata[2] because of the scarecity of the
  additional information. XLOG_HEAP_CONTAINS_NEW_TUPLE only
  seems to be needed. Also is in heap_multi_insert and
  heap_upate.
 
  Could you explain a bit more what you mean by that? The reason it's a
  separate rdata entry is that otherwise a full page write will remove the
  information.

 Sorry, I missed the comment 'so that an eventual FPW doesn't
 remove the tuple's data'. Although given the necessity of removal
 prevention, rewriting rdata[].buffer which is required by design
 (correct?)  with InvalidBuffer seems a bit outrageous for me and
 obfuscating the objective of it.

Well, it's added in a separate rdata element. Just as in dozens of other
places.

- In heap_delete, at the point adding replica identity, same to
  the aboves, rdata[3] could be moved into rdata[2] making new
  type like 'xl_heap_replident'.
 
  Hm. I don't think that'd be a good idea, because we'd then need special
  case decoding code for deletes because the wal format would be different
  for inserts/updates and deletes.

 Hmm. Although one common xl_heap_replident is impractical,
 splitting a logcally single entity into two or more XLogRecDatas
 also seems not to be a good idea.

That's done everywhere. There's basically two reasons to use separate
rdata elements:
* the buffers are different
* the data pointer is different

The rdata chain elements don't survive in the WAL.

 Considering above, looking heap_xlog_insert(), you marked on
 xlrec.flags with XLOG_HEAP_CONTAINS_NEW_TUPLE to signal decoder
 that the record should have tuple data not being removed by fpw.
 This is the same for the update record. So the redoer(?) also can
 distinguish whether the update record contains extra tuple data
 or not.

But it doesn't know the length of the individual records, so knowing
there are two doesn't help.

 On the other hand, the update record after patched is longer by
 sizeof(uint16) regardless of whether 'tuple data' is attached or
 not. I don't know the value of the size in WAL stream, but the
 smaller would be better maybe.

I think that'd make things too complicated without too much gain in
comparison to the savings.

 # Of course, it doesn't matter if the object for OLD can
 # naturally be missing as English.

Well, I think the context makes it clear enough.

 I'm not a native English speaker as you know:-)

Neither am I ;). 

 undefining XLOG_HEAP_CONTAINS_OLD and use separte macros type 1
 |  if (xlrec-flags  XLOG_HEAP_CONTAINS_OLD_KEY ||
 |

Re: [HACKERS] Extension Templates S03E11

On 12/1/13, 10:47 PM, Stephen Frost wrote:
 Having a management system for sets of objects is a *great* idea- and
 one which we already have through schemas.  What we don't have is any
 kind of versioning system built-in or other metadata about it, nor do we
 have good tooling which leverages such a versioning or similar system.
 Extensions provide some of that metadata around schemas and object
 definitions,

Schemas can't manage objects that are not in schemas, so that won't work.

It would be great if we could take the dependency tracking mechanism in
extensions and expose it separately.  I would like to be able to say

START PACKAGE foo  -- bad name

bunch of DDL

STOP PACKAGE

use it, later

DROP PACKAGE foo;


This mechanism already exists in extensions, but it's combined with a
bunch of other things.  Separating those things (and naming them
separately) might clear a few things up.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol

On 2013-12-05 11:39:29 -0500, Peter Eisentraut wrote:
 I think this proposal is a bit deadlocked now.
 
 - There are technical concerns about launching a server executable from
 within a client.
 
 - There are conceptual concerns about promoting an embedded database mode.
 
 On the other hand:
 
 - Everyone would like to have a way to use psql (and other basic
 clients) in stand-alone mode.
 The only solutions I see are:
 
 1. do nothing
 
 2. do everything (i.e., existing terminal mode plus socket mode plus
 embedded mode), letting the user work out the differences
 
 Pick one. ;-)

3) make it an explicit parameter, outside the database DSN, and let the
   clients contain a tiny bit of explict code about it. There really
   aren't that many clients that can use such a mode sensibly.

If we ever want to support a real embedded mode, much, much more than
this is needed. I don't think we should let that stop us from improving
single user mode.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: variant of regclass

On 12/5/13, 10:08 AM, Tom Lane wrote:
 Peter Eisentraut pete...@gmx.net writes:
 We could invent some sneaky syntax variants, like 'pg_klass'::regclass
 errors, but '?pg_klass'::regclass does not.
 
 Hmm ... cute idea, but shoehorning it into regoperator might be
 problematic.  You'd have to pick a flag character that wasn't a
 valid operator character, which lets out '?' as well as a lot
 of the other mnemonically-reasonable choices.

Well, you could pick any letter, I suppose.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Errors on missing pg_subtrans/ files with 9.3

2013-12-05 Thread J Smith

On Thu, Nov 28, 2013 at 5:15 AM, Andres Freund and...@2ndquadrant.com wrote:
 Hi,

 Do you still have the core file around? If so could you 'p
 *ShmemVariableCache' and 'p *ControlFile'?


So sorry, I didn't see this message until just today. Seems it was
accidentally archived before hitting my eyeballs.

I see that 9.3.2 was released today along with what appears to be some
fixes regarding this and similar issues. Sorry if my missing this
message held anything up.

We still have the core file kicking around, so here's the output at any rate.

(gdb) p *ShmemVariableCache

$1 = {nextOid = 8036795, oidCount = 6974, nextXid = 10260628,
oldestXid = 1673, xidVacLimit = 21673, xidWarnLimit = 2136485320,
xidStopLimit = 2146485320, xidWrapLimit = 2147485320, oldestXidDB = 1,

  latestCompletedXid = 10260621}

(gdb) p *ControlFile

$2 = {system_identifier = 5942827484423487452, pg_control_version =
937, catalog_version_no = 201306121, state = DB_IN_PRODUCTION, time =
1385234938, checkPoint = 381771294048,

  prevCheckPoint = 381771293888, checkPointCopy = {redo =
381771293992, ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites =
1 '\001', nextXidEpoch = 0, nextXid = 10217377, nextOid = 8035577,

nextMulti = 13448295, nextMultiOffset = 32161320, oldestXid =
1673, oldestXidDB = 1, oldestMulti = 1, oldestMultiDB = 1, time =
1385234938, oldestActiveXid = 10217377}, unloggedLSN = 1,

  minRecoveryPoint = 0, minRecoveryPointTLI = 0, backupStartPoint = 0,
backupEndPoint = 0, backupEndRequired = 0 '\000', wal_level = 2,
MaxConnections = 200, max_prepared_xacts = 0, max_locks_per_xact = 64,

  maxAlign = 8, floatFormat = 1234567, blcksz = 8192, relseg_size =
131072, xlog_blcksz = 8192, xlog_seg_size = 16777216, nameDataLen =
64, indexMaxKeys = 32, toast_max_chunk_size = 1996,

  enableIntTimes = 1 '\001', float4ByVal = 1 '\001', float8ByVal = 1
'\001', data_checksum_version = 0, crc = 517534097}

(gdb)

Cheers?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time-Delayed Standbys

2013-12-05 Thread Fabrízio de Royes Mello

On Thu, Dec 5, 2013 at 7:57 AM, Simon Riggs si...@2ndquadrant.com wrote:

 On 5 December 2013 08:51, Magnus Hagander mag...@hagander.net wrote:

  Not recalling the older thread, but it seems the breaks on clock
 drift, I
  think we can fairly easily make that situation good enough. Just have
  IDENTIFY_SYSTEM return the current timestamp on the master, and refuse to
  start if the time difference is too great. Yes, that doesn't catch the
 case
  when the machines are in perfect sync when they start up and drift
 *later*,
  but it will catch the most common cases I bet. But I think that's good
  enough that we can accept the feature, given that *most* people will have
  ntp, and that it's a very useful feature for those people. But we could
 help
  people who run into it because of a simple config error..
 
  Or maybe the suggested patch already does this, in which case ignore that
  part :)

 I think the very nature of *this* feature is that it doesnt *require*
 the clocks to be exactly in sync, even though that is the basis for
 measurement.

 The setting of this parameter for sane usage would be minimum 5 mins,
 but more likely 30 mins, 1 hour or more.

 In that case, a few seconds drift either way makes no real difference
 to this feature.

 So IMHO, without prejudice to other features that may be more
 critically reliant on time synchronisation, we are OK to proceed with
 this specific feature.


Hi all,

I saw the comments of all of you. I'm a few busy with some customers issues
(has been a crazy week), but I'll reply and/or fix your suggestions later.

Thanks for all review and sorry to delay in reply.

Regards,

-- 
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
 Timbira: http://www.timbira.com.br
 Blog sobre TI: http://fabriziomello.blogspot.com
 Perfil Linkedin: http://br.linkedin.com/in/fabriziomello
 Twitter: http://twitter.com/fabriziomello

Re: [HACKERS] Proposal: variant of regclass

2013-12-05 Thread David E. Wheeler

On Dec 5, 2013, at 7:52 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 BTW, another arguable advantage of fixing this via new functions is
 that users could write equivalent (though no doubt slower) functions
 for use in pre-9.4 releases, and thus not need to maintain multiple
 versions of app code that relies on this behavior.

+1 to this idea. Feels cleanest.

David



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we are going to have to go DirectIO

On 12/05/2013 07:40 AM, Greg Stark wrote:
 On Thu, Dec 5, 2013 at 2:54 PM, Claudio Freire klaussfre...@gmail.com wrote:
 That's a bad idea in the current state of affairs. MM files haven't
 been designed for that usage, and getting stable performance out of
 that will be way too difficult.
 
 I'm talking about long-term goals here. Either of these two routes
 would require whole new kernel interfaces to work effectively. Without
 those new kernel interfaces our current approach is possibly the best
 we can get.

Well, in the long run we'll probably be using persistent RAM.  And the
geeks who manage that have already said that MMAP is a bad interface for
persistent RAM.  They haven't defined a good one, though.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we are going to have to go DirectIO

On 12/05/2013 05:48 AM, Stephen Frost wrote:
 * Peter Geoghegan (p...@heroku.com) wrote:
 On Wed, Dec 4, 2013 at 11:07 AM, Josh Berkus j...@agliodbs.com wrote:
 But you know what?  2.6, overall, still performs better than any kernel
 in the 3.X series, at least for Postgres.

 What about the fseek() scalability issue?
 
 Not to mention that the 2.6 which I suspect you're referring to (RHEL)
 isn't exactly 2.6..

Actually, I've been able to do 35K TPS on commodity hardware on Ubuntu
10.04.   I have yet to go about 15K on any Ubuntu running a 3.X Kernel.
 The CPU scheduling on 2.6 just seems to be far better tuned, aside from
the IO issues; at 35K TPS, the CPU workload is evenly distributed across
cores, whereas on 3.X it lurches from core to core like a drunk in a
cathedral.   However, the hardware is not identical, and this is on
proprietary, not benchmark, workloads, which is why I haven't published
anything.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Select query performance and shared buffers

2013-12-05 Thread Claudio Freire

On Thu, Dec 5, 2013 at 1:03 PM, Metin Doslu me...@citusdata.com wrote:
 From what I've seen so far the bigger problem than contention in the
 lwlocks itself, is the spinlock protecting the lwlocks...

 Postgres 9.3.1 also reports spindelay, it seems that there is no contention
 on spinlocks.


Did you check hugepages?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [RFC] Shouldn't we remove annoying FATAL messages from server log?

2013-12-05 Thread Stephen Frost

* David Johnston (pol...@yahoo.com) wrote:
 ISTM that instituting some level of categorization for messages would be
 helpful.  Then logging and reporting frameworks would be able to identify
 and segregate the logs in whatever way they and the configuration deems
 appropriate.

I've wanted to do that and have even discussed it with folks in the
past, the trick is finding enough toit's, which is difficult when you
start to look at the size of the task...

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Re: [RFC] Shouldn't we remove annoying FATAL messages from server log?

On 12/05/2013 10:21 AM, Stephen Frost wrote:
 * David Johnston (pol...@yahoo.com) wrote:
 ISTM that instituting some level of categorization for messages would be
 helpful.  Then logging and reporting frameworks would be able to identify
 and segregate the logs in whatever way they and the configuration deems
 appropriate.
 
 I've wanted to do that and have even discussed it with folks in the
 past, the trick is finding enough toit's, which is difficult when you
 start to look at the size of the task...

But ... if we set a firm policy on this, then we could gradually clean
up the error messages piecemeal over the next couple of major versions.
 We could also make sure that any new features complied with the
categorization policy.

Right now, how to categorize errors is up to each individual patch
author, which means that things are all over the place, and get worse
with each new feature added.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Problem with displaying wide tables in psql

On Thu, Dec 5, 2013 at 1:09 AM, Sergey Muraviov
sergey.k.murav...@gmail.com wrote:
 And my patch affects the row view only.

To help us avoid forgetting about this patch, please add it here:

https://commitfest.postgresql.org/action/commitfest_view/open

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Regression tests failing if not launched on db regression

On Thu, Dec 5, 2013 at 9:24 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Michael Paquier michael.paqu...@gmail.com writes:
 It happens that the following regression tests are failing if they are
 run on a database not named regression:

 This does not seem like a bug to me, although maybe we'd better update the
 documentation to specify that you need to use a DB named regression.

At the same thing, supporting it might not cost anything.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [RFC] Shouldn't we remove annoying FATAL messages from server log?

Josh Berkus j...@agliodbs.com writes:
 On 12/05/2013 10:21 AM, Stephen Frost wrote:
 But ... if we set a firm policy on this, then we could gradually clean
 up the error messages piecemeal over the next couple of major versions.
  We could also make sure that any new features complied with the
 categorization policy.

 Right now, how to categorize errors is up to each individual patch
 author, which means that things are all over the place, and get worse
 with each new feature added.

I don't think there's that much randomness in is-it-an-ERROR-or-not.
What I believe Stephen is talking about is a classification that
simply doesn't exist today, namely something around how likely is the
message to be of interest to a DBA as opposed to the client application.
We currently compose messages almost entirely with the client in mind,
and that's as it should be.  But we could use some new decoration that's
more DBA-oriented to help decide what goes into the postmaster log.

Before we could get very far we'd need a better understanding than we have
of what cases a DBA might be interested in.  To take the specific example
that started this thread, there wouldn't be a lot of value IMO in a
classification like connection failure messages.  I think the OP is
probably right that those are often uninteresting --- but as I mentioned,
too many clients might become interesting if he's wondering whether he
needs to enlarge max_connections.  Or password failure cases might become
interesting if he starts to suspect breakin attempts.  So I'd want to see
a design that credibly covers those sorts of needs before we put any large
effort into code changes.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Regression tests failing if not launched on db regression

Robert Haas robertmh...@gmail.com writes:
 On Thu, Dec 5, 2013 at 9:24 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Michael Paquier michael.paqu...@gmail.com writes:
 It happens that the following regression tests are failing if they are
 run on a database not named regression:

 This does not seem like a bug to me, although maybe we'd better update the
 documentation to specify that you need to use a DB named regression.

 At the same thing, supporting it might not cost anything.

Well, changing these specific tests today might not be terribly expensive,
but what is it that prevents more such tests from being committed next
week?  Perhaps by somebody who feels current_database() should be included
in code coverage, for example?

More generally, we never have and never can promise that the regression
tests pass regardless of environment.  If you turn off enable_seqscan,
for instance, you'll get a whole lot of not-terribly-exciting diffs.
I see no particular benefit to promising that the name of the regression
database isn't significant.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [RFC] Shouldn't we remove annoying FATAL messages from server log?