Re: [HACKERS] Minmax indexes

2013-09-27 Thread Amit Kapila
On Thu, Sep 26, 2013 at 1:46 AM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Amit Kapila escribió:
 On Sun, Sep 15, 2013 at 5:44 AM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:



   On Windows, patch gives below compilation errors:
   src\backend\access\minmax\mmtuple.c(96): error C2057: expected
 constant expression

 I have fixed all these compile errors (fix attached).  Thanks for
 reporting them.  I'll post a new version shortly.

   Thanks for fixing it. In last few days I had spent some time
reading about minmax or equivalent indexes in other databases (Netezza
and Oracle) and going through some parts of your proposal. Its a bit
bigger patch and needs much more time, but I would like to share my
findings/thoughts I had developed till now.

Firstly about interface and use case, as far as I could understand
other databases provide this index automatically rather than having a
separate Create Index command which may be because such an index can
be mainly useful when the data is ordered or if it's distributed in
such a way that it's quite useful for repeatedly executing queries.
You have proposed it as a command which means user needs to take care
of it which I find is okay for first version, later may be we can also
have some optimisations so that it can get created automatically.
For the page range, If I read correctly, currently you have used hash
define, do you want to expose it to user in some way like GUC or
maintain it internally and assign the right value based on performance
of different queries?

Operations on this index seems to be very fast, like Oracle has this
as an in-memory structure and I read in Netezza that write operations
doesn't carry any significant overhead for zone maps as compare to
other indexes, so shouldn't we consider it to be without WAL logged?
OTOH I think because these structures get automatically created in
those databases, so it might be okay but if we provide it as a
command, then user might be bothered if he didn't find it
automatically on server restart.

Few Questions and observations:
1.
+ When a new heap tuple is inserted in a summarized page range, it is
possible to
+ compare the existing index tuple with the new heap tuple.  If the
heap tuple is
+ outside the minimum/maximum boundaries given by the index tuple for
any indexed
+ column (or if the new heap tuple contains null values but the index tuple
+ indicate there are no nulls), it is necessary to create a new index tuple with
+ the new values.  To do this, a new index tuple is inserted, and the
reverse range
+ map is updated to point to it.  The old index tuple is left in
place, for later
+ garbage collection.


Is there a reason why we can't directly update the value rather then
new insert in index, as I understand for other indexes like btree
we do this because we might need to rollback, but here even if after
updating the min or max value, rollback happens, it will not cause
any harm (tuple loss).

2.
+ If the reverse range map points to an invalid TID, the corresponding
page range
+ is not summarized.

3.
It might be better if you can mention when range map will point to an
invalid TID, it's not explained in your proposal, but you have used it
in you proposal to explain some other things.

4.
Range reverse map is a good terminology, but isn't Range translation
map better. I don't mind either way, it's just a thought came to my
mind while understanding concept of Range Reverse map.

5.
/*
 * As above, except that instead of scanning the complete heap, only the given
 * range is scanned.  Scan to end-of-rel can be signalled by passing
 * InvalidBlockNumber as end block number.
 */
double
IndexBuildHeapRangeScan(Relation heapRelation,
Relation indexRelation,
IndexInfo *indexInfo,
bool allow_sync,
BlockNumber start_blockno,
BlockNumber numblocks,
IndexBuildCallback callback,
void *callback_state)

In comments you have used end block number, which parameter does it
refer to? I could see only start_blockno and numb locks?

6.
currently you are passing 0 as start block and InvalidBlockNumber as
number of blocks, what's the logic for it?
return IndexBuildHeapRangeScan(heapRelation, indexRelation,
  indexInfo, allow_sync,
  0, InvalidBlockNumber,
  callback, callback_state);

7.
In mmbuildCallback, it only add's tuple to minmax index, if it
satisfies page range, else this can lead to waste of big scan incase
page range is large (1280 pages as you mentiones in one of your
mails). Why can't we include it end of scan?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minmax indexes

2013-09-27 Thread Amit Kapila
On Fri, Sep 27, 2013 at 11:49 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Thu, Sep 26, 2013 at 1:46 AM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
 Amit Kapila escribió:
 On Sun, Sep 15, 2013 at 5:44 AM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:



   On Windows, patch gives below compilation errors:
   src\backend\access\minmax\mmtuple.c(96): error C2057: expected
 constant expression

 I have fixed all these compile errors (fix attached).  Thanks for
 reporting them.  I'll post a new version shortly.

Thanks for fixing it. In last few days I had spent some time
 reading about minmax or equivalent indexes in other databases (Netezza
 and Oracle) and going through some parts of your proposal. Its a bit
 bigger patch and needs much more time, but I would like to share my
 findings/thoughts I had developed till now.

 Firstly about interface and use case, as far as I could understand
 other databases provide this index automatically rather than having a
 separate Create Index command which may be because such an index can
 be mainly useful when the data is ordered or if it's distributed in
 such a way that it's quite useful for repeatedly executing queries.
 You have proposed it as a command which means user needs to take care
 of it which I find is okay for first version, later may be we can also
 have some optimisations so that it can get created automatically.
 For the page range, If I read correctly, currently you have used hash
 define, do you want to expose it to user in some way like GUC or
 maintain it internally and assign the right value based on performance
 of different queries?

 Operations on this index seems to be very fast, like Oracle has this
 as an in-memory structure and I read in Netezza that write operations
 doesn't carry any significant overhead for zone maps as compare to
 other indexes, so shouldn't we consider it to be without WAL logged?
 OTOH I think because these structures get automatically created in
 those databases, so it might be okay but if we provide it as a
 command, then user might be bothered if he didn't find it
 automatically on server restart.

 Few Questions and observations:
 1.
 + When a new heap tuple is inserted in a summarized page range, it is
 possible to
 + compare the existing index tuple with the new heap tuple.  If the
 heap tuple is
 + outside the minimum/maximum boundaries given by the index tuple for
 any indexed
 + column (or if the new heap tuple contains null values but the index tuple
 + indicate there are no nulls), it is necessary to create a new index tuple 
 with
 + the new values.  To do this, a new index tuple is inserted, and the
 reverse range
 + map is updated to point to it.  The old index tuple is left in
 place, for later
 + garbage collection.


 Is there a reason why we can't directly update the value rather then
 new insert in index, as I understand for other indexes like btree
 we do this because we might need to rollback, but here even if after
 updating the min or max value, rollback happens, it will not cause
 any harm (tuple loss).

 2.
 + If the reverse range map points to an invalid TID, the corresponding
 page range
 + is not summarized.

 3.
 It might be better if you can mention when range map will point to an
 invalid TID, it's not explained in your proposal, but you have used it
 in you proposal to explain some other things.

 4.
 Range reverse map is a good terminology, but isn't Range translation
 map better. I don't mind either way, it's just a thought came to my
 mind while understanding concept of Range Reverse map.

 5.
 /*
  * As above, except that instead of scanning the complete heap, only the given
  * range is scanned.  Scan to end-of-rel can be signalled by passing
  * InvalidBlockNumber as end block number.
  */
 double
 IndexBuildHeapRangeScan(Relation heapRelation,
 Relation indexRelation,
 IndexInfo *indexInfo,
 bool allow_sync,
 BlockNumber start_blockno,
 BlockNumber numblocks,
 IndexBuildCallback callback,
 void *callback_state)

 In comments you have used end block number, which parameter does it
 refer to? I could see only start_blockno and numb locks?

 6.
 currently you are passing 0 as start block and InvalidBlockNumber as
 number of blocks, what's the logic for it?
 return IndexBuildHeapRangeScan(heapRelation, indexRelation,
   indexInfo, allow_sync,
   0, InvalidBlockNumber,
   callback, callback_state);

I got it, I think here it means scan all the pages.

 7.
 In mmbuildCallback, it only add's tuple to minmax index, if it
 satisfies page range, else this can lead to waste of big scan incase
 page range is large (1280 pages as you mentiones in one of your
 mails). Why can't we include it end of scan?


 With Regards,
 Amit Kapila.
 EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FW: REVIEW: Allow formatting in log_line_prefix

2013-09-27 Thread David Rowley
On Fri, Sep 27, 2013 at 10:28 AM, Robert Haas robertmh...@gmail.com wrote:

 On Thu, Sep 26, 2013 at 5:18 PM, Robert Haas robertmh...@gmail.com
 wrote:
  On Thu, Sep 26, 2013 at 3:55 PM, David Rowley dgrowle...@gmail.com
 wrote:
  I think I must have forgot to save it before I emailed it...
 
  Here's the correct version.
 
  Ah ha.  Looks better.
 
  I'm working on getting this committed now.  Aside from some stylistic
  things, I've found one serious functional bug, which is that you need
  to test padding != 0, not padding  0, when deciding which path to
  take.  No need to send a new patch, I've already fixed it in my local
  copy...


Oops, negative padding. My bad. I was focusing too much on the benchmarks I
think.


 Committed now.  Let me know if I broke anything.


Great, thanks for commiting!
Thank you Albe for your review too!

Regards

David.


 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company



Re: [HACKERS] pgbench filler columns

2013-09-27 Thread Pavan Deolasee
On Thu, Sep 26, 2013 at 7:20 PM, Noah Misch n...@leadboat.com wrote:

 On Thu, Sep 26, 2013 at 03:23:30PM +0530, Pavan Deolasee wrote:

  Should we just fix the comment and say its applicable for all tables
 except
  accounts ?

 Please do.


How about something like this ? Patch attached.

Thanks,
Pavan


-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


pgbench_filler_column_notes.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Some interesting news about Linux 3.12 OOM

2013-09-27 Thread Daniel Farina
On Wed, Sep 25, 2013 at 8:00 AM, Greg Stark st...@mit.edu wrote:

 On Wed, Sep 25, 2013 at 12:15 AM, Daniel Farina dan...@heroku.com wrote:

 Enable the memcg OOM killer only for user faults, where it's really the
 only option available.


 Is this really a big deal? I would expect most faults to be user faults.

 It's certainly a big deal that we need to ensure we can handle ENOMEM from
 syscalls and library functions we weren't expecting to return it. But I
 don't expect it to actually reduce the OOM killing sprees by much.

Hmm, I see what you mean.  I have been reading through the mechanism:
I got too excited about 'allocations by system calls', because I
thought that might mean brk  and friends, except that's not much of an
allocation at all, just reservation.  I think.

There is some interesting stuff coming in along with these patches in
bringing the user-space memcg OOM handlers up to snuff that may make
it profitable to issue SIGTERM to backends when a safety margin is
crossed (too bad the error messages will be confusing in that case).
I was rather hoping that a regular ENOMEM could be injected by this
mechanism the next time a syscall is touched (unknown), but I'm not
confident if this is made easier or not, one way or another.  One
could imagine the kernel injecting such a fault when the amount of
memory being consumed starts to look hairy, but I surmise part of the
impetus for userspace handling of that is to avoid getting into that
particular heuristics game.

Anyway, I did do some extensive study of cgroups and memcg's
implementation in particular and found it not really practical for
Postgres use unless one was happy with lots and lots of database
restarts, and this work still gives me some hope to try again, even if
smaller modifications still seem necessary.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait free LW_SHARED acquisition

2013-09-27 Thread Heikki Linnakangas

On 27.09.2013 01:55, Andres Freund wrote:

Hello,

We have had several customers running postgres on bigger machines report
problems on busy systems. Most recently one where a fully cached
workload completely stalled in s_lock()s due to the *shared* lwlock
acquisition in BufferAlloc() around the buffer partition lock.

Increasing the padding to a full cacheline helps making the partitioning
of the partition space actually effective (before it's essentially
halved on a read-mostly workload), but that still leaves one with very
hot spinlocks.

So the goal is to have LWLockAcquire(LW_SHARED) never block unless
somebody else holds an exclusive lock. To produce enough appetite for
reading the rest of the long mail, here's some numbers on a
pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620

master + padding: tps = 146904.451764
master + padding + lwlock: tps = 590445.927065


How does that compare with simply increasing NUM_BUFFER_PARTITIONS?

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait free LW_SHARED acquisition

2013-09-27 Thread Andres Freund
Hi,

On 2013-09-27 10:14:46 +0300, Heikki Linnakangas wrote:
 On 27.09.2013 01:55, Andres Freund wrote:
 We have had several customers running postgres on bigger machines report
 problems on busy systems. Most recently one where a fully cached
 workload completely stalled in s_lock()s due to the *shared* lwlock
 acquisition in BufferAlloc() around the buffer partition lock.
 
 Increasing the padding to a full cacheline helps making the partitioning
 of the partition space actually effective (before it's essentially
 halved on a read-mostly workload), but that still leaves one with very
 hot spinlocks.
 
 So the goal is to have LWLockAcquire(LW_SHARED) never block unless
 somebody else holds an exclusive lock. To produce enough appetite for
 reading the rest of the long mail, here's some numbers on a
 pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620
 
 master + padding: tps = 146904.451764
 master + padding + lwlock: tps = 590445.927065
 
 How does that compare with simply increasing NUM_BUFFER_PARTITIONS?

Heaps better. In the case causing this investigation lots of the pages
with hot spinlocks were the simply the same ones over and over again,
partitioning the lockspace won't help much there.
That's not exactly an uncommon scenario since often enough there's a
small amount of data hit very frequently and lots more that is accessed
only infrequently. E.g. recently inserted data and such tends to be very hot.

I can run a test on the 4 socket machine if it's unused, but on my 2
socket workstation the benefits of at least our simulation of the
original workloads the improvements were marginal after increasing the
padding to a full cacheline.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait free LW_SHARED acquisition

2013-09-27 Thread Andres Freund
On 2013-09-27 09:21:05 +0200, Andres Freund wrote:
  So the goal is to have LWLockAcquire(LW_SHARED) never block unless
  somebody else holds an exclusive lock. To produce enough appetite for
  reading the rest of the long mail, here's some numbers on a
  pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620
  
  master + padding: tps = 146904.451764
  master + padding + lwlock: tps = 590445.927065
  
  How does that compare with simply increasing NUM_BUFFER_PARTITIONS?
 
 Heaps better. In the case causing this investigation lots of the pages
 with hot spinlocks were the simply the same ones over and over again,
 partitioning the lockspace won't help much there.
 That's not exactly an uncommon scenario since often enough there's a
 small amount of data hit very frequently and lots more that is accessed
 only infrequently. E.g. recently inserted data and such tends to be very hot.
 
 I can run a test on the 4 socket machine if it's unused, but on my 2
 socket workstation the benefits of at least our simulation of the
 original workloads the improvements were marginal after increasing the
 padding to a full cacheline.

Ok, was free:

padding + 16 partitions:
tps = 147884.648416

padding + 32 partitions:
tps = 141777.841125

padding + 64 partitions:
tps = 141561.539790

padding + 16 partitions + new lwlocks
tps = 601895.580903 (yeha, still reproduces after some sleep!)


Now, the other numbers were best-of-three, these aren't, but I think
it's pretty clear that you're not going to see the same benefits. I am
not surprised...
The current implementation of lwlocks will frequently block others, both
during acquiration and release of locks. What's even worse, trying to
fruitlessly acquire a spinlock will often prevent releasing it because
we need the spinlock during release.
With the proposed algorithm, even if we need the spinlock during release
of an lwlock because there are queued PGPROCs, we will acquire that
spinlock only after already having released the lock...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for fail-back without fresh backup

2013-09-27 Thread Sawada Masahiko
On Thu, Sep 26, 2013 at 8:54 PM, Pavan Deolasee
pavan.deola...@gmail.com wrote:

 On Thu, Sep 19, 2013 at 4:02 PM, Fujii Masao masao.fu...@gmail.com wrote:



 Hmm... when synchronous_transfer is set to data_flush,
 IMO the intuitive behaviors are

 (1) synchronous_commit = on
 A data flush should wait for the corresponding WAL to be
 flushed in the standby

 (2) synchronous_commit = remote_write
 A data flush should wait for the corresponding WAL to be
 written to OS in the standby.

 (3) synchronous_commit = local
 (4) synchronous_commit = off
 A data flush should wait for the corresponding WAL to be
 written locally in the master.


 I thought synchronous_commit and synchronous_transfer are kind of orthogonal
 to each other. synchronous_commit only controls whether and how to wait for
 the standby only when a transaction commits. synchronous_transfer OTOH tells
 how to interpret the standby listed in synchronous_standbys parameter. If
 set to commit then they are synchronous standbys (like today). If set to
 data_flush, they are asynchronous failback safe standby and if set to
 all then they are synchronous failback safe standbys. Well, its confusing
 :-(

 So IMHO in the current state of things, the synchronous_transfer GUC can not
 be changed at a session/transaction level since all backends, including
 background workers must honor the settings to guarantee failback safety.
 synchronous_commit still works the same way, but is ignored if
 synchronous_transfer is set to data_flush because that effectively tells
 us that the standbys listed under synchronous_standbys are really *async*
 standbys with failback safety.


Thank you for comment. I think it is good simple idea.
In your opinion, if synchronous_transfer is set 'all' and
synchronous_commit is set 'on',
the master wait for data flush eve if user sets synchronous_commit to
'local' or 'off'.
For example, when user want to do transaction early, user can't do this.
we leave the such situation as constraint?

Regards,

---
Sawada Masahiko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait free LW_SHARED acquisition

2013-09-27 Thread Heikki Linnakangas

On 27.09.2013 10:21, Andres Freund wrote:

Hi,

On 2013-09-27 10:14:46 +0300, Heikki Linnakangas wrote:

On 27.09.2013 01:55, Andres Freund wrote:

We have had several customers running postgres on bigger machines report
problems on busy systems. Most recently one where a fully cached
workload completely stalled in s_lock()s due to the *shared* lwlock
acquisition in BufferAlloc() around the buffer partition lock.

Increasing the padding to a full cacheline helps making the partitioning
of the partition space actually effective (before it's essentially
halved on a read-mostly workload), but that still leaves one with very
hot spinlocks.

So the goal is to have LWLockAcquire(LW_SHARED) never block unless
somebody else holds an exclusive lock. To produce enough appetite for
reading the rest of the long mail, here's some numbers on a
pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620

master + padding: tps = 146904.451764
master + padding + lwlock: tps = 590445.927065


How does that compare with simply increasing NUM_BUFFER_PARTITIONS?


Heaps better. In the case causing this investigation lots of the pages
with hot spinlocks were the simply the same ones over and over again,
partitioning the lockspace won't help much there.
That's not exactly an uncommon scenario since often enough there's a
small amount of data hit very frequently and lots more that is accessed
only infrequently. E.g. recently inserted data and such tends to be very hot.


I see. So if only a few buffers are really hot, I'm assuming the problem 
isn't just the buffer partition lock, but also the spinlock on the 
buffer header, and the buffer content lwlock. Yeah, improving LWLocks 
would be a nice wholesale solution to that. I don't see any fundamental 
flaw in your algorithm. Nevertheless, I'm going to throw in a couple of 
other ideas:


* Keep a small 4-5 entry cache of buffer lookups in each backend of most 
recently accessed buffers. Before searching for a buffer in the 
SharedBufHash, check the local cache.


* To pin a buffer, use an atomic fetch-and-add instruction to increase 
the refcount. PinBuffer() also increases usage_count, but you could do 
that without holding a lock; it doesn't need to be accurate.



One problem with your patch is going to be to make it also work without 
the CAS and fetch-and-add instructions. Those are probably present in 
all the architectures we support, but it'll take some effort to get the 
architecture-specific code done. Until it's all done, it would be good 
to be able to fall back to plain spinlocks, which we already have. Also, 
when someone ports PostgreSQL to a new architecture in the future, it 
would be helpful if you wouldn't need to write all the 
architecture-specific code immediately to get it to compile.


Did you benchmark your patch against the compare-and-swap patch I posted 
earlier? 
(http://www.postgresql.org/message-id/519a3587.80...@vmware.com). Just 
on a theoretical level, I would assume your patch to scale better since 
it uses fetch-and-add instead of compare-and-swap for acquiring a shared 
lock. But in practice it might be a wash.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait free LW_SHARED acquisition

2013-09-27 Thread Andres Freund
On 2013-09-27 09:57:07 +0200, Andres Freund wrote:
 On 2013-09-27 09:21:05 +0200, Andres Freund wrote:
   So the goal is to have LWLockAcquire(LW_SHARED) never block unless
   somebody else holds an exclusive lock. To produce enough appetite for
   reading the rest of the long mail, here's some numbers on a
   pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620
   
   master + padding: tps = 146904.451764
   master + padding + lwlock: tps = 590445.927065
   
   How does that compare with simply increasing NUM_BUFFER_PARTITIONS?
  
  Heaps better. In the case causing this investigation lots of the pages
  with hot spinlocks were the simply the same ones over and over again,
  partitioning the lockspace won't help much there.
  That's not exactly an uncommon scenario since often enough there's a
  small amount of data hit very frequently and lots more that is accessed
  only infrequently. E.g. recently inserted data and such tends to be very 
  hot.
  
  I can run a test on the 4 socket machine if it's unused, but on my 2
  socket workstation the benefits of at least our simulation of the
  original workloads the improvements were marginal after increasing the
  padding to a full cacheline.
 
 Ok, was free:
 
 padding + 16 partitions:
 tps = 147884.648416
 
 padding + 32 partitions:
 tps = 141777.841125
 
 padding + 64 partitions:
 tps = 141561.539790
 
 padding + 16 partitions + new lwlocks
 tps = 601895.580903 (yeha, still reproduces after some sleep!)

Pgbench numbers for writes on the machine (fsync = off,
synchronous_commit = off):
padding + 16 partitions:
tps = 8903.532163
vs
padding + 16 partitions + new lwlocks
tps = 13324.080491

So, on bigger machines the advantages seem to be there for writes as
well...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for fail-back without fresh backup

2013-09-27 Thread Pavan Deolasee
On Fri, Sep 27, 2013 at 1:28 PM, Sawada Masahiko sawada.m...@gmail.comwrote:


 

 Thank you for comment. I think it is good simple idea.
 In your opinion, if synchronous_transfer is set 'all' and
 synchronous_commit is set 'on',
 the master wait for data flush eve if user sets synchronous_commit to
 'local' or 'off'.
 For example, when user want to do transaction early, user can't do this.
 we leave the such situation as constraint?


No, user can still override the transaction commit point wait. So if

synchronous_transfer is set to all:
 - If synchronous_commit is ON - wait at all points
 - If synchronous_commit is OFF - wait only at buffer flush (and other
related to failback safety) points

synchronous_transfer is set to data_flush:
 - If synchronous_commit is either ON o OFF - do not wait at commit points,
but wait at all other points

synchronous_transfer is set to commit:
 - If synchronous_commit is ON - wait at commit point
 - If synchronous_commit is OFF - do not wait at any point

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


Re: [HACKERS] Wait free LW_SHARED acquisition

2013-09-27 Thread Andres Freund
On 2013-09-27 11:11:56 +0300, Heikki Linnakangas wrote:
 On 27.09.2013 10:21, Andres Freund wrote:
 Heaps better. In the case causing this investigation lots of the pages
 with hot spinlocks were the simply the same ones over and over again,
 partitioning the lockspace won't help much there.
 That's not exactly an uncommon scenario since often enough there's a
 small amount of data hit very frequently and lots more that is accessed
 only infrequently. E.g. recently inserted data and such tends to be very hot.

 I see. So if only a few buffers are really hot, I'm assuming the problem
 isn't just the buffer partition lock, but also the spinlock on the buffer
 header, and the buffer content lwlock. Yeah, improving LWLocks would be a
 nice wholesale solution to that. I don't see any fundamental flaw in your
 algorithm. Nevertheless, I'm going to throw in a couple of other ideas:

 * Keep a small 4-5 entry cache of buffer lookups in each backend of most
 recently accessed buffers. Before searching for a buffer in the
 SharedBufHash, check the local cache.

I thought about that as well, but you'd either need to revalidate after
pinning the buffers, or keep them pinned.
I had a very hacky implementation of that, but it just make the buffer
content locks the top profile spots. Similar issue there.

It might be worthwile to do this nonetheless - lock xadd; certainly
isn't cheap, even due it's cheaper than a full spinnlock - but it's not
trivial.

 * To pin a buffer, use an atomic fetch-and-add instruction to increase the
 refcount. PinBuffer() also increases usage_count, but you could do that
 without holding a lock; it doesn't need to be accurate.

Yes, Pin/UnpinBuffer() are the primary contention points after this
patch. I want to tackle them, but that seems like a separate thing to
do.

I think we should be able to get rid of most or even all LockBufHdr()
calls by
a) introducing PinBufferInternal() which increase pins but not
usage count using an atomic increment. That can replace locking headers
in many cases.
b) make PinBuffer() increment pin and usagecount using a single 64bit
atomic add if available and fit flags in there as well. Then back off
the usagecount if it's too high or even wraps around, that doesn't hurt
much, we're pinned in that moment.

 One problem with your patch is going to be to make it also work without the
 CAS and fetch-and-add instructions. Those are probably present in all the
 architectures we support, but it'll take some effort to get the
 architecture-specific code done. Until it's all done, it would be good to be
 able to fall back to plain spinlocks, which we already have. Also, when
 someone ports PostgreSQL to a new architecture in the future, it would be
 helpful if you wouldn't need to write all the architecture-specific code
 immediately to get it to compile.

I think most recent compilers have intrinsics for stuff for operations
like that. I quite like the API provided by gcc for this kind of stuff,
I think we should model an internal wrapper API similarly. I don't see
any new platforming coming that has a compiler without intrinsics?

But yes, you're right, I think we need a spinlock based fallback for
now. Even if it's just because nobody of us can verify the
implementations on the more obsolete platforms we claim to support.I
just didn't see it as a priority in the PoC.

 Did you benchmark your patch against the compare-and-swap patch I posted
 earlier? (http://www.postgresql.org/message-id/519a3587.80...@vmware.com).
 Just on a theoretical level, I would assume your patch to scale better since
 it uses fetch-and-add instead of compare-and-swap for acquiring a shared
 lock. But in practice it might be a wash.

I've tried compare-and-swap for shared acquisition and it performed
worse, didn't try your patch though as you seemed to have concluded it's
a wash after doing the unlocked test.

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for fast gin cache performance improvement

2013-09-27 Thread Etsuro Fujita
I wrote:
 I had a look over this patch.  I think this patch is interesting and very
useful.
 Here are my review points:

 8. I think there are no issues in this patch.  However, I have one question:
 how this patch works in the case where gin_fast_limit/fast_cache_size = 0?  In
 this case, in my understanding, this patch inserts new entries into the
pending
 list temporarily and immediately moves them to the main GIN data structure
using
 ginInsertCleanup().  Am I right?  If so, that is obviously inefficient.

Sorry, There are incorrect expressions.  I mean gin_fast_limit  0 and
fast_cache_size = 0.

Although I asked this question, I've reconsidered about these parameters, and it
seems that these parameters not only make code rather complex but are a little
confusing to users.  So I'd like to propose to introduce only one parameter:
fast_cache_size.  While users that give weight to update performance for the
fast update technique set this parameter to a large value, users that give
weight not only to update performance but to search performance set this
parameter to a small value.  What do you think about this?

Thanks,

Best regards,
Etsuro Fujita




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] insert throw error when year field len 4 for timestamptz datatype

2013-09-27 Thread Rushabh Lathia
Sorry for delay in reply.



On Tue, Sep 17, 2013 at 6:23 PM, Haribabu kommi
haribabu.ko...@huawei.comwrote:

  On Tue, 17 September 2013 14:33 Rushabh Lathia wrote:

 On Mon, Sep 16, 2013 at 7:22 PM, Haribabu kommi 
 haribabu.ko...@huawei.com wrote:

 ***On *14 August 2013 Rushabh Lathia wrote:

  postgres=# create table test ( a timestamptz);

 CREATE TABLE

  -- Date with year 1000

 postgres=#  insert into test values ( 'Sat Mar 11 23:58:48 1000 IST');*
 ***

 INSERT 0 1

  -- Now try with year 1 it will return error

 postgres=#  insert into test values ( 'Sat Mar 11 23:58:48 1 IST');
 

 ERROR:  invalid input syntax for type timestamp with time zone: Sat
 Mar 11 23:58:48 1 IST 

 LINE 1: insert into test values ( 'Sat Mar 11 23:58:48 1 IST');

 Patch applies cleanly to HEAD. As this patch tries to improve in
 inserting the date of the year value to be more than 4 in length.

 But it didn’t solve all the ways to insert the year field more than 4
 in length. Please check the following test.

 ** **

  postgres=# insert into test values ('10001010 10:10:10 IST');

 INSERT 0 1

 postgres=# insert into test values ('100011010 10:10:10 IST');

 ERROR:  invalid input syntax for type timestamp with time zone:
 100011010 10:10:10 IST at character 26

 STATEMENT:  insert into test values ('100011010 10:10:10 IST');

 ERROR:  invalid input syntax for type timestamp with time zone:
 100011010 10:10:10 IST

 LINE 1: insert into test values ('100011010 10:10:10 IST');

 ^

  I feel it is better to provide the functionality of inserting year
 field more than 4 in length in all flows.

 ** **

 +1. Nice catch.

 ** **

 Here is the latest version of patch which handles the functionality in
 all flows. 

 Could you test it and share you comments.

 ** **

 I am getting some other failures with the updated patch also, please check
 the following tests.

 ** **

 select date 'January 8, 19990';

 select timestamptz 'January 8, 199910 01:01:01 IST';

 INSERT INTO TIMESTAMPTZ_TST VALUES(4, '10001 SAT 8 MAR 10:10:10 IST');

 ** **

 you can get the test scripts from regress test files of date.sql,
 timetz.sql, timestamp.sql and timestamptz.sql

 and modify according to the patch for verification.

 ** **

 I feel changing the year value to accept the length (4) is not simple. **
 **

 So many places the year length crossing more than length 4 is not
 considered.

 Search in the code with “” and correct all related paths.


Right, changing the year value to accept the length (4) is not simple
because so
many places the year length crossing plus most of the please having
assumption
that it will be always 4.

Tried to fix issue more couple of places but I don't feeling like its
always going
to be safe to assume that we covered all path.

Still looking and wondering if we can do change in any simple place or
whether
we can find any other smarter way to fix the issue.



 

 ** **

 Regards,

 Hari babu.

 ** **




-- 
Rushabh Lathia


Re: [HACKERS] Patch for fail-back without fresh backup

2013-09-27 Thread Sawada Masahiko
On Fri, Sep 27, 2013 at 5:18 PM, Pavan Deolasee
pavan.deola...@gmail.com wrote:
 On Fri, Sep 27, 2013 at 1:28 PM, Sawada Masahiko sawada.m...@gmail.com
 wrote:


 

 Thank you for comment. I think it is good simple idea.
 In your opinion, if synchronous_transfer is set 'all' and
 synchronous_commit is set 'on',
 the master wait for data flush eve if user sets synchronous_commit to
 'local' or 'off'.
 For example, when user want to do transaction early, user can't do this.
 we leave the such situation as constraint?


 No, user can still override the transaction commit point wait. So if

 synchronous_transfer is set to all:
  - If synchronous_commit is ON - wait at all points
  - If synchronous_commit is OFF - wait only at buffer flush (and other
 related to failback safety) points

 synchronous_transfer is set to data_flush:
  - If synchronous_commit is either ON o OFF - do not wait at commit points,
 but wait at all other points

 synchronous_transfer is set to commit:
  - If synchronous_commit is ON - wait at commit point
  - If synchronous_commit is OFF - do not wait at any point


Thank you for explain. Understood.
if synchronous_transfer is set 'all' and user changes
synchronous_commit to 'off'( or 'local') at a transaction,
the master server wait at buffer flush, but doesn't wait at commit
points. Right?

In currently patch, synchronous_transfer works in cooperation with
synchronous_commit.
But if user changes synchronous_commit at a transaction, they are not
in cooperation.
So, your idea might be better than currently behaviour of synchronous_transfer.

Regards,

---
Sawada Masahiko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] insert throw error when year field len 4 for timestamptz datatype

2013-09-27 Thread Haribabu kommi
On 27 September 2013 15:04 Rushabh Lathia wrote:
On Tue, Sep 17, 2013 at 6:23 PM, Haribabu kommi 
haribabu.ko...@huawei.commailto:haribabu.ko...@huawei.com wrote:
I feel changing the year value to accept the length (4) is not simple.
So many places the year length crossing more than length 4 is not considered.
Search in the code with  and correct all related paths.

Right, changing the year value to accept the length (4) is not simple because 
so
many places the year length crossing plus most of the please having assumption
that it will be always 4.

Tried to fix issue more couple of places but I don't feeling like its always 
going
to be safe to assume that we covered all path.

Still looking and wondering if we can do change in any simple place or whether
we can find any other smarter way to fix the issue.

If the changes are very high to deal all scenarios,
I feel it is better do it only in scenarios where the use cases needs it, until 
it is not confusing users.
The rest can be documented.
Any other opinions/suggestions welcome.

Regards,
Hari babu.


Re: [HACKERS] backup.sgml-cmd-v003.patch

2013-09-27 Thread Robert Haas
On Thu, Sep 26, 2013 at 1:15 PM, Ivan Lezhnjov IV
i...@commandprompt.com wrote:
 Thanks for a detailed response. I attached a patch file that builds on your 
 corrections and introduces some of the edits discussed above.

This patch makes at least five unrelated sets of changes:

1. Attempting to encourage people to consider custom format dumps.
2. Suggesting that superuser access isn't necessary to dump the database.
3. Adding a note that OIDs on user objects are deprecated.
4. Describing the manner in which pg_dump can be used with pg_dumpall
to back up all databases in a cluster.
5. Promoting pigz.

It's not a good idea to bundle so many unrelated changes together into
a single patch, because if there is disagreement about any item on the
list, then the patch doesn't go in.  Moreover, we typically try to
commit one change at a time, so this leaves the eventual committer
with the unenviable task of splitting up the patch for commit.

I think that #3 and #5 have no business are things we shouldn't do at
all.  There are many compression utilities in the world other than
gzip, and everyone has their favorites.  I see no reason why we should
promote one over the other.  Certainly, the fact that you yourself
have found pigz useful is not sufficient evidence that everyone should
prefer it.  And, while it's true that OIDs on user objects are
deprecated, I don't think the pg_dump documentation necessarily needs
to recapitulate this point particularly; hopefully, it's documented
elsewhere; if not, this isn't the right place to mention it.  If,
contrary to my feelings on the matter, we do decide to make a change
in either of these areas, it's unrelated to the rest of this patch and
needs to be submitted and discussed separately.

I think that #2 is probably a worthwhile change in some form.
However, I don't particularly agree with the way you've worded it
here.  This section is trying to give a general overview of how to
back up your whole database; it's not the pg_dump reference page,
which exists elsewhere.  I would suggest that the relevant caveat here
is not so much that you might have a non-superuser account that just
happens to have enough privileges to back up the entire database, but
rather that we shouldn't give people the impression that the only
possible use of pg_dump is as for a full backup.  Because in my
experience, a full-database backup almost always DOES require
superuser privileges; it's the selective-dump case where you can often
get by without them.  I've attached a proposed patch along these lines
for your consideration.

The remaining changes (#1 and #4) can probably be blended together in
a single patch.  However, I think that you need to make more of an
effort to use neutral wording.  The purpose of the documentation is
not to pass judgement on the tools.  Examples:

+   The above example creates a plain text file.  This type of dump can be used
+   to restore a database in full. However there are more sophisticated
+   productnamePostgreSQL/ backup formats which allow for far greater
+   control when working with backups.  One of these is
+   the quotecustom/quote format, which the following more elaborate
+   example creates:

I don't find it helpful to classify the other formats as more
sophisticated or to say that they allow far greater control, or to
call the example elaborate.  What's important is what you can do, and
there's basically one thing: selective restores.  So I think you could
replace all that, the following example, and the paragraph after that
with: The above example creates a plain text file.  pg_dump can also
create backups in other formats, such as the custom format.  This may
be advantageous, since all formats other than the plain text file
format make it possible to selectively restore objects from the dump.
See xref linkend=app-pgrestore for more details.

+   para
+Unfortunately, applicationpg_dumpall/ can only create plaintext
+backups. However, it is currently the only way to backup the globals in you

Similarly, let's not label pg_dumpall's behavior as unfortunate.  I
think it's appropriate to explain the pg_dumpall -g / pg_dump -Fc
combination somewhere in the documentation, as it is widely used and
very useful.  But I don't think it's useful to imply to users that the
tools are inadequate; generally, they're not, improvements that we'd
like to make nonwithstanding.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index ccb76d8..54681a0 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -47,8 +47,13 @@ pg_dump replaceable class=parameterdbname/replaceable gt; replaceable cl
that you can perform this backup procedure from any remote host that has
access to the database. But remember that applicationpg_dump/
does not operate with special permissions. In particular, it must
-   have read access to all tables that 

Re: [HACKERS] Extra functionality to createuser

2013-09-27 Thread Robert Haas
On Thu, Sep 26, 2013 at 1:04 PM, Christopher Browne cbbro...@gmail.com wrote:
 Sitting on my todo list for a while has been to consider the idea of
 adding a bit of additional functionality to createuser.

 One of the functions of CREATE ROLE is to associate the role with
 other roles, thus...

create role my_new_user nosuperuser nocreatedb login
 IN ROLE app_readonly_role, app2_writer_role;

 That isn't something that I can do using createuser; to do that, I
 would need to submit two requests separately:

PGUSER=postgres createuser -D -S -l my_new_user
PGUSER=postgres psql -c grant app_readonly_role, app2_writer_role
 to my_new_user;

 I could certainly change over to using psql to do all the work, but it
 would be rather nice if createuser had (say) a -g option which
 allowed specifying the set of roles that should be assigned.

 Thus, the above commands might be replaced by:
PGUSER=postgres createuser -D -S -l -g
 app_readonly_role,app2_writer_role my_new_user

 Would this be worth adding to the ToDo list?

I'd be inclined to favor a patch implementing this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Request for Patch Feedback: Lag Lead Window Functions Can Ignore Nulls

2013-09-27 Thread Robert Haas
On Thu, Sep 26, 2013 at 4:20 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 But how do we ensure that the BMS is allocated in a context?  You'd have
 to switch contexts each time you call bms_add_member.  I don't have a
 good answer to this.

The coding of bms_add_member is pretty funky.  Why doesn't it just
repalloc() the input argument if it's not big enough?  If it did that,
the new allocation would be in the same memory context as the original
one, and we'd not need to care about the memory context when adding an
element to an already non-empty set.

But even if we did decide to switch memory contexts on every call, it
would still be much cleaner than this.  It's only three lines of code
(and about as many instructions) to save and restore
CurrentMemoryContext.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2013-09-27 Thread Robert Haas
On Thu, Sep 26, 2013 at 11:58 PM, Peter Geoghegan p...@heroku.com wrote:
 On Thu, Sep 26, 2013 at 12:15 PM, Robert Haas robertmh...@gmail.com wrote:
 Well, I think we can rule out value locks that are held for the
 duration of a transaction right away. That's just not going to fly.

 I think I agree with that.  I don't think I remember hearing that proposed.

 I think I might have been unclear - I mean locks that are held for the
 duration of *another* transaction, not our own, as we wait for that
 other transaction to commit/abort. I think that earlier remarks from
 yourself and Andres implied that this would be necessary. Perhaps I'm
 mistaken. Your most recent design proposal doesn't do this, but I
 think that that's only because it restricts the user to a single
 unique index - it would otherwise be necessary to sit on the earlier
 value locks (index tuples belonging to an unfinished transaction)
 pending the completion of some other conflicting transaction, which
 has numerous disadvantages (as described in my it suits my purposes
 to have the value locks be held for only an instant mail to you [1]).

OK, now I understand what you are saying.  I don't think I agree with it.

 I don't have another idea either. In fact, I'd go so far as to say
 that doing any third thing that's better than those two to any
 reasonable person is obviously impossible. But I'd add that we simple
 cannot rollback at read committed, so we're just going to have to hold
 our collective noses and do strange things with visibility.

I don't accept that as a general principal.  We're writing the code;
we can make it behave any way we think best.

 This is something that I haven't given remotely enough thought yet, so
 please take it with a big grain of salt.

I doubt that any change to HeapTupleSatisfiesMVCC() will be
acceptable.  This feature needs to restrain itself to behavior changes
that only affect users of this feature, I think.

 There is certainly value in considering that, and you're right to take
 that tact - it is generally valuable to have a patch be minimally
 invasive. However, ultimately that's just one aspect of any given
 design, an aspect that needs to be weighed against others where there
 is a tension. Obviously in this instance I believe, rightly or
 wrongly, that doing more - adding more infrastructure than might be
 considered strictly necessary - is the least worst thing. Also,
 sometimes the apparent similarity of a design to what we have today is
 illusory - certainly, I think you'd at least agree that the problems
 that bloating during the interim between value locking and row locking
 present are qualitatively different to other problems that bloat
 presents in all existing scenarios.

TBH, no, I don't think I agree with that.  See further below.

 Let me try and explain myself better, with reference to a concrete
 example. Suppose we have a table with a primary key column, A, and a
 unique constraint column, B, and we lock the pk value first and the
 unique constraint value second. I'm assuming your design, but allowing
 for multiple unique indexes because I don't think doing anything less
 will be accepted - promise tuples have some of the same problems, as
 well as some other idiosyncratic ones (see my earlier remarks on
 recovery/freezing [2] for examples of those).

OK, so far I'm right with you.

 So there is a fairly high probability that the pk value on A will be
 unique, and a fairly low probability that the unique constraint value
 on B will be unique, at least in this usage pattern of interest, where
 the user is mostly going to end up updating. Mostly, we insert a
 speculative regular index tuple (that points to a speculative heap
 tuple that we might decide to kill) into the pk column, A, right away,
 and then maybe block pending the resolution of a conflicting
 transaction on the unique constraint column B. I don't think we have
 any reasonable way of not blocking on A - if we go clean it up for the
 wait, that's going to bloat quite dramatically, *and* we have to WAL
 log. In any case you seemed to accept that cleaning up bloat
 synchronously like that was just going to be too expensive. So I
 suppose that rules that out. That just leaves sitting on the value
 lock (that the pk index tuple already inserted effectively is)
 indefinitely, pending the outcome of the first transaction.

Agreed.

 What are the consequences of sitting on that value lock indefinitely?
 Well, xacts are going to block on the pk value much more frequently,
 by simple virtue of the fact that the value locks there are held for a
 long time - they just needed to hear a no answer, which the unique
 constraint was in most cases happy to immediately give, so this is
 totally unnecessary. Contention is now in a certain sense almost as
 bad for every unique index as it is for the weakest link. That's only
 where the problems begin, though, and it isn't necessary for there to
 be bad contention on more than one unique 

Re: [HACKERS] Wait free LW_SHARED acquisition

2013-09-27 Thread Florian Pflug
On Sep27, 2013, at 00:55 , Andres Freund and...@2ndquadrant.com wrote:
 So the goal is to have LWLockAcquire(LW_SHARED) never block unless
 somebody else holds an exclusive lock. To produce enough appetite for
 reading the rest of the long mail, here's some numbers on a
 pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620
 
 master + padding: tps = 146904.451764
 master + padding + lwlock: tps = 590445.927065
 
 That's rougly 400%.

Interesting. I played with pretty much the same idea two years or so ago.
At the time, I compared a few different LWLock implementations. Those
were AFAIR

  A) Vanilla LWLocks
  B) A + an atomic-increment fast path, very similar to your proposal
  C) B but with a partitioned atomic-increment counter to further
 reduce cache-line contention
  D) A with the spinlock-based queue replaced by a lockless queue

At the time, the improvements seemed to be negligible - they looked great
in synthetic benchmarks of just the locking code, but didn't translate
to improved TPS numbers. Though I think the only version that ever got
tested on more than a handful of cores was C…

My (rather hacked together) benchmarking code can be found here: 
https://github.com/fgp/lockbench.
The different LWLock implementations live in the various pg_lwlock_* subfolders.

Here's a pointer into the relevant thread: 
http://www.postgresql.org/message-id/651002c1-2ec1-4731-9b29-99217cb36...@phlo.org

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] backup.sgml-cmd-v003.patch

2013-09-27 Thread Karl O. Pinc
Hi Robert,

On 09/27/2013 05:56:52 AM, Robert Haas wrote:

 1. Attempting to encourage people to consider custom format dumps.
  
 What's important is what you can do...

Your critique seems obvious in retrospect.  Sorry you had
to step in here and do my job.  The above point is particularly
salient.  I will try to keep it in mind in the future.

Karl k...@meme.com
Free Software:  You don't pay back, you pay forward.
 -- Robert A. Heinlein


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.1

2013-09-27 Thread Thom Brown
On 27 September 2013 16:14, Andres Freund and...@2ndquadrant.com wrote:
 Hi,

 Attached you can find an updated version of the series taking in some of
 the review comments (the others are queued, not ignored), including:
 * split of things from the big Introduce wal decoding via ... patch
 * fix the bug Steve notice where CreateLogicalDecodingContext was passed
   the wrong is_init = false where it should have been true
 * A number of smaller bugs I noticed while reviewing
 * Renaming of some variables, including guaranteedlyLogged ;)
 * Comment improvements in decode.c
 * rename pg_receivellog to pg_recvlogical

 I'll work more on the other points in the next days, so far they are
 clear of other big stuff.


 0001 wal_decoding: Allow walsender's to connect to a specific database
 - as before

 0002 wal_decoding: Log xl_running_xact's at a higher frequency than 
 checkpoints are done
 - as before

 0003 wal_decoding: Add information about a tables primary key to struct 
 RelationData
 - as before

 0004 wal_decoding: Add wal_level = logical and log data required for logical 
 decoding
 - splitof patch that contains the wal format changes including the
   addition of a new wal_level option

 0005 wal_decoding: Add option to treat additional tables as catalog tables
 - Option to treat user defined table as a catalog table which means it
   can be accessed during logical decoding from an output plugin

 0006 wal_decoding: Introduce wal decoding via catalog timetravel
 - The guts of changeset extraction, without a user interface

 0007 wal_decoding: logical changeset extraction walsender interface
 - splitof patch containing the walsender changes, which allow to receive
   the changeset data in a streaming fashion, supporting sync rep and
   such fancy things

 0008 wal_decoding: Only peg the xmin horizon for catalog tables during 
 logical decoding
 - splitof optimization which reduces the pain 06 introduces by pegging
   the xmin horizon to the smallest of the logical decoding slots. Now
   it's pegged differently for data tables than from catalog tables

 0009 wal_decoding: test_decoding: Add a simple decoding module in contrib
 - Example output plugin which is also used in tests

 0010 wal_decoding: pg_recvlogical: Introduce pg_receivexlog equivalent for 
 logical changes
 - renamed client for the walsender interface

 0011 wal_decoding: test_logical_decoding: Add extension for easier testing of 
 logical decoding
 - SQL SRF to get data from a decoding slot, also used as a vehicle for
   tests

 0012 wal_decoding: design document v2.4 and snapshot building design doc v0.5

I'm encountering a make error:

install  pg_basebackup '/home/thom/Development/psql/bin/pg_basebackup'
install  pg_receivexlog '/home/thom/Development/psql/bin/pg_receivexlog'
install  pg_recvlogical(X) '/home/thom/Development/psql/bin/pg_receivellog'
/bin/dash: 1: Syntax error: ( unexpected
make[3]: *** [install] Error 2
make[3]: Leaving directory
`/home/thom/Development/postgresql/src/bin/pg_basebackup'
make[2]: *** [install-pg_basebackup-recurse] Error 2

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6.1

2013-09-27 Thread Andres Freund
On 2013-09-27 16:35:53 +0100, Thom Brown wrote:
 On 27 September 2013 16:14, Andres Freund and...@2ndquadrant.com wrote:
  Hi,
 
  Attached you can find an updated version of the series taking in some of
  the review comments (the others are queued, not ignored), including:
  * split of things from the big Introduce wal decoding via ... patch
  * fix the bug Steve notice where CreateLogicalDecodingContext was passed
the wrong is_init = false where it should have been true
  * A number of smaller bugs I noticed while reviewing
  * Renaming of some variables, including guaranteedlyLogged ;)
  * Comment improvements in decode.c
  * rename pg_receivellog to pg_recvlogical
 
  I'll work more on the other points in the next days, so far they are
  clear of other big stuff.
 
 
  0001 wal_decoding: Allow walsender's to connect to a specific database
  - as before
 
  0002 wal_decoding: Log xl_running_xact's at a higher frequency than 
  checkpoints are done
  - as before
 
  0003 wal_decoding: Add information about a tables primary key to struct 
  RelationData
  - as before
 
  0004 wal_decoding: Add wal_level = logical and log data required for 
  logical decoding
  - splitof patch that contains the wal format changes including the
addition of a new wal_level option
 
  0005 wal_decoding: Add option to treat additional tables as catalog tables
  - Option to treat user defined table as a catalog table which means it
can be accessed during logical decoding from an output plugin
 
  0006 wal_decoding: Introduce wal decoding via catalog timetravel
  - The guts of changeset extraction, without a user interface
 
  0007 wal_decoding: logical changeset extraction walsender interface
  - splitof patch containing the walsender changes, which allow to receive
the changeset data in a streaming fashion, supporting sync rep and
such fancy things
 
  0008 wal_decoding: Only peg the xmin horizon for catalog tables during 
  logical decoding
  - splitof optimization which reduces the pain 06 introduces by pegging
the xmin horizon to the smallest of the logical decoding slots. Now
it's pegged differently for data tables than from catalog tables
 
  0009 wal_decoding: test_decoding: Add a simple decoding module in contrib
  - Example output plugin which is also used in tests
 
  0010 wal_decoding: pg_recvlogical: Introduce pg_receivexlog equivalent for 
  logical changes
  - renamed client for the walsender interface
 
  0011 wal_decoding: test_logical_decoding: Add extension for easier testing 
  of logical decoding
  - SQL SRF to get data from a decoding slot, also used as a vehicle for
tests
 
  0012 wal_decoding: design document v2.4 and snapshot building design doc 
  v0.5
 
 I'm encountering a make error:

Gah. Lastminute changes. Always the same... Updated patch attached.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


0010-wal_decoding-pg_recvlogical-Introduce-pg_receivexlog.patch.gz
Description: application/patch-gzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] hstore extension version screwup

2013-09-27 Thread Andrew Dunstan


When adding json support to hstore, I made a major blunder and added the 
new functionality to the existing sql script instead of bumping the 
version, renaming the script and adding an update script.


This was lazy and there's no real excuse, although I will note that it 
was a mistake far too easy to make. Perhaps as a warning indicator we 
should remove write permissions from these files.


Anyway, I have had some discussions with Dimitri, and the best idea 
seems to be that we should do all the above, but in the update script 
use conditional logic that only adds the functions if they aren't 
already there and dependent on the extension. In the release notes we 
should advise anyone who has loaded hstore to run 'ALTER EXTENSION 
hstore UPDATE TO '1.2';


The minor downside of this is that the upgrade script will depend on 
plpgsql be available. We'll need to note that too, although I don't 
recall the last time I came across a site that didn't have it loaded.


See attached for details of what's proposed.

cheers

andrew




diff --git a/contrib/hstore/hstore--1.1--1.2.sql 
b/contrib/hstore/hstore--1.1--1.2.sql
new file mode 100644
index 000..99b8a16
--- /dev/null
+++ b/contrib/hstore/hstore--1.1--1.2.sql
@@ -0,0 +1,47 @@
+/* contrib/hstore/hstore--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use ALTER EXTENSION hstore UPDATE TO '1.2' to load this file. \quit
+
+
+-- A version of 1.1 was shipped with these objects mistakenly in 9.3.0.
+-- Therefore we only add them if we detect that they aren't already there and 
+-- dependent on the extension.
+
+DO LANGUAGE plpgsql
+
+$$
+
+BEGIN
+
+   PERFORM 1
+   FROM pg_proc p
+   JOIN  pg_depend d
+  ON p.proname = 'hstore_to_json_loose'
+AND d.objid = p.oid
+AND d.refclassid = 'pg_extension'::regclass
+   JOIN pg_extension x
+  ON d.refobjid = x.oid
+AND  x.extname = 'hstore';
+
+   IF NOT FOUND
+   THEN 
+
+CREATE FUNCTION hstore_to_json(hstore)
+RETURNS json
+AS 'MODULE_PATHNAME', 'hstore_to_json'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE CAST (hstore AS json)
+  WITH FUNCTION hstore_to_json(hstore);
+
+CREATE FUNCTION hstore_to_json_loose(hstore)
+RETURNS json
+AS 'MODULE_PATHNAME', 'hstore_to_json_loose'
+LANGUAGE C IMMUTABLE STRICT;
+
+   END IF;
+
+END;
+
+$$;
diff --git a/contrib/hstore/hstore--1.1.sql b/contrib/hstore/hstore--1.2.sql
similarity index 100%
rename from contrib/hstore/hstore--1.1.sql
rename to contrib/hstore/hstore--1.2.sql
diff --git a/contrib/hstore/hstore.control b/contrib/hstore/hstore.control
index 4104e17..9daf5e2 100644
--- a/contrib/hstore/hstore.control
+++ b/contrib/hstore/hstore.control
@@ -1,5 +1,5 @@
 # hstore extension
 comment = 'data type for storing sets of (key, value) pairs'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/hstore'
 relocatable = true

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] backup.sgml-cmd-v003.patch

2013-09-27 Thread Robert Haas
On Fri, Sep 27, 2013 at 9:49 AM, Karl O. Pinc k...@meme.com wrote:
 Hi Robert,
 On 09/27/2013 05:56:52 AM, Robert Haas wrote:

 1. Attempting to encourage people to consider custom format dumps.

 What's important is what you can do...

 Your critique seems obvious in retrospect.  Sorry you had
 to step in here and do my job.  The above point is particularly
 salient.  I will try to keep it in mind in the future.

Hey, your reviewing is much appreciated.  I actually didn't notice
your reply before I sent mine; I started writing it yesterday
afternoon and finished this morning, so I wasn't intending to preempt
you in any way.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minmax indexes

2013-09-27 Thread Jim Nasby

On 9/26/13 2:46 PM, Robert Haas wrote:

On Thu, Sep 26, 2013 at 2:58 PM, Jim Nasby j...@nasby.net wrote:

Why would we add additional code complexity when forks do the trick? That
seems like a step backwards, not forward.


Well, they sorta do the trick, but see e.g. commit
ece01aae479227d9836294b287d872c5a6146a11.  I doubt that's the only
code that's poorly-optimized for multiple forks; IOW, every time
someone adds a new fork, there's a system-wide cost to that, even if
that fork is only used in a tiny percentage of the relations that
exist in the system.


Yeah, we obviously kept things simpler when adding forks in order to get the 
feature out the door. There's improvements that need to be made. But IMHO 
that's not reason to automatically avoid forks; we need to consider the cost of 
improving them vs what we gain by using them.

Of course there's always some added cost so we shouldn't just blindly use them 
all over the place without considering the fork cost either...


It's tempting to think that we can use the fork mechanism every time
we have multiple logical streams of blocks within a relation and
don't want to figure out a way to multiplex them onto the same
physical file.  However, the reality is that the fork mechanism isn't
up to the job.  I certainly don't want to imply that we shouldn't have
gone in that direction - both the fsm and the vm are huge steps
forward, and we wouldn't have gotten them in 8.4 without that
mechanism.  But they haven't been entirely without their own pain,
too, and that pain is going to grow the more we push in the direction
of relying on forks.


Agreed.

Honestly, I think we actually need more obfuscation between what happens on the 
filesystem and the rest of postgres... we're starting to look at areas where 
that would help. For example, the recent idea of being able to truncate 
individual relation files and not being limited to only truncating the end of 
the relation. My concern in that case is that 1GB is a pretty arbitrary size 
that we happened to pick, so if we're going to go for more efficiency in 
storage we probably shouldn't just blindly stick with 1G (though of course 
initial implementation might do that to reduce complexity, but we better still 
consider where we're headed).


If the only complaint about forks is directory traversal why wouldn't we go
with the well established practice of using multiple directories instead of
glomming everything into one place?


That's not the only complaint about forks - but I support what you're
proposing there anyhow, because it will be helpful to users with lots
of relations regardless of what we do or do not decide to do about
forks.




--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extra functionality to createuser

2013-09-27 Thread Jim Nasby

On 9/27/13 6:01 AM, Robert Haas wrote:

Thus, the above commands might be replaced by:
PGUSER=postgres createuser -D -S -l -g
app_readonly_role,app2_writer_role my_new_user

Would this be worth adding to the ToDo list?

I'd be inclined to favor a patch implementing this.


+1
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minmax indexes

2013-09-27 Thread Greg Stark
On Fri, Sep 27, 2013 at 7:22 PM, Jim Nasby j...@nasby.net wrote:

 Yeah, we obviously kept things simpler when adding forks in order to get the 
 feature out the door. There's improvements that need to be made. But IMHO 
 that's not reason to automatically avoid forks; we need to consider the cost 
 of improving them vs what we gain by using them.


We think this gives short change to the decision to introduce forks.
If you go back to the discussion at the time it was a topic of debate
and the argument which won the day is that interleaving different
streams of data in one storage system is exactly what the file system
is designed to do and we would just be reinventing the wheel if we
tried to do it ourselves. I think that makes a lot of sense for things
like the fsm or vm which grow indefinitely and are maintained by a
different piece of code from the main heap.

The tradeoff might be somewhat different for the pieces of a data
structure like a bitmap index or gin index where the code responsible
for maintaining it is all the same.

 Honestly, I think we actually need more obfuscation between what happens on 
 the filesystem and the rest of postgres... we're starting to look at areas 
 where that would help. For example, the recent idea of being able to truncate 
 individual relation files and not being limited to only truncating the end of 
 the relation. My concern in that case is that 1GB is a pretty arbitrary size 
 that we happened to pick, so if we're going to go for more efficiency in 
 storage we probably shouldn't just blindly stick with 1G (though of course 
 initial implementation might do that to reduce complexity, but we better 
 still consider where we're headed).

The ultimate goal here would be to get the filesystem to issue a TRIM
call so an SSD storage system can reuse the underlying blocks.
Truncating 1GB files might be a convenient way to do it, especially if
we have some new kind of vacuum full that can pack tuples within each
1GB file.

But there may be easier ways to achieve the same thing. If we can
notify the filesystem that we're not using some of the blocks in the
middle of the file we might be able to just leave things where they
are and have holes in the files. Or we might be better off not
depending on truncate and just look for ways to mark entire 1GB files
as deprecated and move tuples out of them until we can just remove
that whole file.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] Cpu usage 100% on slave. s_lock problem.

2013-09-27 Thread Merlin Moncure
On Fri, Sep 27, 2013 at 7:56 AM, Merlin Moncure mmonc...@gmail.com wrote:
 On Thu, Sep 26, 2013 at 10:14 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Thu, Sep 26, 2013 at 6:08 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
 On 2013-08-27 12:17:55 -0500, Merlin Moncure wrote:
 On Tue, Aug 27, 2013 at 10:55 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  On 2013-08-27 09:57:38 -0500, Merlin Moncure wrote:
  + bool
  + RecoveryMightBeInProgress(void)
  + {
  + /*
  +  * We check shared state each time only until we leave recovery 
  mode. We
  +  * can't re-enter recovery, so there's no need to keep checking 
  after the
  +  * shared variable has once been seen false.
  +  */
  + if (!LocalRecoveryInProgress)
  + return false;
  + else
  + {
  + /* use volatile pointer to prevent code rearrangement */
  + volatile XLogCtlData *xlogctl = XLogCtl;
  +
  + /* Intentionally query xlogctl without spinlocking! */
  + LocalRecoveryInProgress = 
  xlogctl-SharedRecoveryInProgress;
  +
  + return LocalRecoveryInProgress;
  + }
  + }
 
  I don't think it's acceptable to *set* LocalRecoveryInProgress
  here. That should only be done in the normal routine.

 quite right -- that was a major error -- you could bypass the
 initialization call to the xlog with some bad luck.

 I've seen this in profiles since, so I'd appreciate pushing this
 forward.

 roger that -- will push ahead when I get into the office...

 attached is new version fixing some comment typos.

Attached is simplified patch that replaces the spinlock with a read
barrier based on a suggestion made by Andres offlist.  The approach
has different performance characteristics -- a barrier call is being
issued instead of a non-blocking read.   I don't have a performance
test case in hand to prove that's better so I'm going with Andre's
approach because it's simpler.  Aside: can this truly the only caller
of pg_read_barrier()?

Also, moving to -hackers from -performance.

merlin


recovery5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minmax indexes

2013-09-27 Thread Jim Nasby

On 9/27/13 1:43 PM, Greg Stark wrote:

Honestly, I think we actually need more obfuscation between what happens on the 
filesystem and the rest of postgres... we're starting to look at areas where 
that would help. For example, the recent idea of being able to truncate 
individual relation files and not being limited to only truncating the end of 
the relation. My concern in that case is that 1GB is a pretty arbitrary size 
that we happened to pick, so if we're going to go for more efficiency in 
storage we probably shouldn't just blindly stick with 1G (though of course 
initial implementation might do that to reduce complexity, but we better still 
consider where we're headed).

The ultimate goal here would be to get the filesystem to issue a TRIM
call so an SSD storage system can reuse the underlying blocks.
Truncating 1GB files might be a convenient way to do it, especially if
we have some new kind of vacuum full that can pack tuples within each
1GB file.

But there may be easier ways to achieve the same thing. If we can
notify the filesystem that we're not using some of the blocks in the
middle of the file we might be able to just leave things where they
are and have holes in the files. Or we might be better off not
depending on truncate and just look for ways to mark entire 1GB files
as deprecated and move tuples out of them until we can just remove
that whole file.


Yeah, there's a ton of different things we might do. And dealing with free space is just 
one example... things like the VM give us the ability to detect areas of the heap that 
have gone dormant; imagine if we could seamlessly move that data to it's own 
storage, possibly compressing it at the same time. (Yes, I realize there's partitioning 
and tablespaces and compressing filesystems, but those are a lot more work and will never 
be as efficient as what the database itself can do).

Anyway, I think we're all on the same page. We should stop hijacking Alvaro's 
thread... ;)
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Request for Patch Feedback: Lag Lead Window Functions Can Ignore Nulls

2013-09-27 Thread Heikki Linnakangas

On 27.09.2013 14:12, Robert Haas wrote:

On Thu, Sep 26, 2013 at 4:20 PM, Alvaro Herrera
alvhe...@2ndquadrant.com  wrote:

But how do we ensure that the BMS is allocated in a context?  You'd have
to switch contexts each time you call bms_add_member.  I don't have a
good answer to this.


The coding of bms_add_member is pretty funky.  Why doesn't it just
repalloc() the input argument if it's not big enough?  If it did that,
the new allocation would be in the same memory context as the original
one, and we'd not need to care about the memory context when adding an
element to an already non-empty set.

But even if we did decide to switch memory contexts on every call, it
would still be much cleaner than this.  It's only three lines of code
(and about as many instructions) to save and restore
CurrentMemoryContext.


[looks] Huh, yeah, as it stands, bms_add_member() is an accident waiting 
to happen. It's totally non-obvious that it might cause the BMS to jump 
from another memory context to CurrentMemoryContext. Same with 
bms_add_members(). And it would be good to Assert in bms_join that both 
arguments are in the same from MemoryContext.


Let's fix that...

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] Cpu usage 100% on slave. s_lock problem.

2013-09-27 Thread Heikki Linnakangas

On 27.09.2013 22:00, Merlin Moncure wrote:

On Fri, Sep 27, 2013 at 7:56 AM, Merlin Moncuremmonc...@gmail.com  wrote:

On Thu, Sep 26, 2013 at 10:14 PM, Merlin Moncuremmonc...@gmail.com  wrote:

On Thu, Sep 26, 2013 at 6:08 PM, Andres Freundand...@2ndquadrant.com  wrote:

On 2013-08-27 12:17:55 -0500, Merlin Moncure wrote:

On Tue, Aug 27, 2013 at 10:55 AM, Andres Freundand...@2ndquadrant.com  wrote:

On 2013-08-27 09:57:38 -0500, Merlin Moncure wrote:

+ bool
+ RecoveryMightBeInProgress(void)
+ {
+ /*
+  * We check shared state each time only until we leave recovery mode. We
+  * can't re-enter recovery, so there's no need to keep checking after the
+  * shared variable has once been seen false.
+  */
+ if (!LocalRecoveryInProgress)
+ return false;
+ else
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+
+ /* Intentionally query xlogctl without spinlocking! */
+ LocalRecoveryInProgress = xlogctl-SharedRecoveryInProgress;
+
+ return LocalRecoveryInProgress;
+ }
+ }


I don't think it's acceptable to *set* LocalRecoveryInProgress
here. That should only be done in the normal routine.


quite right -- that was a major error -- you could bypass the
initialization call to the xlog with some bad luck.


I've seen this in profiles since, so I'd appreciate pushing this
forward.


roger that -- will push ahead when I get into the office...


attached is new version fixing some comment typos.


Attached is simplified patch that replaces the spinlock with a read
barrier based on a suggestion made by Andres offlist.  The approach
has different performance characteristics -- a barrier call is being
issued instead of a non-blocking read.   I don't have a performance
test case in hand to prove that's better so I'm going with Andre's
approach because it's simpler.


Can you think of a concrete example of what might go wrong if we simply 
removed the spinlock, and did an unprotected read through the volatile 
pointer? Is there some particular call sequence that might get optimized 
in an unfortunate way? I'm not suggesting that we do that - better safe 
than sorry even if we can't think of any particular scenario - but it 
would help me understand this.


I don't think a read-barrier is enough here. See README.barrier:


When a memory barrier is being used to separate two
loads, use pg_read_barrier(); when it is separating two stores, use
pg_write_barrier(); when it is a separating a load and a store (in either
order), use pg_memory_barrier().


The function RecoveryInProgress() function does just one load, to read 
the variable, and wouldn't even need a barrier by itself. The other load 
or store that needs to be protected by the barrier happens in the 
caller, before or after the function, and we can't say for sure if it's 
a load or a store. So, let's use pg_memory_barrier().



 Aside: can this truly the only caller
of pg_read_barrier()?


Yep. I only added the first caller of the barriers altogether in the 
xlog-insertion scaling patch. Robert wrote the infrastructure in 9.3, 
but it wasn't used until now, in 9.4.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] Cpu usage 100% on slave. s_lock problem.

2013-09-27 Thread Peter Geoghegan
On Fri, Sep 27, 2013 at 1:28 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 Yep. I only added the first caller of the barriers altogether in the
 xlog-insertion scaling patch. Robert wrote the infrastructure in 9.3, but it
 wasn't used until now, in 9.4.

FWIW, it was actually during 9.2 development that Robert first added
the barriers.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6

2013-09-27 Thread Steve Singer

On 09/26/2013 02:47 PM, Steve Singer wrote:



I've determined that when in this test the walsender seems to be 
hitting this when it is decode the transactions that are behind the 
slonik commands to add tables to replication (set add table, set add 
sequence).  This is before the SUBSCRIBE SET is submitted.


I've also noticed something else that is strange (but might be 
unrelated).  If I stop my slon process and restart it I get messages 
like:


WARNING:  Starting logical replication from 0/a9321360
ERROR:  cannot stream from 0/A9321360, minimum is 0/A9320B00

Where 0/A9321360 was sent in the last packet my slon received from the 
walsender before the restart.


If force it to restart replication from 0/A9320B00 I see datarows that 
I appear to have already seen before the restart.
I think this is happening when I process the data for 0/A9320B00 but 
don't get the feedback message my slon was killed. Is this expected?





I've further narrowed this down to something (or the combination of) 
what the  _disorder_replica.altertableaddTriggers(1);

stored function does.  (or @SLONYNAMESPACE@.altertableaddTriggers(int);

Which is essentially
* Get an exclusive lock on sl_config_lock
* Get an exclusive lock on the user table in question
* create a trigger (the deny access trigger)
* create a truncate trigger
* create a deny truncate trigger

I am not yet able to replicate the error by issuing the same SQL 
commands from psql, but I must be missing something.


I can replicate this when just using the test_decoding plugin.











Greetings,

Andres Freund













--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] Cpu usage 100% on slave. s_lock problem.

2013-09-27 Thread Andres Freund
On 2013-09-27 23:28:37 +0300, Heikki Linnakangas wrote:
 On 27.09.2013 22:00, Merlin Moncure wrote:
 Attached is simplified patch that replaces the spinlock with a read
 barrier based on a suggestion made by Andres offlist.  The approach
 has different performance characteristics -- a barrier call is being
 issued instead of a non-blocking read.   I don't have a performance
 test case in hand to prove that's better so I'm going with Andre's
 approach because it's simpler.
 
 Can you think of a concrete example of what might go wrong if we simply
 removed the spinlock, and did an unprotected read through the volatile
 pointer? Is there some particular call sequence that might get optimized in
 an unfortunate way? I'm not suggesting that we do that - better safe than
 sorry even if we can't think of any particular scenario - but it would help
 me understand this.

I think in some architecutres the write from the other side might just
not immediately be visible, so RecoveryInProgress() might return a
spurious true, although we already finished.
I haven't checked whether there's any callers that would have a problem
with that.

 I don't think a read-barrier is enough here. See README.barrier:
 
 When a memory barrier is being used to separate two
 loads, use pg_read_barrier(); when it is separating two stores, use
 pg_write_barrier(); when it is a separating a load and a store (in either
 order), use pg_memory_barrier().

I think the documentation is just worded in an easy to misunderstand
way. If you look at the following example, we're fine.
The two stores, two loads it is talking about are what's happening
in one thread of execution. The important part of a read barrier is that
it forces a fresh LOAD, separated from earlier LOADs to the same
address.

 The function RecoveryInProgress() function does just one load, to read the
 variable, and wouldn't even need a barrier by itself. The other load or
 store that needs to be protected by the barrier happens in the caller,
 before or after the function, and we can't say for sure if it's a load or a
 store. So, let's use pg_memory_barrier().

The caller uses a spinlock, so it's guaranteed to write out before the
spinlock is released. A write barrier (the spinlock in the startup
process) should always be paired by a read barrier.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical changeset generation v6

2013-09-27 Thread Andres Freund
Hi Steve,

On 2013-09-27 17:06:59 -0400, Steve Singer wrote:
 I've determined that when in this test the walsender seems to be hitting
 this when it is decode the transactions that are behind the slonik
 commands to add tables to replication (set add table, set add sequence).
 This is before the SUBSCRIBE SET is submitted.
 
 I've also noticed something else that is strange (but might be unrelated).
 If I stop my slon process and restart it I get messages like:
 
 WARNING:  Starting logical replication from 0/a9321360
 ERROR:  cannot stream from 0/A9321360, minimum is 0/A9320B00
 
 Where 0/A9321360 was sent in the last packet my slon received from the
 walsender before the restart.

Uh, that looks like I fumbled some comparison. Let me check.

 I've further narrowed this down to something (or the combination of) what
 the  _disorder_replica.altertableaddTriggers(1);
 stored function does.  (or @SLONYNAMESPACE@.altertableaddTriggers(int);
 
 Which is essentially
 * Get an exclusive lock on sl_config_lock
 * Get an exclusive lock on the user table in question
 * create a trigger (the deny access trigger)
 * create a truncate trigger
 * create a deny truncate trigger
 
 I am not yet able to replicate the error by issuing the same SQL commands
 from psql, but I must be missing something.
 
 I can replicate this when just using the test_decoding plugin.

Thanks. That should get me started with debugging. Unless it's possibly
fixed in the latest version, one bug fixed there might cause something
like this if the moon stands exactly right?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait free LW_SHARED acquisition

2013-09-27 Thread Andres Freund
On 2013-09-27 14:46:50 +0200, Florian Pflug wrote:
 On Sep27, 2013, at 00:55 , Andres Freund and...@2ndquadrant.com wrote:
  So the goal is to have LWLockAcquire(LW_SHARED) never block unless
  somebody else holds an exclusive lock. To produce enough appetite for
  reading the rest of the long mail, here's some numbers on a
  pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620
 
  master + padding: tps = 146904.451764
  master + padding + lwlock: tps = 590445.927065
 
  That's rougly 400%.

 Interesting. I played with pretty much the same idea two years or so ago.
 At the time, I compared a few different LWLock implementations. Those
 were AFAIR

   A) Vanilla LWLocks
   B) A + an atomic-increment fast path, very similar to your proposal
   C) B but with a partitioned atomic-increment counter to further
  reduce cache-line contention
   D) A with the spinlock-based queue replaced by a lockless queue

 At the time, the improvements seemed to be negligible - they looked great
 in synthetic benchmarks of just the locking code, but didn't translate
 to improved TPS numbers. Though I think the only version that ever got
 tested on more than a handful of cores was C…

I think you really need multi-socket systems to see the big benefits
from this. My laptop barely shows any improvements, while my older 2
socket workstation already shows some in workloads that have more
contention than pgbench -S.

From a quick look, you didn't have any sleeping queueing in at least one
of the variants in there? In my tests, that was tremendously important
to improve scaling if there was any contention. Which is not surprising
in the end, because otherwise you essentially have rw-spinlocks which
really aren't suitable for many of the lwlocks we use.

Getting the queueing semantics, including releaseOK, right was what took
me a good amount of time, the atomic ops part was pretty quick...

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] Cpu usage 100% on slave. s_lock problem.

2013-09-27 Thread Andres Freund
Hi,

What confuses me is that pg_read_barrier() is just a compiler barrier on
x86[-64] in barrier.h. According to my knowledge it needs to be an
lfence or the full barrier?
The linked papers from Paul McKenney - which are a great read - seem to
agree?

On 2013-09-27 23:12:17 +0200, Andres Freund wrote:
 On 2013-09-27 23:28:37 +0300, Heikki Linnakangas wrote:
  The function RecoveryInProgress() function does just one load, to read the
  variable, and wouldn't even need a barrier by itself. The other load or
  store that needs to be protected by the barrier happens in the caller,
  before or after the function, and we can't say for sure if it's a load or a
  store. So, let's use pg_memory_barrier().
 
 The caller uses a spinlock, so it's guaranteed to write out before the
 spinlock is released. A write barrier (the spinlock in the startup
 process) should always be paired by a read barrier.

s/caller/startup process/

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extra functionality to createuser

2013-09-27 Thread Christopher Browne
Attached is a patch implementing the -g / --roles option for createuser.

I'll be attaching it to the open CommitFest shortly.


createuser.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] record identical operator - Review

2013-09-27 Thread Kevin Grittner
Bruce Momjian br...@momjian.us wrote:
 On Thu, Sep 19, 2013 at 06:31:38PM -0400, Steve Singer wrote:
 On 09/12/2013 06:27 PM, Kevin Grittner wrote:
 Attached is a patch for a bit of infrastructure I believe to be
 necessary for correct behavior of REFRESH MATERIALIZED VIEW
 CONCURRENTLY as well as incremental maintenance of matviews.

 Here is attempt at a review of the v1 patch.

 There has been a heated discussion on list about if we even want
 this type of operator. I will try to summarize it here

 The arguments against it are
 * that a record_image_identical call on two records that returns
 false can still return true under the equality operator, and the
 records might (or might not) appear to be the same to users
 * Once we expose this as an operator via SQL someone will find
 it and use it and then complain when we change things such as
 the internal representation of a datatype which might break
 there queries

I don't see where that is possible unless they count on the order
of records sorted with the new operators, which would not be
something I recommend.  Changing the internal representation cannot
break simple use of the alternative equality definition as long as
all you cared about was whether the binary storage format of the
values was identical.

 * The differences between = and record_image_identical might not
 be understood by everywhere who finds the operator leading to
 confusion

Either they find it and read the code, or we document it and they
read the docs.

 * Using a criteria other than equality (the = operator provided
 by the datatype) might cause problems if we later provide 'on
 change' triggers for materialized views

I would fight user-defined triggers for matviews tooth and nail.
We need to get incremental maintenance working instead.

 The arguments for this patch are
 * We want the materialized view to return the same value as
 would be returned if the query were executed directly.  This not
 only means that the values should be the same according to a
 datatypes = operator but that they should appear the same 'to
 the eyeball'.

And to functions the user can run against the values.  The example
with the null bitmap for arrays being included when not necessary
is something that isn't directly apparent to the eye, but queries
which use pg_column_size would not get equal results.

 * Supporting the materialized view refresh check via SQL makes a
 lot of sense but doing that requires exposing something via SQL
 * If we adequately document what we mean by
 record_image_identical and the operator we pick for this then
 users shouldn't be surprised at what they get if they use this

We first need to document the existing record comparison operators.
If they read the docs for comparing row_constructors and expect
that to be the behavior they get when they compare records, they
will be surprised.

 This is a good summary.  I think there are a few things that make
 this issue difficult to decide:

 1.  We have to use an operator to give the RMVC (REFRESH
 MATERIALIZED VIEW CONCURRENTLY) SPI query the ability to optimize
 this query.  If we could do this with an SQL or C function, there
 would be less concern about the confusion caused by adding this
 capability.

 (a)  We can't.

 (b)  Why would that be less confusing?

 Question:  If we are comparing based on some primary key, why do
 we need this to optimize?

Because comparing primary keys doesn't tell us whether the old and
new values in the row all match.

 Certainly the primary key index will be used, no?

It currently uses all columns which are part of unique indexes on
the matview which are not partial (i.e., they do not use a WHERE
clause), they index only on columns (not expressions).  It is not
possible to define a primary key on a matview.  There are two
reasons for using these columns:

(a)  It provides a correctness guarantee against having duplicated
rows which have no nulls.  This guarantee is what allows us to use
the differential technique which is faster than retail DELETE and
re-INSERT of everything.

(b)  Since we don't support hash joins of records, and it would
tend to be pretty slow if we did, it allows faster hash joins on
one or more columns which stand a good chance of doing this
efficiently.

 2.  Agregates are already non-deterministic,

Only some of them, and only with certain source data.

 so some feel that adding this feature doesn't improve much.

It allows a materialized view which contains a column of a type
without a default btree opclass to *use* concurrent refresh, it
makes queries with deterministic results always generate matview
results with exactly match the results of a VIEW using the query if
it were run at the same moment, and for nondeterministic queries,
it provides results consistent with *some* result set which could
have been returned by a view at the same time.  Unless we do
something, none of these things are true.  That seems like much to
me.

 The counter-argument is that 

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2013-09-27 Thread Peter Geoghegan
On Fri, Sep 27, 2013 at 5:36 AM, Robert Haas robertmh...@gmail.com wrote:
 I don't have another idea either. In fact, I'd go so far as to say
 that doing any third thing that's better than those two to any
 reasonable person is obviously impossible. But I'd add that we simple
 cannot rollback at read committed, so we're just going to have to hold
 our collective noses and do strange things with visibility.

 I don't accept that as a general principal.  We're writing the code;
 we can make it behave any way we think best.

I presume you're referring to the principle that we cannot throw
serialization failures at read committed. I'd suggest that letting
that happen would upset a lot of people, because it's so totally
unprecedented. A large segment of our user base would just consider
that to be Postgres randomly throwing errors, and would be totally
dismissive of the need to do so, and not without some justification -
no one else does the same. The reality is that the majority of our
users don't even know what an isolation level is. I'm not just talking
about people that use Postgres more casually, such as Heroku
customers. I've personally talked to people who didn't even know what
a transaction isolation level was, that were in a position where they
really, really should have known.

 I doubt that any change to HeapTupleSatisfiesMVCC() will be
 acceptable.  This feature needs to restrain itself to behavior changes
 that only affect users of this feature, I think.

I agree with the principle of what you're saying, but I'm not aware
that those changes to HeapTupleSatisfiesMVCC() imply any behavioral
changes for those not using the feature. Certainly, the standard
regression tests and isolation tests still pass, for what it's worth.
Having said that, I have not thought about it enough to be willing to
actually defend that bit of code. Though I must admit that I am a
little encouraged by the fact that it passes casual inspection.

I am starting to wonder if it's really necessary to have a blessed
update that can see the locked, not-otherwise-visible tuple. Doing
that certainly has its disadvantages, both in terms of code complexity
and in terms of being arbitrarily restrictive. We're going to have to
allow the user to see the locked row after it's updated (the new row
version that we create will naturally be visible to its creating xact)
- is it really any worse that the user can see it before an update (or
a delete)? The user could decide to effectively make the update change
nothing, and see the same thing anyway.

I get why you're averse to doing odd things to visibility - I was too.
I just don't see that we have a choice if we want this feature to work
acceptably with read committed. In addition, as it happens I just
don't see that the general situation is made any worse by the fact
that the user might be able to see the row before an update/delete.
Isn't is also weird to update or delete something you cannot see?

Couldn't EvalPlanQual() be said to be an MVCC violation on similar
grounds? It also reaches into the future. Locking a row isn't really
that distinct from updating it in terms of the code footprint, but
also from a logical perspective.

 It's probably important to avoid having
 them keep recreating speculative tuples and then killing them as long
 as a candidate tuple is available, so that they don't create a dead
 tuple per iteration.  But that seems doable.

I'm not so sure.

 Realize you can generally only kill the heap tuple *before* you have
 the row lock, because otherwise a totally innocent non-HOT update (may
 not update any unique indexed columns at all) will deadlock with your
 session, which I don't think is defensible, and will probably happen
 often if allowed to (after all, this is upsert - users are going to
 want to update their locked rows!).

 I must be obtuse; I don't see why that would deadlock.

If you don't see it, then you aren't being obtuse in asking for
clarification. It's really easy to be wrong about this kind of thing.

If the non-HOT update updates some random row, changing the key
columns, it will lock that random row version. It will then proceed
with value locking (i.e. inserting index tuples in the usual way, in
this case with entirely new values). It might then block on one of the
index tuples we, the upserter, have already inserted (these are our
value locks under your scheme). Meanwhile, we (the upserter) might
have *already* concluded that the *old* heap row that the regular
updater is in the process of rendering invisible is to blame in
respect of some other value in some later unique index, and that *it*
must be locked. Deadlock. This seems very possible if the key values
are somewhat correlated, which is probably generally quite common.

The important observation here is that an updater, in effect, locks
both the old and new sets of values (for non-HOT updates). And as I've
already noted, any practical value locking implementation isn't
going to be able 

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE - visibility semantics

2013-09-27 Thread Peter Geoghegan
On Tue, Sep 24, 2013 at 2:14 AM, Andres Freund and...@2ndquadrant.com wrote:
 Various messages are discussing semantics around visibility. I by now
 have a hard time keeping track. So let's keep the discussion of the
 desired semantics to this thread.

Yes, it's pretty complicated.

I meant to comment on this here, but ended up saying some stuff to
Robert about this in the main thread, so I should probably direct you
to that. You were probably right to start a new thread, because I
think we can usefully discuss this topic in parallel, but that's just
what ended up happening.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers