Re: [HACKERS] AutoVacuum starvation from sinval messages

2012-11-09 Thread Simon Riggs
On 8 November 2012 23:58, Jeff Janes jeff.ja...@gmail.com wrote:
 On Thu, Nov 8, 2012 at 2:50 PM, Hannu Krosing ha...@2ndquadrant.com wrote:
 On 11/08/2012 11:40 PM, Simon Riggs wrote:

 On 8 November 2012 20:36, Jeff Janes jeff.ja...@gmail.com wrote:

 It does not seem outrageous to me that there would be real-world
 conditions in which invalidations would be sent more than once a
 minute over prolonged periods, so this total starvation seems like a
 bug.

 Yes, its a bug, but do you really believe the above? In what cases?

 We see lots of traffic on the mail list about people trying to dump
 several hundred thousand tables, or they can only create one database
 every two minutes, or truncating hundreds of tables at a time over and
 over again gets slow, etc.  I know little about the internal of the
 invalidation code, but I would think doing that kind of thing must
 generate a lot of them.

OK, so the problem is *any* sinval. I thought you meant one sinval per
object per minute, which seemed much less likely.

I agree one sinval per minute for long periods is actually quite likely.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-09 Thread Simon Riggs
On 8 November 2012 23:20, Hannu Krosing ha...@2ndquadrant.com wrote:
 On 11/08/2012 08:51 PM, Simon Riggs wrote:

 On 8 November 2012 17:07, Robert Haas robertmh...@gmail.com wrote:

 On Wed, Nov 7, 2012 at 10:34 AM, Simon Riggs si...@2ndquadrant.com
 wrote:

 For 9.2 we discussed having COPY setting tuples as frozen. Various
 details apply.
 Earlier threads:
RFC: Making TRUNCATE more MVCC-safe
COPY with hints, rebirth

 I was unhappy with changing the behaviour of TRUNCATE, and still am.
 So the proposal here is to have a specific modifier on TRUNCATE
 command that makes it MVCC safe by throwing a serialization error.

 I don't think I understand the proposal.  Under what circumstances
 would it throw a serialization error?

 If you ask for TRUNCATE SERIALIZABLE then if someone asks for data in
 the table and has a snapshot that can see earlier data then it will
 throw a serializable error. So its a new kind of TRUNCATE that is MVCC
 safe.

 Can't we make it so that the reader with earlier snapshot sees the data from
 the pre-truncation file ?

We could... but that would require keeping a history of relfilenodes
for an object to allow for more than one TRUNCATE event. Tracking all
of that would be hard and I don't personally think its worth that
effort.

 and we unlink the base file(s) only once nobody has a snapshot the can see
 it ?

DELETE does that if that's the semantics you want.

 or are there some subtler problems (I was under impression that we already
 did this as described above) ?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] ERROR: lock 9 is not held

2012-11-09 Thread er
9.3devel / AMD FX-8120 8-core  / centos 6.2 / Linux 2.6.32-279.11.1.el6.x86_64

Hi,

I made a test setup of 9.3devel synchronous replication (git-master as of a few 
hours ago):
two instances on 1 machine:
  master: 93_1  port 6664
  slave:  93_2  port 6665

I do nothing on either server.  I get these logs (note the ERROR-lines in the 
slave log, every 5
minutes) :

$ tail -F 
/home/aardvark/pg_stuff/pg_installations/pgsql.93_1/data/../logfile.93_1
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
LOG:  received smart shutdown request
LOG:  autovacuum launcher shutting down
LOG:  shutting down
LOG:  database system is shut down
2012-11-09 09:42:49.717 CET 6279 LOG:  database system was shut down at 
2012-11-09 09:42:49 CET
2012-11-09 09:42:49.823 CET 6283 LOG:  autovacuum launcher started
2012-11-09 09:42:49.824 CET 6278 LOG:  database system is ready to accept 
connections
2012-11-09 09:42:53.244 CET 6329 LOG:  standby wal_receiver_01 is now the 
synchronous standby
with priority 1

$ tail -F 
/home/aardvark/pg_stuff/pg_installations/pgsql.93_2/data/../logfile.93_2
cp: cannot stat `/home/aardvark/pg_stuff/archive_dir/00010004': 
No such file or
directory
2012-11-09 09:42:53.144 CET 6328 LOG:  streaming replication successfully 
connected to primary
2012-11-09 09:49:01.124 CET 6323 ERROR:  lock 9 is not held
2012-11-09 09:53:02.427 CET 6323 ERROR:  lock 9 is not held
2012-11-09 09:58:03.845 CET 6323 ERROR:  lock 9 is not held
2012-11-09 10:03:05.239 CET 6323 ERROR:  lock 9 is not held
2012-11-09 10:07:51.620 CET 6323 ERROR:  lock 9 is not held
2012-11-09 10:12:53.019 CET 6323 ERROR:  lock 9 is not held
2012-11-09 10:17:54.416 CET 6323 ERROR:  lock 9 is not held
2012-11-09 10:22:55.811 CET 6323 ERROR:  lock 9 is not held


Master configuration:

data_directory = '/home/aardvark/pg_stuff/pg_installations/pgsql.93_1/data'
listen_addresses = '*'
max_connections = 100
shared_buffers = 128MB
wal_level = hot_standby
synchronous_commit = on
checkpoint_segments = 50
archive_mode = on
archive_command = 'cp %p /home/aardvark/pg_stuff/archive_dir/%f  /dev/null'
max_wal_senders = 3
synchronous_standby_names = '*'


Slave configuration:

data_directory = '/home/aardvark/pg_stuff/pg_installations/pgsql.93_2/data'
listen_addresses = '*'
port = 6665
max_connections = 100
shared_buffers = 128MB
wal_level = hot_standby
synchronous_commit = on
checkpoint_segments = 50
archive_mode = on
archive_command = 'cp %p /home/aardvark/pg_stuff/archive_dir/%f  /dev/null'
max_wal_senders = 3
synchronous_standby_names = ''
hot_standby = on
wal_receiver_status_interval = 59


It's just an ERROR-line in the log, and it may not be a real problem, but in 
any case I'd like to
know how it is caused

(I ought to add that I had not run these build-scripts for testing syncrep for 
a few weeks/months
so I may have missed some developments in setup/config)

thanks,

Erik Rijkers





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-09 Thread Hannu Krosing

On 11/09/2012 09:34 AM, Simon Riggs wrote:

On 8 November 2012 23:20, Hannu Krosing ha...@2ndquadrant.com wrote:

On 11/08/2012 08:51 PM, Simon Riggs wrote:

On 8 November 2012 17:07, Robert Haas robertmh...@gmail.com wrote:

On Wed, Nov 7, 2012 at 10:34 AM, Simon Riggs si...@2ndquadrant.com
wrote:

For 9.2 we discussed having COPY setting tuples as frozen. Various
details apply.
Earlier threads:
RFC: Making TRUNCATE more MVCC-safe
COPY with hints, rebirth

I was unhappy with changing the behaviour of TRUNCATE, and still am.
So the proposal here is to have a specific modifier on TRUNCATE
command that makes it MVCC safe by throwing a serialization error.

I don't think I understand the proposal.  Under what circumstances
would it throw a serialization error?

If you ask for TRUNCATE SERIALIZABLE then if someone asks for data in
the table and has a snapshot that can see earlier data then it will
throw a serializable error. So its a new kind of TRUNCATE that is MVCC
safe.

Can't we make it so that the reader with earlier snapshot sees the data from
the pre-truncation file ?

We could... but that would require keeping a history of relfilenodes
for an object to allow for more than one TRUNCATE event.

MVCC does keep history of old relfilenodes.

I thought we were able to read old MVCC versions in pg_class for
this if the snapshot required data matching older pg_class record.

Tracking all
of that would be hard and I don't personally think its worth that
effort.


and we unlink the base file(s) only once nobody has a snapshot the can see
it ?

DELETE does that if that's the semantics you want.


or are there some subtler problems (I was under impression that we already
did this as described above) ?




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-09 Thread Marti Raudsepp
On Wed, Nov 7, 2012 at 5:34 PM, Simon Riggs si...@2ndquadrant.com wrote:
 I was unhappy with changing the behaviour of TRUNCATE, and still am.
 So the proposal here is to have a specific modifier on TRUNCATE
 command that makes it MVCC safe by throwing a serialization error.
 That new behaviour should be requestable by adding the SERIALIZABLE
 keyword.
 i.e. TRUNCATE foo SERIALIZABLE;
 This then allows a new style of TRUNCATE, yet without modiying
 behaviour of earlier programs (ugh!).

Personally I think the behavior should be dictated by the *reader*.
The one doing the truncation may not know about the consistency
requirements of particular readers. Especially when you do the
truncate via pg_restore or some other generic tool. And indeed
different readers may have different consistency requirements.

So I'd prefer if it were a GUC variable; readers that accept relaxed
consistency can set truncate_conflict=off

Granted, making this part of the TRUNCATE statement does give you
table-level granularity. But if a reader can already handle
serialization conflicts, it doesn't really matter which table they
came from.

But making TRUNCATE behave in a MVCC-safe way, like proposed by Hannu,
would be much better.

Regards,
Marti


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-09 Thread Simon Riggs
On 9 November 2012 10:28, Marti Raudsepp ma...@juffo.org wrote:
 On Wed, Nov 7, 2012 at 5:34 PM, Simon Riggs si...@2ndquadrant.com wrote:
 I was unhappy with changing the behaviour of TRUNCATE, and still am.
 So the proposal here is to have a specific modifier on TRUNCATE
 command that makes it MVCC safe by throwing a serialization error.
 That new behaviour should be requestable by adding the SERIALIZABLE
 keyword.
 i.e. TRUNCATE foo SERIALIZABLE;
 This then allows a new style of TRUNCATE, yet without modiying
 behaviour of earlier programs (ugh!).

 Personally I think the behavior should be dictated by the *reader*.
 The one doing the truncation may not know about the consistency
 requirements of particular readers. Especially when you do the
 truncate via pg_restore or some other generic tool. And indeed
 different readers may have different consistency requirements.

 So I'd prefer if it were a GUC variable; readers that accept relaxed
 consistency can set truncate_conflict=off

I proposed something similar earlier, but Robert said he didn't like
that. The way you've described it here makes more sense, whereas my
parameter name made it seem more arbitrary. So I think that looks like
the way to go.

I'm trying to use this as a way to optimize COPY, so we'd need to make
a Seq Scan return zero rows if the truncatexid is seen as running by
the snapshot, which is the current behaviour. That seems easy enough.

 Granted, making this part of the TRUNCATE statement does give you
 table-level granularity. But if a reader can already handle
 serialization conflicts, it doesn't really matter which table they
 came from.

 But making TRUNCATE behave in a MVCC-safe way, like proposed by Hannu,
 would be much better.

Maybe, but one of the reasons for having a separate TRUNCATE command
rather than DELETE is the immediately removal of space. Changing
TRUNCATE so it suddenly holds on to space for longer will force us to
create a new command that acts like the old TRUNCATE.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] False indication in pg_stat_replication.sync_state

2012-11-09 Thread Fujii Masao
On Fri, Nov 9, 2012 at 4:06 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 Fujii Masao escribió:
 On Fri, Oct 19, 2012 at 10:29 PM, Fujii Masao masao.fu...@gmail.com wrote:

  However, I've forgotten to treat other three portions in
  walsender.c and syncrep.c also does XLogRecPtrIsInvalid(XLogPtr
  which comes from WAL receiver).  This new patch includes the
  changes for them.
 
  Good catch.

 Does any commiter pick up this?

 If not, please add to next commitfest so that we don't forget.

Yep, I added this to next CF. This is just a bug fix, so please feel free to
pick up this even before CF.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] AutoVacuum starvation from sinval messages

2012-11-09 Thread Robert Haas
On Thu, Nov 8, 2012 at 6:23 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 On 8 November 2012 20:36, Jeff Janes jeff.ja...@gmail.com wrote:
 It does not seem outrageous to me that there would be real-world
 conditions in which invalidations would be sent more than once a
 minute over prolonged periods, so this total starvation seems like a
 bug.

 Yes, its a bug, but do you really believe the above? In what cases?

 It doesn't take a whole lot of DDL to provoke an sinval overrun, if
 the recipient process is just sitting idle and not servicing the
 messages.  I think Jeff's concern is entirely valid.

So, do we need a sinval overrun or just a sinval message to provoke
starvation?  The former would be bad but the latter would be really,
really bad.  IIRC the queue has 4K entries, and IIRC a single DDL
operation might provoke a couple of sinvals, but I'm thinking that
somebody would probably have to be creating 1024 temp tables a minute
to overrun the queue, which is very possible but not necessarily
common.  OTOH, creating 1 temp table a minute would hit a much broader
swath of users.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-09 Thread Robert Haas
On Fri, Nov 9, 2012 at 8:22 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Personally I think the behavior should be dictated by the *reader*.
 The one doing the truncation may not know about the consistency
 requirements of particular readers. Especially when you do the
 truncate via pg_restore or some other generic tool. And indeed
 different readers may have different consistency requirements.

 So I'd prefer if it were a GUC variable; readers that accept relaxed
 consistency can set truncate_conflict=off

 I proposed something similar earlier, but Robert said he didn't like
 that. The way you've described it here makes more sense, whereas my
 parameter name made it seem more arbitrary. So I think that looks like
 the way to go.

Hmm, I don't remember saying I didn't like that.  Maybe I disliked
something about a particular proposed implementation?  Actually, I
don't really see the need for this to be customizable at all.  I have
to believe that there is vanishingly little application code that
would care about this change in semantics, so why not just change the
behavior and call it good?

I think the question that hasn't really been adequately answered is:
where and how are we going to track conflicts?  Your previous patch
involved storing an XID in pg_class, but I think we both found that a
bit grotty - it'd probably need special handling for wraparound, and I
think we came up with some related cases that couldn't be handled in
the same way without adding a bunch more XIDs to various places.  I
don't really like the idea of having XIDs floating around in the
system catalogs - it seems like a recipe for bugs, not to mention that
storing ephemeral data in a persistent table seems like a mismatch.

What I've been wondering since this last came up is whether we could
use some variant of the SIREAD locks Kevin introduced for SSI to
handle this case - essentially have the transaction doing the TRUNCATE
make an entry in the lock table that will force a serialization
failure for any backend which accesses the table with a snapshot that
can't see the truncating transaction's XID.  The lock table entry
would need some kind of deferred clean-up, so it doesn't go away until
the locker's XID precedes RecentGlobalXmin.  Of course, an extra lock
table probe for every table access will be unacceptable from a
concurrency perspective, but we could probably optimize most of them
away by only checking the lock table if the pg_class row's own xmin is
new enough that the other backend's MVCC snapshot can't see it.  A
recent update to pg_class doesn't imply the existing of a lock, but
the absence of any recent update to pg_class does imply that no lock
can exist.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] AutoVacuum starvation from sinval messages

2012-11-09 Thread Simon Riggs
On 9 November 2012 13:42, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Nov 8, 2012 at 6:23 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 On 8 November 2012 20:36, Jeff Janes jeff.ja...@gmail.com wrote:
 It does not seem outrageous to me that there would be real-world
 conditions in which invalidations would be sent more than once a
 minute over prolonged periods, so this total starvation seems like a
 bug.

 Yes, its a bug, but do you really believe the above? In what cases?

 It doesn't take a whole lot of DDL to provoke an sinval overrun, if
 the recipient process is just sitting idle and not servicing the
 messages.  I think Jeff's concern is entirely valid.

 So, do we need a sinval overrun or just a sinval message to provoke
 starvation?  The former would be bad but the latter would be really,
 really bad.  IIRC the queue has 4K entries, and IIRC a single DDL
 operation might provoke a couple of sinvals, but I'm thinking that
 somebody would probably have to be creating 1024 temp tables a minute
 to overrun the queue, which is very possible but not necessarily
 common.  OTOH, creating 1 temp table a minute would hit a much broader
 swath of users.

The point is moot because latches don't work that way anymore.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] AutoVacuum starvation from sinval messages

2012-11-09 Thread Robert Haas
On Fri, Nov 9, 2012 at 9:02 AM, Simon Riggs si...@2ndquadrant.com wrote:
 So, do we need a sinval overrun or just a sinval message to provoke
 starvation?  The former would be bad but the latter would be really,
 really bad.  IIRC the queue has 4K entries, and IIRC a single DDL
 operation might provoke a couple of sinvals, but I'm thinking that
 somebody would probably have to be creating 1024 temp tables a minute
 to overrun the queue, which is very possible but not necessarily
 common.  OTOH, creating 1 temp table a minute would hit a much broader
 swath of users.

 The point is moot because latches don't work that way anymore.

One of us is confused, because IIUC Tom just fixed this this morning,
and I'm trying to figure out how many users will be affected by it,
and how seriously.  Like, do we need an immediate minor release?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-09 Thread Simon Riggs
On 9 November 2012 14:01, Robert Haas robertmh...@gmail.com wrote:

 I think the question that hasn't really been adequately answered is:
 where and how are we going to track conflicts?  Your previous patch
 involved storing an XID in pg_class, but I think we both found that a
 bit grotty - it'd probably need special handling for wraparound, and I
 think we came up with some related cases that couldn't be handled in
 the same way without adding a bunch more XIDs to various places.  I
 don't really like the idea of having XIDs floating around in the
 system catalogs - it seems like a recipe for bugs, not to mention that
 storing ephemeral data in a persistent table seems like a mismatch.

Yes, the xid only needs to be transient, not in pg_class.

 What I've been wondering since this last came up is whether we could
 use some variant of the SIREAD locks Kevin introduced for SSI to
 handle this case - essentially have the transaction doing the TRUNCATE
 make an entry in the lock table that will force a serialization
 failure for any backend which accesses the table with a snapshot that
 can't see the truncating transaction's XID.  The lock table entry
 would need some kind of deferred clean-up, so it doesn't go away until
 the locker's XID precedes RecentGlobalXmin.  Of course, an extra lock
 table probe for every table access will be unacceptable from a
 concurrency perspective, but we could probably optimize most of them
 away by only checking the lock table if the pg_class row's own xmin is
 new enough that the other backend's MVCC snapshot can't see it.  A
 recent update to pg_class doesn't imply the existing of a lock, but
 the absence of any recent update to pg_class does imply that no lock
 can exist.

I think the xid should still live in relcache, per the patch, but
should live in a transient place (and not pg_class).

We need a fast lookup structure that is expandable to accommodate
arbitrary numbers of truncates. Shared hash table, with some form of
overflow mechanism.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-09 Thread Markus Wanner
Jeff,

On 11/09/2012 02:01 AM, Jeff Davis wrote:
 For the sake of simplicity (implementation as well as usability), it
 seems like there is agreement that checksums should be enabled or
 disabled for the entire instance, not per-table.

Agreed. I've quickly thought about making it a per-database setting, but
how about shared system catalogs... Let's keep it simple and have a
single per-cluster instance switch for now.

 I don't think a GUC entirely makes sense (in its current form, anyway).
 We basically care about 3 states:
   1. Off: checksums are not written, nor are they verified. Pages that
 are newly dirtied have the checksum information in the header cleared.
   2. Enabling: checksums are written for every dirty page, but only
 verified for pages where the checksum is present (as determined by
 information in the page header).
   3. On: checksums are written for every dirty page, and verified for
 every page that's read. If a page does not have a checksum, it's
 corrupt.

Sounds sane, yes.

 And the next question is what commands to add to change state. Ideas:
 
CHECKSUMS ENABLE; -- set state to Enabling
CHECKSUMS DISABLE; -- set state to Off

Yet another SQL command doesn't feel like the right thing for such a
switch. Quick googling revealed that CHECKSUM is a system function in MS
SQL and MySQL knows a CHECKSUM TABLE command. And you never know what
the committee is coming up with next.

Apart from that, I'd like something more descriptive that just
checksums. Block checksums? Heap checksums? Data checksums?

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] AutoVacuum starvation from sinval messages

2012-11-09 Thread Simon Riggs
On 9 November 2012 14:16, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Nov 9, 2012 at 9:02 AM, Simon Riggs si...@2ndquadrant.com wrote:
 So, do we need a sinval overrun or just a sinval message to provoke
 starvation?  The former would be bad but the latter would be really,
 really bad.  IIRC the queue has 4K entries, and IIRC a single DDL
 operation might provoke a couple of sinvals, but I'm thinking that
 somebody would probably have to be creating 1024 temp tables a minute
 to overrun the queue, which is very possible but not necessarily
 common.  OTOH, creating 1 temp table a minute would hit a much broader
 swath of users.

 The point is moot because latches don't work that way anymore.

 One of us is confused, because IIUC Tom just fixed this this morning,
 and I'm trying to figure out how many users will be affected by it,
 and how seriously.  Like, do we need an immediate minor release?

You asked what provokes starvation, and the answer is nothing anymore,
since Tom's commit. No confusion here...

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-09 Thread Markus Wanner
On 11/09/2012 06:18 AM, Jesper Krogh wrote:
 I would definately stuff our system in state = 2 in your
 description if it was available.

Hm.. that's an interesting statement.

What's probably worst when switching from OFF to ON is the VACUUM run
that needs to touch every page (provided you haven't ever turned
checksumming on before). Maybe you want to save that step and still get
the additional safety for newly dirtied pages, right?

A use case worth supporting?

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] AutoVacuum starvation from sinval messages

2012-11-09 Thread Kevin Grittner
Simon Riggs wrote:
 Robert Haas robertmh...@gmail.com wrote:

 One of us is confused, because IIUC Tom just fixed this this
 morning, and I'm trying to figure out how many users will be
 affected by it, and how seriously. Like, do we need an immediate
 minor release?
 
 You asked what provokes starvation, and the answer is nothing
 anymore, since Tom's commit. No confusion here...

So for all those end users running production with a build compiled
from HEAD after Tom's commit this morning there is no issue. I'm
wondering about those using something else in production. What does
it take for their autovacuum to be stalled?

-Kevin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] AutoVacuum starvation from sinval messages

2012-11-09 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 So, do we need a sinval overrun or just a sinval message to provoke
 starvation?  The former would be bad but the latter would be really,
 really bad.

The former.  Actually, a signal will be sent as soon as a backend is
determined to be unreasonably far behind, which I think is when the
message queue is half full.  In the AV launcher case, the queue probably
never actually overflows; but the launcher has to wake up to process
messages every so often, and that wakeup cycle is breaking the timeout
management in WaitLatch.

 IIRC the queue has 4K entries, and IIRC a single DDL
 operation might provoke a couple of sinvals, but I'm thinking that
 somebody would probably have to be creating 1024 temp tables a minute
 to overrun the queue, which is very possible but not necessarily
 common.

Well, one DDL typically generates multiple messages --- one for each
catalog row added/modified/removed, roughly speaking.  When I run the
constant create/drop example Jeff posted, I see the AV launcher getting
a catchup signal every few seconds.  I didn't try to determine exactly
how many create/drop cycles that was, but I'm pretty sure it's a lot
less than 1000.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-09 Thread Marti Raudsepp
On Fri, Nov 9, 2012 at 4:24 PM, Simon Riggs si...@2ndquadrant.com wrote:
 We need a fast lookup structure that is expandable to accommodate
 arbitrary numbers of truncates. Shared hash table, with some form of
 overflow mechanism.

Surely you only need to remember the last completed truncate for each
relation? The latest one also invalidates any snapshots before earlier
truncates.

Regards,
Marti


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-09 Thread Kevin Grittner
Robert Haas wrote:

 What I've been wondering since this last came up is whether we
 could use some variant of the SIREAD locks Kevin introduced for SSI
 to handle this case - essentially have the transaction doing the
 TRUNCATE make an entry in the lock table that will force a
 serialization failure for any backend which accesses the table with
 a snapshot that can't see the truncating transaction's XID.

It seems to me that the goal would be to make this semantically
idential to the behavior users would see if an unqualified DELETE
were run against the table rather than a TRUNCATE. To wit:

(1) Any attempt to read from the truncated table would not block. The
pg_class version included in the transaction's snapshot would
determine which heap and indexes were accessed. If the reading
transaction were SERIALIZABLE, it would generate a read-write
conflict out to the truncating transaction.

(2) Any attempt to write to the truncated table would block until the
end of the transaction which is doing the truncation. If the
truncating transaction rolls back, it proceeds normally against the
old data. Otherwise: If the transaction is READ COMMITTED, follow the
pg_class update links. At more strict isolation levels, generate a
write conflict error.

I'm not sure where any new use of the predicate locking system would
come into play in that, other than properly handling read-write
conflicts when both transactions were SERIALIZABLE.

This seems like a subset of the issues which one might want to
address by making DDL statement behave in a more strictly MVCC
fashion. Does it make sense to pick those off one at a time, or
should something like this be done only in the context of an overall
plan to deal with all of it?

-Kevin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] AutoVacuum starvation from sinval messages

2012-11-09 Thread Alvaro Herrera
Tom Lane escribió:
 Robert Haas robertmh...@gmail.com writes:

  IIRC the queue has 4K entries, and IIRC a single DDL
  operation might provoke a couple of sinvals, but I'm thinking that
  somebody would probably have to be creating 1024 temp tables a minute
  to overrun the queue, which is very possible but not necessarily
  common.
 
 Well, one DDL typically generates multiple messages --- one for each
 catalog row added/modified/removed, roughly speaking.  When I run the
 constant create/drop example Jeff posted, I see the AV launcher getting
 a catchup signal every few seconds.  I didn't try to determine exactly
 how many create/drop cycles that was, but I'm pretty sure it's a lot
 less than 1000.

Just creating the sequence for the serial column means 16 pg_attribute
tuples.  There's also two pg_class entries, one more pg_attribute, two
pg_type entries, a bunch of pg_depend entries ... I doubt it's less than
30 catalog tuples, all things considered.  Double that for the drop.  So
for a 4k entry table that needs to get 50% full, that's only ~35 temp
table creations like that.

I hadn't realized sequences used so many pg_attribute entries.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] AutoVacuum starvation from sinval messages

2012-11-09 Thread Robert Haas
On Fri, Nov 9, 2012 at 10:08 AM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Tom Lane escribió:
 Robert Haas robertmh...@gmail.com writes:

  IIRC the queue has 4K entries, and IIRC a single DDL
  operation might provoke a couple of sinvals, but I'm thinking that
  somebody would probably have to be creating 1024 temp tables a minute
  to overrun the queue, which is very possible but not necessarily
  common.

 Well, one DDL typically generates multiple messages --- one for each
 catalog row added/modified/removed, roughly speaking.  When I run the
 constant create/drop example Jeff posted, I see the AV launcher getting
 a catchup signal every few seconds.  I didn't try to determine exactly
 how many create/drop cycles that was, but I'm pretty sure it's a lot
 less than 1000.

 Just creating the sequence for the serial column means 16 pg_attribute
 tuples.  There's also two pg_class entries, one more pg_attribute, two
 pg_type entries, a bunch of pg_depend entries ... I doubt it's less than
 30 catalog tuples, all things considered.  Double that for the drop.  So
 for a 4k entry table that needs to get 50% full, that's only ~35 temp
 table creations like that.

 I hadn't realized sequences used so many pg_attribute entries.

Hmm.  So, are we going to force a minor release for this, or do we
think it's not serious enough to warrant that?

I'm not expressing an opinion either way, just asking.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-09 Thread Simon Riggs
On 9 November 2012 14:55, Marti Raudsepp ma...@juffo.org wrote:
 On Fri, Nov 9, 2012 at 4:24 PM, Simon Riggs si...@2ndquadrant.com wrote:
 We need a fast lookup structure that is expandable to accommodate
 arbitrary numbers of truncates. Shared hash table, with some form of
 overflow mechanism.

 Surely you only need to remember the last completed truncate for each
 relation?

Yes

 The latest one also invalidates any snapshots before earlier
 truncates.

1 per table, arbirary number of tables

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP checksums patch

2012-11-09 Thread Robert Haas
On Thu, Nov 8, 2012 at 9:17 PM, Christopher Browne cbbro...@gmail.com wrote:
 I see one thing to be concerned about, there...

 I imagine it would not be a totally happy thing if the only way to switch it
 on/off was to use Slony or Londiste to replicate into a database with the
 opposite setting.  (e.g. - This implies that built-in replication may only
 replicate into a database with the identical checksum configuration.)

Sure, I agree.  I don't think it should stay that way forever, but
removing the burden of dealing with this issue from the initial commit
would likely allow that commit to happen this release cycle, perhaps
even in the next CommitFest.  And then we'd have half a loaf, which is
better than none, and we could deal with the issues of switching it on
and off as a further enhancement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-09 Thread Simon Riggs
On 9 November 2012 14:22, Kevin Grittner kgri...@mail.com wrote:
 Robert Haas wrote:

 What I've been wondering since this last came up is whether we
 could use some variant of the SIREAD locks Kevin introduced for SSI
 to handle this case - essentially have the transaction doing the
 TRUNCATE make an entry in the lock table that will force a
 serialization failure for any backend which accesses the table with
 a snapshot that can't see the truncating transaction's XID.

 It seems to me that the goal would be to make this semantically
 idential to the behavior users would see if an unqualified DELETE
 were run against the table rather than a TRUNCATE.

Unqualified DELETE already runs that way. TRUNCATE is a different
command for a reason. Making TRUNCATE like something we already have
seems not very useful to me, not least because it breaks existing
applications.

 This seems like a subset of the issues which one might want to
 address by making DDL statement behave in a more strictly MVCC
 fashion. Does it make sense to pick those off one at a time, or
 should something like this be done only in the context of an overall
 plan to deal with all of it?

TRUNCATE is not DDL, plus I have no interest in this other than
speeding up COPY.

Scope creep just kills features.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-09 Thread Merlin Moncure
On Fri, Nov 9, 2012 at 8:22 AM, Kevin Grittner kgri...@mail.com wrote:
 Robert Haas wrote:

 What I've been wondering since this last came up is whether we
 could use some variant of the SIREAD locks Kevin introduced for SSI
 to handle this case - essentially have the transaction doing the
 TRUNCATE make an entry in the lock table that will force a
 serialization failure for any backend which accesses the table with
 a snapshot that can't see the truncating transaction's XID.

 It seems to me that the goal would be to make this semantically
 idential to the behavior users would see if an unqualified DELETE
 were run against the table rather than a TRUNCATE. To wit:

but, triggers would not fire, right?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-09 Thread Kevin Grittner
Merlin Moncure wrote:
 Kevin Grittner kgri...@mail.com wrote:
 Robert Haas wrote:

 It seems to me that the goal would be to make this semantically
 idential to the behavior users would see if an unqualified DELETE
 were run against the table rather than a TRUNCATE.
 
 but, triggers would not fire, right?

Right. Perhaps identical was too strong a word. I was referring to
the aspect under consideration here -- making it serializable in
line with other MVCC operations.

If we're not talking about making conflicts with other transactions
behave just the same as an unqualified DELETE from a user
perspective, I'm not sure what the goal is, exactly. Obviously we
would be keeping the guts of the implementation the same (swapping in
a new, empty heap).

-Kevin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-09 Thread Simon Riggs
On 9 November 2012 15:34, Kevin Grittner kgri...@mail.com wrote:

 If we're not talking about making conflicts with other transactions
 behave just the same as an unqualified DELETE from a user
 perspective, I'm not sure what the goal is, exactly.

Reasonable question.

My goal is to allow COPY to load frozen tuples without causing MVCC violations.

Altering TRUNCATE so it behaves perfectly from an MVCC/Serializable
perspective is a much bigger, and completely different goal, as well
as something I don't see as desirable anyway for at least 2 good
reasons, as explained. IMHO if people want MVCC/Serializable
semantics, use DELETE, possibly spending time to make unqualified
DELETE do some fancy TRUNCATE-like tricks with relfilenodes.

Forcing a tightly scoped proposal into a much wider one will just kill
this and leave it blocked.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-09 Thread Simon Riggs
On 9 November 2012 15:46, Simon Riggs si...@2ndquadrant.com wrote:

 Altering TRUNCATE so it behaves perfectly from an MVCC/Serializable
 perspective is a much bigger, and completely different goal, as well
 as something I don't see as desirable anyway for at least 2 good
 reasons, as explained. IMHO if people want MVCC/Serializable
 semantics, use DELETE, possibly spending time to make unqualified
 DELETE do some fancy TRUNCATE-like tricks with relfilenodes.

We spent a lot of time in 9.2 making TRUNCATE/reload of a table just
work, rather than implementing a REPLACE command.

ISTM strange to throw away all that effort, changing behaviour of
TRUNCATE and thus forcing the need for a REPLACE command after all.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DEALLOCATE IF EXISTS

2012-11-09 Thread Vik Reykja
On Tue, Oct 9, 2012 at 4:44 PM, Vik Reykja vikrey...@gmail.com wrote:

 On Tue, Oct 9, 2012 at 4:09 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 =?ISO-8859-1?Q?S=E9bastien_Lardi=E8re?= slardi...@hi-media.com writes:
  Indeed, brackets was not correct, it's better now (I think), and correct
  some comments.

 Still wrong ... at the very least you missed copyfuncs/equalfuncs.
 In general, when adding a field to a struct, it's good practice to
 grep for all uses of that struct.


 I don't see Sébastien's message, but I made the same mistake in my patch.
 Another one is attached with copyfuncs and equalfuncs.  I did a grep for
 DeallocateStmt and I don't believe I have missed anything else.

 Also, I'm changing the subject so as not to hijack this thread any further.



I am taking no comments to mean no objections and have added this to the
next commitfest.


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-09 Thread Kevin Grittner
Simon Riggs wrote:

 This seems like a subset of the issues which one might want to
 address by making DDL statement behave in a more strictly MVCC
 fashion. Does it make sense to pick those off one at a time, or
 should something like this be done only in the context of an
 overall plan to deal with all of it?
 
 TRUNCATE is not DDL

You're right, I should have said utility commands.

 I have no interest in this other than speeding up COPY.

I would love to have that!

 Scope creep just kills features.

Well, I wasn't saying it should all be *done* at the same time, but
this is not the only utility command which could benefit from such an
effort, and if each one is done with no consideration of what it
takes for them all to be done, we could wind up with something that
doesn't hang together very coherently. Per perhaps this one could
serve as a pilot, to identify issues and help develop such a plan.

-Kevin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-09 Thread Kevin Grittner
Simon Riggs wrote:

 My goal is to allow COPY to load frozen tuples without causing MVCC
 violations.

OK. That wasn't initially clear to me.

 Forcing a tightly scoped proposal into a much wider one will just
 kill this and leave it blocked.

The goal is important enough and narrow enough to merit the approach
you're talking about, IMV, at least. Sorry I initially misunderstood
what you were going for.

-Kevin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-09 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 9 November 2012 15:34, Kevin Grittner kgri...@mail.com wrote:
 If we're not talking about making conflicts with other transactions
 behave just the same as an unqualified DELETE from a user
 perspective, I'm not sure what the goal is, exactly.

 Reasonable question.

 My goal is to allow COPY to load frozen tuples without causing MVCC 
 violations.

If that's the goal, I question why you're insisting on touching
TRUNCATE's behavior.  We already have the principle that TRUNCATE is
like DELETE except not concurrent-safe.  Why not just invent a
non-concurrent-safe option to COPY that loads prefrozen tuples into a
new heap, and call it good?  There will be visibility oddness from that
definition, sure, but AFAICS there will be visibility oddness from what
you're talking about too.  You'll just have expended a very great deal
of effort to make the weirdness a bit different.  Even if the TRUNCATE
part of it were perfectly clean, the load prefrozen tuples part won't
be --- so I'm not seeing the value of changing TRUNCATE.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-09 Thread Robert Haas
On Fri, Nov 9, 2012 at 9:22 AM, Kevin Grittner kgri...@mail.com wrote:
 (1) Any attempt to read from the truncated table would not block. The
 pg_class version included in the transaction's snapshot would
 determine which heap and indexes were accessed.

Well, the thing is, you can't actually do this.  When the transaction
commits, we truncate the main forks of the old heap and index and
remove all of the supplemental forks.  The main forks are finally
removed for good at the next checkpoint cycle.  To make this work,
we'd have to keep around the old heap and index until there were no
longer any MVCC snapshots that could see them.  That might be useful
as an option, but it would need a bunch of additional mechanism, and
it doesn't seem desirable as a default behavior because it could defer
disk space reclamation indefinitely.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-09 Thread Robert Haas
On Fri, Nov 9, 2012 at 11:27 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 My goal is to allow COPY to load frozen tuples without causing MVCC 
 violations.

 If that's the goal, I question why you're insisting on touching
 TRUNCATE's behavior.  We already have the principle that TRUNCATE is
 like DELETE except not concurrent-safe.  Why not just invent a
 non-concurrent-safe option to COPY that loads prefrozen tuples into a
 new heap, and call it good?  There will be visibility oddness from that
 definition, sure, but AFAICS there will be visibility oddness from what
 you're talking about too.  You'll just have expended a very great deal
 of effort to make the weirdness a bit different.  Even if the TRUNCATE
 part of it were perfectly clean, the load prefrozen tuples part won't
 be --- so I'm not seeing the value of changing TRUNCATE.

I don't object to the idea of giving COPY a way to load prefrozen
tuples, but I think you might be missing the point here otherwise.
Right now, if you CREATE or TRUNCATE a table, copy a bunch of data
into it, and then commit, another transaction that took a snapshot
before your commit can subsequently look at that table and it will NOT
see your newly-loaded data.  What it will see instead is an empty
table.  This is, of course, wrong: it ought to fail with a
serialization error.  It is very possible that the table has never
been empty at the conclusion of a completed transaction: it might have
contained data before the TRUNCATE, and it might again contain data by
the time the truncating transaction commits.  Yet, we see it as empty,
which is not MVCC-compliant.

If we were to make COPY pre-freeze the data when the table was created
or truncated in the same transaction, it would alter the behavior in
this situation, and from an application perspective, only this
situation.  Now, instead of seeing the table as empty, you'd see the
new contents.  This is also not MVCC-compliant, and I guess the
concern when we have talked about this topic before is that changing
from wrong behavior to another, not-backward-compatible wrong behavior
might not be the friendliest thing to do.  We could decide we don't
care and just break it.  Or we could try to make it through a
serialization error, as Simon is proposing here, which seems like the
tidiest solution.  Or we could keep the old heap around until there
are no more snapshots that can need it, which is a bit scary since
we'd be eating double disk-space in the meantime, but it would
certainly be useful to some users, I think.

Just having an option to preload frozen tuples dodges all of these
issues by throwing our hands up in the air, but it does have the
advantage of being more general.  Even if we do that I'm not sure it
would be a bad thing to try to solve this issue in a somewhat more
principled way, but it would surely reduce the urgency.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-09 Thread Josh Berkus
Jeff,

 I don't think a GUC entirely makes sense (in its current form, anyway).
 We basically care about 3 states:

Huh?  Why would a GUC not make sense?  How else would you make sure that
checksums where on when you started the system?

   1. Off: checksums are not written, nor are they verified. Pages that
 are newly dirtied have the checksum information in the header cleared.
   2. Enabling: checksums are written for every dirty page, but only
 verified for pages where the checksum is present (as determined by
 information in the page header).
   3. On: checksums are written for every dirty page, and verified for
 every page that's read. If a page does not have a checksum, it's
 corrupt.

Well, large databases would tend to be stuck permanently in Enabling,
becuase the user would never vacuum old cold partitions in order to
checksum them.  So we need to be prepared for this to be the end state
for a lot of databases.

In fact, we'd need three settings for the checksum GUC:

OFF -- don't checksum anything, equal to state (1) above

WRITES -- checksum pages which are being written anyway, but ignore
tables which aren't touched.  Permanent Enabling state.

ALL -- checksum everything you can.  particularly, autovacuum would
checksum any table which was not already checksummed at the next vacuum
of that table.  Goal is to get to state 3 above.

 Does it make sense to store this information in pg_control? That doesn't
 require adding any new file, and it has the benefit that it's already
 checksummed. It's available during recovery and can be made available
 pretty easily in the places where we write data.
 
 And the next question is what commands to add to change state. Ideas:
 
CHECKSUMS ENABLE; -- set state to Enabling
CHECKSUMS DISABLE; -- set state to Off

Don't like this, please make it a GUC.

 And then to get to the On state, you have to run a system-wide VACUUM
 while in the Enabling state. Or, if the above syntax causes problems,
 we can make all of these into VACUUM options.

As there's no such thing as system-wide vacuum, we're going to have to
track whether a table is fully checksummed in the system catalogs.
We'll also need:

VACUUM ( CHECKSUM ON )

... which would vacuum an entire table, skipping no pages and writing
checksums for every page, unless the table were marked fully checksummed
already, in which case it would do a regular vacuum.

Once a table was flagged as all checksummed, then the system could
start producing errors (or warnings?) whenever a page with a missing
checksum was found.

Hmmm, better to have a 2nd GUC:

checksum_fail_action = WARNING | ERROR

... since some people want the write or read to fail, and others just
want to see it in the logs.

So, thinking about it, state (3) is never the state of an entire
installation; it's always the state of individual tables.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-09 Thread Jeff Janes
On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jeff Janes jeff.ja...@gmail.com writes:
 Are sure the server you are dumping out of is head?

 I experimented a bit with dumping/restoring 16000 tables matching
 Bruce's test case (ie, one serial column apiece).  The pg_dump profile
 seems fairly flat, without any easy optimization targets.  But
 restoring the dump script shows a rather interesting backend profile:

 samples  %image name   symbol name
 3086139.6289  postgres AtEOXact_RelationCache
 9911 12.7268  postgres hash_seq_search
...

 There are at least three ways we could whack that mole:

 * Run the psql script in --single-transaction mode, as I was mumbling
 about the other day.  If we were doing AtEOXact_RelationCache only once,
 rather than once per CREATE TABLE statement, it wouldn't be a problem.
 Easy but has only a narrow scope of applicability.

That is effective when loading into 9.3 (assuming you make
max_locks_per_transaction large enough).  But when loading into  9.3,
using --single-transaction will evoke the quadratic behavior in the
resource owner/lock table and make things worse rather than better.


But there is still the question of how people can start using 9.3 if
they can't use pg_upgrade, or use the pg_dump half of the dump/restore
in, order to get there.

It seems to me that pg_upgrade takes some pains to ensure that no one
else attaches to the database during its operation.  In that case, is
it necessary to run the entire dump in a single transaction in order
to get a consistent picture?  The attached crude patch allows pg_dump
to not use a single transaction (and thus not accumulate a huge number
of locks) by using the --pg_upgrade flag.

This seems to remove the quadratic behavior of running pg_dump against
pre-9.3 servers.  It is linear up to 30,000 tables with a single
serial column, at about 1.5 msec per table.

I have no evidence other than a gut feeling that this is a safe thing to do.

I've also tested Tatsuo-san's group-LOCK TABLE patch against this
case, and it is minimal help.  The problem is that there is no syntax
for locking sequences, so they cannot be explicitly locked as a group
but rather are implicitly locked one by one and so still suffer from
the quadratic behavior.

Cheers,

Jeff


pg_dump_for_upgrade.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-09 Thread Jeff Davis
On Fri, 2012-11-09 at 15:42 +0100, Markus Wanner wrote:
 On 11/09/2012 06:18 AM, Jesper Krogh wrote:
  I would definately stuff our system in state = 2 in your
  description if it was available.
 
 Hm.. that's an interesting statement.
 
 What's probably worst when switching from OFF to ON is the VACUUM run
 that needs to touch every page (provided you haven't ever turned
 checksumming on before). Maybe you want to save that step and still get
 the additional safety for newly dirtied pages, right?
 
 A use case worth supporting?

One problem is telling which pages are protected and which aren't. We
can have a couple bits in the header indicating that a checksum is
present, but it's a little disappointing to have only a few bits
protecting a 16-bit checksum.

Also, I think that people will want to have a way to protect their old
data somehow.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-09 Thread Jeff Davis
On Thu, 2012-11-08 at 23:33 -0300, Alvaro Herrera wrote:
 There's no such thing as a system-wide VACUUM.  The most you can get is
 a database-wide VACUUM, which means you'd have to store the state
 per-database somewhere (presumably the pg_database catalog), and perhaps
 pg_control could have it as a system-wide value that's computed as the
 minimum of all database states (so it stays enabling until all
 databases have upgraded to on).

That's a good point. Maybe this should be done as an offline operation
using a command-line utility?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-09 Thread Markus Wanner
On 11/09/2012 07:53 PM, Jeff Davis wrote:
 One problem is telling which pages are protected and which aren't. We
 can have a couple bits in the header indicating that a checksum is
 present, but it's a little disappointing to have only a few bits
 protecting a 16-bit checksum.

Given your description of option 2 I was under the impression that each
page already has a bit indicating whether or not the page is protected
by a checksum. Why do you need more bits than that?

 Also, I think that people will want to have a way to protect their old
 data somehow.

Well, given that specific set of users is not willing to go through a
rewrite of each and every page of its database, it's hard to see how we
can protect their old data better.

However, we certainly need to provide the option to go through the
rewrite for other users, who are well willing to bite that bullet.

From a users perspective, the trade-off seems to be: if you want your
old data to be covered by checksums, you need to go through such an
expensive VACUUM run that touches every page in your database.

If you don't want to or cannot do that, you can still turn on
checksumming for newly written pages. You won't get full protection and
it's hard to tell what data is protected and what not, but it's still
better than no checksumming at all. Especially for huge databases, that
might be a reasonable compromise.

One could even argue, that this just leads to a prolonged migration and
with time, the remaining VACUUM step becomes less and less frightening.

Do you see any real foot-guns or other show-stoppers for permanently
allowing that in-between-state?

Or do we have other viable options that prolong the migration and thus
spread the load better over time?

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] My first patch! (to \df output)

2012-11-09 Thread Jon Erdman
On Oct 27, 2012, at 10:45 AM, Pavel Stehule pavel.steh...@gmail.com wrote:

 Hello
 
 2012/10/27 Jon Erdman postgre...@thewickedtribe.net:
 
 Hello Hackers!
 
 So, currently the only way to see if a function is security definer or not 
 is to directly query pg_proc. This is both irritating, and I think perhaps 
 dangerous since security definer functions can be  so powerful. I thought 
 that rectifying that would make an excellent first patch, and I was bored 
 today here in Prague since pgconf.eu is now over...so here it is. :)
 
 This patch adds a column to the output of \df titled Security with values 
 of definer or invoker based on the boolean secdef column from pg_proc. 
 I've also included a small doc patch to match. This patch is against master 
 from git. Comments welcome!
 
 I just realized I didn't address regression tests, so I guess this is not 
 actually complete yet. I should have time for that next week after I get 
 back to the states.
 
 I would also like to start discussion about perhaps adding a couple more 
 things to \df+, specifically function execution permissions (which are also 
 exposed nowhere outside the catalog to my knowledge), and maybe search_path 
 since that's related to secdef. Thoughts?
 
 I prefer show this in \dt+ for column Security - and for other
 functionality maybe new statement.

I'm assuming you meant \df+, and I've changed it accordingly. With this 
change there is now nothing to change in the regression tests, so please 
consider this my formal and complete submission. 

describe.patch
Description: Binary data


Is there anything else I need to do to get this considered?

Oh, in case anyone is interested, here's what the query now looks like and the 
new output:

jerdman=# \df+ public.akeys
* QUERY **
SELECT n.nspname as Schema,
  p.proname as Name,
  pg_catalog.pg_get_function_result(p.oid) as Result data type,
  pg_catalog.pg_get_function_arguments(p.oid) as Argument data types,
 CASE
  WHEN p.proisagg THEN 'agg'
  WHEN p.proiswindow THEN 'window'
  WHEN p.prorettype = 'pg_catalog.trigger'::pg_catalog.regtype THEN 'trigger'
  ELSE 'normal'
 END as Type,
 CASE
  WHEN prosecdef THEN 'definer'
  ELSE 'invoker'
 END AS Security,
 CASE
  WHEN p.provolatile = 'i' THEN 'immutable'
  WHEN p.provolatile = 's' THEN 'stable'
  WHEN p.provolatile = 'v' THEN 'volatile'
 END as Volatility,
  pg_catalog.pg_get_userbyid(p.proowner) as Owner,
  l.lanname as Language,
  p.prosrc as Source code,
  pg_catalog.obj_description(p.oid, 'pg_proc') as Description
FROM pg_catalog.pg_proc p
 LEFT JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace
 LEFT JOIN pg_catalog.pg_language l ON l.oid = p.prolang
WHERE p.proname ~ '^(akeys)$'
  AND n.nspname ~ '^(public)$'
ORDER BY 1, 2, 4;
**

 List of functions
 Schema | Name  | Result data type | Argument data types |  Type  | Security | 
Volatility |  Owner  | Language | Source code  | Description 
+---+--+-++--++-+--+--+-
 public | akeys | text[]   | hstore  | normal | invoker  | 
immutable  | jerdman | c| hstore_akeys | 
(1 row)

--
Jon T Erdman
Postgresql Zealot



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] My first patch! (to \df output)

2012-11-09 Thread Jon Erdman

Oops! Here it is in the proper diff format. I didn't have my env set up 
correctly :(



describe.patch
Description: Binary data


--
Jon T Erdman
Postgresql Zealot


On Nov 9, 2012, at 1:53 PM, Jon Erdman postgre...@thewickedtribe.net wrote:

 On Oct 27, 2012, at 10:45 AM, Pavel Stehule pavel.steh...@gmail.com wrote:
 
 Hello
 
 2012/10/27 Jon Erdman postgre...@thewickedtribe.net:
 
 Hello Hackers!
 
 So, currently the only way to see if a function is security definer or not 
 is to directly query pg_proc. This is both irritating, and I think perhaps 
 dangerous since security definer functions can be  so powerful. I thought 
 that rectifying that would make an excellent first patch, and I was bored 
 today here in Prague since pgconf.eu is now over...so here it is. :)
 
 This patch adds a column to the output of \df titled Security with values 
 of definer or invoker based on the boolean secdef column from pg_proc. 
 I've also included a small doc patch to match. This patch is against master 
 from git. Comments welcome!
 
 I just realized I didn't address regression tests, so I guess this is not 
 actually complete yet. I should have time for that next week after I get 
 back to the states.
 
 I would also like to start discussion about perhaps adding a couple more 
 things to \df+, specifically function execution permissions (which are also 
 exposed nowhere outside the catalog to my knowledge), and maybe search_path 
 since that's related to secdef. Thoughts?
 
 I prefer show this in \dt+ for column Security - and for other
 functionality maybe new statement.
 
 I'm assuming you meant \df+, and I've changed it accordingly. With this 
 change there is now nothing to change in the regression tests, so please 
 consider this my formal and complete submission. describe.patch
 
 Is there anything else I need to do to get this considered?
 
 Oh, in case anyone is interested, here's what the query now looks like and 
 the new output:
 
 jerdman=# \df+ public.akeys
 * QUERY **
 SELECT n.nspname as Schema,
  p.proname as Name,
  pg_catalog.pg_get_function_result(p.oid) as Result data type,
  pg_catalog.pg_get_function_arguments(p.oid) as Argument data types,
 CASE
  WHEN p.proisagg THEN 'agg'
  WHEN p.proiswindow THEN 'window'
  WHEN p.prorettype = 'pg_catalog.trigger'::pg_catalog.regtype THEN 'trigger'
  ELSE 'normal'
 END as Type,
 CASE
  WHEN prosecdef THEN 'definer'
  ELSE 'invoker'
 END AS Security,
 CASE
  WHEN p.provolatile = 'i' THEN 'immutable'
  WHEN p.provolatile = 's' THEN 'stable'
  WHEN p.provolatile = 'v' THEN 'volatile'
 END as Volatility,
  pg_catalog.pg_get_userbyid(p.proowner) as Owner,
  l.lanname as Language,
  p.prosrc as Source code,
  pg_catalog.obj_description(p.oid, 'pg_proc') as Description
 FROM pg_catalog.pg_proc p
 LEFT JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace
 LEFT JOIN pg_catalog.pg_language l ON l.oid = p.prolang
 WHERE p.proname ~ '^(akeys)$'
  AND n.nspname ~ '^(public)$'
 ORDER BY 1, 2, 4;
 **
 
 List of functions
 Schema | Name  | Result data type | Argument data types |  Type  | Security | 
 Volatility |  Owner  | Language | Source code  | Description 
 +---+--+-++--++-+--+--+-
 public | akeys | text[]   | hstore  | normal | invoker  | 
 immutable  | jerdman | c| hstore_akeys | 
 (1 row)
 
 --
 Jon T Erdman
 Postgresql Zealot
 
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-09 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Just having an option to preload frozen tuples dodges all of these
 issues by throwing our hands up in the air, but it does have the
 advantage of being more general.  Even if we do that I'm not sure it
 would be a bad thing to try to solve this issue in a somewhat more
 principled way, but it would surely reduce the urgency.

Yeah.  ISTM the whole point of TRUNCATE is I don't care about
serializability for this operation, give me efficiency instead.
So I see nothing wrong with a (non-default) option for COPY that
similarly trades away some semantic guarantees for efficiency's sake.
There are an awful lot of bulk-load scenarios where people will gladly
take that trade, and are not very interested in halfway points either.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRUNCATE SERIALIZABLE and frozen COPY

2012-11-09 Thread Simon Riggs
On 9 November 2012 16:27, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 On 9 November 2012 15:34, Kevin Grittner kgri...@mail.com wrote:
 If we're not talking about making conflicts with other transactions
 behave just the same as an unqualified DELETE from a user
 perspective, I'm not sure what the goal is, exactly.

 Reasonable question.

 My goal is to allow COPY to load frozen tuples without causing MVCC 
 violations.

 If that's the goal, I question why you're insisting on touching
 TRUNCATE's behavior.  We already have the principle that TRUNCATE is
 like DELETE except not concurrent-safe.  Why not just invent a
 non-concurrent-safe option to COPY that loads prefrozen tuples into a
 new heap, and call it good?  There will be visibility oddness from that
 definition, sure, but AFAICS there will be visibility oddness from what
 you're talking about too.  You'll just have expended a very great deal
 of effort to make the weirdness a bit different.  Even if the TRUNCATE
 part of it were perfectly clean, the load prefrozen tuples part won't
 be --- so I'm not seeing the value of changing TRUNCATE.

This is wonderful thought and I wish I'd thought of it. My digression
via truncate now annoys me.

Yes, there are objections and I've read what Robert has said. An
explicit new option is perfectly entitled to introduce new behaviour
and won't cause a problem with existing applications. I personally
don't care about serializable stuff here, and nor do most others. They
just want a way to load new data quickly. It is important to me that
we do things in well principled ways, but the task at hand is data
loading not perfect visibility guarantees.

So what we're talking about here is a new mode for COPY, that when
requested will pre-freeze tuples when loading into a newly
created/truncated table. If the table isn't newly created/truncated
then we'll just ignore it and continue. I see no need to throw an
error, since that will just cause annoying usability issues.

COPY FREEZE here we come, with extensive docs to explain the trade-off
the user is accepting.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-09 Thread Bruce Momjian
On Thu, Nov  8, 2012 at 08:59:21PM -0800, Jeff Janes wrote:
 On Thu, Nov 8, 2012 at 4:33 PM, Bruce Momjian br...@momjian.us wrote:
  On Thu, Nov  8, 2012 at 03:46:09PM -0800, Jeff Janes wrote:
  On Wed, Nov 7, 2012 at 6:17 PM, Bruce Momjian br...@momjian.us wrote:
   As a followup to Magnus's report that pg_upgrade was slow for many
   tables, I did some more testing with many tables, e.g.:
  
  ...
  
   Any ideas?  I am attaching my test script.
 
  Have you reviewed the thread at:
  http://archives.postgresql.org/pgsql-performance/2012-09/msg3.php
  ?
 
  There is a known N^2 behavior when using pg_dump against pre-9.3 servers.
 
  I am actually now dumping git head/9.3, so I assume all the problems we
  know about should be fixed.
 
 Are sure the server you are dumping out of is head?

Well, I tested again with 9.2 dumping/loading 9.2 and the same for git
head, and got these results:

   pg_dumprestore
   9.2   git   9.2   git

1  0.13  0.11  0.07  0.07
 1000  4.37  3.98  4.32  5.28
 2000 12.98 12.19 13.64 14.25
 4000 47.85 50.14 61.31 70.97
 8000210.39183.00302.67294.20
16000901.53769.83   1399.25   1359.09

As you can see, there is very little difference between 9.2 and git
head, except maybe at the 16k level for pg_dump.  

Is there some slowdown with a mismatched version dump/reload?  I am
attaching my test script.

 Using head's pg_dump, but 9.2.1 server, it takes me 179.11 seconds to
 dump 16,000 tables (schema only) like your example, and it is
 definitely quadratic.

Are you using a SERIAL column for the tables.  I am, and Peter
Eisentraut reported that was a big slowdown.

 But using head's pg_dump do dump tables out of head's server, it only
 took 24.95 seconds, and the quadratic term is not yet important,
 things still look linear.

Again, using SERIAL?

 But even the 179.11 seconds is several times faster than your report
 of 757.8, so I'm not sure what is going on there.  I don't think my
 laptop is particularly fast:
 
 Intel(R) Pentium(R) CPU B960 @ 2.20GHz

I am using server-grade hardware, Xeon E5620 2.4GHz:

http://momjian.us/main/blogs/pgblog/2012.html#January_20_2012

 Is the next value, increment, etc. for a sequence stored in a catalog,
 or are they stored in the 8kb file associated with each sequence?  If

Each sequence is stored in its own 1-row 8k table:

test= CREATE SEQUENCE seq;
CREATE SEQUENCE

test= SELECT * FROM seq;
-[ RECORD 1 ]-+
sequence_name | seq
last_value| 1
start_value   | 1
increment_by  | 1
max_value | 9223372036854775807
min_value | 1
cache_value   | 1
log_cnt   | 0
is_cycled | f
is_called | f

 they are stored in the file, than it is shame that pg_dump goes to the
 effort of extracting that info if pg_upgrade is just going to
 overwrite it anyway.

Actually, pg_upgrade needs pg_dump to restore all those sequence values.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
:

. traprm

export QUIET=$((QUIET + 1))

 /rtmp/out

for CYCLES in 1 1000 2000 4000 8000 16000
do
echo $CYCLES  /rtmp/out

for DIR in /pgtoprel /pgtop
do  echo $DIR  /rtmp/out
cd $DIR
pginstall

# need for +16k
pipe sed 's/#max_locks_per_transaction = 
64/max_locks_per_transaction = 500/' /u/pg/data/postgresql.conf
pipe sed 's/shared_buffers = 128MB/shared_buffers = 1GB/' 
/u/pg/data/postgresql.conf
pipe sed 's/#work_mem = 1MB/work_mem = 500MB/' 
/u/pg/data/postgresql.conf
pipe sed 's/#maintenance_work_mem = 16MB/maintenance_work_mem = 
500MB/' /u/pg/data/postgresql.conf
pgrestart
sleep 2

for JOT in $(jot $CYCLES); do echo CREATE TABLE test$JOT (x 
SERIAL);; done| PGOPTIONS=-c synchronous_commit=off sql test
echo pg_dump creation  /rtmp/out
/usr/bin/time --output=/rtmp/out --append --format '%e' aspg 
pg_dump --schema-only test  $TMP/1
newdb test
echo pg_dump restore  /rtmp/out
PGOPTIONS=-c synchronous_commit=off /usr/bin/time 
--output=/rtmp/out --append --format '%e' sql test  $TMP/1
pgstop
sleep 2
done
done

bell


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-09 Thread Jeff Davis
On Fri, 2012-11-09 at 20:48 +0100, Markus Wanner wrote:
 Given your description of option 2 I was under the impression that each
 page already has a bit indicating whether or not the page is protected
 by a checksum. Why do you need more bits than that?

The bit indicating that a checksum is present may be lost due to
corruption.

 However, we certainly need to provide the option to go through the
 rewrite for other users, who are well willing to bite that bullet.

That's the use case that I've been focusing on, but perhaps you are
right that it's not the only important one.

 Do you see any real foot-guns or other show-stoppers for permanently
 allowing that in-between-state?

The biggest problem that I see is a few bits indicating the presence of
a checksum may be vulnerable to more kinds of corruption.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP checksums patch

2012-11-09 Thread Jeff Davis
On Mon, 2012-11-05 at 12:19 -0500, Robert Haas wrote:
 Yeah.  I definitely think that we could shed an enormous amount of
 complexity by deciding that this is, for now, an option that can only
 be selected at initdb time.  That would remove approximately 85% of
 everything I've ever disliked about this patch - without, I think,
 precluding the possibility of improving things later.

That's certainly true, but it introduces one large problem: upgrading
would not work, which (in the past few releases) we've treated as a
major showstopper for many features.

If there is really no other good way to do it, then that might be
reasonable. But it seems within grasp to at least offer an offline way
to set checksums.

 It also occurred to me that another way to reduce the scope of this
 change would be to have a first version that does CRCs only for SLRU
 pages.  That would be useful for verifying the integrity of some of
 our most critical data (pg_clog) and be a useful building block toward
 a more complete implementation.

That also breaks upgrade, right?

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-09 Thread Andres Freund
On 2012-11-08 12:30:11 -0500, Peter Eisentraut wrote:
 On 11/7/12 9:17 PM, Bruce Momjian wrote:
  As a followup to Magnus's report that pg_upgrade was slow for many
  tables, I did some more testing with many tables, e.g.:
 
  CREATE TABLE test991 (x SERIAL);
 
  I ran it for 0, 1k, 2k, ... 16k tables, and got these results:
 
  tablespg_dump restore pg_upgrade(increase)
  0   0.300.24   11.73(-)
   1000   6.466.55   28.79(2.45x)
   2000  29.82   20.96   69.75(2.42x)
   4000  95.70  115.88  289.82(4.16x)
   8000 405.38  505.93 1168.60(4.03x)
  160001702.23 2197.56 5022.82(4.30x)

 I can reproduce these numbers, more or less.  (Additionally, it ran out
 of shared memory with the default setting when dumping the 8000 tables.)

 But this issue seems to be entirely the fault of sequences being
 present.  When I replace the serial column with an int, everything
 finishes within seconds and scales seemingly linearly.

I don't know the pg_dump code at all but I would guess that without the
serial there are no dependencies, so the whole dependency sorting
business doesn't need to do very much...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP checksums patch

2012-11-09 Thread Jeff Davis
On Fri, 2012-11-09 at 10:18 -0500, Robert Haas wrote:
 Sure, I agree.  I don't think it should stay that way forever, but
 removing the burden of dealing with this issue from the initial commit
 would likely allow that commit to happen this release cycle, perhaps
 even in the next CommitFest.  And then we'd have half a loaf, which is
 better than none, and we could deal with the issues of switching it on
 and off as a further enhancement.

Just after sending the last email, I realized that it can be separated
into separate commits fairly naturally, I think. So, I agree with you
that we should focus on an initdb setting for the next commitfest and
try for at least an offline migration tool (if not online) later.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-09 Thread Tom Lane
During normal running, operations such as btree page splits are
extremely careful about the order in which they acquire and release
buffer locks, if they're doing something that concurrently modifies
multiple pages.

During WAL replay, that all goes out the window.  Even if an individual
WAL-record replay function does things in the right order for standard
cases, RestoreBkpBlocks has no idea what it's doing.  So if one or more
of the referenced pages gets treated as a full-page image, we are left
with no guarantee whatsoever about what order the pages are restored in.
That never mattered when the code was originally designed, but it sure
matters during Hot Standby when other queries might be able to see the
intermediate states.

I can't prove that this is the cause of bug #7648, but it's fairly easy
to see that it could explain the symptom.  You only need to assume that
the page-being-split had been handled as a full-page image, and that the
new right-hand page had gotten allocated by extending the relation.
Then there will be an interval just after RestoreBkpBlocks does its
thing where the updated left-hand sibling is in the index and is not
locked in any way, but its right-link points off the end of the index.
If a few indexscans come along before the replay process gets to
continue, you'd get exactly the reported errors.

I'm inclined to think that we need to fix this by getting rid of
RestoreBkpBlocks per se, and instead having the per-WAL-record restore
routines dictate when each full-page image is restored (and whether or
not to release the buffer lock immediately).  That's not going to be a
small change unfortunately :-(

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP checksums patch

2012-11-09 Thread Jeff Davis
On Mon, 2012-10-01 at 10:22 -0700, Josh Berkus wrote:
  I think that's OK, because it's still protected by the WAL CRC, and
  there's no expectation that the checksum is correct in shared buffers,
  and the correct checksum should be set on the next checkpoint. Just an
  observation.
 
 We'd need to document that emphatically.  Otherwise folks running on ZFS
 and/or FusionIO with atomic writes (and, in the future, BTRFS) will
 assume that they can turn full_page_writes off and checksums on, and
 clearly that won't work with the current code.  I think that's an
 acceptable limitation, I just think we need to document it carefully,
 and maybe throw a warning if people start up in that configuration.

What situation are you concerned about here? I think that COW
filesystems should still be safe with full_page_writes off, right?

The checksum is calculated before every write, and the COW filesystems
do atomic writes, so the checksums should always be fine. What am I
missing?

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-09 Thread Andres Freund
On 2012-11-09 18:24:25 -0500, Tom Lane wrote:
 I can't prove that this is the cause of bug #7648, but it's fairly easy
 to see that it could explain the symptom.  You only need to assume that
 the page-being-split had been handled as a full-page image, and that the
 new right-hand page had gotten allocated by extending the relation.
 Then there will be an interval just after RestoreBkpBlocks does its
 thing where the updated left-hand sibling is in the index and is not
 locked in any way, but its right-link points off the end of the index.
 If a few indexscans come along before the replay process gets to
 continue, you'd get exactly the reported errors.

Sounds plausible.

 I'm inclined to think that we need to fix this by getting rid of
 RestoreBkpBlocks per se, and instead having the per-WAL-record restore
 routines dictate when each full-page image is restored (and whether or
 not to release the buffer lock immediately).  That's not going to be a
 small change unfortunately :-(

I wonder if we couldn't instead fix it by ensuring the backup blocks
are in the right order in the backup blocks at the inserting
location. That would just need some care about the order of
XLogRecData blocks.
I am pretty unfamiliar with the nbtree locking but I seem to remember
that we should be fine if we always restore from left to right?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-09 Thread Daniel Farina
On Fri, Nov 9, 2012 at 3:24 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 During normal running, operations such as btree page splits are
 extremely careful about the order in which they acquire and release
 buffer locks, if they're doing something that concurrently modifies
 multiple pages.

 During WAL replay, that all goes out the window.  Even if an individual
 WAL-record replay function does things in the right order for standard
 cases, RestoreBkpBlocks has no idea what it's doing.  So if one or more
 of the referenced pages gets treated as a full-page image, we are left
 with no guarantee whatsoever about what order the pages are restored in.
 That never mattered when the code was originally designed, but it sure
 matters during Hot Standby when other queries might be able to see the
 intermediate states.

 I can't prove that this is the cause of bug #7648,

(I was the reporter of 7648)

To lend slightly more circumstantial evidence in support of this, I
also happened to note that the relfile in question was the last
segment and it was about a quarter full, so the access attempt was
definitely at the extreme outermost edge of the index most generally.

--
fdr


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-09 Thread Jeff Janes
On Thu, Nov 8, 2012 at 7:25 PM, Bruce Momjian br...@momjian.us wrote:

 I did some more research and realized that I was not using --schema-only
 like pg_upgrade uses.  With that setting, things look like this:

...

For profiling pg_dump in isolation, you should also specify
--binary-upgrade.  I was surprised that it makes a big difference,
slowing it down by about 2 fold.

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-09 Thread Tom Lane
Andres Freund and...@anarazel.de writes:
 On 2012-11-09 18:24:25 -0500, Tom Lane wrote:
 I'm inclined to think that we need to fix this by getting rid of
 RestoreBkpBlocks per se, and instead having the per-WAL-record restore
 routines dictate when each full-page image is restored (and whether or
 not to release the buffer lock immediately).  That's not going to be a
 small change unfortunately :-(

 I wonder if we couldn't instead fix it by ensuring the backup blocks
 are in the right order in the backup blocks at the inserting
 location. That would just need some care about the order of
 XLogRecData blocks.

I don't think that's a good way to go.  In the first place, if we did
that the fix would require incompatible changes in the contents of WAL
streams.  In the second place, there are already severe constraints on
the positioning of backup blocks to ensure that WAL records can be
uniquely decoded (the section of access/transam/README about WAL coding
touches on this) --- I don't think it's a good plan to add still more
constraints there.  And in the third place, the specific problem we're
positing here results from a failure to hold the buffer lock for a
full-page image until after we're done restoring a *non* full-page image
represented elsewhere in the same WAL record.  In general, of the set
of pages touched by a WAL record, any arbitrary subset of them might be
converted to FPIs during XLogInsert; but the replay-time locking
requirements are going to be the same regardless of that.  So AFAICS,
any design in which RestoreBkpBlocks acts independently of the
non-full-page-image updates is just broken.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further pg_upgrade analysis for many tables

2012-11-09 Thread Jeff Janes
On Fri, Nov 9, 2012 at 3:06 PM, Bruce Momjian br...@momjian.us wrote:
 On Thu, Nov  8, 2012 at 08:59:21PM -0800, Jeff Janes wrote:
 On Thu, Nov 8, 2012 at 4:33 PM, Bruce Momjian br...@momjian.us wrote:
 
  I am actually now dumping git head/9.3, so I assume all the problems we
  know about should be fixed.

 Are sure the server you are dumping out of is head?

 Well, I tested again with 9.2 dumping/loading 9.2 and the same for git
 head, and got these results:

pg_dumprestore
9.2   git   9.2   git

 1  0.13  0.11  0.07  0.07
  1000  4.37  3.98  4.32  5.28
  2000 12.98 12.19 13.64 14.25
  4000 47.85 50.14 61.31 70.97
  8000210.39183.00302.67294.20
 16000901.53769.83   1399.25   1359.09

For pg_dump, there are 4 possible combinations, not just two.  you can
use 9.2's pg_dump to dump from a 9.2 server, use git's pg_dump to dump
from a 9.2 server, use git's pg_dump to dump from a git server, or use
9.2's pg_dump to dump from a git server (although that last one isn't
very relevant)


 As you can see, there is very little difference between 9.2 and git
 head, except maybe at the 16k level for pg_dump.

 Is there some slowdown with a mismatched version dump/reload?  I am
 attaching my test script.

Sorry, from the script I can't really tell what versions are being
used for what.



 Using head's pg_dump, but 9.2.1 server, it takes me 179.11 seconds to
 dump 16,000 tables (schema only) like your example, and it is
 definitely quadratic.

 Are you using a SERIAL column for the tables.  I am, and Peter
 Eisentraut reported that was a big slowdown.

Yes, I'm using the same table definition as your example.



 But using head's pg_dump do dump tables out of head's server, it only
 took 24.95 seconds, and the quadratic term is not yet important,
 things still look linear.

 Again, using SERIAL?

Yep.

 Is the next value, increment, etc. for a sequence stored in a catalog,
 or are they stored in the 8kb file associated with each sequence?  If

 Each sequence is stored in its own 1-row 8k table:

 test= CREATE SEQUENCE seq;
 CREATE SEQUENCE

 test= SELECT * FROM seq;
 -[ RECORD 1 ]-+
 sequence_name | seq
 last_value| 1
 start_value   | 1
 increment_by  | 1
 max_value | 9223372036854775807
 min_value | 1
 cache_value   | 1
 log_cnt   | 0
 is_cycled | f
 is_called | f

 they are stored in the file, than it is shame that pg_dump goes to the
 effort of extracting that info if pg_upgrade is just going to
 overwrite it anyway.

 Actually, pg_upgrade needs pg_dump to restore all those sequence values.

I did an experiment where I had pg_dump just output dummy values
rather than hitting the database.  Once pg_upgrade moves the relation
files over, the dummy values disappear and are set back to their
originals.  So I think that pg_upgrade depends on pg_dump only in a
trivial way--they need to be there, but it doesn't matter what they
are.

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL

2012-11-09 Thread Amit kapila
On Thursday, November 08, 2012 7:56 PM Amit Kapila 
On Thursday, November 08, 2012 1:45 AM Robert Haas wrote:
 On Wed, Nov 7, 2012 at 2:50 PM, Josh Berkus j...@agliodbs.com wrote:
  Well, Magnus' proposed implementation supposed that the existing
 values
  *have* been loaded into the current session.  I agree that with some
  locking and yet more code you could implement it without that.  But
 this
  still doesn't seem to offer any detectable benefit over value-per-
 file.
 
  Well, value-per-file is ugly (imagine you've set 40 different
 variables
  that way) but dodges a lot of complicated issues.  And I suppose
 ugly
  doesn't matter, because the whole idea of the auto-generated files is
  that users aren't supposed to look at them anyway.

 That's pretty much how I feel about it, too.  I think value-per-file
 is an ugly wimp-out that shouldn't really be necessary to solve this
 problem.  It can't be that hard to rewrite a file where every like is
 of the form:

 key = 'value'

 I also believe that it should be possible to rewrite a file without loading
 values into the current session.
 One of the solution if we assume that file is of fixed format and each
 record (key = 'value') of fixed length can be:

 1. While writing .auto file, it will always assume that .auto file contain
 all config parameters.
   Now as this .auto file is of fixed format and fixed record size, it can
 directly write a given record to its particular position.
 2. To handle locking issues, we can follow an approach similar to what GIT
 is doing for editing conf files (using .lock file):
   a. copy the latest content of .auto to .auto.lock
   b. make all the changes to auto.lock file.
   c. at the end of command rename the auto.lock file to .auto file
   d. otherwise if SQL COMMAND/function failed in-between we can delete the
 .auto.lock file
3. Two backends trying to write to .auto file
   we can use .auto.lock as the the lock by trying to create it in
exclusive mode as the first step
   of the command. If it already exists then backend needs to wait.

Please let me know if there are any objections or problems in above method of 
implementation,
else I can go ahead to prepare the patch for the coming CF.

For initial version I will use the function as syntax to provide this feature.

With Regards,
Amit Kapila.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers