Re: [HACKERS] REINDEX CONCURRENTLY 2.0

2014-12-30 Thread Michael Paquier
On Tue, Dec 23, 2014 at 5:54 PM, Oskari Saarenmaa o...@ohmu.fi wrote:

 If the short-lived lock is the only blocker for this feature at the
 moment could we just require an additional qualifier for CONCURRENTLY
 (FORCE?) until the lock can be removed, something like:
 =# [blah]

FWIW, I'd just keep only CONCURRENTLY with no fancy additional
keywords even if we cheat on it, as long as it is precised in the
documentation that an exclusive lock is taken for a very short time,
largely shorter than what a normal REINDEX would do btw.

 It's not optimal, but currently there's no way to reindex a primary key
 anywhere close to concurrently and a short lock would be a huge
 improvement over the current situation.
Yep.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory

2014-12-30 Thread Guillaume Lelarge
Sorry for my very late answer. It's been a tough month.

2014-11-27 0:00 GMT+01:00 Bruce Momjian br...@momjian.us:

 On Mon, Nov  3, 2014 at 12:39:26PM -0800, Jeff Janes wrote:
  It looked to me that the formula, when descending from a previously
 stressed
  state, would be:
 
  greatest(1 + checkpoint_completion_target) * checkpoint_segments,
  wal_keep_segments) + 1 +
  2 * checkpoint_segments + 1

 I don't think we can assume checkpoint_completion_target is at all
 reliable enough to base a maximum calculation on, assuming anything
 above the maximum is cause of concern and something to inform the admins
 about.

 Assuming checkpoint_completion_target is 1 for maximum purposes, how
 about:

 max(2 * checkpoint_segments, wal_keep_segments) + 2 *
 checkpoint_segments + 2


Seems something I could agree on. At least, it makes sense, and it works
for my customers. Although I'm wondering why + 2, and not + 1. It seems
Jeff and you agree on this, so I may have misunderstood something.


-- 
Guillaume.
  http://blog.guillaume.lelarge.info
  http://www.dalibo.com


Re: [HACKERS] Compression of full-page-writes

2014-12-30 Thread Jeff Davis
On Fri, 2013-08-30 at 09:57 +0300, Heikki Linnakangas wrote:
 Speeding up the CRC calculation obviously won't help with the WAL volume 
 per se, ie. you still generate the same amount of WAL that needs to be 
 shipped in replication. But then again, if all you want to do is to 
 reduce the volume, you could just compress the whole WAL stream.

Was this point addressed? How much benefit is there to compressing the
data before it goes into the WAL stream versus after?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Coverity and pgbench

2014-12-30 Thread Tatsuo Ishii
 Anybody looks into problems in pgbench pointed out by Coverity? If no,
 I would like to work on fixing them because I need to write patches
 for -f option related issues anyway.

Done.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Coverity and pgbench

2014-12-30 Thread Michael Paquier
On Tue, Dec 30, 2014 at 8:39 PM, Tatsuo Ishii is...@postgresql.org wrote:
 Anybody looks into problems in pgbench pointed out by Coverity? If no,
 I would like to work on fixing them because I need to write patches
 for -f option related issues anyway.

 Done.
Just wondering, where are those reports located? It would have been
good to point out where the leaks actually were..
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Coverity and pgbench

2014-12-30 Thread Tatsuo Ishii
 Just wondering, where are those reports located? It would have been
 good to point out where the leaks actually were..

The report is a closed one. You need to be a member of PostgreSQL
project of Coverity to look into the report. If want to join the
project, you need to contact to one of admins (Stephen, Peter or
Magnus).

...Or I could email you the pgbench report part if you wish.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2014-12-30 Thread Alexey Vasiliev
 Hello.

Thanks, I understand, what look in another part of code. Hope right now I did 
right changes.

To not modify current pg_usleep calculation, I changed 
restore_command_retry_interval value to seconds (not milliseconds). In this 
case, min value - 1 second.


Mon, 29 Dec 2014 00:15:03 +0900 от Michael Paquier michael.paqu...@gmail.com:
On Sat, Dec 27, 2014 at 3:42 AM, Alexey Vasiliev  leopard...@inbox.ru  wrote:
 Thanks for suggestions.

 Patch updated.

Cool, thanks. I just had an extra look at it.

+This is useful, if I using for restore of wal logs some
+external storage (like AWS S3) and no matter what the slave database
+will lag behind the master. The problem, what for each request to
+AWS S3 need to pay, what is why for N nodes, which try to get next
+wal log each 5 seconds will be bigger price, than for example each
+30 seconds.
I reworked this portion of the docs, it is rather incorrect as the
documentation should not use first-person subjects, and I don't
believe that referencing any commercial products is a good thing in
this context.

+# specifies an optional timeout after nonzero code of restore_command.
+# This can be useful to increase/decrease number of a restore_command calls.
This is still referring to a timeout. That's not good. And the name of
the parameter at the top of this comment block is missing.

+static int restore_command_retry_interval = 5000L;
I think that it would be more adapted to set that to 5000, and
multiply by 1L. I am also wondering about having a better lower bound,
like 100ms to avoid some abuse with this feature in the retries?

+   ereport(ERROR,
+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg(\%s\ must
be bigger zero,
+   restore_command_retry_interval)));
I'd rather rewrite that to must have a strictly positive value.

-* Wait for more WAL to
arrive. Time out after 5 seconds,
+* Wait for more WAL to
arrive. Time out after
+*
restore_command_retry_interval (5 seconds by default),
 * like when polling the
archive, to react to a trigger
 * file promptly.
 */

WaitLatch(XLogCtl-recoveryWakeupLatch,
  WL_LATCH_SET
| WL_TIMEOUT,
- 5000L);
+
restore_command_retry_interval);
I should have noticed earlier, but in its current state your patch
actually does not work. What you are doing here is tuning the time
process waits for WAL from stream. In your case what you want to
control is the retry time for a restore_command in archive recovery,
no?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Alexey Vasiliev
diff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml
index ef78bc0..38420a5 100644
--- a/doc/src/sgml/recovery-config.sgml
+++ b/doc/src/sgml/recovery-config.sgml
@@ -145,6 +145,26 @@ restore_command = 'copy C:\\server\\archivedir\\%f %p'  # Windows
   /listitem
  /varlistentry
 
+ varlistentry id=restore-command-retry-interval xreflabel=restore_command_retry_interval
+  termvarnamerestore_command_retry_interval/varname (typeinteger/type)
+  indexterm
+primaryvarnamerestore_command_retry_interval/ recovery parameter/primary
+  /indexterm
+  /term
+  listitem
+   para
+If varnamerestore_command/ returns nonzero exit status code, retry
+command after the interval of time specified by this parameter.
+Default value is literal5s/.
+   /para
+   para
+This is useful, if I using for restore of wal logs some
+external storage and no matter what the slave database
+will lag behind the master.
+   /para
+  /listitem
+ /varlistentry
+
 /variablelist
 
   /sect1
diff --git a/src/backend/access/transam/recovery.conf.sample b/src/backend/access/transam/recovery.conf.sample
index b777400..5b63f60 100644
--- a/src/backend/access/transam/recovery.conf.sample
+++ b/src/backend/access/transam/recovery.conf.sample
@@ -58,6 +58,11 @@
 #
 #recovery_end_command = ''
 #
+# specifies an optional retry interval of restore_command command, if previous return nonzero exit status code.
+# This can be useful to increase/decrease number of a restore_command calls.
+#
+#restore_command_retry_interval = 5s
+#
 #---
 # RECOVERY TARGET PARAMETERS
 

Re: [HACKERS] Compression of full-page-writes

2014-12-30 Thread Michael Paquier
On Tue, Dec 30, 2014 at 6:21 PM, Jeff Davis pg...@j-davis.com wrote:
 On Fri, 2013-08-30 at 09:57 +0300, Heikki Linnakangas wrote:
 Speeding up the CRC calculation obviously won't help with the WAL volume
 per se, ie. you still generate the same amount of WAL that needs to be
 shipped in replication. But then again, if all you want to do is to
 reduce the volume, you could just compress the whole WAL stream.

 Was this point addressed?
Compressing the whole record is interesting for multi-insert records,
but as we need to keep the compressed data in a pre-allocated buffer
until WAL is written, we can only compress things within a given size
range. The point is, even if we define a  lower bound, compression is
going to perform badly with an application that generates for example
many small records that are just higher than the lower bound...
Unsurprisingly for small records this was bad:
http://www.postgresql.org/message-id/cab7npqsc97o-ue5paxfmukwcxe_jioyxo1m4a0pmnmyqane...@mail.gmail.com
Now are there still people interested in seeing the amount of time
spent in the CRC calculation depending on the record length? Isn't
that worth speaking on the CRC thread btw? I'd imagine that it would
be simple to evaluate the effect of the CRC calculation within a
single process using a bit getrusage.

 How much benefit is there to compressing the data before it goes into the WAL 
 stream versus after?
Here is a good list:
http://www.postgresql.org/message-id/20141212145330.gk31...@awork2.anarazel.de
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Compression of full-page-writes

2014-12-30 Thread Andres Freund
On 2014-12-30 21:23:38 +0900, Michael Paquier wrote:
 On Tue, Dec 30, 2014 at 6:21 PM, Jeff Davis pg...@j-davis.com wrote:
  On Fri, 2013-08-30 at 09:57 +0300, Heikki Linnakangas wrote:
  Speeding up the CRC calculation obviously won't help with the WAL volume
  per se, ie. you still generate the same amount of WAL that needs to be
  shipped in replication. But then again, if all you want to do is to
  reduce the volume, you could just compress the whole WAL stream.
 
  Was this point addressed?
 Compressing the whole record is interesting for multi-insert records,
 but as we need to keep the compressed data in a pre-allocated buffer
 until WAL is written, we can only compress things within a given size
 range. The point is, even if we define a  lower bound, compression is
 going to perform badly with an application that generates for example
 many small records that are just higher than the lower bound...
 Unsurprisingly for small records this was bad:

So why are you bringing it up? That's not an argument for anything,
except not doing it in such a simplistic way.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] ON_ERROR_ROLLBACK

2014-12-30 Thread Bernd Helmle



--On 29. Dezember 2014 12:55:11 -0500 Tom Lane t...@sss.pgh.pa.us wrote:


Given the lack of previous complaints, this probably isn't backpatching
material, but it sure seems like a bit of attention to consistency
would be warranted here.


Now that i read it i remember a client complaining about this some time 
ago. I forgot about it, but i think there's value in it to backpatch.


--
Thanks

Bernd


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2014-12-30 Thread Michael Paquier
On Tue, Dec 30, 2014 at 9:33 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Tue, Dec 30, 2014 at 9:10 PM, Alexey Vasiliev leopard...@inbox.ru wrote:
 To not modify current pg_usleep calculation, I changed
 restore_command_retry_interval value to seconds (not milliseconds). In this
 case, min value - 1 second.
 Er, what the problem with not changing 100L to 1000L? The unit of
 your parameter is ms AFAIK.
Of course I meant in the previous version of the patch not the current
one. Wouldn't it be useful to use it with for example retry intervals
of the order of 100ms~300ms for some cases?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Additional role attributes superuser review

2014-12-30 Thread Stephen Frost
* Amit Kapila (amit.kapil...@gmail.com) wrote:
 On Tue, Dec 30, 2014 at 6:52 AM, Stephen Frost sfr...@snowman.net wrote:
   There is one issue that occurs to me, however.  We're talking about
   pg_dump, but what about pg_dumpall?  In particular, I don't think the
   DUMP privilege should provide access to pg_authid, as that would allow
   the user to bypass the privilege system in some environments by using
   the hash to log in as a superuser.  Now, I don't encourage using
   password based authentication, especially for superuser accounts, but
   lots of people do.  The idea with these privileges is to allow certain
   operations to be performed by a non-superuser while preventing trivial
   access to superuser.  Perhaps it's pie-in-the-sky, but my hope is to
   achieve that.
  
   Ugh, hadn't thought about that. :(
 
  I don't think it kills the whole idea, but we do need to work out how to
  address it.
 
 One way to address this might be to allow this only for superusers,

Huh?  The point here is to have a role account which isn't a superuser
who can perform these actions.

 another could be have a separate privilege (DBA) or a role which is
 not a superuser, however can be used to perform such tasks.

I'm pretty sure we've agreed that having a catch-all role attribute like
'DBA' is a bad idea because we'd likely be adding privileges to it later
which would expand the rights of users with that attribute, possibly
beyond what is intended.

  I don't really like 'xlog' as a permission.  BINARY BACKUP is
  interesting but if we're going to go multi-word, I would think we would
  do PITR BACKUP or something FILESYSTEM BACKUP or similar.  I'm not
  really a big fan of the two-word options though.
 
 How about PHYSICAL BACKUP (for basebackup) and LOGICAL BACKUP
 for pg_dump?

Again, this makes it look like the read-all right would be specific to
users executing pg_dump, which is incorrect and misleading.  PHYSICAL
might also imply the ability to do pg_basebackup.

 This is normally the terminology I have seen people using for these
 backup's.

I do like the idea of trying to find a correlation in the documentation
for these rights, though I'm not sure we'll be able to.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] What exactly is our CRC algorithm?

2014-12-30 Thread Heikki Linnakangas

On 12/30/2014 09:40 AM, Abhijit Menon-Sen wrote:

Hi.

I'm re-attaching the two patches as produced by format-patch. I haven't
listed any reviewers. It's either just Andres, or maybe a lot of people.

Is anyone in a position to try out the patches on MSVC and see if they
build and work sensibly, please? (Otherwise it may be better to remove
those bits from the patch for now.)


A couple of quick comments:

bswap32 is unused on on little-endian systems. That will give a compiler 
warning.


pg_comp_crc32c_sse processes one byte at a time, until the pointer is 
4-bytes aligned. Then it processes 8 bytes at a time. So it fetches the 
8-byte chunks from only 4-byte aligned addresses. Is that intentional? 
If unaligned access performs well, why bother with the initial 
byte-at-a-time processing at all?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] ON_ERROR_ROLLBACK

2014-12-30 Thread Tom Lane
Bernd Helmle maili...@oopsware.de writes:
 --On 29. Dezember 2014 12:55:11 -0500 Tom Lane t...@sss.pgh.pa.us wrote:
 Given the lack of previous complaints, this probably isn't backpatching
 material, but it sure seems like a bit of attention to consistency
 would be warranted here.

 Now that i read it i remember a client complaining about this some time 
 ago. I forgot about it, but i think there's value in it to backpatch.

Hm.  Last night I wrote the attached draft patch, which I was intending
to apply to HEAD only.  The argument against back-patching is basically
that this might change the interpretation of scripts that had been
accepted silently before.  For example
\set ECHO_HIDDEN NoExec
will now select noexec mode whereas before you silently got on mode.
In one light this is certainly a bug fix, but in another it's just
definitional instability.

If we'd gotten a field bug report we might well have chosen to back-patch,
though, and perhaps your client's complaint counts as that.

Opinions anyone?

regards, tom lane

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index d29dfa2..bdfb67c 100644
*** a/doc/src/sgml/ref/psql-ref.sgml
--- b/doc/src/sgml/ref/psql-ref.sgml
*** EOF
*** 173,180 
Echo the actual queries generated by command\d/command and other backslash
commands. You can use this to study applicationpsql/application's
internal operations. This is equivalent to
!   setting the variable varnameECHO_HIDDEN/varname from within
!   applicationpsql/application.
/para
/listitem
  /varlistentry
--- 173,179 
Echo the actual queries generated by command\d/command and other backslash
commands. You can use this to study applicationpsql/application's
internal operations. This is equivalent to
!   setting the variable varnameECHO_HIDDEN/varname to literalon/.
/para
/listitem
  /varlistentry
*** EOF
*** 333,340 
quietly. By default, it prints welcome messages and various
informational output. If this option is used, none of this
happens. This is useful with the option-c/option option.
!   Within applicationpsql/application you can also set the
!   varnameQUIET/varname variable to achieve the same effect.
/para
/listitem
  /varlistentry
--- 332,339 
quietly. By default, it prints welcome messages and various
informational output. If this option is used, none of this
happens. This is useful with the option-c/option option.
!   This is equivalent to setting the variable varnameQUIET/varname
!   to literalon/.
/para
/listitem
  /varlistentry
*** bar
*** 2884,2891 
  termvarnameECHO_HIDDEN/varname/term
  listitem
  para
! When this variable is set and a backslash command queries the
! database, the query is first shown. This way you can study the
  productnamePostgreSQL/productname internals and provide
  similar functionality in your own programs. (To select this behavior
  on program start-up, use the switch option-E/option.)  If you set
--- 2883,2891 
  termvarnameECHO_HIDDEN/varname/term
  listitem
  para
! When this variable is set to literalon/ and a backslash command
! queries the database, the query is first shown.
! This feature helps you to study
  productnamePostgreSQL/productname internals and provide
  similar functionality in your own programs. (To select this behavior
  on program start-up, use the switch option-E/option.)  If you set
*** bar
*** 3046,3061 
/term
  listitem
  para
! When literalon/, if a statement in a transaction block
  generates an error, the error is ignored and the transaction
! continues. When literalinteractive/, such errors are only
  ignored in interactive sessions, and not when reading script
! files. When literaloff/ (the default), a statement in a
  transaction block that generates an error aborts the entire
! transaction. The on_error_rollback-on mode works by issuing an
  implicit commandSAVEPOINT/ for you, just before each command
! that is in a transaction block, and rolls back to the savepoint
! on error.
  /para
  /listitem
/varlistentry
--- 3046,3061 
/term
  listitem
  para
! When set to literalon/, if a statement in a transaction block
  generates an error, the error is ignored and the transaction
! continues. When set to literalinteractive/, such errors are only
  ignored in interactive sessions, and not when reading script
! files. When unset or set to literaloff/, a statement in a
  

Re: [HACKERS] Additional role attributes superuser review

2014-12-30 Thread Magnus Hagander
On Mon, Dec 29, 2014 at 11:01 PM, Stephen Frost sfr...@snowman.net wrote:


 * Adam Brightwell (adam.brightw...@crunchydatasolutions.com) wrote:
   I'd suggest it's called DUMP if that's what it allows, to keep it
 separate
   from the backup parts.
 
  Makes sense to me.

 I'm fine calling it 'DUMP', but for different reasons.

 We have no (verifiable) idea what client program is being used to
 connect and therefore we shouldn't try to tie the name of the client
 program to the permission.

 That said, a 'DUMP' privilege which allows the user to dump the contents
 of the entire database is entirely reasonable.  We need to be clear in
 the documentation though- such a 'DUMP' privilege is essentially
 granting USAGE on all schemas and table-level SELECT on all tables and

sequences (anything else..?).  To be clear, a user with this privilege
 can utilize that access without using pg_dump.


Well, it would not give you full USAGE - granted USAGE on a schema, you can
execute functions in there for example (provided permissions). This
privilege would not do that, it would only give you COPY access. (And also
COPY != SELECT in the way that the rule system applies, I think? And this
one could be for COPY only)

But other than that I agree - for *all* these privileges, it needs to be
clearly documented that they are not limited to a specific client side
application, even if their name happens to be similar to one.


One other point is that this shouldn't imply any other privileges, imv.
 I'm specifically thinking of BYPASSRLS- that's independently grantable
 and therefore should be explicitly set, if it's intended.  Things


I think BYPASSRLS would have to be implicitly granted by the DUMP
privilege. Without that, the DUMP privilege is more or less meaningless
(for anybody who uses RLS - but if they don't use RLS, then having it
include BYPASSRLS makes no difference). Worse than that, you may end up
with a dump that you cannot restore.

Similar concerns would exist for the existing REPLICATION role for example
- that one clearly lets you bypass RLS as well, just not with a SQL
statement.


should work 'sanely' with any combination of the two options.
 Similairly, DUMP shouldn't imply BACKUP or visa-versa.  In particular,
 I'd like to see roles which have only the BACKUP privilege be unable to
 directly read any data (modulo things granted to PUBLIC).  This would
 allow an unprivileged user to manage backups, kick off ad-hoc ones, etc,
 without being able to actually access any of the data (this would
 require the actual backup system to have a similar control, but that's
 entirely possible with more advanced SANs and enterprise backup
 solutions).


So you're saying a privilege that would allow you to do
pg_start_backup()/pg_stop_backup() but *not* actually use pg_basebackup?
That would be EXCLUSIVEBACKUP or something like that, to be consistent
with existing terminology though.



  That seems really bad names, IMHO. Why? Because we use WAL and XLOG
   throughout documentation and parameters and code to mean *the same
 thing*.
   And here they'd suddenly mean different things. If we need them as
 separate
   privileges, I think we need much better names. (And a better
 description -
   what is xlog operations really?)
  
 
  Fair enough, ultimately what I was trying to address is the following
  concern raised by Alvaro:
 
  To me, what this repeated discussion on this particular BACKUP point
  says, is that the ability to run pg_start/stop_backend and the xlog
  related functions should be a different privilege, i.e. something other
  than BACKUP; because later we will want the ability to grant someone the
  ability to run pg_dump on the whole database without being superuser,
  and we will want to use the name BACKUP for that.  So I'm inclined to
  propose something more specific for this like WAL_CONTROL or
  XLOG_OPERATOR, say.

 Note that the BACKUP role attribute was never intended to cover the
 pg_dump use-case.  Simply the name of it caused confusion though.  I'm
 not sure if adding a DUMP role attribute is sufficient enough to address
 that confusion, but I haven't got a better idea either.


We need to separate the logical backups (pg_dump) from the physical ones
(start/stop+filesystem and pg_basebackup). We might also need to separate
the two different ways of doing physical backups.

Personalyl I think using the DUMP name makes that a lot more clear. Maybe
we need to avoid using BACKUP alone as well, to make sure it doesn't go the
other way - using BASEBACKUP and EXCLUSIVEBACKUP for those two different
ones perhaps?


 When indeed, what it meant was to have the following separate (effectively
  merging #2 and #3):
 
  1) ability to pg_dump
  2) ability to start/stop backups *and* ability to execute xlog related
  functions.


We probably also need to define what those xlog related functions
actually arse. pg_current_xlog_location() is definitely an xlog related
function, but does it need the 

Re: [HACKERS] Serialization exception : Who else was involved?

2014-12-30 Thread Olivier MATROT
Indeed, NOTICE is wrong because it would doom the transaction that sets the 
flag if it should be later PREPARED.
I think that reporting the PIDs and the current activity of each process would 
be nice. DeadLoackReport() is using pgstat_get_backend_current_activity() to 
get the process activity.

I'll see what I could come up with.

Thanks,
om

-Message d'origine-
De : Noah Misch [mailto:n...@leadboat.com] 
Envoyé : samedi 27 décembre 2014 10:51
À : Olivier MATROT
Cc : pgsql-hackers@postgresql.org
Objet : Re: Serialization exception : Who else was involved?

On Tue, Dec 02, 2014 at 11:17:43AM +0100, Olivier MATROT wrote:
 Serialization conflict detection is done in 
 src/backend/storage/lmgr/predicate.c, where transactions that are 
 doomed to fail are marked as such with the SXACT_FLAG_DOOMED flag.
  
 I simply added elog(...) calls with the NOTIFY level, each time the 
 flag is set, compiled the code and give it a try.

 I would like to see this useful and simple addition in a future 
 version of PostgreSQL.
 Is it in the spirit of what is done when it comes to ease the work of 
 the developer ?
 May be the level I've chosen is not appropriate ?

I would value extra logging designed to help users understand the genesis of 
serialization failures.  A patch the community would adopt will probably have 
more complexity than your quick elog(NOTICE, ...) addition.  I don't have a 
clear picture of what the final patch should be, but the following are some 
thoughts to outline the problem space.  See [1] for an earlier discussion.
The logging done in DeadLockReport() is a good baseline; it would be best to 
consolidate a similar level of detail and report it all as part of the main 
serialization failure report.  That may prove impractical.  If transaction TA 
marks transaction TB's doom, TA can be long gone by the time TB reports its 
serialization failure.  TA could persist the details needed for that future 
error report, but that may impose costs out of proportion with the benefit.
If so, we can fall back on your original idea of emitting a message in the 
command that performs the flag flip.  That would need a DEBUG error level, 
though.  Sending a NOTICE to a client when its transaction dooms some other 
transaction would add noise in the wrong place.

Thanks,
nm

[1] 
http://www.postgresql.org/message-id/flat/AANLkTikF-CR-52nWAo2VG_348aTsK_+0i=chbpnqd...@mail.gmail.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Additional role attributes superuser review

2014-12-30 Thread Stephen Frost
* Magnus Hagander (mag...@hagander.net) wrote:
 On Mon, Dec 29, 2014 at 11:01 PM, Stephen Frost sfr...@snowman.net wrote:
  That said, a 'DUMP' privilege which allows the user to dump the contents
  of the entire database is entirely reasonable.  We need to be clear in
  the documentation though- such a 'DUMP' privilege is essentially
  granting USAGE on all schemas and table-level SELECT on all tables and
 
 sequences (anything else..?).  To be clear, a user with this privilege
  can utilize that access without using pg_dump.
 
 Well, it would not give you full USAGE - granted USAGE on a schema, you can
 execute functions in there for example (provided permissions). This
 privilege would not do that, 

The approach I was thinking was to document and implement this as
impliciting granting exactly USAGE and SELECT rights, no more (not
BYPASSRLS) and no less (yes, the role could execute functions).  I agree
that doing so would be strictly more than what pg_dump actually requires
but it's also what we actually have support for in our privilege system.

 it would only give you COPY access. (And also
 COPY != SELECT in the way that the rule system applies, I think? And this
 one could be for COPY only)

COPY certainly does equal SELECT rights..  We haven't got an independent
COPY privilege and I don't think it makes sense to invent one.  It
sounds like you're suggesting that we add a hack directly into COPY for
this privilege, but my thinking is that the right place to change for
this is in the privilege system and not hack it into individual
commands..  I'm also a bit nervous that we'll end up painting ourselves
into a corner if we hack this to mean exactly what pg_dump needs today.

Lastly, I've been considering other use-cases for this privilege beyond
the pg_dump one (thinking about auditing, for example).

 But other than that I agree - for *all* these privileges, it needs to be
 clearly documented that they are not limited to a specific client side
 application, even if their name happens to be similar to one.

Right.

 One other point is that this shouldn't imply any other privileges, imv.
  I'm specifically thinking of BYPASSRLS- that's independently grantable
  and therefore should be explicitly set, if it's intended.  Things
 
 I think BYPASSRLS would have to be implicitly granted by the DUMP
 privilege. Without that, the DUMP privilege is more or less meaningless
 (for anybody who uses RLS - but if they don't use RLS, then having it
 include BYPASSRLS makes no difference). Worse than that, you may end up
 with a dump that you cannot restore.

The dump would fail, as I mentioned before.  I don't see why BYPASSRLS
has to be implicitly granted by the DUMP privilege and can absolutely
see use-cases for it to *not* be.  Those use-cases would require passing
pg_dump the --enable-row-security option, but that's why it's there.

Implementations which don't use RLS are not impacted either way, so we
don't need to consider them.  Implementations which *do* use RLS, in my
experience, would certainly understand and expect BYPASSRLS to be
required if they want this role to be able to dump all data out
regardless of the RLS policies.  What does implicitly including
BYPASSRLS in this privilege gain us?  A slightly less irritated DBA who
forgot to include BYPASSRLS and ended up getting a pg_dump error because
of it.  On the other hand, it walls off any possibility of using this
privilege for roles who shouldn't be allowed to bypass RLS or for
pg_dumps to be done across the entire system for specific RLS policies.

 Similar concerns would exist for the existing REPLICATION role for example
 - that one clearly lets you bypass RLS as well, just not with a SQL
 statement.

I'm not sure that I see the point you're making here.  Yes, REPLICATION
allows you to do a filesystem copy of the entire database and that
clearly bypasses RLS and *all* of our privilege system.  I'm suggesting
that this role attribute work as an implicit grant of privileges we
already have.  That strikes me as easy to document and very clear for
users.

 should work 'sanely' with any combination of the two options.
  Similairly, DUMP shouldn't imply BACKUP or visa-versa.  In particular,
  I'd like to see roles which have only the BACKUP privilege be unable to
  directly read any data (modulo things granted to PUBLIC).  This would
  allow an unprivileged user to manage backups, kick off ad-hoc ones, etc,
  without being able to actually access any of the data (this would
  require the actual backup system to have a similar control, but that's
  entirely possible with more advanced SANs and enterprise backup
  solutions).
 
 So you're saying a privilege that would allow you to do
 pg_start_backup()/pg_stop_backup() but *not* actually use pg_basebackup?

Yes.

 That would be EXCLUSIVEBACKUP or something like that, to be consistent
 with existing terminology though.

Ok.  I agree that working out the specific naming is difficult and would
like to focus 

[HACKERS] Re[2]: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2014-12-30 Thread Alexey Vasiliev
Tue, 30 Dec 2014 21:39:33 +0900 от Michael Paquier michael.paqu...@gmail.com:
 On Tue, Dec 30, 2014 at 9:33 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
  On Tue, Dec 30, 2014 at 9:10 PM, Alexey Vasiliev leopard...@inbox.ru 
  wrote:
  To not modify current pg_usleep calculation, I changed
  restore_command_retry_interval value to seconds (not milliseconds). In this
  case, min value - 1 second.
  Er, what the problem with not changing 100L to 1000L? The unit of
  your parameter is ms AFAIK.
 Of course I meant in the previous version of the patch not the current
 one. Wouldn't it be useful to use it with for example retry intervals
 of the order of 100ms~300ms for some cases?
 -- 
 Michael
 
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers
 

Thanks, patch changed. 

As I understand now = (pg_time_t) time(NULL); return time in seconds, what is 
why I multiply value to 1000 to compare with restore_command_retry_interval in 
milliseconds.

I am not sure about small retry interval of time, in my cases I need interval 
bigger 5 seconds (20-40 seconds). Right now I limiting this value be bigger 100 
milliseconds.

-- 
Alexey Vasilievdiff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml
index ef78bc0..52cb47d 100644
--- a/doc/src/sgml/recovery-config.sgml
+++ b/doc/src/sgml/recovery-config.sgml
@@ -145,6 +145,29 @@ restore_command = 'copy C:\\server\\archivedir\\%f %p'  # Windows
   /listitem
  /varlistentry
 
+ varlistentry id=restore-command-retry-interval xreflabel=restore_command_retry_interval
+  termvarnamerestore_command_retry_interval/varname (typeinteger/type)
+  indexterm
+primaryvarnamerestore_command_retry_interval/ recovery parameter/primary
+  /indexterm
+  /term
+  listitem
+   para
+If varnamerestore_command/ returns nonzero exit status code, retry
+command after the interval of time specified by this parameter.
+Default value is literal5s/.
+   /para
+   para
+This is useful, if I using for restore of wal logs some
+external storage and request each 5 seconds for wal logs
+can be useless and cost additional money.
+Increasing this parameter allow to increase retry interval of time,
+if not found new wal logs inside external storage and no matter
+what the slave database will lag behind the master.
+   /para
+  /listitem
+ /varlistentry
+
 /variablelist
 
   /sect1
diff --git a/src/backend/access/transam/recovery.conf.sample b/src/backend/access/transam/recovery.conf.sample
index b777400..5b63f60 100644
--- a/src/backend/access/transam/recovery.conf.sample
+++ b/src/backend/access/transam/recovery.conf.sample
@@ -58,6 +58,11 @@
 #
 #recovery_end_command = ''
 #
+# specifies an optional retry interval of restore_command command, if previous return nonzero exit status code.
+# This can be useful to increase/decrease number of a restore_command calls.
+#
+#restore_command_retry_interval = 5s
+#
 #---
 # RECOVERY TARGET PARAMETERS
 #---
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e54..67a873c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -235,6 +235,7 @@ static TimestampTz recoveryTargetTime;
 static char *recoveryTargetName;
 static int	recovery_min_apply_delay = 0;
 static TimestampTz recoveryDelayUntilTime;
+static int 	restore_command_retry_interval = 5000;
 
 /* options taken from recovery.conf for XLOG streaming */
 static bool StandbyModeRequested = false;
@@ -4881,6 +4882,28 @@ readRecoveryCommandFile(void)
 	(errmsg_internal(trigger_file = '%s',
 	 TriggerFile)));
 		}
+		else if (strcmp(item-name, restore_command_retry_interval) == 0)
+		{
+			const char *hintmsg;
+
+			if (!parse_int(item-value, restore_command_retry_interval, GUC_UNIT_MS,
+		   hintmsg))
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+		 errmsg(parameter \%s\ requires a temporal value,
+restore_command_retry_interval),
+		 hintmsg ? errhint(%s, _(hintmsg)) : 0));
+			ereport(DEBUG2,
+	(errmsg_internal(restore_command_retry_interval = '%s', item-value)));
+
+			if (restore_command_retry_interval  100)
+			{
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+		 errmsg(\%s\ must have a bigger 100 milliseconds value,
+	restore_command_retry_interval)));
+			}
+		}
 		else if (strcmp(item-name, recovery_min_apply_delay) == 0)
 		{
 			const char *hintmsg;
@@ -10495,13 +10518,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * machine, so we've exhausted all the options for
 	 * 

Re: [HACKERS] Detecting backend failures via libpq / DBD::Pg

2014-12-30 Thread Greg Sabino Mullane

-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160


Andrew Gierth asked:
 this is to send a simple SELECT via PQexec

 Why not PQexec(conn, ) ?

Because I want to leave a good clue for debugging; so 
DBAs are better able to figure out where a mystery slew of 
queries is coming from. The query is: SELECT 'DBD::Pg ping test'

Which also means the inverse is true: simple blank queries 
are guaranteed to *not* be coming from DBD::Pg.


- -- 
Greg Sabino Mullane g...@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201412301041
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-BEGIN PGP SIGNATURE-

iEYEAREDAAYFAlSix/YACgkQvJuQZxSWSsjILwCdHnkhYC1i+LJZkNUWjfTi5yG+
FHwAn007+arJIw62gIUO20+SxnzRT4ub
=9Rym
-END PGP SIGNATURE-




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What exactly is our CRC algorithm?

2014-12-30 Thread Abhijit Menon-Sen
At 2014-12-30 16:05:50 +0200, hlinnakan...@vmware.com wrote:

 A couple of quick comments:
 
 bswap32 is unused on on little-endian systems. That will give a
 compiler warning.

Huh. I don't get a warning, even when I add -Wunused to the build flags.
But since you mention it, it would be better to write the function thus:

static inline uint32 cpu_to_le32(uint32 x)
{
#ifndef WORDS_BIGENDIAN
return x;
#elif defined(__GNUC__) || defined(__clang__)
return __builtin_bswap32(x);
#else
return  ((x  24)  0xff00) |
((x   8)  0x00ff) |
((x   8)  0xff00) |
((x  24)  0x00ff);
#endif
}

 pg_comp_crc32c_sse […] fetches the 8-byte chunks from only 4-byte
 aligned addresses. Is that intentional?

Thanks for spotting that. I had meant to change the test to  7. But
again, now that you mention it, I'm not sure it's necessary. The CRC32*
instructions don't have the usual SSE alignment requirements, and I see
that the Linux kernel (among other implementations) process eight bytes
at a time from the start of the buffer and then process the remaining
bytes one at a time. I'll do a bit more research and post an update.

Thanks for having a look.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-30 Thread Merlin Moncure
On Mon, Dec 29, 2014 at 3:53 PM, Kevin Grittner kgri...@ymail.com wrote:
 I tend to build out applications on top of functions and the
 inability to set isolation mode inside a function confounds me
 from using anything but 'read committed'.

 Hey, no problem -- just set default_transaction_isolation =
 'serializable' in your postgresql.conf file and never override
 isolation level and you'll never have to use an explicit table lock
 or SELECT FOR UPDATE again. While you're at it, set
 default_transaction_read_only = on and only override it for
 transactions that (might) need to update and you'll have dodged the
 worst of the performance problems from serializable transactions.

ok...wow.  That's pretty convincing, honestly.  Your objective is
sensible and clear.

So basically there are/were two concrete concerns:  compatibility
break and loss of reported detail.  You've convinced me that
principled serialization code shouldn't be impacted negatively by
changing the returned sqlstate (although, perhaps, we're being a bit
too cavalier in terms of the amount of unprincipled code out there --
particularly looping pl/pgsql exception handlers).

Ideally, you'd sneak more detail about the specifics of the error into
errdetail or someplace as Jim is suggesting.  That'd address the other
point.  So, given those things, I'm mollified, FWIW.

As an aside, the main reason I don't run with serializable isn't
performance paranoia or OCD management of locking semantics in the
90%+ of code that doesn't need that treatment; it's because I build
code (well, chaining DML and such) in the server, always, and have no
way of managing isolation level inside of functions because by the
time the function body is invoked the snapshot is already generated.
I can make read committed transactions work with appropriate locking
but there is no way to downgrade serializable transactions in flight.
IOW, lack of proper stored procedures is what's holding me back.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] ON_ERROR_ROLLBACK

2014-12-30 Thread David Johnston
On Tue, Dec 30, 2014 at 8:54 AM, Adrian Klaver adrian.kla...@aklaver.com
wrote:

 On 12/30/2014 07:43 AM, David G Johnston wrote:

 Tom Lane-2 wrote

 Bernd Helmle lt;


  mailings@


  gt; writes:

 --On 29. Dezember 2014 12:55:11 -0500 Tom Lane lt;


  tgl@.pa


  gt; wrote:

 Given the lack of previous complaints, this probably isn't backpatching
 material, but it sure seems like a bit of attention to consistency
 would be warranted here.


  Now that i read it i remember a client complaining about this some time
 ago. I forgot about it, but i think there's value in it to backpatch.


 Hm.  Last night I wrote the attached draft patch, which I was intending
 to apply to HEAD only.  The argument against back-patching is basically
 that this might change the interpretation of scripts that had been
 accepted silently before.  For example
 \set ECHO_HIDDEN NoExec
 will now select noexec mode whereas before you silently got on mode.
 In one light this is certainly a bug fix, but in another it's just
 definitional instability.

 If we'd gotten a field bug report we might well have chosen to
 back-patch,
 though, and perhaps your client's complaint counts as that.

 Opinions anyone?


 -0.5 for back patching

 The one thing supporting this is that we'd potentially be fixing scripts
 that are broken but don't know it yet.  But the downside of changing
 active
 settings for working scripts - even if they are only accidentally working
 -
 is enough to counter that for me.  Being more liberal in our acceptance of
 input is more feature than bug fix even if we document that we accept more
 items.


 It is more about being consistent then liberal. Personally I think a
 situation where for one variable 0 = off but for another 0 = on,  is a bug


​I can sorta buy the consistency angle but what will seal it for me is
script portability - the ability to write a script and instructions using
the most current release and have it run on previous versions without
having to worry about this kind of incompatibility.

So, +1 for back patching from me.

David J.​


[HACKERS] pg_ctl {start,restart,reload} bad handling of stdout file descriptor

2014-12-30 Thread Luis Menina
Hi,

I've been trying to run some pg_ctl command inside a python script,
and saw that some of them where deadlocking. It seems that the
commands that start postgres handle stdout in a way that that block
the caller. Redirecting stdout to /dev/null or to a file using the -l
option allow me to workaround the problem, but fixing it upstream
would be nice.

Here's a simple python program that reproduces the problem, and should deadlock.
Customize your data path first, of course.

Regards,
-- 
Luis MENINA / Software Engineer
Web: www.anevia.com
Anevia: 1 rue René Anjolvy, 94250 Gentilly, France
#! /usr/bin/env python

import subprocess

PGDATADIR = '/mnt/streams1/pgsql/localdb/data'

# WORKS !!! (because of stdout redirection)
#cmd = 'sudo -u postgres /usr/lib/postgresql/9.3/bin/pg_ctl restart -w -s -m fast -D ' + PGDATADIR + '  /dev/null'
#cmd = 'sudo -u postgres /usr/lib/postgresql/9.3/bin/pg_ctl restart -w -s -m fast -D ' + PGDATADIR + ' -l /tmp/postgres.log'

# BROKEN !!!
#cmd = 'sudo -u postgres /usr/lib/postgresql/9.3/bin/pg_ctl restart -w -s -m fast -D ' + PGDATADIR + ' 21'
cmd = 'sudo -u postgres /usr/lib/postgresql/9.3/bin/pg_ctl restart -w -s -m fast -D ' + PGDATADIR

print cmd

print Running subprocess.check_output
subprocess.check_output(cmd, shell=True)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl {start,restart,reload} bad handling of stdout file descriptor

2014-12-30 Thread Luis Menina
I knew I'd forget something...
I'm running Postgres 9.3.5 on Debian/Linux.

Regards,
--
Luis MENINA / Software Engineer
Web: www.anevia.com
Anevia: 1 rue René Anjolvy, 94250 Gentilly, France


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-12-30 Thread Jeff Janes
On Sun, Dec 28, 2014 at 12:45 PM, Peter Geoghegan p...@heroku.com wrote:

 On Sun, Dec 28, 2014 at 12:37 PM, Jeff Davis pg...@j-davis.com wrote:
  Do others have similar numbers? I'm quite surprised at how little
  work_mem seems to matter for these plans (HashJoin might be a different
  story though). I feel like I made a mistake -- can someone please do a
  sanity check on my numbers?

 I have seen external sorts that were quicker than internal sorts
 before. With my abbreviated key patch, under certain circumstances
 external sorts are faster, while presumably the same thing is true of
 int4 attribute sorts today. Actually, I saw a 10MB work_mem setting
 that was marginally faster than a multi-gigabyte one that fit the
 entire sort in memory. It probably has something to do with caching
 effects dominating over the expense of more comparisons, since higher
 work_mem settings that still resulted in an external sort were slower
 than the 10MB setting.

 I was surprised by this too, but it has been independently reported by
 Jeff Janes.


I don't recall (at the moment) seeing our external sort actually faster
than quick-sort, but I've very reliably seen external sorts get faster with
less memory than with more.  It is almost certainly a CPU caching issue.
Very large simple binary heaps are horrible on the CPU cache. And for
sort-by-reference values, quick sort is also pretty bad.

With a slow enough data bus between the CPU and main memory, I don't doubt
that a 'tapesort' with small work_mem could actually be faster than
quicksort with large work_mem.  But I don't recall seeing it myself.  But
I'd be surprised that a tapesort as currently implemented would be faster
than a quicksort if the tapesort is using just one byte less memory than
the quicksort is.

But to Jeff Davis's question, yes, tapesort is not very sensitive to
work_mem, and to the extent it is sensitive it is in the other direction of
more memory being bad.  Once work_mem is so small that it takes multiple
passes over the data to do the merge, then small memory would really be a
problem.  But on modern hardware you have to get pretty absurd settings
before that happens.

Cheers,

Jeff


Re: [HACKERS] Documentation of bt_page_items()'s ctid field

2014-12-30 Thread Heikki Linnakangas

On 12/30/2014 04:08 AM, Peter Geoghegan wrote:

Attached documentation patch describes the purpose of
bt_page_items()'s ctid field. This has come up enough times in
disaster recovery or testing scenarios that I feel it's worth drawing
particular attention to.


How much detail on the b-tree internals do we want to put in the 
pageinspect documentation? I can see that being useful, but should we 
also explain e.g. that the first item on each (non-rightmost) page is 
the high key?



+  Note that structfieldctid/ addresses a heap tuple if the
+  page under consideration is a B-Tree leaf page.  Otherwise, for
+  internal B-Tree pages structfieldctid/ addresses a page in
+  the B-Tree itself (excluding the root page if and only if there
+  has not yet been a root page split, as in the example above).
+  These internally referenced pages are child pages, and may
+  themselves be leaf pages or internal pages.


I had a hard time understanding the remark about the root page. But in 
any case, if you look at the flags set e.g. with bt_page_stats(), the 
root page is flagged as also being a leaf page, when it is the only page 
in the index. So the root page is considered also a leaf page in that case.


I'd suggest saying the same thing (or more) with fewer words:

In a B-tree leaf page, structfieldctid/ points to a heap tuple. In 
an internal page, it points to another page in the index itself, and the 
offset number part (the second number) of the ctid field is ignored.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-30 Thread Greg Stark
On Mon, Dec 29, 2014 at 4:17 PM, Merlin Moncure mmonc...@gmail.com wrote:
 Serialization errors only exist as a concession
 to concurrency and performance.  Again, they should be returned as
 sparsely as possible


I think this is fuzzy thinking. Serialization *errors* themselves are
a concession to concurrency and performance but the actual signaling
of the errors is just a consequence that the serialization failure
occurred. If you can find a way to avoid the serialization failure
then yay but you can't just not report it and call that a win.

The point of returning a serialization failure is to report when a
transaction sees something that can't be possible in a serialized
execution. It's not that retrying might make the error go away it's
that the error shouldn't have been possible in a serialized execution
so the code shouldn't have to be prepared for it.

So if you get a unique violation or an RI violation and the
transaction didn't previously verify that the constraint wouldn't be
violated then it's perfectly ok to return a constraint violation
error. It doesn't matter whether retrying might avoid the error since
there was some serialized order in which the constraint was violated.

The case at issue is things like upsert where the code already
verified that no violation should occur. Upsert is a typical case but
it would be equally valid if it's an RI constraint and the transaction
verified that the RI constraint is satisified before inserting or
updating. If a concurrent transaction breaks the constraint for this
insert and the insert gets a constraint violation then it ought to get
a serialization failure since the constraint failure can't occur in a
serialized execution order.

But shouldn't the Read-Write dependency graph already be picking this
up? IF you've previously read a row and someone updates it and then
you try to lock it for updates it should be considering this a
serialization failure.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What exactly is our CRC algorithm?

2014-12-30 Thread Andres Freund
On 2014-12-30 21:36:19 +0530, Abhijit Menon-Sen wrote:
 Thanks for spotting that. I had meant to change the test to  7. But
 again, now that you mention it, I'm not sure it's necessary. The CRC32*
 instructions don't have the usual SSE alignment requirements, and I see
 that the Linux kernel (among other implementations) process eight bytes
 at a time from the start of the buffer and then process the remaining
 bytes one at a time. I'll do a bit more research and post an update.

I'd done some quick and dirty benchmarking when writing my initial
crc32c POC and I found that four byte aligned accesses were faster than
ones not. But it really just was running it a couple times, nothing more.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory

2014-12-30 Thread Jeff Janes
On Tue, Dec 30, 2014 at 12:35 AM, Guillaume Lelarge guilla...@lelarge.info
wrote:

 Sorry for my very late answer. It's been a tough month.

 2014-11-27 0:00 GMT+01:00 Bruce Momjian br...@momjian.us:

 On Mon, Nov  3, 2014 at 12:39:26PM -0800, Jeff Janes wrote:
  It looked to me that the formula, when descending from a previously
 stressed
  state, would be:
 
  greatest(1 + checkpoint_completion_target) * checkpoint_segments,
  wal_keep_segments) + 1 +
  2 * checkpoint_segments + 1

 I don't think we can assume checkpoint_completion_target is at all
 reliable enough to base a maximum calculation on, assuming anything
 above the maximum is cause of concern and something to inform the admins
 about.

 Assuming checkpoint_completion_target is 1 for maximum purposes, how
 about:

 max(2 * checkpoint_segments, wal_keep_segments) + 2 *
 checkpoint_segments + 2


 Seems something I could agree on. At least, it makes sense, and it works
 for my customers. Although I'm wondering why + 2, and not + 1. It seems
 Jeff and you agree on this, so I may have misunderstood something.


From hazy memory, one +1 comes from the currently active WAL file, which
exists but is not counted towards either wal_keep_segments nor towards
recycled files.  And the other +1 comes from the formula for how many
recycled files to retain, which explicitly has a +1 in it.

Cheers,

Jeff


Re: [HACKERS] Documentation of bt_page_items()'s ctid field

2014-12-30 Thread Peter Geoghegan
On Tue, Dec 30, 2014 at 8:59 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 How much detail on the b-tree internals do we want to put in the pageinspect
 documentation? I can see that being useful, but should we also explain e.g.
 that the first item on each (non-rightmost) page is the high key?

Maybe we should. I see no reason not to, and I think that it makes
sense to explain things at that level without going into flags and so
on. But don't forget that that isn't quite the full story if we're
going to talk about high keys at all; we must also explain minus
infinity keys, alongside any explanation of the high key:

 * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
 * minus infinity: this routine will always claim it is less than the
 * scankey.  The actual key value stored (if any, which there probably isn't)
 * does not matter.  This convention allows us to implement the Lehman and
 * Yao convention that the first down-link pointer is before the first key.
 * See backend/access/nbtree/README for details.

In particular, this means that the key data is garbage, which is
something I've also seen causing confusion [1].

I would like to make it easier for competent non-experts on the B-Tree
code to eyeball a B-Tree with pageinspect, and be reasonably confident
that things add up. In order for such people to know that something is
wrong, we should explain what right looks like in moderate detail.
So, as I said, I feel an exact explanation of flags is unnecessary,
but tend to agree that a brief reference to both page highkeys and
minus infinity keys is appropriate, since users of the function will
see them all the time.

 I had a hard time understanding the remark about the root page. But in any
 case, if you look at the flags set e.g. with bt_page_stats(), the root page
 is flagged as also being a leaf page, when it is the only page in the index.
 So the root page is considered also a leaf page in that case.

I think that a better way of handling that originally would have been
to make root-ness a separate property from leaf-ness/internal-ness.
Too late for that now, I suppose.

 I'd suggest saying the same thing (or more) with fewer words:

 In a B-tree leaf page, structfieldctid/ points to a heap tuple. In an
 internal page, it points to another page in the index itself, and the offset
 number part (the second number) of the ctid field is ignored.

That seems good. What do you think of the attached revision?

[1] 
http://www.postgresql.org/message-id/20140828.110824.1195843073079055852.t-is...@sraoss.co.jp
-- 
Peter Geoghegan
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 6f51dc6..ffc620a 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -192,6 +192,16 @@ test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
   7 | (0,7)   |  12 | f | f| 29 27 00 00
   8 | (0,8)   |  12 | f | f| 2a 27 00 00
 /screen
+  In a B-tree leaf page, structfieldctid/ points to a heap
+  tuple.  In an internal page, it points to another page in the
+  index itself, and the offset number part (the second number) of
+  the structfieldctid/ field is ignored.  Note that the first
+  item on any non-rightmost page is the page quotehigh
+  key/quote, meaning its structfielddata/ serves as an upper
+  bound on all items appearing on the page.  Also, on non-leaf
+  pages, the first data item (the first item that is not a high
+  key) is a quoteminus infinity/quote item, implying that the
+  value of structfielddata/ is garbage.
  /para
 /listitem
/varlistentry

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Documentation of bt_page_items()'s ctid field

2014-12-30 Thread Heikki Linnakangas

On 12/30/2014 10:07 PM, Peter Geoghegan wrote:

On Tue, Dec 30, 2014 at 8:59 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

How much detail on the b-tree internals do we want to put in the pageinspect
documentation? I can see that being useful, but should we also explain e.g.
that the first item on each (non-rightmost) page is the high key?


Maybe we should. I see no reason not to, and I think that it makes
sense to explain things at that level without going into flags and so
on. But don't forget that that isn't quite the full story if we're
going to talk about high keys at all; we must also explain minus
infinity keys, alongside any explanation of the high key:


Yeah, good point.


  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * minus infinity: this routine will always claim it is less than the
  * scankey.  The actual key value stored (if any, which there probably isn't)
  * does not matter.  This convention allows us to implement the Lehman and
  * Yao convention that the first down-link pointer is before the first key.
  * See backend/access/nbtree/README for details.

In particular, this means that the key data is garbage, which is
something I've also seen causing confusion [1].


In practice, we never store any actual key value for the minus 
infinity key. I guess the code would ignore it if it was there, but it 
would make more sense to explain that the first data key on an internal 
page does not have a key value. If there is a value there, it's a sign 
that something's wrong.



I would like to make it easier for competent non-experts on the B-Tree
code to eyeball a B-Tree with pageinspect, and be reasonably confident
that things add up. In order for such people to know that something is
wrong, we should explain what right looks like in moderate detail.


Makes sense.


I had a hard time understanding the remark about the root page. But in any
case, if you look at the flags set e.g. with bt_page_stats(), the root page
is flagged as also being a leaf page, when it is the only page in the index.
So the root page is considered also a leaf page in that case.


I think that a better way of handling that originally would have been
to make root-ness a separate property from leaf-ness/internal-ness.


Hmm, yeah, bt_page_stats() currently returns 'l' in the type column when 
(BTP_ROOT | BTP_LEAF).


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Documentation of bt_page_items()'s ctid field

2014-12-30 Thread Peter Geoghegan
On Tue, Dec 30, 2014 at 12:21 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 In practice, we never store any actual key value for the minus infinity
 key. I guess the code would ignore it if it was there, but it would make
 more sense to explain that the first data key on an internal page does not
 have a key value. If there is a value there, it's a sign that something's
 wrong.

Well, depends on what you're exact definition of a key is, I suppose.
Here is what I see for a B-Tree's root page (it's a few hundred KiBs
in size):

postgres=# select * from bt_page_items('ix_order_custid', 3);
 itemoffset |  ctid  | itemlen | nulls | vars |  data
++-+---+--+-
  1 | (1,1)  |   8 | f | f|
  2 | (2,1)  |  16 | f | f| 65 02 00 00 00 00 00 00
  3 | (4,1)  |  16 | f | f| d2 04 00 00 00 00 00 00
  4 | (5,1)  |  16 | f | f| 28 07 00 00 00 00 00 00
  5 | (6,1)  |  16 | f | f| b9 09 00 00 00 00 00 00

*** SNIP ***

It's storing an index tuple, but not a key value itself. I guess that
the equivocating comments (that I pasted here before, that appear over
_bt_compare()) about what is stored made me use the term garbage, but
off hand I'm not aware of how that could actually happen either.
Perhaps it's an obsolete comment, and the data field will always be
empty for minus infinity items?

I'll leave the exact description of this state of the data field to you.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl {start,restart,reload} bad handling of stdout file descriptor

2014-12-30 Thread Tom Lane
Luis Menina lmen...@anevia.com writes:
 I've been trying to run some pg_ctl command inside a python script,
 and saw that some of them where deadlocking. It seems that the
 commands that start postgres handle stdout in a way that that block
 the caller. Redirecting stdout to /dev/null or to a file using the -l
 option allow me to workaround the problem, but fixing it upstream
 would be nice.

I think this is just pilot error, not a bug.  When you launch the
postmaster via pg_ctl and don't supply a -l option, the postmaster's
stdout remains connected to wherever pg_ctl's stdout went to --- which,
in this instance, is a pipe leading to the python script's process.
So even after pg_ctl exits, subprocess.check_output() sees the pipe as
still having live writers, and it keeps waiting for more input (which
may indeed be forthcoming, if the postmaster prints log data to stdout).
You can verify this by issuing pg_ctl stop from another terminal
and noting that the python script exits when the postmaster shuts down.

This can actually be useful behavior; for instance you might want to
collect the postmaster's stdout via a python script.  So we are certainly
not going to call it a bug and break it.  What's perhaps more debatable is
whether it should be the default ... but it's been that way for a decade
or two, so changing the default would probably annoy far more people than
it would help.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [COMMITTERS] pgsql: pg_event_trigger_dropped_objects: Add name/args output columns

2014-12-30 Thread Alvaro Herrera
Alvaro Herrera wrote:
 pg_event_trigger_dropped_objects: Add name/args output columns
 
 These columns can be passed to pg_get_object_address() and used to
 reconstruct the dropped objects identities in a remote server containing
 similar objects, so that the drop can be replicated.

This is causing buildfarm member dunlin to fail:

== pgsql.build/src/test/regress/regression.diffs 
===
*** /buildfarm/root/HEAD/pgsql.build/src/test/regress/expected/privileges.out   
Tue Dec 30 13:05:24 2014
--- /buildfarm/root/HEAD/pgsql.build/src/test/regress/results/privileges.out
Tue Dec 30 13:10:49 2014
***
*** 1323,1328 
--- 1323,1329 
  
  DROP SCHEMA testns CASCADE;
  NOTICE:  drop cascades to table testns.acltest1
+ ERROR:  requested object address for object type that cannot support it
  SELECT d.* -- check that entries went away
FROM pg_default_acl d LEFT JOIN pg_namespace n ON defaclnamespace = n.oid
WHERE nspname IS NULL AND defaclnamespace != 0;

No other member so far has reported a problem here.  Not sure if that's
the strangest bit, or the fact that dunlin is reporting anything in the
first place.  I can reproduce the problem quite simply by creating an
event trigger on sql_drop and then try to drop an object not supported
by getObjectIdentityParts -- in this case it's a default ACL for tables
in the schema being dropped.

My hope was that this check would help us detect new object types that
did not have a working get_object_address representation, but since
there are already cases that don't have it, it's currently bogus.  We
could add the default ACL support without too much effort, I think, but
it would take even less effort to just remove the check.

Another possible option is to mark ALTER DEFAULT PRIVILEGES as an
unsupported command in event triggers.  I'm not sure why we have it,
particularly since we don't have GRANT and REVOKE ... but then my DDL
deparsing patch is supposed to add support for GRANT and REVOKE, so I'm
not really proposing this.

Any strong opinions?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] ON_ERROR_ROLLBACK

2014-12-30 Thread Andrew Dunstan


On 12/30/2014 09:20 AM, Tom Lane wrote:

Bernd Helmle maili...@oopsware.de writes:

--On 29. Dezember 2014 12:55:11 -0500 Tom Lane t...@sss.pgh.pa.us wrote:

Given the lack of previous complaints, this probably isn't backpatching
material, but it sure seems like a bit of attention to consistency
would be warranted here.

Now that i read it i remember a client complaining about this some time
ago. I forgot about it, but i think there's value in it to backpatch.

Hm.  Last night I wrote the attached draft patch, which I was intending
to apply to HEAD only.  The argument against back-patching is basically
that this might change the interpretation of scripts that had been
accepted silently before.  For example
\set ECHO_HIDDEN NoExec
will now select noexec mode whereas before you silently got on mode.
In one light this is certainly a bug fix, but in another it's just
definitional instability.

If we'd gotten a field bug report we might well have chosen to back-patch,
though, and perhaps your client's complaint counts as that.

Opinions anyone?

r


I got caught by this with ON_ERROR_ROLLBACK on 9.3 just this afternoon 
before remembering this thread. So there's a field report :-)


+0.75 for backpatching (It's hard to imagine someone relying on the bad 
behaviour, but you never know).


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: pg_event_trigger_dropped_objects: Add name/args output columns

2014-12-30 Thread Tom Lane
Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Alvaro Herrera wrote:
 pg_event_trigger_dropped_objects: Add name/args output columns

 This is causing buildfarm member dunlin to fail:
 ...
 No other member so far has reported a problem here.  Not sure if that's
 the strangest bit, or the fact that dunlin is reporting anything in the
 first place.  I can reproduce the problem quite simply by creating an
 event trigger on sql_drop and then try to drop an object not supported
 by getObjectIdentityParts -- in this case it's a default ACL for tables
 in the schema being dropped.

But is there any reason to think the failure on dunlin has anything to do
with default ACLs?  I think you'd better work on understanding why there
is a platform dependency here, before you consider either removing the
regression test case or adding support for object types that it shouldn't
be hitting.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Failure on markhor with CLOBBER_CACHE_ALWAYS for test brin

2014-12-30 Thread Michael Paquier
HI all.

markhor has run for the first time in 8 days, and there is something
in range e703261..72dd233 making the regression test of brin crashing.
See here:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhordt=2014-12-30%2020%3A58%3A49
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POLA violation with \c service=

2014-12-30 Thread David Fetter
On Wed, Dec 17, 2014 at 08:14:04AM -0500, Andrew Dunstan wrote:
 
 On 12/17/2014 04:11 AM, Heikki Linnakangas wrote:
 On 12/17/2014 10:03 AM, Albe Laurenz wrote:
 David Fetter wrote:
 I've noticed that psql's  \c function handles service= requests in a
 way that I can only characterize as broken.
 
 This came up in the context of connecting to a cloud hosting service
 named after warriors or a river or something, whose default hostnames
 are long, confusing, and easy to typo, so I suspect that service= may
 come up more often going forward than it has until now.
 
 For example, when I try to use
 
 \c service=foo
 
 It will correctly figure out which database I'm trying to connect to,
 but fail to notice that it's on a different host, port, etc., and
 hence fail to connect with a somewhat unhelpful error message.
 
 I can think of a few approaches for fixing this:
 
 0.  Leave it broken.
 1.  Disable service= requests entirely in \c context, and error out
 if attempted.
 2.  Ensure that \c actually uses all of the available information.
 
 Is there another one I missed?
 
 If not, which of the approaches seems reasonable?
 
 #2 is the correct solution, #1 a band aid.
 
 It would be handy, if \c service=foo actually worked. We should do #3.
 If the database name is actually a connection string, or a service
 specification, it should not re-use the hostname and port from previous
 connection, but use the values from the connection string or service file.
 
 Yeah, that's the correct solution. It should not be terribly difficult to
 create a test for a conninfo string in the dbname parameter. That's what
 libpq does after all. We certainly don't want psql to have to try to
 interpret the service file. psql just needs to let libpq do its work in this
 situation.

This took a little longer to get time to polish than I'd hoped, but
please find attached a patch which:

- Correctly connects to service= and postgres(ql)?:// with \c
- Disallows tab completion in the above cases

I'd like to see about having tab completion actually work correctly in
at least the service= case, but that's a matter for a follow-on patch.

Thanks to Andrew Dunstan for the original patch, and to Andrew Gierth
for his help getting it into shape.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index c7a17d7..4ca50b3 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -1610,6 +1610,8 @@ do_connect(char *dbname, char *user, char *host, char 
*port)
PGconn *o_conn = pset.db,
   *n_conn;
char   *password = NULL;
+   boolkeep_password = true;
+   boolhas_connection_string = false;
 
if (!o_conn  (!dbname || !user || !host || !port))
{
@@ -1623,14 +1625,32 @@ do_connect(char *dbname, char *user, char *host, char 
*port)
return false;
}
 
-   if (!dbname)
-   dbname = PQdb(o_conn);
if (!user)
user = PQuser(o_conn);
+   else if (strcmp(user, PQuser(o_conn)) != 0)
+   keep_password = false;
+
if (!host)
host = PQhost(o_conn);
+   else if (strcmp(host, PQhost(o_conn)) != 0)
+   keep_password = false;
+
if (!port)
port = PQport(o_conn);
+   else if (strcmp(port, PQport(o_conn)) != 0)
+   keep_password = false;
+
+   has_connection_string = recognized_connection_string(dbname);
+
+   if (has_connection_string)
+   keep_password = false;
+
+   /*
+* Unlike the previous stanzas, changing only the dbname shouldn't
+* trigger throwing away the password.
+*/
+   if (!dbname)
+   dbname = PQdb(o_conn);
 
/*
 * If the user asked to be prompted for a password, ask for one now. If
@@ -1646,9 +1666,13 @@ do_connect(char *dbname, char *user, char *host, char 
*port)
{
password = prompt_for_password(user);
}
-   else if (o_conn  user  strcmp(PQuser(o_conn), user) == 0)
+   else if (o_conn  keep_password)
{
-   password = pg_strdup(PQpass(o_conn));
+   password = PQpass(o_conn);
+   if (password  *password)
+   password = pg_strdup(password);
+   else
+   password = NULL;
}
 
while (true)
@@ -1656,23 +1680,28 @@ do_connect(char *dbname, char *user, char *host, char 
*port)
 #define PARAMS_ARRAY_SIZE  8
const char **keywords = pg_malloc(PARAMS_ARRAY_SIZE * 
sizeof(*keywords));
const char **values = pg_malloc(PARAMS_ARRAY_SIZE * 
sizeof(*values));
+ 

Re: [HACKERS] Failure on markhor with CLOBBER_CACHE_ALWAYS for test brin

2014-12-30 Thread Alvaro Herrera
Michael Paquier wrote:
 HI all.
 
 markhor has run for the first time in 8 days, and there is something
 in range e703261..72dd233 making the regression test of brin crashing.
 See here:
 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhordt=2014-12-30%2020%3A58%3A49

This shows that the crash was in the object_address test, not brin.
Will research.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-12-30 Thread Peter Geoghegan
On Mon, Dec 29, 2014 at 11:52 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 Correct, I haven't seen any problems with approach #1

That helps with debugging #2, then. That's very helpful.

 Generally the problem will occur early on in the process, and if not then it
 will not occur at all.  I think that is because the table starts out empty,
 and so a lot of insertions collide with each other.  Once the table is more
 thoroughly populated, most query takes the CONFLICT branch and therefore two
 insertion-branches are unlikely to collide.

 At its simplest, I just use the count_upsert.pl script and your patch and
 forget all the rest of the stuff from my test platform.

I can reproduce this on my laptop now. I think that building at -O2
and without assertions helps. I'm starting to work through debugging
it.

I threw together a quick script for getting pg_xlogdump into a
Postgres table (a nice use of the new pg_lsn type). It's here:

https://github.com/petergeoghegan/jjanes_upsert/blob/master/pg_xlogdump2csv.py

It tells a story. Looking at the last segment before shutdown when the
problem occurred, I see:

postgres=# select count(*), tx from my_xlogdump group by tx having
count(*)  4 order by 1;
 count |   tx
---+-
 5 | 1917836
 5 | 1902576
 5 | 1909746
 5 | 1901586
 5 | 1916971
 6 | 1870077
39 | 1918004
   119 | 1918003
  2246 |   0
(9 rows)

postgres=# select max(tx::text::int4) from my_xlogdump ;
   max
-
 1918004
(1 row)

So the last two transactions (1918003 and 1918004) get into some kind
of live-lock situation, it looks like. Or at least something that
causes them to produce significant more WAL records than other xacts
due to some kind of persistent problem with conflicts.

Here is where the earlier of the two problematic transactions has its
first record:

postgres=# select * from my_xlogdump where tx = '1918003' order by
r_lsn asc limit 1;
 rmgr | len_rec | len_tot |   tx|   r_lsn|  prev_lsn  |
   descr
--+-+-+-+++
 Heap |   3 | 203 | 1918003 | 0/1783BB70 | 0/1783BB48 | INSERT
off 33 blkref #0: rel 1663/16471/12502 blk
(1 row)

After and including that record, until the trouble spot up to and
including shutdown, here is the rmgr breakdown:

postgres=# select count(*),  rmgr from my_xlogdump where r_lsn =
'0/1783BB70' group by rmgr order by 1;
 count |rmgr
---+-
 1 | XLOG -- 1 CHECKPOINT_SHUTDOWN record
 2 | Transaction -- commit records for the two xacts
20 | Heap2-- all are CLEAN remxid records, tx is 0
76 | Heap  -- All from our two xacts...
80 | Btree  -- All from XID 1918003 only
(5 rows)

So looks like a bad interaction with VACUUM. Maybe it's a problem with
VACUUM interlocking. That was my first suspicion, FWIW.

I'll need to do more investigating, but I can provide a custom format
dump of the table, in case anyone wants to look at what I have here in
detail. I've uploaded it to:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/files/my_xlogdump.custom.dump.table
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Additional role attributes superuser review

2014-12-30 Thread Amit Kapila
On Tue, Dec 30, 2014 at 6:35 PM, Stephen Frost sfr...@snowman.net wrote:
 * Amit Kapila (amit.kapil...@gmail.com) wrote:
  On Tue, Dec 30, 2014 at 6:52 AM, Stephen Frost sfr...@snowman.net
wrote:

  another could be have a separate privilege (DBA) or a role which is
  not a superuser, however can be used to perform such tasks.

 I'm pretty sure we've agreed that having a catch-all role attribute like
 'DBA' is a bad idea because we'd likely be adding privileges to it later
 which would expand the rights of users with that attribute, possibly
 beyond what is intended.


Yes, that could happen but do you want to say that is the only reason
to consider server level roles (such as DBA) a bad idea.  Can't we make
users aware of such things via documentation and then there will be
some onus on user's to give such a privilege with care.

On looking around, it seems many of the databases provides such
roles
https://dbbulletin.wordpress.com/2013/05/29/backup-privileges-needed-to-backup-databases/

In the link, though they are talking about physical (file level) backup,
there is mention about such roles in other databases.

The other way as discussed on this thread is to use something like
DUMPALL (or DUMPFULL) privilege which also sounds to be a good
way, apart from the fact that the same privilege can be used for
non-dump purposes to extract the information from database/cluster.


   I don't really like 'xlog' as a permission.  BINARY BACKUP is
   interesting but if we're going to go multi-word, I would think we
would
   do PITR BACKUP or something FILESYSTEM BACKUP or similar.  I'm not
   really a big fan of the two-word options though.
 
  How about PHYSICAL BACKUP (for basebackup) and LOGICAL BACKUP
  for pg_dump?

 Again, this makes it look like the read-all right would be specific to
 users executing pg_dump, which is incorrect and misleading.  PHYSICAL
 might also imply the ability to do pg_basebackup.


That's right, however having unrelated privileges for doing physical
backup via pg_basebackup and pg_start*/pg_stop* method also
doesn't sound to be the best way (can be slightly difficult for
users to correlate after reading docs).  Don't you think in this case
we should have some form of hierarchy for privileges (like user
having privilege to do pg_basebackup can also perform the
backup via pg_start*/pg_stop* method or some other way to relate
both forms of physical backup)?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates

2014-12-30 Thread Robert Haas
On Fri, Dec 26, 2014 at 7:57 AM, David Rowley dgrowle...@gmail.com wrote:
 1. Do we need to keep the 128 byte aggregate state size for machines without
 128 bit ints? This has been reduced to 48 bytes in the patch, which is in
 favour code being compiled with a compiler which has 128 bit ints.  I kind
 of think that we need to keep the 128 byte estimates for compilers that
 don't support int128, but I'd like to hear any counter arguments.

I think you're referring to the estimated state size in pg_aggregate
here, and I'd say it's probably not a big deal one way or the other.
Presumably, at some point, 128-bit arithmetic will become common
enough that we'll want to change that estimate, but I don't know
whether we've reached that point or not.

 2. References to int16 meaning 16 bytes. I'm really in two minds about this,
 it's quite nice to keep the natural flow, int4, int8, int16, but I can't
 help think that this will confuse someone one day. I think it'll be a long
 time before it confused anyone if we called it int128 instead, but I'm not
 that excited about seeing it renamed either. I'd like to hear what others
 have to say... Is there a chance that some sql standard in the distant
 future will have HUGEINT and we might regret not getting the internal names
 nailed down?

Yeah, I think using int16 to mean 16-bytes will be confusing to
someone almost immediately.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] orangutan seizes up during isolation-check

2014-12-30 Thread Robert Haas
On Sun, Dec 28, 2014 at 4:58 PM, Noah Misch n...@leadboat.com wrote:
 I wondered whether to downgrade FATAL to LOG in back branches.  Introducing a
 new reason to block startup is disruptive for a minor release, but having the
 postmaster deadlock at an unpredictable later time is even more disruptive.  I
 am inclined to halt startup that way in all branches.

Jeepers.  I'd rather not do that.  From your report, this problem has
been around for years.  Yet, as far as I know, it's bothering very few
real users, some of whom might be far more bothered by the postmaster
suddenly failing to start.  I'm fine with a FATAL in master, but I'd
vote against doing anything that might prevent startup in the
back-branches without more compelling justification.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

2014-12-30 Thread Guillaume Lelarge
2014-12-12 14:58 GMT+01:00 Heikki Linnakangas hlinnakan...@vmware.com:

 On 12/10/2014 04:32 PM, Dennis Kögel wrote:

 Hi,

 Am 04.09.2014 um 17:50 schrieb Jehan-Guillaume de Rorthais 
 j...@dalibo.com:

 Since few months, we occasionally see .ready files appearing on some
 slave
 instances from various context. The two I have in mind are under 9.2.x.
 […]
 So it seems for some reasons, these old WALs were forgotten by the
 restartpoint mechanism when they should have been recylced/deleted.


 Am 08.10.2014 um 11:54 schrieb Heikki Linnakangas 
 hlinnakan...@vmware.com:

 1. Where do the FF files come from? In 9.2, FF-segments are not supposed
 to created, ever. […]
 2. Why are the .done files sometimes not being created?




 We’ve encountered behaviour which seems to match what has been described
 here: On Streaming Replication slaves, there is an odd piling up of old
 WALs and .ready files in pg_xlog, going back several months.

 The fine people on IRC have pointed me to this thread, and have
 encouraged me to revive it with our observations, so here we go:

 Environment:

 Master,  9.2.9
 |- Slave S1, 9.2.9, on the same network as the master
 '- Slave S2, 9.2.9, some 100 km away (occassional network hickups; *not*
 a cascading replication)

 wal_keep_segments M=100 S1=100 S2=30
 checkpoint_segments M=100 S1=30 S2=30
 wal_level hot_standby (all)
 archive_mode on (all)
 archive_command on both slaves: /bin/true
 archive_timeout 600s (all)


 - On both slaves, we have „ghost“ WALs and corresponding .ready files
 (currently 600 of each on S2, slowly becoming a disk space problem)

 - There’s always gaps in the ghost WAL names, often roughly 0x20, but not
 always

 - The slave with the „bad“ network link has significantly more of these
 files, which suggests that disturbances of the Streaming Replication
 increase chances of triggering this bug; OTOH, the presence of a name gap
 pattern suggests the opposite

 - We observe files named *FF as well


 As you can see in the directory listings below, this setup is *very* low
 traffic, which may explain the pattern in WAL name gaps (?).

 I’ve listed the entries by time, expecting to easily match WALs to their
 .ready files.
 There sometimes is an interesting delay between the WAL’s mtime and the
 .ready file — especially for *FF, where there’s several days between the
 WAL and the .ready file.

 - Master:   http://pgsql.privatepaste.com/52ad612dfb
 - Slave S1: http://pgsql.privatepaste.com/58b4f3bb10
 - Slave S2: http://pgsql.privatepaste.com/a693a8d7f4


 I’ve only skimmed through the thread; my understanding is that there were
 several patches floating around, but nothing was committed.
 If there’s any way I can help, please let me know.


 Yeah. It wasn't totally clear how all this should work, so I got
 distracted with other stuff an dropped the ball; sorry.

 I'm thinking that we should change the behaviour on master so that the
 standby never archives any files from older timelines, only the new one
 that it generates itself. That will solve the immediate problem of old WAL
 files accumulating, and bogus .ready files appearing in the standby.
 However, it will not solve the bigger problem of how do you ensure that all
 WAL files are archived, when you promote a standby server. There is no
 guarantee on that today anyway, but this will make it even less reliable,
 because it will increase the chances that you miss a file on the old
 timeline in the archive, after promoting. I'd argue that that's a good
 thing; it makes the issue more obvious, so you are more likely to encounter
 it in testing, and you won't be surprised in an emergency. But I've started
 a new thread on that bigger issue, hopefully we'll come up with a solution (
 http://www.postgresql.org/message-id/548af1cb.80...@vmware.com).

 Now, what do we do with the back-branches? I'm not sure. Changing the
 behaviour in back-branches could cause nasty surprises. Perhaps it's best
 to just leave it as it is, even though it's buggy.


As long as master is fixed, I don't actually care. But I agree with Dennis
that it's hard to see what's been commited with all the different issues
found, and if any commits were done, in which branch. I'd like to be able
to tell my customers: update to this minor release to see if it's fixed,
but I can't even do that.


-- 
Guillaume.
  http://blog.guillaume.lelarge.info
  http://www.dalibo.com


Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory

2014-12-30 Thread Guillaume Lelarge
2014-12-30 18:45 GMT+01:00 Jeff Janes jeff.ja...@gmail.com:

 On Tue, Dec 30, 2014 at 12:35 AM, Guillaume Lelarge 
 guilla...@lelarge.info wrote:

 Sorry for my very late answer. It's been a tough month.

 2014-11-27 0:00 GMT+01:00 Bruce Momjian br...@momjian.us:

 On Mon, Nov  3, 2014 at 12:39:26PM -0800, Jeff Janes wrote:
  It looked to me that the formula, when descending from a previously
 stressed
  state, would be:
 
  greatest(1 + checkpoint_completion_target) * checkpoint_segments,
  wal_keep_segments) + 1 +
  2 * checkpoint_segments + 1

 I don't think we can assume checkpoint_completion_target is at all
 reliable enough to base a maximum calculation on, assuming anything
 above the maximum is cause of concern and something to inform the admins
 about.

 Assuming checkpoint_completion_target is 1 for maximum purposes, how
 about:

 max(2 * checkpoint_segments, wal_keep_segments) + 2 *
 checkpoint_segments + 2


 Seems something I could agree on. At least, it makes sense, and it works
 for my customers. Although I'm wondering why + 2, and not + 1. It seems
 Jeff and you agree on this, so I may have misunderstood something.


 From hazy memory, one +1 comes from the currently active WAL file, which
 exists but is not counted towards either wal_keep_segments nor towards
 recycled files.  And the other +1 comes from the formula for how many
 recycled files to retain, which explicitly has a +1 in it.



OK, that seems much better. Thanks, Jeff.



-- 
Guillaume.
  http://blog.guillaume.lelarge.info
  http://www.dalibo.com