date:20150205

On Fri, Jan 30, 2015 at 10:26 PM, Andreas Karlsson wrote:
> On 01/30/2015 07:48 AM, Michael Paquier wrote:
>> Looking at the latest patch, it seems that in
>> AlterTableGetLockLevel@tablecmds.c we ought to put AT_ReAddConstraint,
>> AT_AddConstraintRecurse and AT_ProcessedConstraint under the same
>> banner as AT_AddConstraint. Thoughts?
>
> Good point. I think moving those would be a good thing even though it is
> technically not necessary for AT_AddConstraintRecurse, since that one should
> only be related to check constraints.

Andreas, are you planning to send an updated patch?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Early Setup of instrumentation information in pg_stat_statements

2015-02-05 Thread Amit Kapila

On Thu, Feb 5, 2015 at 8:58 PM, Amit Kapila  wrote:
>
> Currently in pg_stat_statements, the setup to track
> instrumentation/totaltime information is done after
> ExecutorStart().  Can we do this before ExecutorStart()?

On further analyzing, I found that currently it is done after
ExecutorStart() because we want to allocate Instrumentation info
in per-query context which is allocated in ExecutorStart(), but I wonder
why can't it be done inside ExecutorStart() after per-query context is
available.  Currently  backend (standard_ExecutorRun()) takes care
of updating the stats/instrumentation info for a plugin (pg_stat_statements)
and the same is initialized by plugin itself, won't it be better that
both the operations be done by backend as backend has more
knowledge when it is appropriate to initialize or update and plugin
needs to just indicate that it needs stats.

Does anyone sees problem with doing it in above way or can think of
a better way such that this information (that plugin needs stats) can be
made available in ExecutorStart phase?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] pg_basebackup may fail to send feedbacks.

2015-02-05 Thread Kyotaro HORIGUCHI

Sorry, I misunderstood that.

> > At Wed, 4 Feb 2015 19:22:39 +0900, Fujii Masao  
> > wrote in 
> > 
> >> On Wed, Feb 4, 2015 at 4:58 PM, Kyotaro HORIGUCHI
> >>  wrote:
> >> > I'm very sorry for confused report. The problem found in 9.4.0
> >> > and the diagnosis was mistakenly done on master.
> >> >
> >> > 9.4.0 has no problem of feedback delay caused by slow xlog
> >> > receiving on pg_basebackup mentioned in the previous mail. But
> >> > the current master still has this problem.
> >>
> >> Seems walreceiver has the same problem. No?
> >
> > pg_receivexlog.c would have the same problem since it uses the
> > same function with pg_basebackup.c.
> >
> > The correspondent of HandleCopyStream in wansender is
> > WalReceiverMain, and it doesn't seem to have the same kind of
> > loop shown below. It seems to surely send feedback per one
> > record.
> >
> > |   r = stream_reader();
> > |   while (r > 0)
> > |   {
> > |  ... wal record processing stuff without sending feedback..
> > |  r = stream_reader();
> > |   }
> 
> WalReceiverMain() has the similar code as follows.
> 
> len = walrcv_receive(NAPTIME_PER_CYCLE, &buf);
> if (len != 0)
> {
> for (;;)
> {
> if (len > 0)
> {
> 
> len = walrcv_receive(0, &buf);
> }
> }

The loop seems a bit different but eventually the same about this
discussion.

408> len = walrcv_receive(NAPTIME_PER_CYCLE, &buf);
409> if (len != 0)
410> {
415> for (;;)
416> {
417> if (len > 0)
418> {
425>XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1);
426> }
427-438> else {break;}
439> len = walrcv_receive(0, &buf);
440> }
441> }

XLogWalRcvProcessMsg doesn't send feedback when processing
'w'=WAL record packet. So the same thing as pg_basebackup and
pg_receivexlog will occur on walsender.

Exiting the for(;;) loop by TimestampDifferenceExceeds just
before line 439 is an equivalent measure to I poposed for
receivelog.c, but calling XLogWalRcvSendHSFeedback(false) there
is seemingly simpler (but I feel a bit uncomfortable for the
latter)

Or other measures?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

On Thu, Feb 5, 2015 at 11:58 PM, Michael Paquier wrote:
> An updated patch is attached.
I just noticed that the patch I sent was incorrect:
- Parameter name was still wal_availability_check_interval and not
wal_retrieve_retry_interval
- Documentation was incorrect.
Please use the patch attached instead for further review.
-- 
Michael
From 06a4d3d1f5fe4362d7be1404cbf0b45b74fea69f Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 19 Jan 2015 16:08:48 +0900
Subject: [PATCH] Add wal_retrieve_retry_interval

This parameter aids to control at which timing WAL availability is checked
when a node is in recovery, particularly  when successive failures happen when
fetching WAL archives, or when fetching WAL records from a streaming source.

Default value is 5s.
---
 doc/src/sgml/recovery-config.sgml   | 17 +
 src/backend/access/transam/recovery.conf.sample |  9 +
 src/backend/access/transam/xlog.c   | 47 +
 src/backend/utils/adt/timestamp.c   | 38 
 src/include/utils/timestamp.h   |  2 ++
 5 files changed, 99 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml
index 0c64ff2..d4babbd 100644
--- a/doc/src/sgml/recovery-config.sgml
+++ b/doc/src/sgml/recovery-config.sgml
@@ -145,6 +145,23 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
   
  
 
+ 
+  wal_retrieve_retry_interval (integer)
+  
+wal_retrieve_retry_interval recovery parameter
+  
+  
+  
+   
+This parameter specifies the amount of time to wait when a failure
+occurred when reading WAL from a source (be it via streaming
+replication, local pg_xlog or WAL archive) for a node
+in standby mode, or when WAL is expected from a source. Default
+value is 5s.
+   
+  
+ 
+
 
 
   
diff --git a/src/backend/access/transam/recovery.conf.sample b/src/backend/access/transam/recovery.conf.sample
index b777400..458308c 100644
--- a/src/backend/access/transam/recovery.conf.sample
+++ b/src/backend/access/transam/recovery.conf.sample
@@ -58,6 +58,15 @@
 #
 #recovery_end_command = ''
 #
+#
+# wal_retrieve_retry_interval
+#
+# specifies an optional internal to wait for WAL to become available when
+# a failure occurred when reading WAL from a source for a node in standby
+# mode, or when WAL is expected from a source.
+#
+#wal_retrieve_retry_interval = '5s'
+#
 #---
 # RECOVERY TARGET PARAMETERS
 #---
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 629a457..111e53d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -235,6 +235,7 @@ static TimestampTz recoveryTargetTime;
 static char *recoveryTargetName;
 static int	recovery_min_apply_delay = 0;
 static TimestampTz recoveryDelayUntilTime;
+static int	wal_retrieve_retry_interval = 5000;
 
 /* options taken from recovery.conf for XLOG streaming */
 static bool StandbyModeRequested = false;
@@ -4881,6 +4882,26 @@ readRecoveryCommandFile(void)
 	(errmsg_internal("trigger_file = '%s'",
 	 TriggerFile)));
 		}
+		else if (strcmp(item->name, "wal_retrieve_retry_interval") == 0)
+		{
+			const char *hintmsg;
+
+			if (!parse_int(item->value, &wal_retrieve_retry_interval, GUC_UNIT_MS,
+		   &hintmsg))
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+		 errmsg("parameter \"%s\" requires a temporal value",
+"wal_retrieve_retry_interval"),
+		 hintmsg ? errhint("%s", _(hintmsg)) : 0));
+			ereport(DEBUG2,
+	(errmsg_internal("wal_retrieve_retry_interval = '%s'", item->value)));
+
+			if (wal_retrieve_retry_interval <= 0)
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+		 errmsg("\"%s\" must have a value strictly positive",
+"wal_retrieve_retry_interval")));
+		}
 		else if (strcmp(item->name, "recovery_min_apply_delay") == 0)
 		{
 			const char *hintmsg;
@@ -10340,8 +10361,8 @@ static bool
 WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			bool fetching_ckpt, XLogRecPtr tliRecPtr)
 {
-	static pg_time_t last_fail_time = 0;
-	pg_time_t	now;
+	TimestampTz now = GetCurrentTimestamp();
+	TimestampTz	last_fail_time = now;
 
 	/*---
 	 * Standby mode is implemented by a state machine:
@@ -10490,15 +10511,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * machine, so we've exhausted all the options for
 	 * obtaining the requested WAL. We're going to loop back
 	 * and retry from the archive, but if it hasn't been long
-	 * since last attempt, sleep 5 seconds to avoid
-	 * busy-waiting.
+	 * since last attempt, sleep the amount of time specified
+	 * by wal_r

[HACKERS] pg_basebackup, tablespace mapping and path canonicalization

2015-02-05 Thread Ian Barwick

Hi

I stumbled on what appears to be inconsistent handling of double slashes
in tablespace paths when using pg_basebackup with the -T/--tablespace-mapping
option:

ibarwick:postgresql (master)$ mkdir /tmp//foo-old
ibarwick:postgresql (master)$ $PG_HEAD/psql 'dbname=postgres port=9595'
psql (9.5devel)
Type "help" for help.

postgres=# CREATE TABLESPACE foo LOCATION '/tmp//foo-old';
CREATE TABLESPACE
postgres=# \db
 List of tablespaces
Name|  Owner   |   Location
+--+--
 foo| ibarwick | /tmp/foo-old
 pg_default | ibarwick |
 pg_global  | ibarwick |
(3 rows)

So far so good. However attempting to take a base backup (on the same
machine) and remap the tablespace directory:

ibarwick:postgresql (master)$ $PG_HEAD/pg_basebackup -p9595 
--pgdata=/tmp//backup --tablespace-mapping=/tmp//foo-old=/tmp//foo-new

produces the following message:

pg_basebackup: directory "/tmp/foo-old" exists but is not empty

which, while undeniably true, is unexpected and could potentially encourage 
someone
to hastily delete "/tmp/foo-old" after confusing it with the new directory.

The double-slash in the old tablespace path is the culprit:

ibarwick:postgresql (master)$ $PG_HEAD/pg_basebackup -p9595 
--pgdata=/tmp//backup --tablespace-mapping=/tmp/foo-old=/tmp//foo-new
NOTICE:  pg_stop_backup complete, all required WAL segments have been 
archived

The documentation does state:

To be effective, olddir must exactly match the path specification of the
tablespace as it is currently defined.

which I understood to mean that e.g. tildes would not be expanded, but it's
somewhat surprising that the path is not canonicalized in the same way
it is pretty much everywhere else (including in "CREATE TABLESPACE").

The attached patch adds the missing canonicalization; I can't see any
reason not to do this. Thoughts?

Regards


Ian Barwick

-- 
 Ian Barwick   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services
diff --git a/src/bin/pg_basebackup/pg_basebackup.c 
b/src/bin/pg_basebackup/pg_basebackup.c
new file mode 100644
index fbf7106..349bd90
*** a/src/bin/pg_basebackup/pg_basebackup.c
--- b/src/bin/pg_basebackup/pg_basebackup.c
*** tablespace_list_append(const char *arg)
*** 199,204 
--- 199,207 
exit(1);
}
  
+   canonicalize_path(cell->old_dir);
+   canonicalize_path(cell->new_dir);
+ 
if (tablespace_dirs.tail)
tablespace_dirs.tail->next = cell;
else

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Table-level log_autovacuum_min_duration

On Fri, Feb 6, 2015 at 4:50 AM, Naoya Anzai
 wrote:
>> The following review has been posted through the commitfest application:
>> make installcheck-world:  tested, failed
>> Implements feature:   tested, failed
>> Spec compliant:   tested, failed
>> Documentation:tested, failed
> I'm sorry, I just sent it by mistake.
> All of them have passed.
That's fine. I think you used the UI on the commit fest app, and it is
not intuitive that you need to check those boxes at first sight when
using it for the first time.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] No-rewrite timestamp<->timestamptz conversions

2015-02-05 Thread Noah Misch

On Tue, Nov 05, 2013 at 05:02:58PM -0800, Josh Berkus wrote:
> I'd also love some way of doing a no-rewrite conversion between
> timestamp and timestamptz, based on the assumption that the original
> values are UTC time.  That's one I encounter a lot.

It was such a conversion that motivated me to add the no-rewrite ALTER TABLE
ALTER TYPE support in the first place.  Interesting.  Support for it didn't
end up in any submitted patch due to a formal problem: a protransform function
shall only consult IMMUTABLE facts, but we posit that timezone==UTC is a
STABLE observation.  However, a protransform function can easily simplify the
immutable expression "tscol AT TIME ZONE 'UTC'", avoiding a rewrite.  See
attached patch.  Examples:

begin;
create table t (c timestamptz);
set client_min_messages = debug1;
-- rewrite: depends on timezone GUC
alter table t alter c type timestamp;
-- rewrite: depends on timezone GUC
alter table t alter c type timestamptz;
-- no rewrite: always UTC+0
alter table t alter c type timestamp using c at time zone 'UTC';
-- no rewrite: always UTC+0
alter table t alter c type timestamptz using c at time zone 'Etc/Universal';
-- rewrite: always UTC+0 in the present day, but not historically
alter table t alter c type timestamp using c at time zone 'Atlantic/Reykjavik';
-- rewrite: always UTC+0 in the present day, but not historically
alter table t alter c type timestamptz using c at time zone 'Africa/Lome';
-- no rewrite: always UTC+0
alter table t alter c type timestamp using c at time zone 'GMT';
-- rewrite: always UTC+1
alter table t alter c type timestamptz using c at time zone '1 hour'::interval;
-- no rewrite: always UTC+0
alter table t alter c type timestamp using c at time zone '0 hour'::interval;
rollback;
diff --git a/src/backend/utils/adt/timestamp.c 
b/src/backend/utils/adt/timestamp.c
index 67e0cf9..723c670 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -27,6 +27,7 @@
 #include "funcapi.h"
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
+#include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "parser/scansup.h"
 #include "utils/array.h"
@@ -4874,6 +4875,87 @@ interval_part(PG_FUNCTION_ARGS)
 }
 
 
+/* timestamp_zone_transform()
+ * If the zone argument of a timestamp_zone() or timestamptz_zone() call is a
+ * plan-time constant denoting a zone equivalent to UTC, the call will always
+ * return its second argument unchanged.  Simplify the expression tree
+ * accordingly.  Civil time zones almost never qualify, because jurisdictions
+ * that follow UTC today have not done so continuously.
+ */
+Datum
+timestamp_zone_transform(PG_FUNCTION_ARGS)
+{
+   Node   *func_node = (Node *) PG_GETARG_POINTER(0);
+   FuncExpr   *expr = (FuncExpr *) func_node;
+   Node   *ret = NULL;
+   Node   *zone_node;
+
+   Assert(IsA(expr, FuncExpr));
+   Assert(list_length(expr->args) == 2);
+
+   zone_node = (Node *) linitial(expr->args);
+
+   if (IsA(zone_node, Const) &&!((Const *) zone_node)->constisnull)
+   {
+   text   *zone = DatumGetTextPP(((Const *) 
zone_node)->constvalue);
+   chartzname[TZ_STRLEN_MAX + 1];
+   char   *lowzone;
+   int type,
+   abbrev_offset;
+   pg_tz  *tzp;
+   boolnoop = false;
+
+   /*
+* If the timezone is forever UTC+0, the FuncExpr function call 
is a
+* no-op for all possible timestamps.  This passage mirrors 
code in
+* timestamp_zone().
+*/
+   text_to_cstring_buffer(zone, tzname, sizeof(tzname));
+   lowzone = downcase_truncate_identifier(tzname,
+   
   strlen(tzname),
+   
   false);
+   type = DecodeTimezoneAbbrev(0, lowzone, &abbrev_offset, &tzp);
+   if (type == TZ || type == DTZ)
+   noop = (abbrev_offset == 0);
+   else if (type == DYNTZ)
+   {
+   /*
+* An abbreviation of a single-offset timezone ought 
not to be
+* configured as a DYNTZ, so don't bother checking.
+*/
+   }
+   else
+   {
+   longtzname_offset;
+
+   tzp = pg_tzset(tzname);
+   if (tzp && pg_get_timezone_offset(tzp, &tzname_offset))
+   noop = (tzname_offset == 0);
+   }
+
+   if (noop)
+   {
+   Node   *timestamp = (Node *) lsecond(expr->args);
+
+   /* Strip any existing RelabelType node(

Re: [HACKERS] Table-level log_autovacuum_min_duration

2015-02-05 Thread Naoya Anzai

> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, failed
> Implements feature:   tested, failed
> Spec compliant:   tested, failed
> Documentation:tested, failed
I'm sorry, I just sent it by mistake.
All of them have passed.

---
Naoya Anzai

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] RangeType internal use

2015-02-05 Thread Amit Langote

Hi,

I wonder why I cannot find a way to get a range type for a given (sub-)
type. I would like to build a RangeType from Datum's of lower and upper
bounds. Much like how construct_array() builds an ArrayType from a Datum
array of elements given elements' type info.

Is there some way I do not seem to know? If not, would it be worthwhile
to make something like construct_range() that returns a RangeType given
Datum's of lower and upper bounds and subtype info?

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Table-level log_autovacuum_min_duration

2015-02-05 Thread Naoya Anzai

The following review has been posted through the commitfest application:
make installcheck-world: tested, failed
Implements feature: tested, failed
Spec compliant: tested, failed
Documentation:tested, failed

Hi,

I'm Naoya Anzai.
I've been working as a PostgreSQL Support Engineer for 6 years.
I am a newcomer of reviewing, and My programming skill is not so high.
But the following members also participate in this review. (We say
"Two heads are better than one." :))

Akira Kurosawa
Taiki Kondo
Huong Dangminh

So I believe reviewing patches is not difficult for us.

This is a review of Table-level log_autovacuum_min_duration patch:
http://www.postgresql.org/message-id/cab7npqtbqsbegvb8coh01k7argys9kbuv8dr+aqgonfvb8k...@mail.gmail.com

Submission review

The patch applies cleanly to HEAD, and it works fine on Fedora
release 20.
There is no regression test,but I think it is no problem
because other parameter also is not tested.

Usability review

pg_dump/pg_restore support is OK.
I think this feature is a nice idea and I also want this feature.

Feature test

I executed following commands after setting
"log_autovacuum_min_duration" to 1000 in the GUC. ("bar" table is
already created with no options.)

CREATE TABLE foo ( ... ) WITH ( log_autovacuum_min_duration = 0 );
ALTER TABLE bar SET (log_autovacuum_min_duration = 0 );

Then, only in "foo" and "bar" table, autovacuum log was printed out
even if elapsed time of autovacuum is lesser than 1000ms. This
behavior was expected and there was no crash or failed assertion.
So it looked good for me. But, I executed following command, in
"buzz" table, autovacuum log was printed out if elapsed time is
more than 1000ms.

CREATE TABLE buzz ( ... ) WITH ( log_autovacuum_min_duration = -1 );
^^

I expect autovacuum log is NOT printed out even if elapsed time is
more than 1000ms in this case. My thought is wrong, isn't it? In my
opinion, there is an use case that autovacuum log is always printed
out in all tables excepting specified tables. I think there is a
person who wants to use it like this case, but I (or he) can NOT use
in this situation.

How about your opinion?

Performance review

Not reviewed from this point of view.

Coding review

I think description of log_autovacuum_min_duration in reloptions.c
(line:215) should be like other parameters (match with guc.c). So
it should be "Sets the minimum execution time above which autovacuum
actions will be logged." but not "Log autovacuum execution for
given threshold".

There is no new function which is defined in this patch, and
modified contents are not related to OS-dependent contents, so I
think it will work fine on Windows/BSD etc.

Architecture review

About the modification of gram.y.

I think it is not good that log_min_duration is initialized to
zeros in gram.y when "FREEZE" option is set. Because any VACUUM
queries never use this parameter. I think log_min_duration always
should be initialized to -1.

Regards,
Naoya Anzai (NEC Solution Innovators,Ltd.)

The new status of this patch is: Waiting on Author

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] HEADS UP: PGCon 2015 major schedule changes

2015-02-05 Thread Dan Langille

Hello,

By request, the format of PGCon 2015 will differ significantly from previous 
year.
Our goal is to give you more of what you want while still keeping the stuff 
you've always liked.

In June 2015, PGCon will be structured as follows:

Unconference: 16-17 June 2015 (Tue afternoon & all day Wed)

Beginner Tutorials: 17 June 2015 (Wed)

Talks: 18-19 June 2015 (Thu-Fri)

Advanced Tutorial: 20 June 2015 (Sat)

The big changes are:
- Unconference moved to weekdays and now 1.5 days (was one day; Saturday)
- Tutorials split between beginner and advanced, and now on Wednesday & Saturday
  (was Tuesday & Wednesday)

Why?

The unconference has become a bigger and more significant part of PGCon
for PostgreSQL contributors.  It has moved to earlier in the week to
coordinate with other developer meetings, in order to expand the
participation in development discussions and meetings around PGCon.
Additionally, the shift of some tutorials to Saturday allows tutorials to 
involve key PostgreSQL contributors without schedule conflicts.

Unfortunately, this meant moving something else to Saturday, at least for this 
year.
We considered moving the talks to earlier in the week, but we felt that our 
changes
were already disruptive and wanted to minimize the effects this late change may
have on people who have already booked travel / accommodation.  To those 
affected, we apologize and hope that this new structure will benefit everyone.

— 
Dan Langille
http://langille.org/







-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

2015-02-05 Thread David Steele

On 2/5/15 4:53 PM, Josh Berkus wrote:
>> Actually, perhaps we should have a boolean setting that just implies
>> min=max, instead of having a configurable minimum?. That would cover all
>> of those reasons pretty well. So we would have a "max_wal_size" setting,
>> and a boolean "preallocate_all_wal = on | off". Would anyone care for
>> the flexibility of setting a minimum that's different from the maximum?
> I do, actually.  Here's the case I want it for:
>
> I have a web application which gets all of its new data as uncoordinated
> batch updates from customers.  Since it's possible for me to receive
> several batch updates at once, I set max_wal_size to 16GB, roughtly the
> side of 8 batch updates.  But I don't want the WAL that big all the time
> because it slows down backup snapshots.  So I set min_wal_size to 2GB,
> roughly the size of one batch update.
>
> That's an idiosyncratic case, but I can imagine more of them out there.
>
> I wouldn't be opposed to min_wal_size = -1 meaning "same as
> max_wal_size" though.

+1 for min_wal_size.  Like Josh, I can think of instances where this
would be good.

-- 
- David Steele
da...@pgmasters.net




signature.asc
Description: OpenPGP digital signature

Re: [HACKERS] Redesigning checkpoint_segments

2015-02-05 Thread Josh Berkus

On 02/05/2015 01:42 PM, Heikki Linnakangas wrote:
> There are a few reasons for making the minimum configurable:

Any thoughts on what the default minimum should be, if the default max
is 1.1GB/64?

> 1. Creating new segments when you need them is not free, so if you have
> a workload with occasional very large spikes, you might want to prepare
> for them. The auto-tuning will accommodate for the peak usage, but it's
> a moving average so if the peaks are infrequent enough, it will shrink
> the size down between the spikes.
> 
> 2. To avoid running out of disk space on write to WAL (which leads to a
> PANIC). In particular, if you have the WAL on the same filesystem as the
> data, pre-reserving all the space required for WAL makes it much more
> likely that you when you run out of disk space, you run out when writing
> regular data, not WAL.
> 
> 3. Unforeseen issues with the auto-tuning. It might not suite everyone,
> so it's nice that you can still get the old behaviour by setting min=max.
> 
> Actually, perhaps we should have a boolean setting that just implies
> min=max, instead of having a configurable minimum?. That would cover all
> of those reasons pretty well. So we would have a "max_wal_size" setting,
> and a boolean "preallocate_all_wal = on | off". Would anyone care for
> the flexibility of setting a minimum that's different from the maximum?

I do, actually.  Here's the case I want it for:

I have a web application which gets all of its new data as uncoordinated
batch updates from customers.  Since it's possible for me to receive
several batch updates at once, I set max_wal_size to 16GB, roughtly the
side of 8 batch updates.  But I don't want the WAL that big all the time
because it slows down backup snapshots.  So I set min_wal_size to 2GB,
roughly the size of one batch update.

That's an idiosyncratic case, but I can imagine more of them out there.

I wouldn't be opposed to min_wal_size = -1 meaning "same as
max_wal_size" though.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

>>> Default of 4 for min_wal_size?
>>
>> I assume you mean 4 segments; why not 3 as currently?  As long as the
>> system has the latitude to ratchet it up when needed, there seems to
>> be little advantage to raising the minimum.  Of course I guess there
>> must be some advantage or Heikki wouldn't have made it configurable,
>> but I'd err on the side of keeping this one small.  Hopefully the
>> system that automatically adjusts this is really smart, and a large
>> min_wal_size is superfluous for most people.
>
> Keep in mind that the current is actually 7, not three (3*2+1).  So 3
> would be a siginficant decrease.  However, I don't feel strongly about
> it either way.  I think that there is probably a minimum reasonable
> value > 1, but I'm not sure what it is.

Good point.  OK, 4 works for me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

On Thu, Feb 5, 2015 at 4:42 PM, Heikki Linnakangas
 wrote:
> Actually, perhaps we should have a boolean setting that just implies
> min=max, instead of having a configurable minimum?. That would cover all of
> those reasons pretty well. So we would have a "max_wal_size" setting, and a
> boolean "preallocate_all_wal = on | off". Would anyone care for the
> flexibility of setting a minimum that's different from the maximum?

I like the way you have it now better.  If we knew for certain that
there were no advantage in configuring a value between 0 and the
maximum, that would be one thing, but we don't and can't know that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments


On 02/05/2015 11:28 PM, Robert Haas wrote:

On Thu, Feb 5, 2015 at 2:11 PM, Josh Berkus  wrote:

Default of 4 for min_wal_size?


I assume you mean 4 segments; why not 3 as currently?  As long as the
system has the latitude to ratchet it up when needed, there seems to
be little advantage to raising the minimum.  Of course I guess there
must be some advantage or Heikki wouldn't have made it configurable,
but I'd err on the side of keeping this one small.  Hopefully the
system that automatically adjusts this is really smart, and a large
min_wal_size is superfluous for most people.


There are a few reasons for making the minimum configurable:

1. Creating new segments when you need them is not free, so if you have 
a workload with occasional very large spikes, you might want to prepare 
for them. The auto-tuning will accommodate for the peak usage, but it's 
a moving average so if the peaks are infrequent enough, it will shrink 
the size down between the spikes.


2. To avoid running out of disk space on write to WAL (which leads to a 
PANIC). In particular, if you have the WAL on the same filesystem as the 
data, pre-reserving all the space required for WAL makes it much more 
likely that you when you run out of disk space, you run out when writing 
regular data, not WAL.


3. Unforeseen issues with the auto-tuning. It might not suite everyone, 
so it's nice that you can still get the old behaviour by setting min=max.


Actually, perhaps we should have a boolean setting that just implies 
min=max, instead of having a configurable minimum?. That would cover all 
of those reasons pretty well. So we would have a "max_wal_size" setting, 
and a boolean "preallocate_all_wal = on | off". Would anyone care for 
the flexibility of setting a minimum that's different from the maximum?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

2015-02-05 Thread Josh Berkus

On 02/05/2015 01:28 PM, Robert Haas wrote:
> On Thu, Feb 5, 2015 at 2:11 PM, Josh Berkus  wrote:
>> Except that, when setting up servers for customers, one thing I pretty
>> much always do for them is temporarily increase checkpoint_segments for
>> the initial data load.  So having Postgres do this automatically would
>> be a feature, not a bug.
> 
> Right!
> 
>> I say we go with ~~ 1GB.  That's an 8X increase over current default
>> size for the maximum
> 
> Sounds great.
> 
>> Default of 4 for min_wal_size?
> 
> I assume you mean 4 segments; why not 3 as currently?  As long as the
> system has the latitude to ratchet it up when needed, there seems to
> be little advantage to raising the minimum.  Of course I guess there
> must be some advantage or Heikki wouldn't have made it configurable,
> but I'd err on the side of keeping this one small.  Hopefully the
> system that automatically adjusts this is really smart, and a large
> min_wal_size is superfluous for most people.

Keep in mind that the current is actually 7, not three (3*2+1).  So 3
would be a siginficant decrease.  However, I don't feel strongly about
it either way.  I think that there is probably a minimum reasonable
value > 1, but I'm not sure what it is.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

On Thu, Feb 5, 2015 at 2:11 PM, Josh Berkus  wrote:
> Except that, when setting up servers for customers, one thing I pretty
> much always do for them is temporarily increase checkpoint_segments for
> the initial data load.  So having Postgres do this automatically would
> be a feature, not a bug.

Right!

> I say we go with ~~ 1GB.  That's an 8X increase over current default
> size for the maximum

Sounds great.

> Default of 4 for min_wal_size?

I assume you mean 4 segments; why not 3 as currently?  As long as the
system has the latitude to ratchet it up when needed, there seems to
be little advantage to raising the minimum.  Of course I guess there
must be some advantage or Heikki wouldn't have made it configurable,
but I'd err on the side of keeping this one small.  Hopefully the
system that automatically adjusts this is really smart, and a large
min_wal_size is superfluous for most people.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SSL renegotiation and other related woes


On 01/26/2015 12:14 PM, Andres Freund wrote:

Hi,

When working on getting rid of ImmediateInterruptOK I wanted to verify
that ssl still works correctly. Turned out it didn't. But neither did it
in master.

Turns out there's two major things we do wrong:

1) We ignore the rule that once called and returning
SSL_ERROR_WANTS_(READ|WRITE) SSL_read()/write() have to be called
again with the same parameters. Unfortunately that rule doesn't mean
just that the same parameters have to be passed in, but also that we
can't just constantly switch between _read()/write(). Especially
nonblocking backend code (i.e. walsender) and the whole frontend code
violate this rule.


I don't buy that. Googling around, and looking at the man pages, I just 
don't see anything to support that claim. I believe it is supported to 
switch between reads and writes.



2) We start renegotiations in be_tls_write() while in nonblocking mode,
but don't properly retry to handle socket readyness. There's a loop
that retries handshakes twenty times (???), but what actually is
needed is to call SSL_get_error() and ensure that there's actually
data available.


Yeah, that's just crazy.

Actually, I don't think we need to call SSL_do_handshake() at all. At 
least on modern OpenSSL versions. OpenSSL internally uses a state 
machine, and SSL_read() and SSL_write() calls will complete the 
handshake just as well.



2) is easy enough to fix, but 1) is pretty hard. Before anybody says
that 1) isn't an important rule: It reliably causes connection aborts
within a couple renegotiations. The higher the latency the higher the
likelihood of aborts. Even locally it doesn't take very long to
abort. Errors usually are something like "SSL connection has been closed
unexpectedly" or "SSL Error: sslv3 alert unexpected message" and a host
of other similar messages. There's a couple reports of those in the
archives and I've seen many more in client logs.


Yeah, I can reproduce that. There's clearly something wrong.

I believe this is an OpenSSL bug. I traced the "unexpected message" 
error to this code in OpenSSL's s3_read_bytes() function:



case SSL3_RT_APPLICATION_DATA:
/*
 * At this point, we were expecting handshake data, but have
 * application data.  If the library was running inside ssl3_read()
 * (i.e. in_read_app_data is set) and it makes sense to read
 * application data at this point (session renegotiation not yet
 * started), we will indulge it.
 */
if (s->s3->in_read_app_data &&
(s->s3->total_renegotiations != 0) &&
(((s->state & SSL_ST_CONNECT) &&
  (s->state >= SSL3_ST_CW_CLNT_HELLO_A) &&
  (s->state <= SSL3_ST_CR_SRVR_HELLO_A)
 ) || ((s->state & SSL_ST_ACCEPT) &&
   (s->state <= SSL3_ST_SW_HELLO_REQ_A) &&
   (s->state >= SSL3_ST_SR_CLNT_HELLO_A)
 )
)) {
s->s3->in_read_app_data = 2;
return (-1);
} else {
al = SSL_AD_UNEXPECTED_MESSAGE;
SSLerr(SSL_F_SSL3_READ_BYTES, SSL_R_UNEXPECTED_RECORD);
goto f_err;
}


So that quite clearly throws an error, *unless* the original application 
call was SSL_read(). As you also concluded, OpenSSL doesn't like it when 
SSL_write() is called when it's in renegotiation. I think that's 
OpenSSL's fault and it should "indulge" the application data message 
even in SSL_write().


Can we work-around that easily? I believe we can. The crucial part is 
that we mustn't let SSL_write() to process any incoming application 
data. We can achieve that if we always call SSL_read() to drain such 
data, before calling SSL_write(). But that's not quite enough; if we're 
in renegotiation, SSL_write() might still try to read more data from the 
socket that has arrived after the SSL_read() call. Fortunately, we can 
easily prevent that by hacking pqsecure_raw_read() to always return 
EWOULDBLOCK, when it's called through SSL_write().


The attached patch does that. I've been running your pg_receivexlog test 
case with this for about 15 minutes without any errors now, with 
ssl_renegotiation_limit=50kB, while before it errored out within seconds.


In theory, I guess we should do similar hacks in the server, and always 
call SSL_read() before SSL_write(), but it seems to work without it. Or 
maybe not; OpenSSL server and client code is not symmetric, so perhaps 
it works in server mode without that.


PS. The server code started renegotiation when it's 1kB from reaching 
ssl_renegotiation_limit. I increased that to 32kB, because in testing, I 
got some "SSL failed to renegotiate connection before limit expired" 
errors in testing before I did that.


PPS. I did all testing of this patch with my sleeping logic 
simplification patch applied 
(http://www.postgresql.org/message-id/54d3821e.9060...@vmware.com). It 
applies without it too, and I

Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

Fujii Masao wrote:
> +TimestampDifference(start_time, stop_time, &secs, µsecs);
> +pg_usleep(interval_msec * 1000L - (100L * secs + 1L * microsecs));
>
> What if the large interval is set and a signal arrives during the sleep?
> I'm afraid that a signal cannot terminate the sleep on some platforms.
> This problem exists even now because pg_usleep is used, but the sleep
> time is just 5 seconds, so it's not so bad. But the patch allows a user to
> set large sleep time.

Yes, and I thought that we could live with that for this patch... Now
that you mention it something similar to what recoveryPausesHere would
be quite good to ease the shutdown of a process interrupted, even more
than now as well. So let's do that.

> Shouldn't we use WaitLatch or split the pg_usleep like recoveryPausesHere() 
> does?

I'd vote for the latter as we would still need to calculate a
timestamp difference in any cases, so it feels easier to do that in
the new single API and this patch does (routine useful for plugins as
well). Now I will not fight if people think that using
recoveryWakeupLatch is better.

An updated patch is attached. This patch contains as well a fix for
something that was mentioned upthread but not added in latest version:
wal_availability_check_interval should be used when waiting for a WAL
record from a stream, and for a segment when fetching from archives.
Last version did only the former, not the latter.
-- 
Michael
From 9c3a7fa35538993c1345a40fe0973332a76bdb81 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 19 Jan 2015 16:08:48 +0900
Subject: [PATCH] Add wal_availability_check_interval

This parameter aids to control at which timing WAL availability is checked
when a node is in recovery, particularly  when successive failures happen when
fetching WAL archives, or when fetching WAL records from a streaming source.

Default value is 5s.
---
 doc/src/sgml/recovery-config.sgml   | 21 +++
 src/backend/access/transam/recovery.conf.sample | 10 ++
 src/backend/access/transam/xlog.c   | 47 +
 src/backend/utils/adt/timestamp.c   | 38 
 src/include/utils/timestamp.h   |  2 ++
 5 files changed, 104 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml
index 0c64ff2..7fd51ce 100644
--- a/doc/src/sgml/recovery-config.sgml
+++ b/doc/src/sgml/recovery-config.sgml
@@ -145,6 +145,27 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
   
  
 
+ 
+  wal_availability_check_interval (integer)
+  
+wal_availability_check_interval recovery parameter
+  
+  
+  
+   
+This parameter specifies the amount of time to wait when
+WAL is not available for a node in recovery. Default value is
+5s.
+   
+   
+A node in recovery will wait for this amount of time if
+restore_command returns nonzero exit status code when
+fetching new WAL segment files from archive or when a WAL receiver
+is not able to fetch a WAL record when using streaming replication.
+   
+  
+ 
+
 
 
   
diff --git a/src/backend/access/transam/recovery.conf.sample b/src/backend/access/transam/recovery.conf.sample
index b777400..70d3946 100644
--- a/src/backend/access/transam/recovery.conf.sample
+++ b/src/backend/access/transam/recovery.conf.sample
@@ -58,6 +58,16 @@
 #
 #recovery_end_command = ''
 #
+#
+# wal_availability_check_interval
+#
+# specifies an optional interval to check for availability of WAL when
+# recovering a node. This interval of time represents the frequency of
+# retries if a previous command of restore_command returned nonzero exit
+# status code or if a walreceiver did not stream completely a WAL record.
+#
+#wal_availability_check_interval = '5s'
+#
 #---
 # RECOVERY TARGET PARAMETERS
 #---
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 629a457..4f4efca 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -235,6 +235,7 @@ static TimestampTz recoveryTargetTime;
 static char *recoveryTargetName;
 static int	recovery_min_apply_delay = 0;
 static TimestampTz recoveryDelayUntilTime;
+static int	wal_availability_check_interval = 5000;
 
 /* options taken from recovery.conf for XLOG streaming */
 static bool StandbyModeRequested = false;
@@ -4881,6 +4882,26 @@ readRecoveryCommandFile(void)
 	(errmsg_internal("trigger_file = '%s'",
 	 TriggerFile)));
 		}
+		else if (strcmp(item->name, "wal_availability_check_interval") == 0)
+		{
+			const char *hintmsg;
+
+			if (!parse_int(item->value, &wal_availability_check_interval, GUC_UNIT_MS,
+		   &hintmsg))
+ereport(ERROR,
+		(errcode(

Re: [HACKERS] s_lock.h default definitions are rather confused

2015-02-05 Thread Andres Freund

On 2015-01-15 17:59:40 +0100, Andres Freund wrote:
> On 2015-01-15 11:56:24 -0500, Tom Lane wrote:
> > Andres Freund  writes:
> > > On 2015-01-15 10:57:10 -0500, Tom Lane wrote:
> > >> While I'll not cry too hard when we decide to break C89 compatibility,
> > >> I don't want it to happen accidentally; so having a pretty old-school
> > >> compiler in the farm seems important to me.
> > 
> > > I'd worked on setting up a modern gcc (or was it clang?) with the
> > > appropriate flags to warn about !C89 stuff some time back, but failed
> > > because of configure bugs.
> > 
> > My recollection is that there isn't any reasonable way to get gcc to
> > warn about C89 violations as such.  -ansi -pedantic is not very fit
> > for the purpose.
> 
> It was clang, which has -Wc99-extensions/-Wc11-extensions.

gcc-5 now has:

* A new command-line option -Wc90-c99-compat has been added to warn about
features not present in ISO C90, but present in ISO C99.
* A new command-line option -Wc99-c11-compat has been added to warn about
features not present in ISO C99, but present in ISO C11.


Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On Thu, Feb 5, 2015 at 11:06 PM, Syed, Rahila  wrote:
>>/*
>>+* We recheck the actual size even if pglz_compress() report success,
>>+* because it might be satisfied with having saved as little as one byte
>>+* in the compressed data.
>>+*/
>>+   *len = (uint16) compressed_len;
>>+   if (*len >= orig_len - 1)
>>+   return false;
>>+   return true;
>>+}
>
> As per latest code ,when compression is 'on' we introduce additional 2 bytes 
> in the header of each block image for storing raw_length of the compressed 
> block.
> In order to achieve compression while accounting for these two additional 
> bytes, we must ensure that compressed length is less than original length - 2.
> So , IIUC the above condition should rather be
>
> If (*len >= orig_len -2 )
> return false;
> return true;
> The attached patch contains this. It also has a cosmetic change-  renaming 
> compressBuf to uncompressBuf as it is used to store uncompressed page.

Agreed on both things.

Just looking at your latest patch after some time to let it cool down,
I noticed a couple of things.

 #define MaxSizeOfXLogRecordBlockHeader \
 (SizeOfXLogRecordBlockHeader + \
- SizeOfXLogRecordBlockImageHeader + \
+ SizeOfXLogRecordBlockImageHeader, \
+ SizeOfXLogRecordBlockImageCompressionInfo + \
There is a comma here instead of a sum sign. We should really sum up
all those sizes to evaluate the maximum size of a block header.

+ * Permanently allocate readBuf uncompressBuf.  We do it this way,
+ * rather than just making a static array, for two reasons:
This comment is just but weird, "readBuf AND uncompressBuf" is more appropriate.

+ * We recheck the actual size even if pglz_compress() report success,
+ * because it might be satisfied with having saved as little as one byte
+ * in the compressed data. We add two bytes to store raw_length  with the
+ * compressed image. So for compression to be effective
compressed_len should
+ * be atleast < orig_len - 2
This comment block should be reworked, and misses a dot at its end. I
rewrote it like that, hopefully that's clearer:
+   /*
+* We recheck the actual size even if pglz_compress() reports
success and see
+* if at least 2 bytes of length have been saved, as this
corresponds to the
+* additional amount of data stored in WAL record for a compressed block
+* via raw_length.
+*/

In any case, those things have been introduced by what I did in
previous versions... And attached is a new patch.
-- 
Michael
From 2b0c54bc7c4bfb2494cf8b0394d56635b01d7c5a Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 25 Nov 2014 14:24:26 +0900
Subject: [PATCH] Support compression for full-page writes in WAL

Compression is controlled with a new parameter called wal_compression.
This parameter can be changed at session level to control WAL compression.
---
 contrib/pg_xlogdump/pg_xlogdump.c |  17 ++-
 doc/src/sgml/config.sgml  |  29 +
 src/backend/access/transam/xlog.c |   1 +
 src/backend/access/transam/xloginsert.c   | 161 ++
 src/backend/access/transam/xlogreader.c   |  71 +---
 src/backend/utils/misc/guc.c  |   9 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h |   1 +
 src/include/access/xlogreader.h   |   7 +-
 src/include/access/xlogrecord.h   |  49 ++--
 src/include/pg_config.h.in|   4 +-
 11 files changed, 297 insertions(+), 53 deletions(-)

diff --git a/contrib/pg_xlogdump/pg_xlogdump.c b/contrib/pg_xlogdump/pg_xlogdump.c
index c1bfbc2..3ebaac6 100644
--- a/contrib/pg_xlogdump/pg_xlogdump.c
+++ b/contrib/pg_xlogdump/pg_xlogdump.c
@@ -363,14 +363,14 @@ XLogDumpCountRecord(XLogDumpConfig *config, XLogDumpStats *stats,
 	 * takes up BLCKSZ bytes, minus the "hole" length.
 	 *
 	 * XXX: We peek into xlogreader's private decoded backup blocks for the
-	 * hole_length. It doesn't seem worth it to add an accessor macro for
+	 * length of block. It doesn't seem worth it to add an accessor macro for
 	 * this.
 	 */
 	fpi_len = 0;
 	for (block_id = 0; block_id <= record->max_block_id; block_id++)
 	{
 		if (XLogRecHasBlockImage(record, block_id))
-			fpi_len += BLCKSZ - record->blocks[block_id].hole_length;
+			fpi_len += record->blocks[block_id].bkp_len;
 	}
 
 	/* Update per-rmgr statistics */
@@ -465,9 +465,16 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
    blk);
 			if (XLogRecHasBlockImage(record, block_id))
 			{
-printf(" (FPW); hole: offset: %u, length: %u\n",
-	   record->blocks[block_id].hole_offset,
-	   record->blocks[block_id].hole_length);
+if (record->blocks[block_id].is_compressed)
+	printf(" (FPW compressed); hole offset: %u, "
+		   "compressed length: %u, original length: %u\n",
+		   record->blocks[block_id].h

Re: [HACKERS] Possible problem with pgcrypto

2015-02-05 Thread Jan Wieck


On 02/05/2015 01:18 PM, Marko Tiikkaja wrote:

On 2/5/15 4:48 PM, Jan Wieck wrote:

What the script does is to encode a small string with pgp_sym_encrypt()
and then repeatedly try to decrypt it with different "wrong" passwords.
The expected error message for that is of course

  "Wrong key or corrupt data".

Every now and then, I get a different error message. Things I've seen are:

  "Not text data"


That's not unexpected; the check for whether the data is text or not
appears to happen quite early in the process of decoding.  So it's
enough to get to that point without anything being obviously broken.


I suspected something like that.



In addition to the two errors above, it doesn't appear to be too
difficult to see PXE_MBUF_SHORT_READ, which would give you  ERROR:
Corrupt data.  I wonder why that error message is different, though.


From reading the code as far I did, I expected to see that, but haven't 
seen it yet.





  "pgcrypto bug"


That doesn't look too good, but I can't reproduce it against 9.3.6 either.


Let me improve the script to a point where it can run for a long time in 
the background and collect all different error cases as examples of 
encrypted data and wrong password.



Thanks so far.
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

2015-02-05 Thread Josh Berkus

On 02/04/2015 04:16 PM, David Steele wrote:
> On 2/4/15 3:06 PM, Robert Haas wrote:
>>> Hmmm, I see your point.  I spend a lot of time on AWS and in
>>> container-world, where disk space is a lot more constrained.  However,
>>> it probably makes more sense to recommend non-default settings for that
>>> environment, since it requires non-default settings anyway.
>>>
>>> So, 384MB?
>> That's certainly better, but I think we should go further.  Again,
>> you're not committed to using this space all the time, and if you are
>> using it you must have a lot of write activity, which means you are
>> not running on a tin can and a string.  If you have a little tiny
>> database, say 100MB, running on a little-tiny Amazon instance,
>> handling a small number of transactions, you're going to stay close to
>> wal_min_size anyway.  Right?
> 
> The main exception I can think of is when using dump/restore to upgrade
> instead of pg_upgrade.  This would generate a lot of WAL for what could
> otherwise be a low-traffic database.

Except that, when setting up servers for customers, one thing I pretty
much always do for them is temporarily increase checkpoint_segments for
the initial data load.  So having Postgres do this automatically would
be a feature, not a bug.

I say we go with ~~ 1GB.  That's an 8X increase over current default
size for the maximum

Default of 4 for min_wal_size?

On 02/04/2015 07:37 PM, Amit Kapila wrote:> On Thu, Feb 5, 2015 at 3:11
AM, Josh Berkus > If we did add one, I'd suggest calling it "wal_size_limit" or something
>> similar.
>
> I think both the names (max_wal_size and wal_size_limit) seems to
> indicate the same same thing.  Few more suggestions:
> typical_wal_size, wal_size_soft_limit?

Again, you're suggesting more complicated (and difficult to translate,
and for that matter misleading) names in order to work around a future
feature which nobody is currently working on, and we may never have.
Let's keep clear and simple parameter names which most people can
understand, instead of making things complicated for the sake of complexity.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Possible problem with pgcrypto

2015-02-05 Thread Jan Wieck


On 02/05/2015 10:58 AM, Tom Lane wrote:

Jan Wieck  writes:

I have encountered a small instability in the behavior of pgcrypto's
pgp_sym_decrypt() function. Attached is a script that can reproduce the
problem. It may have to be run repeatedly because the symptom occurs
rather seldom.



What the script does is to encode a small string with pgp_sym_encrypt()
and then repeatedly try to decrypt it with different "wrong" passwords.
The expected error message for that is of course
 "Wrong key or corrupt data".



Every now and then, I get a different error message. Things I've seen are:


Have you tested this with this week's releases?  We fixed some
memory-mishandling bugs in pgcrypto ...


The posted script reproduces the symptom in today's checkout of master 
as well as REL9_4_STABLE.



Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

Fujii Masao wrote:
> I wrote
>> This is an inspiration from lz4 APIs. Wouldn't it be buggy for a
>> compression algorithm to return a size of 0 bytes as compressed or
>> decompressed length btw? We could as well make it return a negative
>> value when a failure occurs if you feel more comfortable with it.
>
> I feel that's better. Attached is the updated version of the patch.
> I changed the pg_lzcompress and pg_lzdecompress so that they return -1
> when failure happens. Also I applied some cosmetic changes to the patch
> (e.g., shorten the long name of the newly-added macros).
> Barring any objection, I will commit this.

I just had a look at your updated version, ran some sanity tests, and
things look good from me. The new names of the macros at the top of
tuptoaster.c are clearer as well.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GRANT USAGE on FOREIGN SERVER exposes passwords

2015-02-05 Thread Stephen Frost

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Whether this is a realistic expectation given the wording of the SQL-MED
> standard is unclear.

I've only got a draft as of 2011-12 handy, but I'm not seeing anything
in the standard that cares one bit about the value of the options
specified for the FDW.  All that's said is "Both the option name and the
permissible ranges of option values of a generic option are defined by
the foreign-data wrappers."

In my view, that means that if we give FDWs the ability to control
what's displayed for their options, we're well within the requirements
of the spec.

> I'm also concerned that if we take this on board as being a security
> concern, it will mean that any time we make an effort to push some
> construct we didn't before over to the remote end, we have to worry about
> whether it would be a security breach to allow the local user to cause
> that code to execute on the remote end.  It's tough enough worrying about
> semantic-equivalence issues without mixing hostile-user scenarios in.

I agree that we don't want to go there but I don't see this as leading
us in that direction.

> So I would rather say that the baseline security expectation is that
> granting a user mapping should be presumed to be tantamount to granting
> direct access to the remote server with that login info.  In that context,
> being able to see the password should not be considered to be any big deal.

I don't agree with this, however.  Being able to execute arbitrary SQL
on the remote side through a specific interface does not equate to
having the password and direct access to the remote server.  Even if it
did, there are good reasons to not expose passwords, even to the user
whose password it is.  We take steps to avoid exposing user's passwords
and password hashes in core and I do not see this case as any different.

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Possible problem with pgcrypto

2015-02-05 Thread Marko Tiikkaja


On 2/5/15 4:48 PM, Jan Wieck wrote:

What the script does is to encode a small string with pgp_sym_encrypt()
and then repeatedly try to decrypt it with different "wrong" passwords.
The expected error message for that is of course

  "Wrong key or corrupt data".

Every now and then, I get a different error message. Things I've seen are:

  "Not text data"


That's not unexpected; the check for whether the data is text or not 
appears to happen quite early in the process of decoding.  So it's 
enough to get to that point without anything being obviously broken.


In addition to the two errors above, it doesn't appear to be too 
difficult to see PXE_MBUF_SHORT_READ, which would give you  ERROR: 
Corrupt data.  I wonder why that error message is different, though.



  "pgcrypto bug"


That doesn't look too good, but I can't reproduce it against 9.3.6 either.


.m


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal : REINDEX xxx VERBOSE

Robert Haas  writes:
> We've got a mix of styles for extensible options right now:

That we do.

> So COPY puts the options at the very end, but EXPLAIN and VACUUM put
> them right after the command name.  I prefer the latter style and
> would vote to adopt it here.

Meh.  Options-at-the-end seems by far the most sensible style to me.
The options-right-after-the-keyword style is a mess, both logically
and from a parsing standpoint, and the only reason we have it at all
is historical artifact (ask Bruce about the origins of VACUUM ANALYZE
over a beer sometime).

Still, I can't help noticing that I'm being outvoted.  I'll shut up now.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal : REINDEX xxx VERBOSE

On Wed, Feb 4, 2015 at 8:24 PM, Tom Lane  wrote:
> Kyotaro HORIGUCHI  writes:
>> The phrase "{INDEX | TABLE |..} name" seems to me indivisible as
>> target specification. IMHO, the options for VACUUM and so is
>> placed *just after* command name, not *before* the target.
>
>> If this is right, the syntax would be like this.
>
>> REINDEX [ (option [, option ...] ) ] {INDEX | TABLE | etc } name
>
>> What do you think about this?
>
> I think this is wrong and ugly.  INDEX etc are part of the command name.

I don't think so.  I think they're part of the target.  We have one
manual page is for REINDEX, not five separate reference pages for
REINDEX INDEX, REINDEX TABLE, REINDEX SCHEMA, REINDEX DATABASE, and
REINDEX SYSTEM.  If we really wanted to, we could probably even
support this:

REINDEX INDEX foo, TABLE bar, TABLE baz;

We've got a mix of styles for extensible options right now:

EXPLAIN [ ( option [, ...] ) ] statement
COPY table_name [ ( column_name [, ...] ) ]
FROM { 'filename' | PROGRAM 'command' | STDIN }
[ [ WITH ] ( option [, ...] ) ]
VACUUM [ ( { FULL | FREEZE | VERBOSE | ANALYZE } [, ...] ) ] [
table_name [ (column_name [, ...] ) ] ]

So COPY puts the options at the very end, but EXPLAIN and VACUUM put
them right after the command name.  I prefer the latter style and
would vote to adopt it here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GRANT USAGE on FOREIGN SERVER exposes passwords

2015-02-05 Thread David G Johnston

Tom Lane-2 wrote
> Stephen Frost <

> sfrost@

> > writes:
>> * Robert Haas (

> robertmhaas@

> ) wrote:
>>> On Thu, Feb 5, 2015 at 10:48 AM, Stephen Frost <

> sfrost@

> > wrote:
 And I thought this was about FDW options and not about dblink, really..
> 
>>> The OP is pretty clearly asking about dblink.
> 
>> I was just pointing out that it was an issue that all FDWs suffer from,
>> since we don't have any way for an FDW to say "don't show this option",
>> as discussed.
> 
> The dblink example is entirely uncompelling, given that as you said
> somebody with access to a dblink connection could execute ALTER USER on
> the far end.  

So lets fix that loop-hole as well...


> So I would rather say that the baseline security expectation is that
> granting a user mapping should be presumed to be tantamount to granting
> direct access to the remote server with that login info.  In that context,
> being able to see the password should not be considered to be any big
> deal.

Is there any provision whereby "USAGE" would restrict the person so granted
from viewing any particulars even though they can call/name the item being
granted; and then require "SELECT" privileges to actual view any of the
associated settings?

Regardless, the OP described behavior of suppressing user options normally
but then showing them upon being granted USAGE on the server seems strange.

David J.




--
View this message in context: 
http://postgresql.nabble.com/GRANT-USAGE-on-FOREIGN-SERVER-exposes-passwords-tp5836652p5836826.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GRANT USAGE on FOREIGN SERVER exposes passwords

Stephen Frost  writes:
> * Robert Haas (robertmh...@gmail.com) wrote:
>> On Thu, Feb 5, 2015 at 10:48 AM, Stephen Frost  wrote:
>>> And I thought this was about FDW options and not about dblink, really..

>> The OP is pretty clearly asking about dblink.

> I was just pointing out that it was an issue that all FDWs suffer from,
> since we don't have any way for an FDW to say "don't show this option",
> as discussed.

The dblink example is entirely uncompelling, given that as you said
somebody with access to a dblink connection could execute ALTER USER on
the far end.  It's much more credible to imagine that someone might be
given a mapping for a postgres_fdw, oracle_fdw, etc foreign server with
just a few foreign tables, with the expectation that that couldn't be
parlayed into unfettered access to the remote server.

Whether this is a realistic expectation given the wording of the SQL-MED
standard is unclear.

I'm also concerned that if we take this on board as being a security
concern, it will mean that any time we make an effort to push some
construct we didn't before over to the remote end, we have to worry about
whether it would be a security breach to allow the local user to cause
that code to execute on the remote end.  It's tough enough worrying about
semantic-equivalence issues without mixing hostile-user scenarios in.

So I would rather say that the baseline security expectation is that
granting a user mapping should be presumed to be tantamount to granting
direct access to the remote server with that login info.  In that context,
being able to see the password should not be considered to be any big deal.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GRANT USAGE on FOREIGN SERVER exposes passwords

2015-02-05 Thread Stephen Frost

* Robert Haas (robertmh...@gmail.com) wrote:
> On Thu, Feb 5, 2015 at 10:48 AM, Stephen Frost  wrote:
> >> If she's got dblink access, she can run arbitrary
> >> SQL queries on the remote server anyway, which is all the password
> >> would let her do.  Also, she could use dblink to run ALTER ROLE foo
> >> PASSWORD '...' on the remote server, and then she'll *definitely* know
> >> the password.
> >
> > And I thought this was about FDW options and not about dblink, really..
> 
> The OP is pretty clearly asking about dblink.

I was just pointing out that it was an issue that all FDWs suffer from,
since we don't have any way for an FDW to say "don't show this option",
as discussed.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] GRANT USAGE on FOREIGN SERVER exposes passwords

On Thu, Feb 5, 2015 at 10:48 AM, Stephen Frost  wrote:
>> If she's got dblink access, she can run arbitrary
>> SQL queries on the remote server anyway, which is all the password
>> would let her do.  Also, she could use dblink to run ALTER ROLE foo
>> PASSWORD '...' on the remote server, and then she'll *definitely* know
>> the password.
>
> And I thought this was about FDW options and not about dblink, really..

The OP is pretty clearly asking about dblink.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Possible problem with pgcrypto

Jan Wieck  writes:
> I have encountered a small instability in the behavior of pgcrypto's 
> pgp_sym_decrypt() function. Attached is a script that can reproduce the 
> problem. It may have to be run repeatedly because the symptom occurs 
> rather seldom.

> What the script does is to encode a small string with pgp_sym_encrypt() 
> and then repeatedly try to decrypt it with different "wrong" passwords. 
> The expected error message for that is of course
>  "Wrong key or corrupt data".

> Every now and then, I get a different error message. Things I've seen are:

Have you tested this with this week's releases?  We fixed some
memory-mishandling bugs in pgcrypto ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Possible problem with pgcrypto

2015-02-05 Thread Jan Wieck


Hi,

I have encountered a small instability in the behavior of pgcrypto's 
pgp_sym_decrypt() function. Attached is a script that can reproduce the 
problem. It may have to be run repeatedly because the symptom occurs 
rather seldom.


What the script does is to encode a small string with pgp_sym_encrypt() 
and then repeatedly try to decrypt it with different "wrong" passwords. 
The expected error message for that is of course


"Wrong key or corrupt data".

Every now and then, I get a different error message. Things I've seen are:

"Not text data"
"pgcrypto bug"

This seems to be triggered by a combination of the random data included 
in the encrypted data as well as the wrong password, because for an 
instance of encrypted data only certain passwords cause this symptom.


I wonder if this may actually be a bug in pgcrypto or if this is an 
error inherent in the way, the encrypted data is encoded. I.e. that the 
decryption algorithm cannot really figure out what is wrong and just 
sometimes gets a little further in the attempt to decrypt.



Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


pgcrypto_test.sh
Description: application/shellscript

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GRANT USAGE on FOREIGN SERVER exposes passwords

2015-02-05 Thread Stephen Frost

* Robert Haas (robertmh...@gmail.com) wrote:
> On Tue, Feb 3, 2015 at 6:26 PM, Noah Yetter  wrote:
> > The obvious objection is, "well you should just use foreign tables instead
> > of dblink()".  I'll cut a long story short by saying that doesn't work for
> > us.  We are using postgres_fdw to allow our analysts to run queries against
> > AWS Redshift and blend those results with tables in our OLTP schema.  If you
> > know anything about Redshift, or about analysts, you'll realize immediately
> > why foreign tables are not a viable solution.  Surely there are many others
> > in a similar position, where the flexibility offered by dblink() makes it
> > preferable to fixed foreign tables.
> >
> > S... what gives?  This seems like a really obvious security hole.  I've
> > searched the mailing list archives repeatedly and found zero discussion of
> > this issue.
> 
> Maybe this is an impertinent question, but why do you care if the user
> has the password?  

Eh.  Password-reuse risk, policies, regulations and auditing all come to
mind.

> If she's got dblink access, she can run arbitrary
> SQL queries on the remote server anyway, which is all the password
> would let her do.  Also, she could use dblink to run ALTER ROLE foo
> PASSWORD '...' on the remote server, and then she'll *definitely* know
> the password.

And I thought this was about FDW options and not about dblink, really..

> I would suggest not relying on password authentication in this
> situation.  Instead, use pg_hba.conf to restrict connections by IP and
> SSL mode, and maybe consider SSL certificate authentication.

That's not actually an option here though, is it?  dblink_connect
requires a password-based authentication, unless you're a superuser
(which I'm pretty sure Noah Y would prefer these folks not be..).

Further, I don't think you get to control whatever the pg_hba.conf is on
the RedShift side..  I agree with the general sentiment that it'd be
better to use other authentication methods (SSL certificates or Kerberos
credentials), but we'd need to provide a way for those to work for
non-superusers.  Kerberos credential-forwarding comes to mind but I
don't know of anyone who is currently working on that and I doubt it'd
work with Redshift anyway.

> All that having been said, it wouldn't be crazy to try to invent a
> system to lock this down, but it *would* be complicated.  An
> individual FDW can call its authentication-related options anything it
> likes; they do not need to be called 'password'.  So we'd need a way
> to identify which options should be hidden from untrusted users, and
> then a bunch of mechanism to do that.

Agreed, we'd need to provide a way for FDWs to specify which options
should be hidden and which shouldn't be.  For my 2c, I do think that'd
be worthwhile to do.  We let users change their own passwords with ALTER
USER too, but they don't get to view it (or even the hash of it) in
pg_authid.

Thanks,

Stephen

signature.asc
Description: Digital signature

[HACKERS] Early Setup of instrumentation information in pg_stat_statements

2015-02-05 Thread Amit Kapila

Currently in pg_stat_statements, the setup to track
instrumentation/totaltime information is done after
ExecutorStart().  Can we do this before ExecutorStart()?
In particular, I am referring to below code:

static void
pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
{
..
standard_ExecutorStart(queryDesc, eflags);
..
if (pgss_enabled() && queryDesc->plannedstmt->queryId != 0)
{
..
if (queryDesc->totaltime == NULL)
{
..
queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL);
..
}
}
}

The reason why I am asking is that to track instrumentation
information (like buffer usage parameters) in case of parallel
sequence scan, we need to pass this information at start of
backend workers which are started in ExecutorStart() phase
and at that time, there is no information available which can
guarantee (we have queryId stored in planned stmt, but I think
that is not sufficient to decide) that such an information is
needed by plugin.  This works well for Explain statement as
that has the information for instrumentation available before
ExecutorStart() phase.

Please suggest me if there is a better way to make this
information available in ExecutorStart() phase?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] GRANT USAGE on FOREIGN SERVER exposes passwords

Robert Haas  writes:
> All that having been said, it wouldn't be crazy to try to invent a
> system to lock this down, but it *would* be complicated.  An
> individual FDW can call its authentication-related options anything it
> likes; they do not need to be called 'password'.  So we'd need a way
> to identify which options should be hidden from untrusted users, and
> then a bunch of mechanism to do that.

It's also debatable whether this wouldn't be a violation of the SQL
standard.  I see nothing in the SQL-MED spec authorizing filtering
of the information_schema.user_mapping_options view.

We actually are doing some filtering of values in user_mapping_options,
but it's all-or-nothing so far as the options for any one mapping go.
That's still not exactly supportable per spec but it's probably less of a
violation than option-by-option filtering would be.

It also looks like that filtering differs in corner cases from what the
regular pg_user_mappings view does, which is kinda silly.  In particular
I think we should try to get rid of the explicit provision for superuser
access.

I was hoping Peter would weigh in on what his design considerations
were for these views ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [GENERAL] 4B row limit for CLOB tables

2015-02-05 Thread Matthew Kelly

> That's assuming that toasting is evenly spread between tables. In my 
> experience, that's not a great bet...

Time to create a test:
SELECT chunk_id::bigint/10 as id_range, count(*), count(*)/(10::float) 
density FROM (SELECT chunk_id FROM pg_toast.pg_toast_39000165 WHERE chunk_id <
 1 AND chunk_seq = 0) f GROUP BY id_range ORDER BY id_range;

The machine in question was restored in parallel in Sept 2013 as part of an 
upgrade from 8.4.  It has about 2000 tables, so its definitely not dominated by 
a couple tables. Progress towards oid wrap around is about 25.6%.

With minimal effort, I found 2 bad examples, and I’m sure I can easily find 
more. I attached the results for those two.

There were runs of 1,100,000+ and 600,000+ chunk_ids where more than 99% of the 
chunk_id are taken.  After restore completion, oid densities averaged less than 
20 per 100,000 and 400 per 100,000 respectively.  The only reasons those runs 
seem to be so short is because the tables were much smaller back then.  I 
expect that next time I dump restore (necessary for upgrading OS versions due 
to the collation issue), I’m going to have runs closer to 20,,000.

> ... this "fix" would actually make things enormously worse.  With the
> single counter feeding all tables, you at least have a reasonable
> probability that there are not enormously long runs of consecutive OIDs in
> any one toast table.  With a sequence per table, you are nearly guaranteed
> that there are such runs, because inserts into other tables don't create a
> break.

It makes each toast table independent (and far less likely to wrap) .  It would 
wrap when the sum(mods on THIS toast table) > 2^32.  Right now the function 
looks like:

sum(mods on ALL toast tables in cluster) + sum(created normal tables in cluster 
* k) + sum(created temp tables in cluster * k) + [...] > 2^32,
WHERE k average number of ids consumed for pg_class, pg_type, etc...

In the case of an insert only table (which is a common use case for 
partitions), the id would only wrap when the TOAST table was “full”.  On the 
other hand currently, it would wrap into its pg_restored section when the 
combined oid consuming operations on the cluster surpassed 4 billion.

That being said, I’m certainly not attached to that solution.  My real argument 
is that although its not a problem today, we are only about 5 years from it 
being a problem for large installs and the first time you’ll hear about it is 
after someone has a 5 minute production outage on a database thats been taking 
traffic for 2 years.

- Matt K.


id_range | count | density 
-+---+-
 390 | 92188 | 0.92188
 391 | 99186 | 0.99186
 392 | 99826 | 0.99826
 393 | 99101 | 0.99101
 394 | 99536 | 0.99536
 395 | 99796 | 0.99796
 396 | 99321 | 0.99321
 397 | 99768 | 0.99768
 398 | 99744 | 0.99744
 399 | 99676 | 0.99676
 400 | 98663 | 0.98663
 401 | 40690 |  0.4069
 403 |92 | 0.00092
 404 |   491 | 0.00491
 407 |74 | 0.00074
 408 |54 | 0.00054
 415 |   152 | 0.00152
 416 |47 | 0.00047
 419 |59 | 0.00059
 422 | 2 |   2e-05
 423 |14 | 0.00014
 424 | 5 |   5e-05
 425 |11 | 0.00011
 426 | 7 |   7e-05
 427 | 5 |   5e-05
 428 | 6 |   6e-05
 517 | 5 |   5e-05
 518 | 9 |   9e-05
 519 | 6 |   6e-05
 520 |12 | 0.00012
 521 |17 | 0.00017
 522 | 5 |   5e-05
 588 |15 | 0.00015
 589 |10 |  0.0001
 590 |19 | 0.00019
 591 |12 | 0.00012
 592 |12 | 0.00012
 593 | 2 |   2e-05
 617 | 4 |   4e-05
 618 | 9 |   9e-05
 619 | 7 |   7e-05
 620 |14 | 0.00014
 621 | 5 |   5e-05
 622 |11 | 0.00011
 682 | 8 |   8e-05
 683 |13 | 0.00013
 684 |17 | 0.00017
 685 | 6 |   6e-05
 686 |17 | 0.00017
 687 | 4 |   4e-05
 767 | 5 |   5e-05
 768 |10 |  0.0001
 769 | 9 |   9e-05
 770 | 2 |   2e-05
 771 |14 | 0.00014
 772 | 2 |   2e-05
 773 |11 | 0.00011
 774 |13 | 0.00013
 775 |10 |  0.0001
 776 | 3 |   3e-05
 914 | 7 |   7e-05
 915 | 7 |   7e-05
 916 | 1 |   1e-05
 917 | 3 |   3e-05
 918 | 3 |   3e-05
 919 | 5 |   5e-05
 920 | 4 |   4e-05
 921 | 9 |   9e-05
 922 | 9 |   9e-05
 923 | 1 |   1e-05
(70 rows)

id_range | count | density 
-+---+-
 402 | 96439 | 0.96439
 403 | 99102 | 0.99102
 404 | 98787 | 0.98787

Re: [HACKERS] GRANT USAGE on FOREIGN SERVER exposes passwords

On Tue, Feb 3, 2015 at 6:26 PM, Noah Yetter  wrote:
> The obvious objection is, "well you should just use foreign tables instead
> of dblink()".  I'll cut a long story short by saying that doesn't work for
> us.  We are using postgres_fdw to allow our analysts to run queries against
> AWS Redshift and blend those results with tables in our OLTP schema.  If you
> know anything about Redshift, or about analysts, you'll realize immediately
> why foreign tables are not a viable solution.  Surely there are many others
> in a similar position, where the flexibility offered by dblink() makes it
> preferable to fixed foreign tables.
>
> S... what gives?  This seems like a really obvious security hole.  I've
> searched the mailing list archives repeatedly and found zero discussion of
> this issue.

Maybe this is an impertinent question, but why do you care if the user
has the password?  If she's got dblink access, she can run arbitrary
SQL queries on the remote server anyway, which is all the password
would let her do.  Also, she could use dblink to run ALTER ROLE foo
PASSWORD '...' on the remote server, and then she'll *definitely* know
the password.

I would suggest not relying on password authentication in this
situation.  Instead, use pg_hba.conf to restrict connections by IP and
SSL mode, and maybe consider SSL certificate authentication.

All that having been said, it wouldn't be crazy to try to invent a
system to lock this down, but it *would* be complicated.  An
individual FDW can call its authentication-related options anything it
likes; they do not need to be called 'password'.  So we'd need a way
to identify which options should be hidden from untrusted users, and
then a bunch of mechanism to do that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments


On 02/05/2015 04:47 PM, Andres Freund wrote:

On 2015-02-05 09:42:37 -0500, Robert Haas wrote:

I previously proposed 100 segments, or 1.6GB.  If that seems too
large, how about 64 segments, or 1.024GB?  I think there will be few
people who can't tolerate a gigabyte of xlog under peak load, and an
awful lot who will benefit from it.


It'd be quite easier to go there if we'd shrink back to the min_size
after a while, after having peaked above it. IIUC the patch doesn't do
that?


It doesn't actively go and remove files once they've already been 
recycled, but if the system stays relatively idle for several 
checkpoints, the WAL usage will shrink down again. That's the core idea 
of the patch.


If the system stays completely or almost completely idle, that won't 
happen though, because then it will never switch to a new segment so 
none of the segments become old so that they could be removed.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redesigning checkpoint_segments

2015-02-05 Thread Andres Freund

On 2015-02-05 09:42:37 -0500, Robert Haas wrote:
> I previously proposed 100 segments, or 1.6GB.  If that seems too
> large, how about 64 segments, or 1.024GB?  I think there will be few
> people who can't tolerate a gigabyte of xlog under peak load, and an
> awful lot who will benefit from it.

It'd be quite easier to go there if we'd shrink back to the min_size
after a while, after having peaked above it. IIUC the patch doesn't do
that?

Admittedly it's not easy to come up with an algorithm that doesn't cause
superflous file removals. Initiating wal files isn't cheap.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Simplify sleeping while reading/writing from client

Looking again at the code after Andres' interrupt-handling patch series, 
I got confused by the fact that there are several wait-retry loops in 
different layers, and reading and writing works slightly differently.


I propose the attached, which pulls all the wait-retry logic up to 
secure_read() and secure_write(). This makes the code a lot more 
understandable.


- Heikki
From 4e73dbc72d279c2cb66af0a16357eab17f1c740b Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 5 Feb 2015 16:44:13 +0200
Subject: [PATCH 1/1] Simplify waiting logic in reading from / writing to
 client.

The client socket is always in non-blocking mode, and if we actually want
blocking behaviour, we emulate it by sleeping and retrying. But we retry
loops at different layers for reads and writes, which was confusing.
To simplify, remove all the sleeping and retrying code from the lower
levels, from be_tls_read and secure_raw_read and secure_raw_write, and put
all the logic in secure_read() and secure_write().
---
 src/backend/libpq/be-secure-openssl.c |  81 +--
 src/backend/libpq/be-secure.c | 143 ++
 src/backend/libpq/pqcomm.c|   3 +-
 src/include/libpq/libpq-be.h  |   4 +-
 4 files changed, 79 insertions(+), 152 deletions(-)

diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index d5f9712..39587e8 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -511,14 +511,11 @@ be_tls_close(Port *port)
  *	Read data from a secure connection.
  */
 ssize_t
-be_tls_read(Port *port, void *ptr, size_t len)
+be_tls_read(Port *port, void *ptr, size_t len, int *waitfor)
 {
 	ssize_t		n;
 	int			err;
-	int			waitfor;
-	int			latchret;
 
-rloop:
 	errno = 0;
 	n = SSL_read(port->ssl, ptr, len);
 	err = SSL_get_error(port->ssl, n);
@@ -528,39 +525,15 @@ rloop:
 			port->count += n;
 			break;
 		case SSL_ERROR_WANT_READ:
+			*waitfor = WL_SOCKET_READABLE;
+			errno = EWOULDBLOCK;
+			n = -1;
+			break;
 		case SSL_ERROR_WANT_WRITE:
-			/* Don't retry if the socket is in nonblocking mode. */
-			if (port->noblock)
-			{
-errno = EWOULDBLOCK;
-n = -1;
-break;
-			}
-
-			waitfor = WL_LATCH_SET;
-
-			if (err == SSL_ERROR_WANT_READ)
-waitfor |= WL_SOCKET_READABLE;
-			else
-waitfor |= WL_SOCKET_WRITEABLE;
-
-			latchret = WaitLatchOrSocket(MyLatch, waitfor, port->sock, 0);
-
-			/*
-			 * We'll, among other situations, get here if the low level
-			 * routine doing the actual recv() via the socket got interrupted
-			 * by a signal. That's so we can handle interrupts once outside
-			 * openssl, so we don't jump out from underneath its covers. We
-			 * can check this both, when reading and writing, because even
-			 * when writing that's just openssl's doing, not a 'proper' write
-			 * initiated by postgres.
-			 */
-			if (latchret & WL_LATCH_SET)
-			{
-ResetLatch(MyLatch);
-ProcessClientReadInterrupt(true);  /* preserves errno */
-			}
-			goto rloop;
+			*waitfor = WL_SOCKET_WRITEABLE;
+			errno = EWOULDBLOCK;
+			n = -1;
+			break;
 		case SSL_ERROR_SYSCALL:
 			/* leave it to caller to ereport the value of errno */
 			if (n != -1)
@@ -595,12 +568,10 @@ rloop:
  *	Write data to a secure connection.
  */
 ssize_t
-be_tls_write(Port *port, void *ptr, size_t len)
+be_tls_write(Port *port, void *ptr, size_t len, int *waitfor)
 {
 	ssize_t		n;
 	int			err;
-	int			waitfor;
-	int			latchret;
 
 	/*
 	 * If SSL renegotiations are enabled and we're getting close to the
@@ -653,7 +624,6 @@ be_tls_write(Port *port, void *ptr, size_t len)
 		}
 	}
 
-wloop:
 	errno = 0;
 	n = SSL_write(port->ssl, ptr, len);
 	err = SSL_get_error(port->ssl, n);
@@ -663,30 +633,15 @@ wloop:
 			port->count += n;
 			break;
 		case SSL_ERROR_WANT_READ:
+			*waitfor = WL_SOCKET_READABLE;
+			errno = EWOULDBLOCK;
+			n = -1;
+			break;
 		case SSL_ERROR_WANT_WRITE:
-
-			waitfor = WL_LATCH_SET;
-
-			if (err == SSL_ERROR_WANT_READ)
-waitfor |= WL_SOCKET_READABLE;
-			else
-waitfor |= WL_SOCKET_WRITEABLE;
-
-			latchret = WaitLatchOrSocket(MyLatch, waitfor, port->sock, 0);
-
-			/*
-			 * Check for interrupts here, in addition to secure_write(),
-			 * because an interrupted write in secure_raw_write() will return
-			 * here, and we cannot return to secure_write() until we've
-			 * written something.
-			 */
-			if (latchret & WL_LATCH_SET)
-			{
-ResetLatch(MyLatch);
-ProcessClientWriteInterrupt(true); /* preserves errno */
-			}
-
-			goto wloop;
+			*waitfor = WL_SOCKET_WRITEABLE;
+			errno = EWOULDBLOCK;
+			n = -1;
+			break;
 		case SSL_ERROR_SYSCALL:
 			/* leave it to caller to ereport the value of errno */
 			if (n != -1)
diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c
index c2c1842..4e7acbe 100644
--- a/src/backend/libpq/be-secure.c
+++ b/src/backend/libpq/be-secure.c
@@ -127,30 +127,45 @@ ssize_t
 secure_read(Port *port, void *

Re: [HACKERS] Redesigning checkpoint_segments