date:20141112

On Wed, Nov 12, 2014 at 10:06 PM, Petr Jelinek  wrote:
> Brief list of changes:
>  - the commit timestamp record now stores timestamp, lsn and nodeid
Now that not only the commit timestamp is stored, calling that "commit
timestamp", "committs" or "commit_timestamp" is strange, no? If this
patch is moving toward being a more complex information provider,
calling it commit information or commit data is more adapted, no?
Documentation would need a fresh brush as well in this case.

>  - added code to disallow turning track_commit_timestamp on with too small
> pagesize
>  - the get interfaces error out when track_commit_timestamp is off
OK, that's sane.

>  - if the xid passed to get interface is out of range -infinity timestamp is
> returned (I think it's bad idea to throw errors here as the valid range is
> not static and same ID can start throwing errors between calls
> theoretically)
Already mentioned by Jim in a previous mail: this would be better as NULL.

>  - renamed the sql interfaces to pg_xact_commit_timestamp,
> pg_xact_commit_timestamp_data and pg_last_committed_xact, they don't expose
> the nodeid atm, I personally am not big fan of the "xact" but it seems more
> consistent with existing naming
pg_xact_commit_timestamp and pg_xact_commit_timestamp_data are
overlapping. What's wrong with a single function able to return the
whole set (node ID, commit timetamp, commit LSN)? Let's say
pg_xact_commit_information or pg_xact_commit_data. Already mentioned,
but I also find using a SRF able to return all the available
information from a given XID value quite useful. And this does not
conflict with what is proposed currently, you would need just to call
the function with XID + number of entries wanted to get a single one.
Comments from other folks about that?

>  - documented pg_resetxlog changes and make all the pg_resetxlog options
> alphabetically ordered
>  - added WAL logging of the track_commit_timestamp GUC
>  - added alternative expected output of the regression test so that it works
> with make installcheck when track_commit_timestamp is on
>  - added C interface to set default nodeid for current backend
>  - several minor comment and naming adjustments mostly suggested by Michael
Thanks for those adjustments.

Then more input about the latest patch:
1) This block is not needed, option -e is listed twice:
The -o, -x, -e,
-   -m, -O,
-   and -l
+   -m, -O, -l
+   and -e
2) Very small thing: a couple of files have no newlines at the end,
among them committs.conf and test_committs/Makefile.
3) pg_last_committed_xact and not pg_last_xact_commit_information or similar?
4) storage.sgml needs to be updated with the new folder pg_committs
5) Er.. node ID is missing in pg_last_committed_xact, no?
6) This XXX notice can be removed:
+   /*
+* Return empty if the requested value is older than what we have or
+* newer than newest we have.
+*
+* XXX: should this be error instead?
+*/
We are moving toward returning invalid information in the SQL
interface when the information is not in history instead of an error,
no? (Note that I am still a partisan of an error message to let the
caller know that commit info history does not have the information
requested).
7) Note that TransactionTreeSetCommitTsData still never sets do_xlog
at true and that WriteSetTimestampXlogRec never gets called. So no
information is WAL-logged with this patch. Wouldn't that be useful for
standbys as well? Perhaps I am missing why this is disabled? This code
should be activated IMO or it would be just untested.
8) As a more general point, the node ID stuff makes me uncomfortable
and is just added on top of the existing patch without much
thinking... So I am really skeptical about it. The need here is to
pass on demand a int8 that is a node ID that can only be set through a
C interface, so only extensions could play with it. The data passed to
a WAL record is always built and determined by the system and entirely
transparent to the user, inserting user-defined data like that
inconsistent with what we've been doing until now, no?

Also, a question particularly for BDR and Slony folks: do you
sometimes add a new node using the base backup of an existing node :)
See what I come up with: a duplication of this new node ID system with
the already present system ID, no?
Similarly, the LSN is added to the WAL record containing the commit
timestamp, but cannot the LSN of the WAL record containing the commit
timestamp itself be used as a point of reference for a better
ordering? That's not exactly the same as the LSN of the transaction
commit, still it provides a WAL-based reference.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Race in "tablespace" test on Windows

2014-11-12 Thread Amit Kapila

On Thu, Nov 13, 2014 at 8:46 AM, Noah Misch  wrote:
>
> On Tue, Nov 11, 2014 at 10:21:26AM +0530, Amit Kapila wrote:
> > On Sat, Nov 8, 2014 at 10:34 AM, Noah Misch  wrote:
> > > Here is a briefer command sequence exhibiting the same problem:
> > >
> > > To make this work as well on Windows as it does elsewhere, DROP
TABLESPACE
> > > would need to wait for other backends to close relevant unlinked
files.
> > > Perhaps implement "wait_unlinked_files(const char *dirname)" to poll
> > unlinked,
> > > open files until they disappear.  (An attempt to open an unlinked file
> > reports
> > > ERROR_ACCESS_DENIED.  It might be tricky to reliably distinguish this
> > cause
> > > from other causes of that error, but it should be possible.)
> >
> > I think the proposed mechanism can work but the wait can be very long
> > (untill the backend holding descriptor executes another command).
>
> The DROP TABLESPACE could send a catchup interrupt.
>

Yeah, that can work.

> > Can we think of some other solution like in Drop Tablespace instead of
> > checking if directory is empty, check if there is no object that belongs
> > to database/cluster, then allow to forcibly delete that directory
someway.
>
> I'm not aware of a way to forcibly delete the directory.  One could rename
> files to the tablespace top-level directory just before unlinking them.
Since
> DROP TABLESPACE never removes that directory, their continued presence
there
> would not pose a problem.  (Compare use of the rename-before-unlink trick
in
> RemoveOldXlogFiles().)  That adds the overhead of an additional system
call to
> every unlink, which might be acceptable.  It may be possible to rename
after
> unlink, as-needed in DROP TABLESPACE.
>

Right.  I think we can discuss further about which approach is better,
once someone decides to work on this issue.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] tracking commit timestamps

On Thu, Nov 13, 2014 at 7:56 AM, Jim Nasby  wrote:
> On 11/12/14, 7:06 AM, Petr Jelinek wrote:
>>
>>   - if the xid passed to get interface is out of range -infinity timestamp
>> is returned (I think it's bad idea to throw errors here as the valid range
>> is not static and same ID can start throwing errors between calls
>> theoretically)
>
>
> Wouldn't NULL be more appropriate?
Definitely. Defining a given value for information not valid is awkward.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Tue, Nov 11, 2014 at 3:52 PM, Peter Geoghegan  wrote:
> I'm pretty puzzled by this. Other than our "agree to disagree and
> defer to committer" position on the question of whether or not more
> than one suggestion can come from a single RTE, which you were fine
> with before [1], I have only restored the core/contrib separation to a
> state recently suggested by Robert as the best and simplest all around
> [2].
> Did I miss something else?
My point is: I am not sure I can be defined as a reviewer of this
patch or take any credit in this patch review knowing that the latest
version submitted is a simple rebase of the version I did my first
review on. Hence, code speaking, this patch is in the same state as
when it has been firstly submitted.
Thanks,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_receivexlog --status-interval add fsync feedback

Thanks for reviewing the patch!

On Thu, Nov 13, 2014 at 4:05 AM, Alvaro Herrera
 wrote:
> Fujii Masao wrote:
>
>> --- 127,152 
>>When this option is used, pg_receivexlog will 
>> report
>>a flush position to the server, indicating when each segment has 
>> been
>>synchronized to disk so that the server can remove that segment 
>> if it
>> !  is not otherwise needed. --synchronous option 
>> must
>> ! be specified when making pg_receivexlog run as
>> ! synchronous standby by using replication slot. Otherwise WAL data
>> ! cannot be flushed frequently enough for this to work correctly.
>>   
>> 
>>
>
> Whitespace damage here.

Fixed.

>> + printf(_("  --synchronous  flush transaction log in real 
>> time\n"));
>
> "in real time" sounds odd.  How about "flush transaction log
> immediately after writing", or maybe "have transaction log writes be
> synchronous".

The former sounds better to me. So I chose it.

>> --- 781,791 
>>   now = feGetCurrentTimestamp();
>>
>>   /*
>> !  * Issue sync command as soon as there are WAL data which
>> !  * has not been flushed yet if synchronous option is true.
>>*/
>>   if (lastFlushPosition < blockpos &&
>> ! walfile != -1 && synchronous)
>
> I'd put the "synchronous" condition first in the if(), and start the
> comment with it rather than putting it at the end.  Both seem weird.

Fixed, i.e., moved the "synchronous" condition first in the if()'s test
and also moved the comment "If synchronous option is true" also first
in the comment.

>>   progname, 
>> current_walfile_name, strerror(errno));
>>   goto error;
>>   }
>>   lastFlushPosition = blockpos;
>> !
>> ! /*
>> !  * Send feedback so that the server sees the latest 
>> WAL locations
>> !  * immediately if synchronous option is true.
>> !  */
>> ! if (!sendFeedback(conn, blockpos, now, false))
>> ! goto error;
>> ! last_status = now;
>
> I'm not clear about this comment .. why does it say "if synchronous
> option is true" when it's not checking the condition?

I added that comment because the code exists with the if() block
checking "synchronous" condition. But it seems confusing. Just removed
that part from the comment.

Attached is the updated version of the patch.

Regards,

-- 
Fujii Masao
*** a/doc/src/sgml/ref/pg_receivexlog.sgml
--- b/doc/src/sgml/ref/pg_receivexlog.sgml
***
*** 49,54  PostgreSQL documentation
--- 49,61 

  

+Unlike the standby's WAL receiver, pg_receivexlog
+flushes WAL data only when WAL file is closed, by default.
+--synchronous option must be specified to flush WAL data
+in real time and ensure it's safely flushed to disk.
+   
+ 
+   
 The transaction log is streamed over a regular
 PostgreSQL connection, and uses the replication
 protocol. The connection must be made with a superuser or a user
***
*** 86,106  PostgreSQL documentation
   
  
   
--F interval
---fsync-interval=interval
-
- 
- Specifies the maximum time to issue sync commands to ensure the
- received WAL file is safely flushed to disk, in seconds. The default
- value is zero, which disables issuing fsyncs except when WAL file is
- closed. If -1 is specified, WAL file is flushed as
- soon as possible, that is, as soon as there are WAL data which has
- not been flushed yet.
- 
-
-   
- 
-  
-n
--no-loop

--- 93,98 
***
*** 135,150  PostgreSQL documentation
   When this option is used, pg_receivexlog will report
   a flush position to the server, indicating when each segment has been
   synchronized to disk so that the server can remove that segment if it
!  is not otherwise needed.  When using this parameter, it is important
!  to make sure that pg_receivexlog cannot become the
!  synchronous standby through an incautious setting of
!  ; it does not flush
!  data frequently enough for this to work correctly.
  

   
  
   
-v
--verbose

--- 127,152 
   When this option is used, pg_receivexlog will report
   a flush position to the server, indicating when each segment has been
   synchronized to disk so that the server can remove that segment if it
!  is not otherwise needed. --synchronous option must
!  be specified when making pg_receivexlog run as
!  synchronous standby by using r

Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index

On Wed, Nov 12, 2014 at 9:30 PM, Fujii Masao  wrote:
> On Wed, Nov 12, 2014 at 12:40 AM, Tom Lane  wrote:
>> Fujii Masao  writes:
>>> On Tue, Nov 11, 2014 at 10:14 PM, Robert Haas  wrote:
 Not to kibitz too much after-the-fact, but wouldn't it be better to
 give this a name that has "GIN" in it somewhere?
>>
>>> Maybe. gin_pending_list_cleanup_size? gin_pending_list_limit? Better name?
>>
>> gin_pending_list_limit sounds good to me.
>
> OK, barring any objection, I will rename the reloption and GUC to
> gin_pending_list_limit.

Done.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Race in "tablespace" test on Windows

2014-11-12 Thread Noah Misch

On Tue, Nov 11, 2014 at 10:21:26AM +0530, Amit Kapila wrote:
> On Sat, Nov 8, 2014 at 10:34 AM, Noah Misch  wrote:
> > Here is a briefer command sequence exhibiting the same problem:
> >
> > CREATE TABLESPACE testspace LOCATION '...somewhere...';
> > CREATE TABLE atable (c int) tablespace testspace;
> > SELECT COUNT(*) FROM atable;-- open heap
> > \c -
> > ALTER TABLE atable SET TABLESPACE pg_default;
> > DROP TABLESPACE testspace;  -- bug: fails sometimes
> > DROP TABLESPACE testspace;  -- second one ~always works
> > DROP TABLE atable;
> >
> 
> For me, it doesn't get success even second time, I am getting
> the same error until I execute some command on first session
> which means till first session has processed the invalidation
> messages.
> 
> postgres=# Drop tablespace tbs;
> ERROR:  tablespace "tbs" is not empty
> postgres=# Drop tablespace tbs;
> ERROR:  tablespace "tbs" is not empty
> 
> I have tested this on Windows 7.

The behavior you see makes sense if you have a third, idle backend.  I had
only the initial backend and the "\c"-created second one.

> > To make this work as well on Windows as it does elsewhere, DROP TABLESPACE
> > would need to wait for other backends to close relevant unlinked files.
> > Perhaps implement "wait_unlinked_files(const char *dirname)" to poll
> unlinked,
> > open files until they disappear.  (An attempt to open an unlinked file
> reports
> > ERROR_ACCESS_DENIED.  It might be tricky to reliably distinguish this
> cause
> > from other causes of that error, but it should be possible.)
> 
> I think the proposed mechanism can work but the wait can be very long
> (untill the backend holding descriptor executes another command).

The DROP TABLESPACE could send a catchup interrupt.

> Can we think of some other solution like in Drop Tablespace instead of
> checking if directory is empty, check if there is no object that belongs
> to database/cluster, then allow to forcibly delete that directory someway.

I'm not aware of a way to forcibly delete the directory.  One could rename
files to the tablespace top-level directory just before unlinking them.  Since
DROP TABLESPACE never removes that directory, their continued presence there
would not pose a problem.  (Compare use of the rename-before-unlink trick in
RemoveOldXlogFiles().)  That adds the overhead of an additional system call to
every unlink, which might be acceptable.  It may be possible to rename after
unlink, as-needed in DROP TABLESPACE.

> > I propose to add
> > this as a TODO, then bandage the test case with s/^\\c -$/RESET ROLE;/.
> 
> Yeah, this make sense.

Done.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_multixact not getting truncated

Jim Nasby wrote:
> On 11/10/14, 12:16 AM, Josh Berkus wrote:
> >On 11/09/2014 08:00 PM, Josh Berkus wrote:
> >On 11/08/2014 01:46 PM, Andres Freund wrote:
> >>>I'm these days suggesting that people should add manual vacuuming for
> "older" relations during off peak hours on busy databases. There's too
> many sites which service degrades noticeably during a full table vacuum.
> >>Me too: https://github.com/pgexperts/flexible-freeze
> >
> >It turns out that not even a program of preventative scheduled vacuuming
> >helps.  This is because the template0 database anchors the minmxid and
> >prevents it from being advanced until autovacuum gets around to that
> >database, at whatever the minmxid threshold is.
> 
> How did template0 even get a MultiXact? That sounds like they're really 
> abusing the template databases. :( (Do keep in mind that MXID 1 is a special 
> value.)

No, it's normal -- template0 does not have a multixact in any tuple's
xmax, but datminxid is set to the value that is current when it is
frozen.

> BTW, the only reason I know of not to set both min_age parameters to
> zero is to prevent loss of forensic information. If that's not a
> concern you can always just set them to zero. Even if it is a concern,
> I suspect that the forensic info you could gather from a MultiXact is
> a lot more limited than for an XID, so it's probably pretty safe
> setting that to zero.

Freezing tuples too early could cause useless dirtying of pages; if the
tuple is deleted, updated or locked again after being frozen, you end up
with more writes.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates

Andreas Karlsson wrote:
> On 11/13/2014 02:03 AM, Andreas Karlsson wrote:
> >Here is version 2 of the patch which detects the presence of gcc/clang
> >style 128-bit integers and has been cleaned up to a reviewable state. I
> >have not added support for any other compilers since I found no
> >documentation 128-bit support with icc or MSVC. I do not have access to
> >any Windows machines either.
> >
> >A couple of things I was not sure about was the naming of the new
> >functions and if I should ifdef the size of the aggregate state in the
> >catalog or not.

configure is a generated file.  If your patch touches it but not
configure.in, there is a problem.

> diff --git a/configure b/configure



-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On the warpath again about ill-considered inclusion nests

Tom Lane wrote:
> I noticed that the recent custom-path commit completely ignored my
> advice about not including executor headers into planner headers or
> vice versa.  On the way to fixing that, I was dismayed to discover
> that the RLS patch has utterly bollixed all semblance of modularization
> of the headers.  src/include/rewrite/rowsecurity.h, which one would
> reasonably think to be a rewriter header (nevermind its header comment
> to the contrary), nonetheless includes execnodes.h (executor stuff)
> and relation.h (planner stuff), neither of which a rewriter header
> has any business including.  And if that weren't bad enough, it's
> been included into utils/rel.h (relcache), which is close enough
> to guaranteeing that all planner and executor symbols are visible
> in every darn module we've got.  Might as well just put everything
> we have in postgres.h and abandon all pretense of modularity.

I noticed the RLS side of things a week ago as well, and wasn't very
pleased about it.  I don't know about an axe, but we do need some
serious cleanup.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates

2014-11-12 Thread Andreas Karlsson


On 11/13/2014 02:03 AM, Andreas Karlsson wrote:

Here is version 2 of the patch which detects the presence of gcc/clang
style 128-bit integers and has been cleaned up to a reviewable state. I
have not added support for any other compilers since I found no
documentation 128-bit support with icc or MSVC. I do not have access to
any Windows machines either.

A couple of things I was not sure about was the naming of the new
functions and if I should ifdef the size of the aggregate state in the
catalog or not.


The correct file is attached in the message.

--
Andreas Karlsson
diff --git a/configure b/configure
new file mode 100755
index c4f70e8..f11f1c8
*** a/configure
--- b/configure
*** _ACEOF
*** 13734,13739 
--- 13734,13751 
  
  fi
  
+ ac_fn_c_check_type "$LINENO" "__int128_t" "ac_cv_type___int128_t" "#include 
+ "
+ ac_fn_c_check_type "$LINENO" "__uint128_t" "ac_cv_type___uint128_t" "#include 
+ "
+ if test "x$ac_cv_type___int128_t" = xyes && test "x$ac_cv_type___uint128_t" = xyes; then :
+ 
+ cat >>confdefs.h <<_ACEOF
+ #define HAVE_GCC_INT128_T 1
+ _ACEOF
+ 
+ fi
+ 
  
  # We also check for sig_atomic_t, which *should* be defined per ANSI
  # C, but is missing on some old platforms.
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
new file mode 100644
index d61af92..d53905c
*** a/src/backend/utils/adt/numeric.c
--- b/src/backend/utils/adt/numeric.c
*** static void apply_typmod(NumericVar *var
*** 402,407 
--- 402,410 
  static int32 numericvar_to_int4(NumericVar *var);
  static bool numericvar_to_int8(NumericVar *var, int64 *result);
  static void int8_to_numericvar(int64 val, NumericVar *var);
+ #ifdef HAVE_INT128
+ static void int16_to_numericvar(int128 val, NumericVar *var);
+ #endif
  static double numeric_to_double_no_overflow(Numeric num);
  static double numericvar_to_double_no_overflow(NumericVar *var);
  
*** numeric_float4(PG_FUNCTION_ARGS)
*** 2639,2644 
--- 2642,2650 
   * Actually, it's a pointer to a NumericAggState allocated in the aggregate
   * context.  The digit buffers for the NumericVars will be there too.
   *
+  * On platforms which support 128-bit integers some aggergates instead use a
+  * 128-bit integer based transition datatype to speed up calculations.
+  *
   * --
   */
  
*** numeric_accum_inv(PG_FUNCTION_ARGS)
*** 2897,2902 
--- 2903,2967 
  	PG_RETURN_POINTER(state);
  }
  
+ #ifdef HAVE_INT128
+ typedef struct Int16AggState
+ {
+ 	bool	calcSumX2;	/* if true, calculate sumX2 */
+ 	int64	N;			/* count of processed numbers */
+ 	int128	sumX;		/* sum of processed numbers */
+ 	int128	sumX2;		/* sum of squares of processed numbers */
+ } Int16AggState;
+ 
+ /*
+  * Prepare state data for a 128-bit aggregate function that needs to compute
+  * sum, count and optionally sum of squares of the input.
+  */
+ static Int16AggState *
+ makeInt16AggState(FunctionCallInfo fcinfo, bool calcSumX2)
+ {
+ 	Int16AggState *state;
+ 	MemoryContext agg_context;
+ 	MemoryContext old_context;
+ 
+ 	if (!AggCheckCallContext(fcinfo, &agg_context))
+ 		elog(ERROR, "aggregate function called in non-aggregate context");
+ 
+ 	old_context = MemoryContextSwitchTo(agg_context);
+ 
+ 	state = (Int16AggState *) palloc0(sizeof(Int16AggState));
+ 	state->calcSumX2 = calcSumX2;
+ 
+ 	MemoryContextSwitchTo(old_context);
+ 
+ 	return state;
+ }
+ 
+ /*
+  * Accumulate a new input value for 128-bit aggregate functions.
+  */
+ static void
+ do_int16_accum(Int16AggState *state, int128 newval)
+ {
+ 	if (state->calcSumX2)
+ 		state->sumX2 += newval * newval;
+ 
+ 	state->sumX += newval;
+ 	state->N++;
+ }
+ 
+ /*
+  * Remove an input value from the aggregated state.
+  */
+ static void
+ do_int16_discard(Int16AggState *state, int128 newval)
+ {
+ 	if (state->calcSumX2)
+ 		state->sumX2 -= newval * newval;
+ 
+ 	state->sumX -= newval;
+ 	state->N--;
+ }
+ #endif
  
  /*
   * Integer data types all use Numeric accumulators to share code and
*** numeric_accum_inv(PG_FUNCTION_ARGS)
*** 2905,2915 
--- 2970,2996 
   * for the sum(X*X) value.  Hence, we use int2_accum and int4_accum only
   * for stddev/variance --- there are faster special-purpose accumulator
   * routines for SUM and AVG of these datatypes.
+  *
+  * Similarily we can, where available, use 128-bit integer accumulators
+  * for sum(X) for int8 and sum(X*X) for int2 and int4, but not sum(X*X)
+  * for int8.
   */
  
  Datum
  int2_accum(PG_FUNCTION_ARGS)
  {
+ #ifdef HAVE_INT128
+ 	Int16AggState *state;
+ 
+ 	state = PG_ARGISNULL(0) ? NULL : (Int16AggState *) PG_GETARG_POINTER(0);
+ 
+ 	/* Create the state data on the first call */
+ 	if (state == NULL)
+ 		state = makeInt16AggState(fcinfo, true);
+ 
+ 	if (!PG_ARGISNULL(1))
+ 		do_int16_accum(state, (in128) PG_GETARG_INT16(1));
+ #else
  	NumericAggState *state;
  
  	state = PG_ARGISNULL(0) ? NULL : (N

Re: [HACKERS] REINDEX CONCURRENTLY 2.0

On Thu, Nov 13, 2014 at 10:26 AM, Michael Paquier
 wrote:
> On Thu, Nov 13, 2014 at 9:31 AM, Andres Freund  wrote:
>> I don't recall what the problem with just swapping the names was - but
>> I'm pretty sure there was one... Hm. The index relation oids are
>> referred to by constraints and dependencies. That's somewhat
>> solvable. But I think there was something else as well...
> The reason given 2 years ago for not using relname was the fast that
> the oid of the index changes, and to it be refered by some pg_depend
> entries:
Feel free to correct: "and that it could be referred".
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] REINDEX CONCURRENTLY 2.0

On Thu, Nov 13, 2014 at 9:31 AM, Andres Freund  wrote:
> I don't recall what the problem with just swapping the names was - but
> I'm pretty sure there was one... Hm. The index relation oids are
> referred to by constraints and dependencies. That's somewhat
> solvable. But I think there was something else as well...
The reason given 2 years ago for not using relname was the fast that
the oid of the index changes, and to it be refered by some pg_depend
entries:
http://www.postgresql.org/message-id/20121208133730.ga6...@awork2.anarazel.de
http://www.postgresql.org/message-id/12742.1354977...@sss.pgh.pa.us
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] REINDEX CONCURRENTLY 2.0

Andres Freund wrote:
> On 2014-11-12 18:23:38 -0500, Robert Haas wrote:

> > > The problem is that it's very hard to avoid the wrong index's
> > > relfilenode being used when swapping the relfilenodes between two
> > > indexes.
> > 
> > How about storing both the old and new relfilenodes in the same pg_class 
> > entry?
> 
> That's quite a cool idea
> 
> [think a bit]
> 
> But I think it won't work realistically. We have a *lot* of
> infrastructure that refers to indexes using it's primary key.

Hmm, can we make the relmapper do this job instead of having another
pg_class column?  Essentially the same sketch Robert proposed, instead
we would initially set relfilenode=0 and have all onlookers use the
relmapper to obtain the correct relfilenode; switching to the new
relfilenode can be done atomically, and un-relmap the index once the
process is complete.

The difference from what Robert proposes is that the transient state is
known to cause failures for anyone not prepared to deal with it, so it
should be easy to spot what places need adjustment.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates

2014-11-12 Thread Andreas Karlsson


Hi,

Here is version 2 of the patch which detects the presence of gcc/clang 
style 128-bit integers and has been cleaned up to a reviewable state. I 
have not added support for any other compilers since I found no 
documentation 128-bit support with icc or MSVC. I do not have access to 
any Windows machines either.


A couple of things I was not sure about was the naming of the new 
functions and if I should ifdef the size of the aggregate state in the 
catalog or not.


--
Andreas Karlsson

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-12 Thread Peter Geoghegan

On Wed, Nov 12, 2014 at 4:54 PM, Peter Geoghegan  wrote:
> Attached patch moves the Levenshtein distance implementation into core.

Oops. Somehow managed to send a *.patch.swp file.  :-)

Here is the actual patch.

-- 
Peter Geoghegan
From b7df918f1a52107637600f3b22d1cff18bd07ae1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Sat, 30 Nov 2013 23:15:00 -0800
Subject: [PATCH 1/2] Move Levenshtein distance implementation into core

The fuzzystmatch Levenshtein distance implementation is moved from
/contrib to core.  However, the related SQL-callable functions may still
only be used with the fuzzystmatch extension installed -- the
fuzzystmatch definitions become simple forwarding stubs.

It is not anticipated that the user-facing SQL functions will be moved
into core in the future, due to the MAX_LEVENSHTEIN_STRLEN restriction.
An in-core Levenshtein distance implementation is only anticipated to be
helpful in building diagnostic messages.
---
 contrib/fuzzystrmatch/Makefile|   3 -
 contrib/fuzzystrmatch/fuzzystrmatch.c |  81 ---
 contrib/fuzzystrmatch/levenshtein.c   | 403 --
 src/backend/utils/adt/Makefile|   2 +
 src/backend/utils/adt/levenshtein.c   | 394 +
 src/backend/utils/adt/varlena.c   |  24 ++
 src/include/utils/builtins.h  |   5 +
 7 files changed, 481 insertions(+), 431 deletions(-)
 delete mode 100644 contrib/fuzzystrmatch/levenshtein.c
 create mode 100644 src/backend/utils/adt/levenshtein.c

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 024265d..0327d95 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -17,6 +17,3 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
-
-# levenshtein.c is #included by fuzzystrmatch.c
-fuzzystrmatch.o: fuzzystrmatch.c levenshtein.c
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.c b/contrib/fuzzystrmatch/fuzzystrmatch.c
index 7a53d8a..62e650f 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.c
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.c
@@ -154,23 +154,6 @@ getcode(char c)
 /* These prevent GH from becoming F */
 #define NOGHTOF(c)	(getcode(c) & 16)	/* BDH */
 
-/* Faster than memcmp(), for this use case. */
-static inline bool
-rest_of_char_same(const char *s1, const char *s2, int len)
-{
-	while (len > 0)
-	{
-		len--;
-		if (s1[len] != s2[len])
-			return false;
-	}
-	return true;
-}
-
-#include "levenshtein.c"
-#define LEVENSHTEIN_LESS_EQUAL
-#include "levenshtein.c"
-
 PG_FUNCTION_INFO_V1(levenshtein_with_costs);
 Datum
 levenshtein_with_costs(PG_FUNCTION_ARGS)
@@ -180,8 +163,20 @@ levenshtein_with_costs(PG_FUNCTION_ARGS)
 	int			ins_c = PG_GETARG_INT32(2);
 	int			del_c = PG_GETARG_INT32(3);
 	int			sub_c = PG_GETARG_INT32(4);
-
-	PG_RETURN_INT32(levenshtein_internal(src, dst, ins_c, del_c, sub_c));
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes,
+t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(varstr_leven(s_data, s_bytes, t_data, t_bytes, ins_c,
+ del_c, sub_c));
 }
 
 
@@ -191,8 +186,20 @@ levenshtein(PG_FUNCTION_ARGS)
 {
 	text	   *src = PG_GETARG_TEXT_PP(0);
 	text	   *dst = PG_GETARG_TEXT_PP(1);
-
-	PG_RETURN_INT32(levenshtein_internal(src, dst, 1, 1, 1));
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes,
+t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(varstr_leven(s_data, s_bytes, t_data, t_bytes, 1, 1,
+	   1));
 }
 
 
@@ -206,8 +213,20 @@ levenshtein_less_equal_with_costs(PG_FUNCTION_ARGS)
 	int			del_c = PG_GETARG_INT32(3);
 	int			sub_c = PG_GETARG_INT32(4);
 	int			max_d = PG_GETARG_INT32(5);
-
-	PG_RETURN_INT32(levenshtein_less_equal_internal(src, dst, ins_c, del_c, sub_c, max_d));
+	const char *s_data;
+	const char *t_data;
+	int			s_bytes,
+t_bytes;
+
+	/* Extract a pointer to the actual character data */
+	s_data = VARDATA_ANY(src);
+	t_data = VARDATA_ANY(dst);
+	/* Determine length of each string in bytes and characters */
+	s_bytes = VARSIZE_ANY_EXHDR(src);
+	t_bytes = VARSIZE_ANY_EXHDR(dst);
+
+	PG_RETURN_INT32(varstr_leven_less_equal(s_data, s_bytes, t_data, t_bytes,
+			ins_c, del_c, sub_c, max_d));
 }
 
 
@@ -218,8 +237,20 @@ levenshtein_less_equal(PG_FUNCTION_ARGS)
 	text	   *src = PG_GETARG_TEXT_PP(0);
 	text	   *dst = PG_GETARG_TEXT_PP(1);
 	int			max_d = PG_GETARG_INT32(2);
-
-	PG_RETURN_INT32(levenshtein_less_equal_internal(src, dst, 1, 1, 1, max_d));
+	const char *s_data;
+

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-12 Thread Peter Geoghegan

On Wed, Nov 12, 2014 at 1:13 PM, Peter Geoghegan  wrote:
> On Wed, Nov 12, 2014 at 12:59 PM, Robert Haas  wrote:
>> I agree with your proposed approach to moving Levenshtein into core.
>> However, I think this should be separated into two patches, one of
>> them moving the Levenshtein functionality into core, and the other
>> adding the new treatment for missing column errors.  If you can do
>> that relatively soon, I'll make an effort to get the refactoring patch
>> committed in the near future.  Once that's done, we can focus in on
>> the interesting part of the patch, which is the actual machinery for
>> suggesting alternatives.
>
> Okay, thanks. I think I can do that fairly soon.

Attached patch moves the Levenshtein distance implementation into core.

You're missing patch 2 of 2 here, because I have yet to incorporate
your feedback on the HINT itself -- when I've done that, I'll post a
newly rebased patch 2/2, with those items taken care of. As you
pointed out, there is no reason to wait for that.

-- 
Peter Geoghegan


.0001-Move-Levenshtein-distance-implementation-into-core.patch.swp
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] REINDEX CONCURRENTLY 2.0

On 2014-11-12 18:23:38 -0500, Robert Haas wrote:
> On Wed, Nov 12, 2014 at 4:39 PM, Andres Freund  wrote:
> > On 2014-11-12 16:11:58 -0500, Robert Haas wrote:
> >> On Wed, Nov 12, 2014 at 4:10 PM, Robert Haas  wrote:
> >> > On Thu, Nov 6, 2014 at 9:50 AM, Peter Eisentraut  wrote:
> >> >> If REINDEX cannot work without an exclusive lock, we should invent some
> >> >> other qualifier, like WITH FEWER LOCKS.
> >> >
> >> > What he said.
> >
> > I'm unconvinced. A *short* exclusive lock (just to update two pg_class
> > row), still gives most of the benefits of CONCURRENTLY.
> 
> I am pretty doubtful about that.  It's still going to require you to
> wait for all transactions to drain out of the table while new ones are
> blocked from entering.  Which sucks.  Unless all of your transactions
> are very short, but that's not necessarily typical.

Yes, it sucks. But it beats not being able to reindex a relation with a
primary key (referenced by a fkey) without waiting several hours by a
couple magnitudes. And that's the current situation.

> > The problem is that it's very hard to avoid the wrong index's
> > relfilenode being used when swapping the relfilenodes between two
> > indexes.
> 
> How about storing both the old and new relfilenodes in the same pg_class 
> entry?

That's quite a cool idea

[think a bit]

But I think it won't work realistically. We have a *lot* of
infrastructure that refers to indexes using it's primary key. I don't
think we want to touch all those places to also disambiguate on some
other factor. All the relevant APIs are either just passing around oids
or relcache entries.

There's also the problem that we'd really need two different pg_index
rows to make things work. Alternatively we can duplicate the three
relevant columns (indisready, indislive, indislive) in there for the
different filenodes. But that's not entirely pretty.

> 1. Take a snapshot.
> 2. Index all the tuples in that snapshot.
> 3. Publish the new relfilenode to an additional pg_class column,
> relnewfilenode or similar.
> 4. Wait until everyone can see step #3.

Here all backends need to update both indexes, right? And all the
indexing infrastructure can't deal with that without having separate
oids & relcache entries.

> 5. Rescan the table and add any missing tuples to the index.
> 6. Set some flag in pg_class to mark the relnewfilenode as active and
> relfilenode as not to be used for queries.
> 7. Wait until everyone can see step #6.
> 8. Set some flag in pg_class to mark relfilenode as not even to be opened.
> 9. Wait until everyone can see step #8.
> 10. Drop old relfilenode.
> 11. Clean up by setting relfilenode = relnewfilenode, relfilenode = 0.

Even that one isn't trivial - how do you deal with the fact that
somebody looking at updating newrelfilenode might, in the midst of
processing, see newrelfilenode = 0?

I've earlier come up with a couple possible solutions, but I
unfortunately found holes in all of them. And if I can find holes in
them, there surely are more :(.

I don't recall what the problem with just swapping the names was - but
I'm pretty sure there was one... Hm. The index relation oids are
referred to by constraints and dependencies. That's somewhat
solvable. But I think there was something else as well...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] On the warpath again about ill-considered inclusion nests

I noticed that the recent custom-path commit completely ignored my
advice about not including executor headers into planner headers or
vice versa.  On the way to fixing that, I was dismayed to discover
that the RLS patch has utterly bollixed all semblance of modularization
of the headers.  src/include/rewrite/rowsecurity.h, which one would
reasonably think to be a rewriter header (nevermind its header comment
to the contrary), nonetheless includes execnodes.h (executor stuff)
and relation.h (planner stuff), neither of which a rewriter header
has any business including.  And if that weren't bad enough, it's
been included into utils/rel.h (relcache), which is close enough
to guaranteeing that all planner and executor symbols are visible
in every darn module we've got.  Might as well just put everything
we have in postgres.h and abandon all pretense of modularity.

This needs to be cleaned up.  If you don't want me doing it with
an axe, better put some suggestions forward.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

On Wed, Nov 12, 2014 at 9:49 PM, Robert Haas  wrote:
> On Tue, Nov 11, 2014 at 7:48 PM, Kouhei Kaigai  wrote:
>> Isn't provolatile = PROVOLATILE_IMMUTABLE sufficient?
>
> There are certainly things that are parallel-safe that are not
> immutable.  It might be the case that everything immutable is
> parallel-safe.
FWIW, when working on the concept of expression and clause
shippability for Postgres-XC (aka the possibility to pass it safely to
another backend, but in another PG node in this case), we discussed
similar things and if I recall correctly we even discussed about
adding a flag to pg_proc to define if a function was shippable or not.
Finally what we finished with was not adding a new flag and use as
rule that all the immutable functions can be safely shipped, and
others not, even some stable functions that *could* be safe. Maybe
Ashutosh has more comments on that, my memory may be failing.
In the end, I think that you would finish with something similar.
My 2c.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Teaching pg_dump to use NOT VALID constraints


On 11/10/14, 12:00 PM, Simon Riggs wrote:

On 10 November 2014 17:33, Alvaro Herrera  wrote:


  pg_dump --no-revalidaton

will add "NOT VALID" onto the recreation SQL for any FKs, but only for
ones that were already known to be valid.


Well.  Constraints that haven't been validated already have a NOT VALID
emitted by ruleutils.c, yes?  So what this patch does is add such a
clause for all *other* constraints.  Right?  In other words what it aims
to do is speed up loading of data by skipping the validation step on
restore.  Is that right?


Correct. CHECK constraints are added onto main table so they validate at load.


ISTM we could have the default pg_dump behavior emit NOT VALID
constraints, and add VALIDATE CONSTRAINT commands at the end; that way
the database is usable sooner but the constraints end up marked as
validated by the time the dump is finished.


Yes, may be an even better idea. We'd still want the --no-revalidation
option, AFAICS.

FKs are already "at the end". Perhaps we should add another
"validation" section?

I like the idea, just not sure how long it would take.


Isn't the real use-case here that if constraints were valid when you dumped 
then we shouldn't have to *any* re-validate when we load? (Though, we'd have to 
be careful of that with CHECK because that can call user code...)
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_multixact not getting truncated


On 11/10/14, 12:16 AM, Josh Berkus wrote:

On 11/09/2014 08:00 PM, Josh Berkus wrote:
On 11/08/2014 01:46 PM, Andres Freund wrote:

I'm these days suggesting that people should add manual vacuuming for

"older" relations during off peak hours on busy databases. There's too
many sites which service degrades noticeably during a full table vacuum.

Me too: https://github.com/pgexperts/flexible-freeze


It turns out that not even a program of preventative scheduled vacuuming
helps.  This is because the template0 database anchors the minmxid and
prevents it from being advanced until autovacuum gets around to that
database, at whatever the minmxid threshold is.


How did template0 even get a MultiXact? That sounds like they're really abusing 
the template databases. :( (Do keep in mind that MXID 1 is a special value.)

Regarding linking the two settings, I agree with others that XIDs and MXIDs are 
basically completely independent (as your customer apparently has discovered). 
If you set both of the min_age parameters fairly small then it doesn't matter 
which max limit (the table_age parameters) you hit; you'll get a full scan and 
the low min_age limits will mean you'll get good freezing of both.

The only other thing I can think of would be having yet another set of minimum 
age limits that come into play when you're doing a full scan as opposed to a 
partial one, but that seems like overkill to me. I guess another option would 
be to get more aggressive depending on the size of pg_multixact/...

BTW, the only reason I know of not to set both min_age parameters to zero is to 
prevent loss of forensic information. If that's not a concern you can always 
just set them to zero. Even if it is a concern, I suspect that the forensic 
info you could gather from a MultiXact is a lot more limited than for an XID, 
so it's probably pretty safe setting that to zero.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

On 11/12/14, 5:27 PM, Robert Haas wrote:

Maybe as anyarray, but I think pg_node_tree
>>might even be better.  That can also represent data of some arbitrary
>>type, but it doesn't enforce that everything is uniform.

>
>Of course, the more general you make it, the more likely that it'll be
>impossible to optimize well.

The point for me is just that range and list partitioning probably
need different structure, and hash partitioning, if we want to support
that, needs something else again.  Range partitioning needs an array
of partition boundaries and an array of child OIDs.  List partitioning
needs an array of specific values and a child table OID for each.
Hash partitioning needs something probably quite different.  We might
be able to do it as a pair of arrays - one of type anyarray and one of
type OID - and meet all needs that way.

Another issue is I don't know that we could support multi-key partitions with 
something like an anyarray. Perhaps that's OK as a first pass, but I expect 
it'll be one of the next things folks ask for.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unintended restart after recovery error

On Wed, Nov 12, 2014 at 4:52 PM, Antonin Houska  wrote:
> Fujii Masao  wrote:
>
>> On Wed, Nov 12, 2014 at 6:52 PM, Antonin Houska  wrote:
>> > While looking at postmaster.c:reaper(), one problematic case occurred to 
>> > me.
>> >
>> >
>> > 1. Startup process signals PMSIGNAL_RECOVERY_STARTED.
>> >
>> > 2. Checkpointer process is forked and immediately dies.
>> >
>> > 3. reaper() catches this failure, calls HandleChildCrash() and thus sets
>> > FatalError to true.
>> >
>> > 4. Startup process exits with non-zero status code too - either due to 
>> > SIGQUIT
>> > received from HandleChildCrash or due to some other failure of the startup
>> > process itself. However, FatalError is already set, because of the previous
>> > crash of the checkpointer. Thus reaper() does not set RecoveryError.
>> >
>> > 5. As RecoverError failed to be set to true, postmaster will try to restart
>> > the cluster, although it apparently should not.
>>
>> Why shouldn't postmaster restart the cluster in that case?
>>
>
> At least for the behavior to be consistent with simpler cases of failed
> recovery (e.g. any FATAL error in StartupXLOG), which end up not restarting
> the cluster.

It's true that if the startup process dies we don't try to restart,
but it's also true that if the checkpointer dies we do try to restart.
I'm not sure why this specific situation should be an exception to
that general rule.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

On Wed, Nov 12, 2014 at 5:06 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> I thought putting the partition boundaries into pg_inherits was a
>> strange choice.  I'd put it in pg_class, or in pg_partition if we
>> decide to create that.
>
> Yeah.  I rather doubt that we want this mechanism to be very closely
> tied to the existing inheritance features.  If we do that, we are
> going to need a boatload of error checks to prevent people from breaking
> partitioned tables by applying the sort of twiddling that inheritance
> allows.

Well, as I said upthread, I think it would be a pretty poor idea to
imagine that the first version of this feature is going to obsolete
everything we've done with inheritance.  Are we going to reinvent the
machinery to make inheritance children get scanned when the parent
does?  Reinvent Merge Append?

>> Maybe as anyarray, but I think pg_node_tree
>> might even be better.  That can also represent data of some arbitrary
>> type, but it doesn't enforce that everything is uniform.
>
> Of course, the more general you make it, the more likely that it'll be
> impossible to optimize well.

The point for me is just that range and list partitioning probably
need different structure, and hash partitioning, if we want to support
that, needs something else again.  Range partitioning needs an array
of partition boundaries and an array of child OIDs.  List partitioning
needs an array of specific values and a child table OID for each.
Hash partitioning needs something probably quite different.  We might
be able to do it as a pair of arrays - one of type anyarray and one of
type OID - and meet all needs that way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] REINDEX CONCURRENTLY 2.0

On Wed, Nov 12, 2014 at 4:39 PM, Andres Freund  wrote:
> On 2014-11-12 16:11:58 -0500, Robert Haas wrote:
>> On Wed, Nov 12, 2014 at 4:10 PM, Robert Haas  wrote:
>> > On Thu, Nov 6, 2014 at 9:50 AM, Peter Eisentraut  wrote:
>> >> If REINDEX cannot work without an exclusive lock, we should invent some
>> >> other qualifier, like WITH FEWER LOCKS.
>> >
>> > What he said.
>
> I'm unconvinced. A *short* exclusive lock (just to update two pg_class
> row), still gives most of the benefits of CONCURRENTLY.

I am pretty doubtful about that.  It's still going to require you to
wait for all transactions to drain out of the table while new ones are
blocked from entering.  Which sucks.  Unless all of your transactions
are very short, but that's not necessarily typical.

> The problem is that it's very hard to avoid the wrong index's
> relfilenode being used when swapping the relfilenodes between two
> indexes.

How about storing both the old and new relfilenodes in the same pg_class entry?

1. Take a snapshot.
2. Index all the tuples in that snapshot.
3. Publish the new relfilenode to an additional pg_class column,
relnewfilenode or similar.
4. Wait until everyone can see step #3.
5. Rescan the table and add any missing tuples to the index.
6. Set some flag in pg_class to mark the relnewfilenode as active and
relfilenode as not to be used for queries.
7. Wait until everyone can see step #6.
8. Set some flag in pg_class to mark relfilenode as not even to be opened.
9. Wait until everyone can see step #8.
10. Drop old relfilenode.
11. Clean up by setting relfilenode = relnewfilenode, relfilenode = 0.

This is basically CREATE INDEX CONCURRENTLY (without the first step
where we out-wait people who might create now-invalid HOT chains,
because that can't arise with a REINDEX of an existing index) plus
DROP INDEX CONCURRENTLY.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Race condition between hot standby and restoring a FPW


On 11/12/14, 9:47 AM, Tom Lane wrote:

Heikki Linnakangas  writes:

On 11/12/2014 05:20 PM, Tom Lane wrote:

On reconsideration I think the "RBM_ZERO returns page already locked"
alternative may be the less ugly.   That has the advantage that any code
that doesn't get updated will fail clearly and reliably.



Yeah, I'm leaning to that approach as well. It's made more ugly by the
fact that you sometimes need a cleanup lock on the buffer, so the caller
will somehow need to specify whether to get a cleanup lock or a normal
exclusive lock. Maybe add yet another mode, RBM_ZERO_WITH_CLEANUP_LOCK.
Could also rename RBM_ZERO to e.g. RBM_ZERO_AND_LOCK, to make any code
that's not updated to break even more obviously, at compile-time.


Yeah, I was considering suggesting changing the name of the mode too.
+1 for solving the lock-type problem with 2 modes.

(You could also consider leaving RBM_ZERO in place with its current
semantics, but I think what you've shown here is that there is no
safe way to use it, so probably that's not what we should do.)


If we're tweaking modes, can we avoid zeroing the buffer if we're about to dump 
a full page image into it? Presumably that means we'd have to keep the page 
locked until the image is written...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tracking commit timestamps


On 11/12/14, 7:06 AM, Petr Jelinek wrote:

  - if the xid passed to get interface is out of range -infinity timestamp is 
returned (I think it's bad idea to throw errors here as the valid range is not 
static and same ID can start throwing errors between calls theoretically)


Wouldn't NULL be more appropriate?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

Robert Haas  writes:
> I thought putting the partition boundaries into pg_inherits was a
> strange choice.  I'd put it in pg_class, or in pg_partition if we
> decide to create that.

Yeah.  I rather doubt that we want this mechanism to be very closely
tied to the existing inheritance features.  If we do that, we are
going to need a boatload of error checks to prevent people from breaking
partitioned tables by applying the sort of twiddling that inheritance
allows.

> Maybe as anyarray, but I think pg_node_tree
> might even be better.  That can also represent data of some arbitrary
> type, but it doesn't enforce that everything is uniform.

Of course, the more general you make it, the more likely that it'll be
impossible to optimize well.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add CREATE support to event triggers

On 2014-11-12 16:36:30 -0500, Robert Haas wrote:
> On Mon, Nov 10, 2014 at 9:02 PM, Jim Nasby  wrote:
> > +1. Adding columns is a PITA, you have to manually ensure you do it on all
> > slaves first.
> >
> > Drop is somewhat worse, because you have to do it on the master first,
> > opposite of the (more usual) case of adding a column.
> >
> > RENAME is a complete disaster.
> >
> > Handing scripts to your replication system to execute isn't a very good
> > alternative either; it assumes that you actually have a script (bad
> > assumption with ORMs), and that you have a reasonable way to get that script
> > to wherever you run your replication system.
> 
> I don't disagree with any of that, but running the command on the
> master and then propagating it to the slaves where it may succeed or
> fail - and if it fails, you won't know unless you're watching the logs
> on those machines, and, oh by the way, replication will also be broken
> - is not good either.

That's already the situation today with all the logical replication
solutions. They *constantly* break in the field. Most commonly because
of DDL differences.

I don't understand why you think it's likely for logical replication to
break due to this? You mean because deparse yielded a invalid statement?
In a normal single master setup there really shouldn't be scenarios
where that happens? Except bugs - but as you know we had more than in
HS/SR as well?

Or are you worried about stuff like ALTER TABLE ... USING()? I think
that's the replication solution's job to take care of/prevent.

For multimaster the situation is more complex, I agree, but I don't
think in core stuff needs to solve that for now?

We are thinking about extending 2PC to be usable across logical
decoding. That's a relatively simple patch. Then it's possible to do the
DDL on the primary, ship it to the standby, apply it there, and only
afterwards commit the prepared xact if that was successfull.  That's
quite cool - but somewhat in the remit of the replication solution.

> I think the approach to DDL
> replication that Alvaro, Andres, et al. are proposing here is
> absolutely fine - even praiseworthy - as an out-of-core solution that
> users can adopt if they are willing to accept the associated risks, as
> many users probably will be.  But you wouldn't convince me to run it
> on any production system for which I was responsible.

The solution here doesn't force you to do that, does it? It's something
that can be used by more than replication solution?

I just don't see the alternative you're proposing? I've so far not even
seen a credible *sketch* of an alternative design that also can handle
ALTER.  The only current alternatives are 1) the user inserts some
events into the queue manually. If they depend on any local state you're
screwed. If they have syntax errors they're often screwed. 2). The user
does all actions on the standby first. Then on the primary. That's hard
for ALTER ADD COLUMN and similar, and just about impossible for renaming
things.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unintended restart after recovery error

2014-11-12 Thread Antonin Houska

Fujii Masao  wrote:

> On Wed, Nov 12, 2014 at 6:52 PM, Antonin Houska  wrote:
> > While looking at postmaster.c:reaper(), one problematic case occurred to me.
> >
> >
> > 1. Startup process signals PMSIGNAL_RECOVERY_STARTED.
> >
> > 2. Checkpointer process is forked and immediately dies.
> >
> > 3. reaper() catches this failure, calls HandleChildCrash() and thus sets
> > FatalError to true.
> >
> > 4. Startup process exits with non-zero status code too - either due to 
> > SIGQUIT
> > received from HandleChildCrash or due to some other failure of the startup
> > process itself. However, FatalError is already set, because of the previous
> > crash of the checkpointer. Thus reaper() does not set RecoveryError.
> >
> > 5. As RecoverError failed to be set to true, postmaster will try to restart
> > the cluster, although it apparently should not.
> 
> Why shouldn't postmaster restart the cluster in that case?
> 

At least for the behavior to be consistent with simpler cases of failed
recovery (e.g. any FATAL error in StartupXLOG), which end up not restarting
the cluster.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

On Mon, Nov 10, 2014 at 8:53 PM, Amit Langote
 wrote:
> Above commands are merely transformed into ALTER TABLE subcommands that 
> arrange
> partitioned table and partitions into inheritance hierarchy, but with extra
> information, that is, allowed values for the partition in a new anyarray 
> column
> called 'pg_inherits.values'. A special case of ATExecAddInherit() namely
> ATExecAttachPartitionI(), as part of its processing, also adds partition
> constraints in the form of appropriate CHECK constraints. So, a few of the
> manual steps are automated and additional (IMHO non-opaque) metadata (namely
> partition boundaries/list values) is added.

I thought putting the partition boundaries into pg_inherits was a
strange choice.  I'd put it in pg_class, or in pg_partition if we
decide to create that.  Maybe as anyarray, but I think pg_node_tree
might even be better.  That can also represent data of some arbitrary
type, but it doesn't enforce that everything is uniform.  So you could
have a list of objects of the form {RANGEPARTITION :lessthan {CONST
...} :partition 16982} or similar.  The relcache could load that up
and convert the list to a C array, which would then be easy to
binary-search.

As you say, you also need to store the relevant operator somewhere,
and the fact that it's a range partition rather than list or hash,
say.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reverse Engineering - search constraints are not explicitly stated in the tables from the VIEW

nill  writes:
> I am analyzing query plans generated by the view in the database PostgreSQL
> 8.3, looking for missing information "constraints not explicitly registrants
> in the tables."

You realize of course that 8.3 is nearly 7 years old and has been out of
support for awhile.

> In the analysis of the query plan and its subplane, I can not understand
> what the parameter $0 represents, without looking the SQL query. My question
> is: looking only at the query plan product, you can understand what is the
> parameter $0?

It's a variable passed down from the outer query level.  It's true that
you can't tell which variable, in 8.3.  Less obsolete versions produce
more readable output though.  (I won't claim it's perfect; we still don't
try very hard to decompile ANY/ALL subplans.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] REINDEX CONCURRENTLY 2.0

On 2014-11-12 16:11:58 -0500, Robert Haas wrote:
> On Wed, Nov 12, 2014 at 4:10 PM, Robert Haas  wrote:
> > On Thu, Nov 6, 2014 at 9:50 AM, Peter Eisentraut  wrote:
> >> If REINDEX cannot work without an exclusive lock, we should invent some
> >> other qualifier, like WITH FEWER LOCKS.
> >
> > What he said.

I'm unconvinced. A *short* exclusive lock (just to update two pg_class
row), still gives most of the benefits of CONCURRENTLY. Also, I do think
we can get rid of that period in the not too far away future.

> But more to the point  why, precisely, can't this work without an
> AccessExclusiveLock?  And can't we fix that instead of setting for
> something clearly inferior?

It's nontrivial to fix, but I think we can fix it at some point. I just
think we should get the *major* part of the feature before investing
lots of time making it even better. There's *very* frequent questions
about having this. And people do really dangerous stuff (like manually
updating pg_class.relfilenode and such) to cope.

The problem is that it's very hard to avoid the wrong index's
relfilenode being used when swapping the relfilenodes between two
indexes.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add CREATE support to event triggers

On Mon, Nov 10, 2014 at 9:02 PM, Jim Nasby  wrote:
> +1. Adding columns is a PITA, you have to manually ensure you do it on all
> slaves first.
>
> Drop is somewhat worse, because you have to do it on the master first,
> opposite of the (more usual) case of adding a column.
>
> RENAME is a complete disaster.
>
> Handing scripts to your replication system to execute isn't a very good
> alternative either; it assumes that you actually have a script (bad
> assumption with ORMs), and that you have a reasonable way to get that script
> to wherever you run your replication system.

I don't disagree with any of that, but running the command on the
master and then propagating it to the slaves where it may succeed or
fail - and if it fails, you won't know unless you're watching the logs
on those machines, and, oh by the way, replication will also be broken
- is not good either.  We would never have shipped physical
replication solution with that kind of limitation.  What has made
streaming replication so popular and successful with PostgreSQL users
over the last five years is that, while it's a bit of a pain to get
set up, once you have it set up, it is rock-solid.  If there were a
series of legal SQL commands that you could execute without error on a
cluster of servers connected by streaming replication such that, when
you got done, replication was broken, our users would scream bloody
murder, or just stop using PostgreSQL.  I think the approach to DDL
replication that Alvaro, Andres, et al. are proposing here is
absolutely fine - even praiseworthy - as an out-of-core solution that
users can adopt if they are willing to accept the associated risks, as
many users probably will be.  But you wouldn't convince me to run it
on any production system for which I was responsible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-11-12 Thread Peter Geoghegan

On Wed, Nov 12, 2014 at 12:59 PM, Robert Haas  wrote:
> I agree with your proposed approach to moving Levenshtein into core.
> However, I think this should be separated into two patches, one of
> them moving the Levenshtein functionality into core, and the other
> adding the new treatment for missing column errors.  If you can do
> that relatively soon, I'll make an effort to get the refactoring patch
> committed in the near future.  Once that's done, we can focus in on
> the interesting part of the patch, which is the actual machinery for
> suggesting alternatives.

Okay, thanks. I think I can do that fairly soon.

> On that topic, I think there's unanimous consensus against the design
> where equally-distant matches are treated differently based on whether
> they are in the same RTE or different RTEs.  I think you need to
> change that if you want to get anywhere with this.

Alright. It wasn't as if I felt very strongly about it either way.

> On a related note,
> the use of the additional parameter AttrNumber closest[2] to
> searchRangeTableForCol() and of the additional parameters AttrNumber
> *matchedatt and int *distance to scanRTEForColumn() is less than
> self-documenting.  I suggest creating a structure called something
> like FuzzyAttrMatchState and passing a pointer to it down to both
> functions.

Sure.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] REINDEX CONCURRENTLY 2.0

On Wed, Nov 12, 2014 at 4:10 PM, Robert Haas  wrote:
> On Thu, Nov 6, 2014 at 9:50 AM, Peter Eisentraut  wrote:
>> If REINDEX cannot work without an exclusive lock, we should invent some
>> other qualifier, like WITH FEWER LOCKS.
>
> What he said.

But more to the point  why, precisely, can't this work without an
AccessExclusiveLock?  And can't we fix that instead of setting for
something clearly inferior?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] REINDEX CONCURRENTLY 2.0

On Thu, Nov 6, 2014 at 9:50 AM, Peter Eisentraut  wrote:
> If REINDEX cannot work without an exclusive lock, we should invent some
> other qualifier, like WITH FEWER LOCKS.

What he said.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Error building the EnterpriseDB mysql_fdw on OSX

On Wed, Nov 12, 2014 at 3:48 PM, Kirk Roybal  wrote:
> I'm running 10.9.5 of OSX.
>
> I got the MySQL and PostgreSQL dependencies installed (I think).
>
> Checked out the git repo for mysql_fdw from
> git://github.com/EnterpriseDB/mysql_fdw
>
> USE_PGXS=1 make
>
> and got the error:
>
>  mysql_fdw.c
> mysql_fdw.c:153:56: error: use of undeclared identifier 'RTLD_DEEPBIND'
> mysql_dll_handle = dlopen(_MYSQL_LIBNAME, RTLD_LAZY |
> RTLD_DEEPBIND);
>
>
> Is RTLD_DEEPBIND supported on OSX?
>
> Did I do something wrong getting the dependencies together?

I don't know whether the folks who work on that FDW read this list
regularly, but they do keep an eye on issues opened in GitHub.  So you
might want to open one there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

On Sun, Nov 9, 2014 at 11:48 PM, Peter Geoghegan  wrote:
> On Fri, Nov 7, 2014 at 12:57 PM, Robert Haas  wrote:
>> Based on this review from a month ago, I'm going to mark this Waiting
>> on Author.  If nobody updates the patch in a few days, I'll mark it
>> Returned with Feedback.  Thanks.
>
> Attached revision fixes the compiler warning that Heikki complained
> about. I maintain SQL-callable stub functions from within contrib,
> rather than follow Michael's approach. In other words, very little has
> changed from my revision from July last [1].

I agree with your proposed approach to moving Levenshtein into core.
However, I think this should be separated into two patches, one of
them moving the Levenshtein functionality into core, and the other
adding the new treatment for missing column errors.  If you can do
that relatively soon, I'll make an effort to get the refactoring patch
committed in the near future.  Once that's done, we can focus in on
the interesting part of the patch, which is the actual machinery for
suggesting alternatives.

On that topic, I think there's unanimous consensus against the design
where equally-distant matches are treated differently based on whether
they are in the same RTE or different RTEs.  I think you need to
change that if you want to get anywhere with this.  On a related note,
the use of the additional parameter AttrNumber closest[2] to
searchRangeTableForCol() and of the additional parameters AttrNumber
*matchedatt and int *distance to scanRTEForColumn() is less than
self-documenting.  I suggest creating a structure called something
like FuzzyAttrMatchState and passing a pointer to it down to both
functions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Error building the EnterpriseDB mysql_fdw on OSX

2014-11-12 Thread Kirk Roybal

 

I'm running 10.9.5 of OSX. 

I got the MySQL and PostgreSQL dependencies installed (I think). 

Checked out the git repo for mysql_fdw from
git://github.com/EnterpriseDB/mysql_fdw 

USE_PGXS=1 make 

and got the error: 

 mysql_fdw.c
mysql_fdw.c:153:56: error: use of undeclared identifier 'RTLD_DEEPBIND'
 mysql_dll_handle = dlopen(_MYSQL_LIBNAME, RTLD_LAZY | RTLD_DEEPBIND); 

Is RTLD_DEEPBIND supported on OSX? 

Did I do something wrong getting the dependencies together? 

/Kirk 

-- 
--
The best virus protection for Windows is fdisk, format, insert linux
disk.

Re: [HACKERS] pg_prewarm really needs some CHECK_FOR_INTERRUPTS

On 2014-11-11 12:17:11 +0100, Andres Freund wrote:
> pg_prewarm() currently can't be cannot be interrupted - which seems odd
> given that it's intended to read large amounts of data from disk. A
> rather slow process.
> 
> Unless somebody protests I'm going to add a check to the top of each of
> the three loops.

Pushed to master and 9.4.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Reverse Engineering - search constraints are not explicitly stated in the tables from the VIEW

2014-11-12 Thread nill

I am analyzing query plans generated by the view in the database PostgreSQL
8.3, looking for missing information "constraints not explicitly registrants
in the tables."
In nested queries, (ex. IN clause, ...), the query plan consist in the
evaluation of the subplane derived from clause (SELECT * ) and external
queries.
In the present case:
HashAggregate  (cost=15.14..15.16 rows=1 width=247)
 ->  Nested Loop IN Join  (cost=3.46..15.12 rows=1 width=247)
   Join Filter: (o.c_doctype_id = c_doctype.c_doctype_id)
 ->  Hash Left Join  (cost=3.46..12.38 rows=1 width=247)
   Hash Cond: (bp.c_invoiceschedule_id = si.c_invoiceschedule_id)"
   Filter: ((o.invoicerule = 'I'::bpchar) OR ((o.invoicerule =
'O'::bpchar) AND (NOT (subplan))) OR ((o.invoicerule = 'D'::bpchar) AND
(l.qtyinvoiced <> l.qtydelivered)) OR ((o.invoicerule = 'S'::bpchar) AND
(bp.c_invoiceschedule_id IS NULL)) OR ((o.invoicerule = 'S'::bpchar) AND
(bp.c_invoiceschedule_id IS NOT NULL) AND ((si.invoicefrequency IS NULL) OR
(si.invoicefrequency = 'D'::bpchar) OR (si.invoicefrequency = 'W'::bpchar)
OR ((si.invoicefrequency = 'T'::bpchar) AND
(((adempiere.trunc((o.dateordered)::timestamp with time zone) <=
(((adempiere.firstof(adempiere.getdate(), 'MM'::character
varying))::timestamp with time zone OPERATOR(adempiere.+)
si.invoicedaycutoff) - 1)) AND (adempiere.trunc(adempiere.getdate()) >=
(((adempiere.firstof((o.dateordered)::timestamp with time zone,
'MM'::character varying))::timestamp with time zone OPERATOR(adempiere.+)
si.invoiceday) - 1))) OR ((adempiere.trunc((o.dateordered)::timestamp with
time zone) <= (((adempiere.firstof(adempiere.getdate(), 'MM'::character
varying))::timestamp with time zone OPERATOR(adempiere.+)
si.invoicedaycutoff) + 14)) AND (adempiere.trunc(adempiere.getdate()) >=
(((adempiere.firstof((o.dateordered)::timestamp with time zone,
'MM'::character varying))::timestamp with time zone OPERATOR(adempiere.+)
si.invoiceday) + 14) OR ((si.invoicefrequency = 'M'::bpchar) AND
(adempiere.trunc((o.dateordered)::timestamp with time zone) <=
(((adempiere.firstof(adempiere.getdate(), 'MM'::character
varying))::timestamp with time zone OPERATOR(adempiere.+)
si.invoicedaycutoff) - 1)) AND (adempiere.trunc(adempiere.getdate()) >=
(((adempiere.firstof((o.dateordered)::timestamp with time zone,
'MM'::character varying))::timestamp with time zone OPERATOR(adempiere.+)
si.invoiceday) - 1))
  ->  Hash Join  (cost=2.44..3.87 rows=3 width=300)
Hash Cond: (l.c_order_id = o.c_order_id)
->  Seq Scan on c_orderline l  (cost=0.00..1.31 rows=25
width=141)
  Filter: (qtyordered <> qtyinvoiced)
   ->  Hash  (cost=2.40..2.40 rows=3 width=172)"
  ->  Hash Join  (cost=1.13..2.40 rows=3 width=172)
Hash Cond: (bp.c_bpartner_id = o.c_bpartner_id)
->  Seq Scan on c_bpartner bp  (cost=0.00..1.17
rows=17 width=26)
->  Hash  (cost=1.10..1.10 rows=3 width=159)
  ->  Seq Scan on c_order o 
(cost=0.00..1.10 rows=3 width=159)
Filter: (docstatus = ANY
('{CO,CL,IP}'::bpchar[]))
  ->  Hash  (cost=1.01..1.01 rows=1 width=47)
->  Seq Scan on c_invoiceschedule si  (cost=0.00..1.01
rows=1 width=47)
  SubPlan
->  Seq Scan on c_orderline zz1  (cost=0.00..1.38 rows=1
width=0)
  Filter: ((qtyordered <> qtydelivered) AND (c_order_id =
$0))
->  Seq Scan on c_doctype  (cost=0.00..2.73 rows=1 width=13)
  Filter: ((c_doctype.docbasetype = 'SOO'::bpchar) AND
(c_doctype.docsubtypeso <> ALL ('{ON,OB,WR}'::bpchar[])))

In the analysis of the query plan and its subplane, I can not understand
what the parameter $0 represents, without looking the SQL query. My question
is: looking only at the query plan product, you can understand what is the
parameter $0?

SELECT o.ad_client_id, o.ad_org_id, o.c_bpartner_id, o.c_order_id,
o.documentno, o.dateordered, o.c_doctype_id, sum((l.qtyordered -
l.qtyinvoiced) * l.priceactual) AS totallines
FROM c_order o
JOIN c_orderline l ON o.c_order_id = l.c_order_id
JOIN c_bpartner bp ON o.c_bpartner_id = bp.c_bpartner_id
LEFT JOIN c_invoiceschedule si ON bp.c_invoiceschedule_id =
si.c_invoiceschedule_id
WHERE (o.docstatus = ANY (ARRAY['CO'::bpchar, 'CL'::bpchar, 'IP'::bpchar]))
AND 
(o.c_doctype_id IN ( SELECT c_doctype.c_doctype_id
 FROM c_doctype
   WHERE c_doctype.docbasetype = 'SOO'::bpchar AND
(c_doctype.docsubtypeso <> ALL (ARRAY['ON'::bpchar, 'OB'::bpchar,
'WR'::bpchar] AND l.qtyordered <> l.qtyinvoiced 
 AND (o.invoicerule = 'I'::bpchar OR o.invoicerule =
'O'::bpchar 
 AND NOT (EXISTS ( SELECT 1 FROM c_orderline zz1
   WHERE **zz1.c_order_id =
o.c_order_id** AND zz1.qtyordered <> zz1.qtydeliv

Re: [HACKERS] group locking: incomplete patch, just for discussion

On Wed, Nov 12, 2014 at 2:57 AM, Jeff Davis  wrote:
> Trying to catch up on this thread, please excuse me if these questions
> are already covered.

Welcome to the party.  The more, the merrier.  :-)

> You mention the possibility of undetected deadlocks, which is surely
> unacceptable. But why not improve the deadlock detector? How bad are
> _detected_ deadlocks? A lot of the concern was around catalog accesses,
> but those lock requests would rarely wait anyway.

Detected deadlocks are fine.  Improving the deadlock detector is the
heart of what needs to be done here.  As you say, the lock requests
we're talking about will rarely wait, so deadlocks won't be frequent.
The issue is making sure that, if they do happen, we get a better
behavior than "your parallel query hangs forever; good luck figuring
out why".

> I also wonder if group locking is generally the wrong approach to
> parallelism. Parallel scan/sort should work by assigning workers to
> chunks of data, and then merging the results. In principle the workers
> don't need to touch the same data at all, so why are we trying so hard
> to get them to all take the same locks?
>
> The reason, I assume, is that a full table is the lockable object, but
> we are imagining the segment files as the chunks. But maybe there's a
> way to address this more directly with an extra field in the lock tag,
> and perhaps some catalog knowledge?

In the common case, we're only taking AccessShareLock on some relation
to prevent it from being concurrently dropped or rewritten.  Locking
only part of the relation wouldn't lead to any improvement in
concurrency, because the full-relation lock isn't conflicting with
anything that the partial-relation lock wouldn't also need to conflict
with.

More generally, I think there's some misunderstanding about the
overall goal of the parallelism infrastructure that I'm trying to
create.  Many people have proposed interesting strategies for how we
could do various things (locking, planning, execution) better.  Some
of those ideas are, doubtless, good ones.  But my goal is in some ways
the opposite: I'm trying to make it possible to run as much existing
PostgreSQL backend code as possible inside a parallel worker without
any modification.  To do that, I'm trying to modify the subsystems
that are widely used throughout the backend - such as locking - in
such a way that the people using that infrastructure need not put
conditional logic into their code to make it do one thing when in
parallel mode and another thing when not in parallel mode.  And I'm
also trying to avoid fundamentally changing the way major subsystems
work today as a precondition of parallelism.

So I'm taking a very surgical approach to the patches I propose.  I'm
not proposing patches that do anything fundamentally new or different;
instead, I'm proposing patches that let the stuff we already do
continue to work.  In a single backend executing a query, we have lots
of code that assumes you can allocate memory, look data up in a
syscache, throw an error, lock a buffer, examine the value of a GUC,
test a tuple against the active snapshot, and so on.
For parallel mode to be useful, those same things need to work in a
parallel worker.  We of course do not need every crazy thing somebody
may want to do in a parallel worker, nor will it.  But anything that's
done in thousands of places throughout the code had better work, or
parallelism will be so restricted as to be useless.  So commit
2bd9e412f92bc6a68f3e8bcb18e04955cc35001d, for example, is about
letting "ereport" work in a parallel worker, and this patch is about
making things like "SearchSysCache1" and "PG_GETARG_TEXT_PP" work
*reliably* in a parallel worker.

Now, the time to do new and different things is coming, and is maybe
even now not so very far away.  I'm not an expert on parallel
execution, and I'm sure there are a number of people far more skilled
than I at figuring out how to do a parallel scan, join, sort, or
whatever quickly.  What I'm trying to do is get the infrastructure to
a point where those people don't have to worry about whether the basic
facilities that we rely on throughout the backend are gonna work.

By way of comparison, think about the periodic discussions about
making the backend multi-threaded.  Since the backend relies on global
variables in an enormous number of places, it isn't thread-safe.  If
you spawn a thread inside a PostgreSQL backend, it can't safely
ereport(), palloc(), LWLockAcquire(), or just about anything else,
which means that, although anybody can write code that calls
pthread_create(), it's not useful to do so because there's practically
no existing backend code that you could safely call from the new
thread.  Using background worker processes eliminates a lot of those
problems - palloc and lwlocks just work, for example - but not all of
them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing l

Re: [HACKERS] pg_receivexlog --status-interval add fsync feedback

Fujii Masao wrote:

> --- 127,152 
>When this option is used, pg_receivexlog will 
> report
>a flush position to the server, indicating when each segment has 
> been
>synchronized to disk so that the server can remove that segment if 
> it
> !  is not otherwise needed. --synchronous option 
> must
> ! be specified when making pg_receivexlog run as
> ! synchronous standby by using replication slot. Otherwise WAL data
> ! cannot be flushed frequently enough for this to work correctly.
>   
> 
>

Whitespace damage here.

> + printf(_("  --synchronous  flush transaction log in real 
> time\n"));

"in real time" sounds odd.  How about "flush transaction log
immediately after writing", or maybe "have transaction log writes be
synchronous".

> --- 781,791 
>   now = feGetCurrentTimestamp();
>   
>   /*
> !  * Issue sync command as soon as there are WAL data which
> !  * has not been flushed yet if synchronous option is true.
>*/
>   if (lastFlushPosition < blockpos &&
> ! walfile != -1 && synchronous)

I'd put the "synchronous" condition first in the if(), and start the
comment with it rather than putting it at the end.  Both seem weird.

> --- 793,807 
>   progname, current_walfile_name, 
> strerror(errno));
>   goto error;
>   }
>   lastFlushPosition = blockpos;
> ! 
> ! /*
> !  * Send feedback so that the server sees the latest WAL 
> locations
> !  * immediately if synchronous option is true.
> !  */
> ! if (!sendFeedback(conn, blockpos, now, false))
> ! goto error;
> ! last_status = now;

I'm not clear about this comment .. why does it say "if synchronous
option is true" when it's not checking the condition?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] recovery_target_time and standby_mode

On Wed, Nov 12, 2014 at 12:12 PM, Josh Berkus  wrote:
> On 11/07/2014 02:03 PM, Josh Berkus wrote:
>> But, like I said, there's a serviceable workaround.
>
> Some update on this.  We've seen a problem in production with this setup
> which I can't reproduce as a test case, but which may jog Heikki's
> memory for something to fix.
>
> 1. Recover master to 2014-11-10 12:10:00
> 2. Recover replica to 2014-11-10 12:10:00,
>with pause_at_recovery_target
> 3. reconfigure recovery.conf for streaming replication
>and restart the replica
> 4. get a fatal error for replication, because
>the replica is ahead of the master on timeline1
>
> What *appears* to be happening is that the pause_at_recovery_target,
> followed by the restart, on the replica causes it to advance one commit
> on timeline 1.  But *not all the time*; this doesn't happen in my
> pgbench-based tests.
>
> There's a workaround for the user (they just restore the replica to 5
> minutes earlier), but I'm thinking this is a minor bug somewhere.

I'm not sure what's going on here, but keep in mind that when you
restart the replica, it's going to back up to the most recent
restartpoint and begin replication from there, not from the point it
was at when you shut down.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] what does this mean: "running xacts with xcnt == 0"

On Wed, Nov 12, 2014 at 12:47 PM, Andres Freund  wrote:
>> >> > So maybe 'Encountered xl_running_xacts record with xcnt = 0.'?
>> >>
>> >> That's not very user-facing, is it -- I mean, why bother the user with
>> >> the names of structs and members thereof?  It seems better to describe
>> >> what the condition is; something like "found point in time with no
>> >> running transaction".  Maybe "point in time" should be "WAL record"
>> >> instead.
>> >
>> > Is that really a win in clarity? When analyzing a problem I'd much
>> > rather have a concrete hint than something fuzzy.
>>
>> You can't phrase error messages in terms of internal concepts that 99%
>> of users won't understand or care about.  Like Peter says, user-facing
>> error messages need to be written in English, not C.
>
> That's not the actual message, but an errdetail() - and lots of those
> refer to internals? And it's not an error, but a log message? E.g. we
> add error contexts for wal replay errors that print the internals
> literaly?  And it's *really* helpful?

Like what?  That's the only errdetail() currently in the tree that
contains ==, for example.

I see there are three different detail messages associated with this
error message.   You don't really need these messages to be phrased in
any; it's enough to have the messages be different from each other.

1. found initial snapshot in snapbuild file
2. Transaction ID %u finished; no more running transactions.
3. running xacts with xcnt == 0

The second one follows style guidelines, but the other two do not.

I suggest:

1. Logical decoding will begin using saved snapshot.
2. Transaction ID %u finished; no more running transactions.
3. There are no running transactions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] alter user set local_preload_libraries.

On Wed, Nov 5, 2014 at 1:22 AM, Peter Eisentraut  wrote:
> On 10/9/14 1:58 PM, Fujii Masao wrote:
>> Also I think that it's useful to allow ALTER ROLE/DATABASE SET to
>> set PGC_BACKEND and PGC_SU_BACKEND parameters. So, what
>> about applying the attached patch? This patch allows that and
>> changes the context of session_preload_libraries to PGC_SU_BACKEND.
>
> After looking through this again, I wonder whether there is any reason
> why ignore_system_indexes cannot be plain PGC_USERSET.  With this
> change, we'd allow setting it via ALTER ROLE, but the access to
> pg_db_role_setting happens before it.

Even without the patch, we can set ignore_system_indexes
at the startup of the connection because it's defined with
PGC_BACKEND context, but the access to system tables
can happen before that. Am I missing something?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Log inability to lock pages during vacuum

On Wed, Nov 12, 2014 at 12:37 PM, Andres Freund  wrote:
> Stop overdesigning this.
>
> Add it to the existing mesage and let us be done with this. This thread
> has already wasted far too much time.

That's a little harsh, but I agree.  Producing a warning here is just
going to be log-spam.  We've had this behavior for years and - to my
knowledge - we have not got one complaint that can be attributed to
this feature.  What used to happen is that VACUUM would get stuck for
minutes, hours, or days trying to vacuum a table because of an open
cursor, and people did complain about that.  It was a serious nuisance
that is now gone.  The entire argument in favor of some change here is
"even though a careful theoretical analysis indicates that this is
safe, even though it solved a real problem that was hurting our users,
and even though we have no evidence whatsoever that the change is
hurting anybody in any way whatsoever, the lack of instrumentation
means that it could possibly be hurting somebody and we wouldn't
know."

That is, of course, quite true, but you could apply the same argument
to lots of patches.  It would usually be a waste of time because most
of the patches we think are good actually ARE good - but of course,
every once in a while you would find a real problem that had been
overlooked.

Maybe the right thing here is not to make any change to the source
code *at all* but to patch his own copy and try to come up with a
reproducible test case where this is actually a problem.  Then, if
there is a problem, we could actually fix it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] what does this mean: "running xacts with xcnt == 0"

On 2014-11-12 12:40:41 -0500, Robert Haas wrote:
> On Wed, Nov 12, 2014 at 10:59 AM, Andres Freund  
> wrote:
> > On 2014-11-12 11:56:01 -0300, Alvaro Herrera wrote:
> >> Andres Freund wrote:
> >> > Hi,
> >> >
> >> > On 2014-11-12 09:03:40 -0500, Peter Eisentraut wrote:
> >> > > Could someone translate this detail message to English:
> >> > >
> >> > > ereport(LOG,
> >> > > (errmsg("logical decoding found consistent point at 
> >> > > %X/%X",
> >> > > (uint32) (lsn >> 32), (uint32) lsn),
> >> > >  errdetail("running xacts with xcnt == 0")));
> >> >
> >> > It means there a xl_running_xacts record was encountered that had xcnt =
> >> > 0 - allowing logical decoding to find a consistent start point
> >> >
> >> > > (or downgrade to debug message, if appropriate)?
> >> >
> >> > The message generally is quite relevant, as the process of finding a
> >> > consistent start point can take quite a while. we don't really have a
> >> > nice way to make errdetail() only be logged on a certain severity level
> >> > as far as I am aware off.
> >>
> >> Can we do just the errmsg() and remove with the errdetail?
> >
> > No, I really don't want to do that. When trying to see whether logical
> > replication started that's imo quite an importantdetail. Especially when
> > first seing
> > ereport(LOG,
> > (errmsg("logical decoding found initial starting 
> > point at %X/%X",
> > (uint32) (lsn >> 32), (uint32) lsn),
> >  errdetail_plural("%u transaction needs to finish.",
> >   "%u transactions 
> > need to finish.",
> >   
> > builder->running.xcnt,
> >   (uint32) 
> > builder->running.xcnt)));
> >
> > Btw, Peter, why did you add a (uint32) to one, but not both,
> > builder->running.xcnt references?
> >
> >> > So maybe 'Encountered xl_running_xacts record with xcnt = 0.'?
> >>
> >> That's not very user-facing, is it -- I mean, why bother the user with
> >> the names of structs and members thereof?  It seems better to describe
> >> what the condition is; something like "found point in time with no
> >> running transaction".  Maybe "point in time" should be "WAL record"
> >> instead.
> >
> > Is that really a win in clarity? When analyzing a problem I'd much
> > rather have a concrete hint than something fuzzy.
> 
> You can't phrase error messages in terms of internal concepts that 99%
> of users won't understand or care about.  Like Peter says, user-facing
> error messages need to be written in English, not C.

That's not the actual message, but an errdetail() - and lots of those
refer to internals? And it's not an error, but a log message? E.g. we
add error contexts for wal replay errors that print the internals
literaly?  And it's *really* helpful?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] what does this mean: "running xacts with xcnt == 0"

On Wed, Nov 12, 2014 at 10:59 AM, Andres Freund  wrote:
> On 2014-11-12 11:56:01 -0300, Alvaro Herrera wrote:
>> Andres Freund wrote:
>> > Hi,
>> >
>> > On 2014-11-12 09:03:40 -0500, Peter Eisentraut wrote:
>> > > Could someone translate this detail message to English:
>> > >
>> > > ereport(LOG,
>> > > (errmsg("logical decoding found consistent point at 
>> > > %X/%X",
>> > > (uint32) (lsn >> 32), (uint32) lsn),
>> > >  errdetail("running xacts with xcnt == 0")));
>> >
>> > It means there a xl_running_xacts record was encountered that had xcnt =
>> > 0 - allowing logical decoding to find a consistent start point
>> >
>> > > (or downgrade to debug message, if appropriate)?
>> >
>> > The message generally is quite relevant, as the process of finding a
>> > consistent start point can take quite a while. we don't really have a
>> > nice way to make errdetail() only be logged on a certain severity level
>> > as far as I am aware off.
>>
>> Can we do just the errmsg() and remove with the errdetail?
>
> No, I really don't want to do that. When trying to see whether logical
> replication started that's imo quite an importantdetail. Especially when
> first seing
> ereport(LOG,
> (errmsg("logical decoding found initial starting 
> point at %X/%X",
> (uint32) (lsn >> 32), (uint32) lsn),
>  errdetail_plural("%u transaction needs to finish.",
>   "%u transactions 
> need to finish.",
>   
> builder->running.xcnt,
>   (uint32) 
> builder->running.xcnt)));
>
> Btw, Peter, why did you add a (uint32) to one, but not both,
> builder->running.xcnt references?
>
>> > So maybe 'Encountered xl_running_xacts record with xcnt = 0.'?
>>
>> That's not very user-facing, is it -- I mean, why bother the user with
>> the names of structs and members thereof?  It seems better to describe
>> what the condition is; something like "found point in time with no
>> running transaction".  Maybe "point in time" should be "WAL record"
>> instead.
>
> Is that really a win in clarity? When analyzing a problem I'd much
> rather have a concrete hint than something fuzzy.

You can't phrase error messages in terms of internal concepts that 99%
of users won't understand or care about.  Like Peter says, user-facing
error messages need to be written in English, not C.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

On Wed, Nov 12, 2014 at 11:36 AM, Robert Haas  wrote:
> On Wed, Nov 12, 2014 at 11:19 AM, Andres Freund  
> wrote:
>> The question is whether the library is actually loaded in that case?
>> Because that normally only happens early during startup - which is why
>> it's a PGC_BACKEND guc.
>
> It looks like that does not work.
>
> [rhaas pgsql]$ PGOPTIONS='-c local_preload_libraries=auto_explain' psql
> psql (9.5devel)
> Type "help" for help.
>
> rhaas=# select * from pg_background_result(pg_background_launch('show
> auto_explain.log_min_duration')) as (x text);
> ERROR:  unrecognized configuration parameter "auto_explain.log_min_duration"
> CONTEXT:  background worker, pid 31316
>
> So, there's more to be done here.  Rats.

It turned out to be quite simple to fix both problems.  This
particular case fails because the call that loads the libraries
specified by session_preload_libraries and local_preload_libraries is
in PostgresMain() and thus never gets called by pg_background.  I
fixed that by adding that call to pg_background in the appropriate
place.  While I was at it, I added the REVOKE statements we discussed
earlier to pg_background's .sql file.

The other problem was due to this code in set_config_option:

/*
 * If a PGC_BACKEND or PGC_SU_BACKEND
parameter is changed in
 * the config file, we want to accept
the new value in the
 * postmaster (whence it will propagate to
 * subsequently-started backends), but
ignore it in existing
 * backends.  This is a tad klugy, but
necessary because we
 * don't re-read the config file
during backend start.
 *
 * In EXEC_BACKEND builds, this works
differently: we load all
 * nondefault settings from the
CONFIG_EXEC_PARAMS file during
 * backend start.  In that case we
must accept PGC_SIGHUP
 * settings, so as to have the same
value as if we'd forked
 * from the postmaster.  We detect
this situation by checking
 * IsInitProcessingMode, which is a
bit ugly, but it doesn't
 * seem worth passing down an explicit
flag saying we're doing
 * read_nondefault_variables().
 */
#ifdef EXEC_BACKEND
if (IsUnderPostmaster &&
!IsInitProcessingMode())
return -1;
#else
if (IsUnderPostmaster)
return -1;
#endif

When restoring variables via RestoreGUCState(), we need the same kind
of special-handling that we do when running in EXEC_BACKEND mode and
restoring variables via read_nondefault_variables().  Extending the
IsInitProcessingMode kludge() doesn't look appealing, so I instead
added the flag contemplated by the comment.

Updated patches attached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
commit dc8980bdc9be89acf6d61bbfa5b1d7391992e8e8
Author: Robert Haas 
Date:   Thu Jul 17 07:58:32 2014 -0400

Add infrastructure to save and restore GUC values.

Amit Khandekar, Noah Misch, Robert Haas

V2: Fix misuse of NULL vs. NUL/zero-byte.
V2: Check return value of set_config_option.
V2: Add an explicit is_reload flag to set_config_option.

diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index 9a0afa4..6692bb5 100644
--- a/src/backend/commands/extension.c
+++ b/src/backend/commands/extension.c
@@ -814,11 +814,11 @@ execute_extension_script(Oid extensionOid, ExtensionControlFile *control,
 	if (client_min_messages < WARNING)
 		(void) set_config_option("client_min_messages", "warning",
  PGC_USERSET, PGC_S_SESSION,
- GUC_ACTION_SAVE, true, 0);
+ GUC_ACTION_SAVE, true, 0, false);
 	if (log_min_messages < WARNING)
 		(void) set_config_option("log_min_messages", "warning",
  PGC_SUSET, PGC_S_SESSION,
- GUC_ACTION_SAVE, true, 0);
+ GUC_ACTION_SAVE, true, 0, false);

 	/*
 	 * Set up the search path to contain the target schema, then the schemas
@@ -843,7 +843,7 @@ execute_extension_script(Oid extensionOid, ExtensionControlFile *control,

 	(void) set_config_option("search_path", pathbuf.data,
 			 PGC_USERSET, PGC_S_SESSION,
-			 GUC_ACTION_SAVE, true, 0);
+			 GUC_ACTION_SAVE, true, 0, false);

 	/*
 	 * Set creating_extension and related variables so that
diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
index c0156fa..2f02303 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers

Re: [HACKERS] Proposal: Log inability to lock pages during vacuum

On 2014-11-12 11:34:04 -0600, Jim Nasby wrote:
> On 11/11/14, 2:01 AM, Andres Freund wrote:
> >On 2014-11-10 19:36:18 -0600, Jim Nasby wrote:
> >>Towards that simple end, I'm a bit torn. My preference would be to
> >>simply log, and throw a warning if it's over some threshold. I believe
> >>that would give the best odds of getting feedback from users if this
> >>isn't as uncommon as we think.
> >
> >I'm strongly against a warning. We have absolutely no sane way of tuning
> >that. We'll just create a pointless warning that people will get
> >confused about and that they'll have to live with till the next release.
> 
> To clarify: I'm only suggesting we issue a warning if we have to skip some 
> significant number of pages; say 5 or 0.01% of the table, whichever is 
> greater. That's aimed directly at the goal of letting us know if this is 
> actually a problem or not.

Meh. You have a 5 page relation and it'll trigger quite easily. And it's
absolutely harmless.

> The reason I'm inclined to do the warning is because I don't think people 
> will notice this otherwise. If this really isn't a problem then it won't 
> matter; if it's a *big* problem then we'll at least know about it.
> 
> I'm thinking of an undocumented GUC to control the threshold, but I assume no 
> one else would be on board with that?

Stop overdesigning this.

Add it to the existing mesage and let us be done with this. This thread
has already wasted far too much time.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Log inability to lock pages during vacuum

On 11/11/14, 2:01 AM, Andres Freund wrote:

On 2014-11-10 19:36:18 -0600, Jim Nasby wrote:

On 11/10/14, 12:56 PM, Andres Freund wrote:

On 2014-11-10 12:37:29 -0600, Jim Nasby wrote:

On 11/10/14, 12:15 PM, Andres Freund wrote:

If what we want is to quantify the extent of the issue, would it be more
convenient to save counters to pgstat? Vacuum already sends pgstat
messages, so there's no additional traffic there.

I'm pretty strongly against that one in isolation. They'd need to be
stored somewhere and they'd need to be queryable somewhere with enough
context to make sense. To actually make sense of the numbers we'd also
need to report all the other datapoints of vacuum in some form. That's
quite a worthwile project imo - but*much* *much* more work than this.

We already report statistics on vacuums
(lazy_vacuum_rel()/pgstat_report_vacuum), so this would just be adding
1 or 2 counters to that. Should we add the other counters from vacuum?
That would be significantly more data.

At the very least it'd require:
* The number of buffers skipped due to the vm
* The number of buffers actually scanned
* The number of full table in contrast to partial vacuums

If we're going to track full scan vacuums separately, I think we'd
need two separate scan counters.

Well, we already have the entire number of vacuums, so we'd have that.

I mean number of pages scanned, but as I said below I don't think that's really
necessary.

I think (for pgstats) it'd make more sense to just count initial
failure to acquire the lock in a full scan in the 'skipped page'
counter. In terms of answering the question "how common is it not to
get the lock", it's really the same event.

It's absolutely not. You need to correlate the number of skipped pages
to the number of vacuumed pages. If you have 100k skipped in 10 billion
total scanned pages it's something entirely different than 100k in 200k
pages.

If the goal here is to find out if this even is a problem then I think the critical question is not
"did we vacuum", but "were we able to acquire the lock on the first try".

Obviously users will care much more about the vacuuming and not so much about
the lock; but if this really is a non-issue as most tend to believe I don't
think it's worth worrying about any of this (except perhaps putting
dtrace/system tap probes in).

Honestly, my desire at this point is just to see if there's actually a
problem. Many people are asserting that this should be a very rare
occurrence, but there's no way to know.

Ok.

Towards that simple end, I'm a bit torn. My preference would be to
simply log, and throw a warning if it's over some threshold. I believe
that would give the best odds of getting feedback from users if this
isn't as uncommon as we think.

I'm strongly against a warning. We have absolutely no sane way of tuning
that. We'll just create a pointless warning that people will get
confused about and that they'll have to live with till the next release.

To clarify: I'm only suggesting we issue a warning if we have to skip some
significant number of pages; say 5 or 0.01% of the table, whichever is greater.
That's aimed directly at the goal of letting us know if this is actually a
problem or not.

The reason I'm inclined to do the warning is because I don't think people will
notice this otherwise. If this really isn't a problem then it won't matter; if
it's a *big* problem then we'll at least know about it.

I'm thinking of an undocumented GUC to control the threshold, but I assume no
one else would be on board with that?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

On 2014-11-12 11:36:14 -0500, Robert Haas wrote:
> On Wed, Nov 12, 2014 at 11:19 AM, Andres Freund  
> wrote:
> > The question is whether the library is actually loaded in that case?
> > Because that normally only happens early during startup - which is why
> > it's a PGC_BACKEND guc.
> 
> It looks like that does not work.
> 
> [rhaas pgsql]$ PGOPTIONS='-c local_preload_libraries=auto_explain' psql
> psql (9.5devel)
> Type "help" for help.
> 
> rhaas=# select * from pg_background_result(pg_background_launch('show
> auto_explain.log_min_duration')) as (x text);
> ERROR:  unrecognized configuration parameter "auto_explain.log_min_duration"
> CONTEXT:  background worker, pid 31316
> 
> So, there's more to be done here.  Rats.

We could just say having PGC_BACKEND guc's that aren't set to their
config files aren't supported for anything parallel. I find that a
reasonable thing - otherwise pooling of bgworkers and all that will
likely be out. And it's not that there are that many important
PGC_BACKEND gucs.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] recovery_target_time and standby_mode

2014-11-12 Thread Josh Berkus

On 11/07/2014 02:03 PM, Josh Berkus wrote:
> But, like I said, there's a serviceable workaround.

Some update on this.  We've seen a problem in production with this setup
which I can't reproduce as a test case, but which may jog Heikki's
memory for something to fix.

1. Recover master to 2014-11-10 12:10:00
2. Recover replica to 2014-11-10 12:10:00,
   with pause_at_recovery_target
3. reconfigure recovery.conf for streaming replication
   and restart the replica
4. get a fatal error for replication, because
   the replica is ahead of the master on timeline1

What *appears* to be happening is that the pause_at_recovery_target,
followed by the restart, on the replica causes it to advance one commit
on timeline 1.  But *not all the time*; this doesn't happen in my
pgbench-based tests.

There's a workaround for the user (they just restore the replica to 5
minutes earlier), but I'm thinking this is a minor bug somewhere.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_receivexlog --status-interval add fsync feedback

On Mon, Nov 10, 2014 at 7:19 PM,   wrote:
>> On Fri, Oct 31, 2014 at 5:46 PM,   wrote:
>> >> > We seem to be going in circles. You suggested having two options,
>> >> > --feedback, and --fsync, which is almost exactly what Furuya posted
>> >> > originally. I objected to that, because I think that user interface
>> >> is
>> >> > too complicated. Instead, I suggested having just a single option
>> >> > called --synchronous, or even better, have no option at all and
>> >> > have the server tell the client if it's participating in
>> >> > synchronous replication, and have pg_receivexlog automatically
>> >> > fsync when it is, and not otherwise [1]. That way you don't need
>> to
>> >> > expose any new options to the user. What did you think of that idea?
>> >>
>> >> I think it's pretty weird to make the fsync behavior of the client
>> is
>> >> controlled by the server.
>> >>
>> >> I also don't think that fsync() on the client side is useless in
>> >> asynchronous replication.  Yeah, it's true that there are no
>> >> *guarantees* with asynchronous replication, but the bound on how long
>> >> the data can take to get out to disk is a heck of a lot shorter if
>> >> you fsync frequently than if you don't.  And on the flip side, that
>> >> has a performance impact.
>> >>
>> >> So I actually think the design you proposed is not as good as what
>> >> was proposed by Furuya and Simon.  But I don't feel incredibly
>> >> strongly about it.
>> >
>> > Thanks for lots of comments!!
>> >
>> > I fixed the patch.
>> > As a default, it behave like a walreceiver.
>> > Same as walreceiver, it fsync and send a feedback after fsync.
>>
>> On second thought, flipping the default behavior seems not worthwhile
>> here.
>> Which might surprise the existing users and cause some troubles to them.
>> I'd like to back to the Heikki's original suggestion like just adding
>> --synchronous option. That is, only when --synchronous is specified, WAL
>> is flushed and feedback is sent back as soon as WAL is received.
>>
>> I changed the patch heavily in that way. Please find the attached patch.
>> By default, pg_receivexlog flushes WAL data only when WAL file is closed.
>> If --synchronous is specified, like the standby's WAL receiver, sync
>> commands are issued as soon as there is WAL data which has not been flushed
>> yet. Also status packets are sent back to the server just after WAL data
>> is flushed whatever --status-interval is set to. I added the description
>> of this behavior to the doc.
>>
>> Thought?
>
> I think it's as you pointed out.
> Thank you for the patch.
> I did a review of the patch.
> There was no problem.

So I'm thinking to push the patch. Does anyone object to that?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

On Wed, Nov 12, 2014 at 11:19 AM, Andres Freund  wrote:
> The question is whether the library is actually loaded in that case?
> Because that normally only happens early during startup - which is why
> it's a PGC_BACKEND guc.

It looks like that does not work.

[rhaas pgsql]$ PGOPTIONS='-c local_preload_libraries=auto_explain' psql
psql (9.5devel)
Type "help" for help.

rhaas=# select * from pg_background_result(pg_background_launch('show
auto_explain.log_min_duration')) as (x text);
ERROR:  unrecognized configuration parameter "auto_explain.log_min_duration"
CONTEXT:  background worker, pid 31316

So, there's more to be done here.  Rats.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

On 2014-11-12 11:15:26 -0500, Robert Haas wrote:
> On Mon, Nov 10, 2014 at 4:58 PM, Andres Freund  wrote:
> > No, that's not what I was thinking of. Consider:
> >
> > export pg_libdir=$(pg_config --libdir)
> > mkdir $pg_libdir/plugins
> > ln -s $pg_libdir/auto_explain.so $pg_libdir/plugins/
> > PG_OPTIONS='-c local_preload_libraries=auto_explain' psql
> >
> > and use pg_background() (or something else using this infrastructure) in
> > that context.
> > Without having reread your relevant code I'd expect a:
> > ERROR:  55P02: parameter "local_preload_libraries" cannot be set after 
> > connection start
> > because the code would try to import the "user session's"
> > local_preload_libraries values into the background process - after that
> > actually already has done its initialization.
> 
> Thanks for these pointers.  Your directions turn out to be wrong in
> detail, because it's PGOPTIONS not PG_OPTIONS, and the plugins
> directory needs to be under $(pg_config --libdir)/postgresql, not just
> $(pg_config --libdir).  But it was enough to get me on the right
> track.  Long story short, it seems to work:

Sorry, was extracting it from a larger local script...

> [rhaas ~]$ PGOPTIONS='-c local_preload_libraries=auto_explain' psql
> psql (9.5devel)
> Type "help" for help.
> 
> rhaas=# show local_preload_libraries;
>  local_preload_libraries
> -
>  auto_explain
> (1 row)
> 
> rhaas=# select * from pg_background_result(pg_background_launch('show
> local_preload_libraries')) as (x text);
>   x
> --
>  auto_explain
> (1 row)

The question is whether the library is actually loaded in that case?
Because that normally only happens early during startup - which is why
it's a PGC_BACKEND guc.

> > How do you define 'SAFE' here? Safe against concurrent SQL level
> > activity? Safe against receiving arbitrary data?
> 
> The first one.  The code powering the background worker is just as
> trustworthy as any other part of the backend, so it can be subverted
> if you load malicious C code into the backend, but not otherwise.

Ah, good. Because the latter sounds quite hard, if not impossible lest
we recreate send/recv ;)

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

On Mon, Nov 10, 2014 at 4:58 PM, Andres Freund  wrote:
> No, that's not what I was thinking of. Consider:
>
> export pg_libdir=$(pg_config --libdir)
> mkdir $pg_libdir/plugins
> ln -s $pg_libdir/auto_explain.so $pg_libdir/plugins/
> PG_OPTIONS='-c local_preload_libraries=auto_explain' psql
>
> and use pg_background() (or something else using this infrastructure) in
> that context.
> Without having reread your relevant code I'd expect a:
> ERROR:  55P02: parameter "local_preload_libraries" cannot be set after 
> connection start
> because the code would try to import the "user session's"
> local_preload_libraries values into the background process - after that
> actually already has done its initialization.

Thanks for these pointers.  Your directions turn out to be wrong in
detail, because it's PGOPTIONS not PG_OPTIONS, and the plugins
directory needs to be under $(pg_config --libdir)/postgresql, not just
$(pg_config --libdir).  But it was enough to get me on the right
track.  Long story short, it seems to work:

[rhaas ~]$ PGOPTIONS='-c local_preload_libraries=auto_explain' psql
psql (9.5devel)
Type "help" for help.

rhaas=# show local_preload_libraries;
 local_preload_libraries
-
 auto_explain
(1 row)

rhaas=# select * from pg_background_result(pg_background_launch('show
local_preload_libraries')) as (x text);
  x
--
 auto_explain
(1 row)

This is what I expected, and I think what we want.  Background workers
enlisted for parallelism (or pg_background) need to be able to have
the same values for GUCs as the user backend that started them, and
our definition of setting things "at backend start" needs to be
expansive enough to include stuff we pull out of a dynamic shared
memory segment shortly thereafter.

>> It should end up with the same values there as the active session, not
>> the current one from postgresql.conf.  But we want to verify that's
>> the behavior we actually get.   Right?
>
> But that's also something worthwile to check.

Mmph.  There's a problem here.  I tried changing
local_preload_libraries='auto_explain' in postgresql.conf and then
sending PGC_SIGHUP.  A session started before that change does indeed
create a worker with that value still empty, but a session started
after that change also creates a worker with that value still empty.
Oops.  Not sure why that's happening yet, will investigate.

>> What would be even better is to find some way to MAKE IT SAFE to send
>> the undecoded tuple.  I'm not sure what that would look like.
>
> How do you define 'SAFE' here? Safe against concurrent SQL level
> activity? Safe against receiving arbitrary data?

The first one.  The code powering the background worker is just as
trustworthy as any other part of the backend, so it can be subverted
if you load malicious C code into the backend, but not otherwise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unintended restart after recovery error

On Wed, Nov 12, 2014 at 6:52 PM, Antonin Houska  wrote:
> While looking at postmaster.c:reaper(), one problematic case occurred to me.
>
>
> 1. Startup process signals PMSIGNAL_RECOVERY_STARTED.
>
> 2. Checkpointer process is forked and immediately dies.
>
> 3. reaper() catches this failure, calls HandleChildCrash() and thus sets
> FatalError to true.
>
> 4. Startup process exits with non-zero status code too - either due to SIGQUIT
> received from HandleChildCrash or due to some other failure of the startup
> process itself. However, FatalError is already set, because of the previous
> crash of the checkpointer. Thus reaper() does not set RecoveryError.
>
> 5. As RecoverError failed to be set to true, postmaster will try to restart
> the cluster, although it apparently should not.

Why shouldn't postmaster restart the cluster in that case?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] what does this mean: "running xacts with xcnt == 0"

On 2014-11-12 11:56:01 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> > Hi,
> > 
> > On 2014-11-12 09:03:40 -0500, Peter Eisentraut wrote:
> > > Could someone translate this detail message to English:
> > > 
> > > ereport(LOG,
> > > (errmsg("logical decoding found consistent point at 
> > > %X/%X",
> > > (uint32) (lsn >> 32), (uint32) lsn),
> > >  errdetail("running xacts with xcnt == 0")));
> > 
> > It means there a xl_running_xacts record was encountered that had xcnt =
> > 0 - allowing logical decoding to find a consistent start point
> > 
> > > (or downgrade to debug message, if appropriate)?
> > 
> > The message generally is quite relevant, as the process of finding a
> > consistent start point can take quite a while. we don't really have a
> > nice way to make errdetail() only be logged on a certain severity level
> > as far as I am aware off.
> 
> Can we do just the errmsg() and remove with the errdetail?

No, I really don't want to do that. When trying to see whether logical
replication started that's imo quite an importantdetail. Especially when
first seing
ereport(LOG,
(errmsg("logical decoding found initial starting point 
at %X/%X",
(uint32) (lsn >> 32), (uint32) lsn),
 errdetail_plural("%u transaction needs to finish.",
  "%u transactions need 
to finish.",
  builder->running.xcnt,
  (uint32) 
builder->running.xcnt)));

Btw, Peter, why did you add a (uint32) to one, but not both,
builder->running.xcnt references?

> > So maybe 'Encountered xl_running_xacts record with xcnt = 0.'?
> 
> That's not very user-facing, is it -- I mean, why bother the user with
> the names of structs and members thereof?  It seems better to describe
> what the condition is; something like "found point in time with no
> running transaction".  Maybe "point in time" should be "WAL record"
> instead.

Is that really a win in clarity? When analyzing a problem I'd much
rather have a concrete hint than something fuzzy.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Race condition between hot standby and restoring a FPW

Heikki Linnakangas  writes:
> On 11/12/2014 05:20 PM, Tom Lane wrote:
>> On reconsideration I think the "RBM_ZERO returns page already locked"
>> alternative may be the less ugly.   That has the advantage that any code
>> that doesn't get updated will fail clearly and reliably.

> Yeah, I'm leaning to that approach as well. It's made more ugly by the 
> fact that you sometimes need a cleanup lock on the buffer, so the caller 
> will somehow need to specify whether to get a cleanup lock or a normal 
> exclusive lock. Maybe add yet another mode, RBM_ZERO_WITH_CLEANUP_LOCK. 
> Could also rename RBM_ZERO to e.g. RBM_ZERO_AND_LOCK, to make any code 
> that's not updated to break even more obviously, at compile-time.

Yeah, I was considering suggesting changing the name of the mode too.
+1 for solving the lock-type problem with 2 modes.

(You could also consider leaving RBM_ZERO in place with its current
semantics, but I think what you've shown here is that there is no
safe way to use it, so probably that's not what we should do.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Race condition between hot standby and restoring a FPW


On 11/12/2014 05:20 PM, Tom Lane wrote:

Heikki Linnakangas  writes:

On 11/12/2014 04:56 PM, Tom Lane wrote:

Not great either.  What about an RBM_NOERROR mode that is like RBM_ZERO
in terms of handling error conditions, but does not forcibly zero the page
if it's already valid?



Anyway, you don't want to read the page from disk, just to check if it's
already valid.


Oh, good point.


(Note that when the page is already in the buffer-cache, RBM_ZERO
already doesn't zero the page. So this race condition only happens when
the page isn't in the buffer cache yet).


Right.

On reconsideration I think the "RBM_ZERO returns page already locked"
alternative may be the less ugly.   That has the advantage that any code
that doesn't get updated will fail clearly and reliably.


Yeah, I'm leaning to that approach as well. It's made more ugly by the 
fact that you sometimes need a cleanup lock on the buffer, so the caller 
will somehow need to specify whether to get a cleanup lock or a normal 
exclusive lock. Maybe add yet another mode, RBM_ZERO_WITH_CLEANUP_LOCK. 
Could also rename RBM_ZERO to e.g. RBM_ZERO_AND_LOCK, to make any code 
that's not updated to break even more obviously, at compile-time.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Race condition between hot standby and restoring a FPW


On 11/12/2014 05:08 PM, Robert Haas wrote:

On Wed, Nov 12, 2014 at 7:39 AM, Heikki Linnakangas
 wrote:

2. When ReadBufferExtended doesn't find the page in cache, it returns the
buffer in !BM_VALID state (i.e. still in I/O in-progress state). Require the
caller to call a second function, after locking the page, to finish the I/O.


This seems like a reasonable approach.

If you tilt your head the right way, zeroing a page and restoring a
backup block are the same thing: either way, you want to "read" the
block into shared buffers without actually reading it, so that you can
overwrite the prior contents with something else.  So, you could fix
this by adding a new mode, RBM_OVERWRITE, and passing the new page
contents as an additional argument to ReadBufferExtended, which would
then memcpy() that data into place where RBM_ZERO calls MemSet() to
zero it.


Yes, that would be quite a clean API. However, there's a problem with 
locking, when the redo routine modifies multiple pages. Currently, you 
lock the page first, and replace the page with the new contents while 
holding the lock. With RBM_OVERWRITE, the new page contents would sneak 
into the buffer before RestoreBackupBlock has acquired the lock on the 
page, and another backend might pin and lock the page before 
RestoreBackupBlock does. The page contents would be valid, but they 
might not be consistent with other buffers yet. The redo routine might 
be doing an atomic operation that spans multiple pages, by holding the 
locks on all the pages until it's finished with all the changes, but the 
backend would see a partial result.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Race condition between hot standby and restoring a FPW

Heikki Linnakangas  writes:
> On 11/12/2014 04:56 PM, Tom Lane wrote:
>> Not great either.  What about an RBM_NOERROR mode that is like RBM_ZERO
>> in terms of handling error conditions, but does not forcibly zero the page
>> if it's already valid?

> Anyway, you don't want to read the page from disk, just to check if it's 
> already valid.

Oh, good point.

> (Note that when the page is already in the buffer-cache, RBM_ZERO 
> already doesn't zero the page. So this race condition only happens when 
> the page isn't in the buffer cache yet).

Right.

On reconsideration I think the "RBM_ZERO returns page already locked"
alternative may be the less ugly.  That has the advantage that any code
that doesn't get updated will fail clearly and reliably.  With the other
thing, you need some additional back channel for the caller to know
whether it must complete the I/O, and it's not obvious what will happen
if it fails to do that (except that it will be bad).

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On 2014-11-12 10:13:18 -0500, Robert Haas wrote:
> On Tue, Nov 11, 2014 at 4:27 AM, Andres Freund  wrote:
> > The more important thing here is that I see little chance of this
> > getting in before Heikki's larger rework of the wal format gets
> > in. Since that'll change everything around anyay I'm unsure how much
> > point there is to iterate till that's done. I know that sucks, but I
> > don't see much of an alternative.
> 
> Why not do this first?  Heikki's patch seems quite far from being
> ready to commit at this point - it significantly increases WAL volume
> and reduces performance.  Heikki may well be able to fix that, but I
> don't know that it's a good idea to make everyone else wait while he
> does.

Because it imo builds the infrastructure to do the compression more
sanely. I.e. provide proper space to store information about the
compressedness of the blocks and such.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Race condition between hot standby and restoring a FPW


On 11/12/2014 04:56 PM, Tom Lane wrote:

Heikki Linnakangas  writes:

There's a race condition between a backend running queries in hot
standby mode, and restoring a full-page image from a WAL record. It's
present in all supported versions.



I can think of two ways to fix this:



1. Have ReadBufferExtended lock the page in RBM_ZERO mode, before
returning it. That makes the API inconsistent, as the function would
sometimes lock the page, and sometimes not.


Ugh.


2. When ReadBufferExtended doesn't find the page in cache, it returns
the buffer in !BM_VALID state (i.e. still in I/O in-progress state).
Require the caller to call a second function, after locking the page, to
finish the I/O.


Not great either.  What about an RBM_NOERROR mode that is like RBM_ZERO
in terms of handling error conditions, but does not forcibly zero the page
if it's already valid?


Isn't that exactly what RBM_ZERO_ONERROR does?

Anyway, you don't want to read the page from disk, just to check if it's 
already valid. We stopped doing that in 8.2 (commit 
8c3cc86e7b688b0efe5ec6ce4f4342c2883b1db5), and it gave a big speedup to 
recovery.


(Note that when the page is already in the buffer-cache, RBM_ZERO 
already doesn't zero the page. So this race condition only happens when 
the page isn't in the buffer cache yet).


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On Tue, Nov 11, 2014 at 4:27 AM, Andres Freund  wrote:
> The more important thing here is that I see little chance of this
> getting in before Heikki's larger rework of the wal format gets
> in. Since that'll change everything around anyay I'm unsure how much
> point there is to iterate till that's done. I know that sucks, but I
> don't see much of an alternative.

Why not do this first?  Heikki's patch seems quite far from being
ready to commit at this point - it significantly increases WAL volume
and reduces performance.  Heikki may well be able to fix that, but I
don't know that it's a good idea to make everyone else wait while he
does.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Race condition between hot standby and restoring a FPW

On Wed, Nov 12, 2014 at 7:39 AM, Heikki Linnakangas
 wrote:
> 2. When ReadBufferExtended doesn't find the page in cache, it returns the
> buffer in !BM_VALID state (i.e. still in I/O in-progress state). Require the
> caller to call a second function, after locking the page, to finish the I/O.

This seems like a reasonable approach.

If you tilt your head the right way, zeroing a page and restoring a
backup block are the same thing: either way, you want to "read" the
block into shared buffers without actually reading it, so that you can
overwrite the prior contents with something else.  So, you could fix
this by adding a new mode, RBM_OVERWRITE, and passing the new page
contents as an additional argument to ReadBufferExtended, which would
then memcpy() that data into place where RBM_ZERO calls MemSet() to
zero it.

I'm not sure whether we want to complicate the API that way, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Race condition between hot standby and restoring a FPW

Heikki Linnakangas  writes:
> There's a race condition between a backend running queries in hot 
> standby mode, and restoring a full-page image from a WAL record. It's 
> present in all supported versions.

> I can think of two ways to fix this:

> 1. Have ReadBufferExtended lock the page in RBM_ZERO mode, before 
> returning it. That makes the API inconsistent, as the function would 
> sometimes lock the page, and sometimes not.

Ugh.

> 2. When ReadBufferExtended doesn't find the page in cache, it returns 
> the buffer in !BM_VALID state (i.e. still in I/O in-progress state). 
> Require the caller to call a second function, after locking the page, to 
> finish the I/O.

Not great either.  What about an RBM_NOERROR mode that is like RBM_ZERO
in terms of handling error conditions, but does not forcibly zero the page
if it's already valid?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] what does this mean: "running xacts with xcnt == 0"

Andres Freund wrote:
> Hi,
> 
> On 2014-11-12 09:03:40 -0500, Peter Eisentraut wrote:
> > Could someone translate this detail message to English:
> > 
> > ereport(LOG,
> > (errmsg("logical decoding found consistent point at %X/%X",
> > (uint32) (lsn >> 32), (uint32) lsn),
> >  errdetail("running xacts with xcnt == 0")));
> 
> It means there a xl_running_xacts record was encountered that had xcnt =
> 0 - allowing logical decoding to find a consistent start point
> 
> > (or downgrade to debug message, if appropriate)?
> 
> The message generally is quite relevant, as the process of finding a
> consistent start point can take quite a while. we don't really have a
> nice way to make errdetail() only be logged on a certain severity level
> as far as I am aware off.

Can we do just the errmsg() and remove with the errdetail?

> So maybe 'Encountered xl_running_xacts record with xcnt = 0.'?

That's not very user-facing, is it -- I mean, why bother the user with
the names of structs and members thereof?  It seems better to describe
what the condition is; something like "found point in time with no
running transaction".  Maybe "point in time" should be "WAL record"
instead.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] what does this mean: "running xacts with xcnt == 0"

Hi,

On 2014-11-12 09:03:40 -0500, Peter Eisentraut wrote:
> Could someone translate this detail message to English:
> 
> ereport(LOG,
> (errmsg("logical decoding found consistent point at %X/%X",
> (uint32) (lsn >> 32), (uint32) lsn),
>  errdetail("running xacts with xcnt == 0")));

It means there a xl_running_xacts record was encountered that had xcnt =
0 - allowing logical decoding to find a consistent start point

> (or downgrade to debug message, if appropriate)?

The message generally is quite relevant, as the process of finding a
consistent start point can take quite a while. we don't really have a
nice way to make errdetail() only be logged on a certain severity level
as far as I am aware off.

So maybe 'Encountered xl_running_xacts record with xcnt = 0.'?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] what does this mean: "running xacts with xcnt == 0"

2014-11-12 Thread Peter Eisentraut

Could someone translate this detail message to English:

ereport(LOG,
(errmsg("logical decoding found consistent point at %X/%X",
(uint32) (lsn >> 32), (uint32) lsn),
 errdetail("running xacts with xcnt == 0")));

(or downgrade to debug message, if appropriate)?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Failback to old master

2014-11-12 Thread Ants Aasma

On Tue, Nov 11, 2014 at 11:52 PM, Maeldron T.  wrote:
> As far as I remember (I can’t test it right now but I am 99% sure) promoting 
> the slave makes it impossible to connect the old master to the new one 
> without making a base_backup. The reason is the timeline change. It complains.

A safely shut down master (-m fast is safe) can be safely restarted as
a slave to the newly promoted master. Fast shutdown shuts down all
normal connections, does a shutdown checkpoint and then waits for this
checkpoint to be replicated to all active streaming clients. Promoting
slave to master creates a timeline switch, that prior to version 9.3
was only possible to replicate using the archive mechanism. As of
version 9.3 you don't need to configure archiving to follow timeline
switches, just add a recovery.conf to the old master to start it up as
a slave and it will fetch everything it needs from the new master.

In case of a unsafe shut down (crash) it is possible that you have WAL
lying around that was not streamed out to the slave. In this case the
old master will request recovery from a point after the timeline
switch and the new master will reply with an error. So it is safe to
try re-adding a crashed master as a slave, but this might fail.
Success is more likely when the whole operating system went down, as
then it's somewhat likely that any WAL got streamed out before it made
it to disk.

In general my suggestion is to avoid slave promotion by removal of
recovery.conf, it's too easy to get confused and end up with hard to
diagnose data corruption.

In your example, if for example B happens to disconnect at WAL
position x1 and remains disconnected while shutdown on A occurred at
WAL position x2 it will be missing the WAL interval A(x1..x2). Now B
is restarted as master from position x1, generates some new WAL past
x2, then A is restarted as slave and starts streaming at x2 as to the
best of it's knowledge that was where things left off. At this point
the slave A is corrupted, you have x1..x2 changes from A that are not
on the master and are also missing some changes that are on the
master. Wrong data and/or crashes ensue.

Always use the promotion mechanism because then you are likely to get
errors when something is screwy. Unfortunately it's still possible to
end up in a corrupted state with no errors, as timeline identifiers
are sequential integers, not GUID's, but at least it's significantly
harder.

Regards,
Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tracking commit timestamps

2014-11-12 Thread Petr Jelinek


On 10/11/14 14:53, Robert Haas wrote:

On Mon, Nov 10, 2014 at 8:39 AM, Petr Jelinek  wrote:

I did the calculation above wrong btw, it's actually 20 bytes not 24 bytes
per record, I am inclined to just say we can live with that.


If you do it as 20 bytes, you'll have to do some work to squeeze out
the alignment padding.  I'm inclined to think it's fine to have a few
extra padding bytes here; someone might want to use those for
something in the future, and they probably don't cost much.



I did get around the alignment via memcpy, so it is still 20bytes.


Since we agreed that the (B) case is not really feasible and we are doing
the (C), I also wonder if extradata should be renamed to nodeid (even if
it's not used at this point as nodeid). And then there is question about the
size of it, since the nodeid itself can live with 2 bytes probably ("64k of
nodes ought to be enough for everybody" ;) ).
Or leave the extradata as is but use as reserved space for future use and
not expose it at this time on SQL level at all?


I vote for calling it node-ID, and for allowing at least 4 bytes for
it.  Penny wise, pound foolish.



Ok, I went this way.

Anyway here is v8 version of the patch, I think I addressed all the 
concerns mentioned, it's also rebased against current master (BRIN 
commit added some conflicts).


Brief list of changes:
 - the commit timestamp record now stores timestamp, lsn and nodeid
 - added code to disallow turning track_commit_timestamp on with too 
small pagesize

 - the get interfaces error out when track_commit_timestamp is off
 - if the xid passed to get interface is out of range -infinity 
timestamp is returned (I think it's bad idea to throw errors here as the 
valid range is not static and same ID can start throwing errors between 
calls theoretically)
 - renamed the sql interfaces to pg_xact_commit_timestamp, 
pg_xact_commit_timestamp_data and pg_last_committed_xact, they don't 
expose the nodeid atm, I personally am not big fan of the "xact" but it 
seems more consistent with existing naming
 - documented pg_resetxlog changes and make all the pg_resetxlog 
options alphabetically ordered
 - committs is not used anymore, it's commit_ts (and CommitTs in 
camelcase), I think it's not really good idea to spell the timestamp 
everywhere as some interface then get 30+ chars long names...

 - added WAL logging of the track_commit_timestamp GUC
 - added alternative expected output of the regression test so that it 
works with make installcheck when track_commit_timestamp is on

 - added C interface to set default nodeid for current backend
 - several minor comment and naming adjustments mostly suggested by Michael


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/contrib/Makefile b/contrib/Makefile
index b37d0dd..e331297 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -50,6 +50,7 @@ SUBDIRS = \
 		spi		\
 		tablefunc	\
 		tcn		\
+		test_committs	\
 		test_decoding	\
 		test_parser	\
 		test_shm_mq	\
diff --git a/contrib/pg_upgrade/pg_upgrade.c b/contrib/pg_upgrade/pg_upgrade.c
index 3b8241b..f0a023f 100644
--- a/contrib/pg_upgrade/pg_upgrade.c
+++ b/contrib/pg_upgrade/pg_upgrade.c
@@ -423,8 +423,10 @@ copy_clog_xlog_xid(void)
 	/* set the next transaction id and epoch of the new cluster */
 	prep_status("Setting next transaction ID and epoch for new cluster");
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
-			  "\"%s/pg_resetxlog\" -f -x %u \"%s\"",
-			  new_cluster.bindir, old_cluster.controldata.chkpnt_nxtxid,
+			  "\"%s/pg_resetxlog\" -f -x %u -c %u \"%s\"",
+			  new_cluster.bindir,
+			  old_cluster.controldata.chkpnt_nxtxid,
+			  old_cluster.controldata.chkpnt_nxtxid,
 			  new_cluster.pgdata);
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
 			  "\"%s/pg_resetxlog\" -f -e %u \"%s\"",
diff --git a/contrib/pg_xlogdump/rmgrdesc.c b/contrib/pg_xlogdump/rmgrdesc.c
index 9397198..e0af3cf 100644
--- a/contrib/pg_xlogdump/rmgrdesc.c
+++ b/contrib/pg_xlogdump/rmgrdesc.c
@@ -10,6 +10,7 @@
 
 #include "access/brin_xlog.h"
 #include "access/clog.h"
+#include "access/committs.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
diff --git a/contrib/test_committs/.gitignore b/contrib/test_committs/.gitignore
new file mode 100644
index 000..1f95503
--- /dev/null
+++ b/contrib/test_committs/.gitignore
@@ -0,0 +1,5 @@
+# Generated subdirectories
+/log/
+/isolation_output/
+/regression_output/
+/tmp_check/
diff --git a/contrib/test_committs/Makefile b/contrib/test_committs/Makefile
new file mode 100644
index 000..2240749
--- /dev/null
+++ b/contrib/test_committs/Makefile
@@ -0,0 +1,45 @@
+# Note: because we don't tell the Makefile there are any regression tests,
+# we have to clean those result files explicitly
+EXTRA_CLEAN = $(pg_regress_clean_files) ./regression_output ./isolation_output
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+in

Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

On Tue, Nov 11, 2014 at 7:48 PM, Kouhei Kaigai  wrote:
> Isn't provolatile = PROVOLATILE_IMMUTABLE sufficient?

There are certainly things that are parallel-safe that are not
immutable.  It might be the case that everything immutable is
parallel-safe.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Race condition between hot standby and restoring a FPW

There's a race condition between a backend running queries in hot 
standby mode, and restoring a full-page image from a WAL record. It's 
present in all supported versions.


RestoreBackupBlockContents does this:


buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork, bkpb.block,

RBM_ZERO);
Assert(BufferIsValid(buffer));
if (get_cleanup_lock)
LockBufferForCleanup(buffer);
else
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);


If the page is not in buffer cache yet, and a backend reads and locks 
the buffer after the above XLogReadBufferExtended call has zeroed it, 
but before it has locked it, the backend sees an empty page.


The principle of fixing that is straightforward: the zeroed page should 
not be visible to others, even momentarily. Unfortunately, I think 
that's going to require an API change to ReadBufferExtended(RBM_ZERO) :-(.


I can think of two ways to fix this:

1. Have ReadBufferExtended lock the page in RBM_ZERO mode, before 
returning it. That makes the API inconsistent, as the function would 
sometimes lock the page, and sometimes not.


2. When ReadBufferExtended doesn't find the page in cache, it returns 
the buffer in !BM_VALID state (i.e. still in I/O in-progress state). 
Require the caller to call a second function, after locking the page, to 
finish the I/O.


Anyone have a better idea?

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index