Re: BUG #15346: Replica fails to start after the crash

2018-08-30 Thread Kyotaro HORIGUCHI
At Thu, 30 Aug 2018 18:48:55 -0700, Michael Paquier  wrote 
in <20180831014855.gj15...@paquier.xyz>
> On Fri, Aug 31, 2018 at 09:48:46AM +0900, Kyotaro HORIGUCHI wrote:
> > Please wait a bit.. I have a concern about this.
> 
> Sure, please feel free.

Thanks.

I looked though the patch and related code and have a concern.

The patch inhibits turning off updateMinRecoveryPoint on other
than the startup process running crash-recovery except at the end
of XLogNeedsFlush.

> /*
>  * Check minRecoveryPoint for any other process than the startup
>  * process doing crash recovery, which should not update the control
>  * file value if crash recovery is still running.
>  */
> if (XLogRecPtrIsInvalid(minRecoveryPoint))
>   updateMinRecoveryPoint = false;

Even if any other processes than the startup calls the function
during crash recovery, they don't have a means to turn on
updateMinRecoveryPoint again. Actually the only process that
calls the function druing crash recovery is the startup. bwriter
and checkpointer doesn't. Hot-standby backends come after
crash-recvery ends. After all, we just won't have an invalid
minRecoveryPoint value there, and updateMinRecoverypoint must be
true.

Other than that I didn't find a problem. Please find the attached
patch. I also have not come up with how to check the case
automatically..


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 493f1db7b9..74692a128d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2723,9 +2723,13 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
 	 * i.e., we're doing crash recovery.  We never modify the control file's
 	 * value in that case, so we can short-circuit future checks here too. The
 	 * local values of minRecoveryPoint and minRecoveryPointTLI should not be
-	 * updated until crash recovery finishes.
+	 * updated until crash recovery finishes.  We only do this for the startup
+	 * process as it should not update its own reference of minRecoveryPoint
+	 * until it has finished crash recovery to make sure that all WAL
+	 * available is replayed in this case.  This also saves from extra locks
+	 * taken on the control file from the startup process.
 	 */
-	if (XLogRecPtrIsInvalid(minRecoveryPoint))
+	if (XLogRecPtrIsInvalid(minRecoveryPoint) && InRecovery)
 	{
 		updateMinRecoveryPoint = false;
 		return;
@@ -2737,7 +2741,9 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
 	minRecoveryPoint = ControlFile->minRecoveryPoint;
 	minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
 
-	if (force || minRecoveryPoint < lsn)
+	if (XLogRecPtrIsInvalid(minRecoveryPoint))
+		updateMinRecoveryPoint = false;
+	else if (force || minRecoveryPoint < lsn)
 	{
 		XLogRecPtr	newMinRecoveryPoint;
 		TimeLineID	newMinRecoveryPointTLI;
@@ -3124,12 +3130,15 @@ XLogNeedsFlush(XLogRecPtr record)
 	if (RecoveryInProgress())
 	{
 		/*
-		 * An invalid minRecoveryPoint means that we need to recover all the
-		 * WAL, i.e., we're doing crash recovery.  We never modify the control
-		 * file's value in that case, so we can short-circuit future checks
-		 * here too.
+		 * An invalid minRecoveryPoint on the startup process means that we
+		 * need to recover all the WAL, i.e., we're doing crash recovery.  We
+		 * never modify the control file's value in that case, so we can
+		 * short-circuit future checks here too.  This triggers a quick exit
+		 * path for the startup process, which cannot update its local copy of
+		 * minRecoveryPoint as long as it has not replayed all WAL available
+		 * when doing crash recovery.
 		 */
-		if (XLogRecPtrIsInvalid(minRecoveryPoint))
+		if (XLogRecPtrIsInvalid(minRecoveryPoint) && InRecovery)
 			updateMinRecoveryPoint = false;
 
 		/* Quick exit if already known to be updated or cannot be updated */
@@ -3146,6 +3155,12 @@ XLogNeedsFlush(XLogRecPtr record)
 		minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
 		LWLockRelease(ControlFileLock);
 
+		/*
+		 * No other process than the startup doesn't reach here before crash
+		 * recovery ends. So minRecoveryPoint must have a valid value here.
+		 */
+		Assert(minRecoveryPoint != InvalidXLogRecPtr);
+
 		/* check again */
 		return record > minRecoveryPoint;
 	}


Re: Copy function for logical replication slots

2018-08-30 Thread Masahiko Sawada
On Wed, Aug 29, 2018 at 9:39 AM, Masahiko Sawada  wrote:
> On Tue, Aug 28, 2018 at 10:34 PM, Michael Paquier  wrote:
>> On Tue, Aug 28, 2018 at 04:14:04PM +0900, Masahiko Sawada wrote:
>>> I think the copying from a slot that already reserved WAL would be
>>> helpful for backup cases (maybe you suggested?). Also, either way we
>>> need to make a safe logic of acquring and releasing the source slot
>>> for logical slots cases. Or you meant  to restrict the case where the
>>> copying a slot that doesn't reserve WAL?
>>
>> I mean the latter, as-known-as there is no actual point in being able to
>> copy WAL which does *not* reserve WAL.
>
> Agreed. I'll restrict that case in the next version
>
>>
 Does it actually make sense to allow copy of temporary slots or change
 their persistence?  Those don't live across sessions so they'd need to
 be copied in the same session which created them.
>>>
>>> I think the copying of temporary slots would be an impracticable
>>> feature but the changing their persistence might be helpful for some
>>> cases, especially copying from persistent to temporary.
>>
>> The session doing the copy of a permanent slot to the temporary slot has
>> to be the one also consuming it as the session which created the slot
>> owns it, and the slot would be dropped when the session ends.  For
>> logical slots perhaps you have something in mind?  Like copying a slot
>> which is not active to check where it is currently replaying, and using
>> the copy for sanity checks?
>
> Yeah, I imagined such case. If we want to do backup/logical decoding
> from the same point as the source permanent slot we might want to use
> temporary slots so that it will be dropped surely after that.
>

Attached a new version patch incorporated the all comments I got.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


v6-0001-Copy-function-for-logical-and-physical-replicatio.patch
Description: Binary data


Re: TupleTableSlot abstraction

2018-08-30 Thread Amit Khandekar
On 28 August 2018 at 22:43, Ashutosh Bapat
 wrote:
> On Fri, Aug 24, 2018 at 6:46 AM, Andres Freund  wrote:
>>
>>> -/*
>>> - * slot_getsysattr
>>> - *   This function fetches a system attribute of the slot's 
>>> current tuple.
>>> - *   Unlike slot_getattr, if the slot does not contain system 
>>> attributes,
>>> - *   this will return false (with a NULL attribute value) instead 
>>> of
>>> - *   throwing an error.
>>> - */
>>> -bool
>>> -slot_getsysattr(TupleTableSlot *slot, int attnum,
>>> - Datum *value, bool *isnull)
>>> -{
>>> - HeapTuple   tuple = slot->tts_tuple;
>>> -
>>> - Assert(attnum < 0); /* else caller error */
>>> - if (tuple == NULL ||
>>> - tuple == &(slot->tts_minhdr))
>>> - {
>>> - /* No physical tuple, or minimal tuple, so fail */
>>> - *value = (Datum) 0;
>>> - *isnull = true;
>>> - return false;
>>> - }
>>> - *value = heap_getsysattr(tuple, attnum, slot->tts_tupleDescriptor, 
>>> isnull);
>>> - return true;
>>> -}
>>
>> I think I was wrong at saying that we should remove this. I think you
>> were right that it should become a callback...
>
> We have replaced all slot_getsysattrs() with heap_getsysattr(). Do you
> want to reinstantiate those as well? If so, slot_getsysattr() becomes
> a wrapper around getsysattr() callback.

One option is that the getsysattr() callback function returns false if
system attributes are not supported for that slot type. Other option
is that in the not-supported case, the function errors out, meaning
that the caller should be aware that the slot type is the one that
supports system attributes.

I had prepared changes for the first option, but Ashutosh Bapat
offlist made me realize that it's worth considering the 2nd option(
i.e. erroring out).

The only use case where slot_getsysattr() is called right now is
execCurrentOf(), and it currently checks the bool return value of
slot_getsysattr() and prints a user-friendly message if false is
returned. Looking at the code (commit 8f5ac440430ab), it seems that we
want to continue to have a user-friendly message for some unhandled
cases like custom scans. So perhaps for now it's ok to go with the
first option where getsysattr callback returns false for slot types
that don't have system attributes.

>
>>
>>
>>> +/*
>>> + * This is a function used by all getattr() callbacks which deal with a 
>>> heap
>>> + * tuple or some tuple format which can be represented as a heap tuple 
>>> e.g. a
>>> + * minimal tuple.
>>> + *
>>> + * heap_getattr considers any attnum beyond the attributes available in the
>>> + * tuple as NULL. This function however returns the values of missing
>>> + * attributes from the tuple descriptor in that case. Also this function 
>>> does
>>> + * not support extracting system attributes.
>>> + *
>>> + * If the attribute needs to be fetched from the tuple, the function fills 
>>> in
>>> + * tts_values and tts_isnull arrays upto the required attnum.
>>> + */
>>> +Datum
>>> +tts_heap_getattr_common(TupleTableSlot *slot, HeapTuple tuple, uint32 
>>> *offp,
>>> + int attnum, bool *isnull)
>>> +{
>>> + HeapTupleHeader tup = tuple->t_data;
>>> + Assert(slot->tts_nvalid < attnum);
>>> +
>>> + Assert(attnum > 0);
>>> +
>>> + if (attnum > HeapTupleHeaderGetNatts(tup))
>>> + return getmissingattr(slot->tts_tupleDescriptor, attnum, 
>>> isnull);
>>> +
>>> + /*
>>> +  * check if target attribute is null: no point in groveling through 
>>> tuple
>>> +  */
>>> + if (HeapTupleHasNulls(tuple) && att_isnull(attnum - 1, tup->t_bits))
>>> + {
>>> + *isnull = true;
>>> + return (Datum) 0;
>>> + }
>>
>> I still think this is an optimization with a negative benefit,
>> especially as it requires an extra callback. We should just rely on
>> slot_deform_tuple and then access that. That'll also just access the
>> null bitmap for the relevant column, and it'll make successive accesses
>> cheaper.
>
> I don't understand why we have differing implementations for
> slot_attisnull(), slot_getsomeattrs(), slot_getattr() in HEAD. If what
> you are saying is true, we should have implemented all the first and
> last as a call to slot_getsomeattrs() followed by returing values from
> tts_values and tts_isnull. Since this is refactoring work, I am trying
> not to change the existing functionality of those functions.

I agree that we should not change the way in which slot_getattr()
finds the attr; i.e. first call att_isnull(), and only then try to
deform the tuple. I mean there must be some good reason that is done
on HEAD. Maybe we can change this separately after investigation, but
not as part of the tuple abstraction patch.

--

BTW, on HEAD, for dropped attribute slot_getattr() returns null datum,
which hasn't been done in the patch 

Re: Add a semicolon to query related to search_path

2018-08-30 Thread Tatsuro Yamada

On 2018/08/31 2:28, Peter Eisentraut wrote:

On 17/08/2018 05:32, Tatsuro Yamada wrote:

Hi Robert,

On 2018/08/17 4:32, Robert Haas wrote:

On Thu, Aug 16, 2018 at 1:20 AM, Tatsuro Yamada
 wrote:

As you can see, queries with and without a semicolon are mixed, it is hard
to understand the end of each query. This is not beautiful and
user-friendly,
do not you think so?


I agree with you.


Thanks!
This fix is not a noteworthy new feature, but I believe it will be useful for
users to improve readability.  I'll add the patch to commitfest.


committed


Hi Peter,

Thank you!


Regards,
Tatsuro Yamada





Re: BUG #15346: Replica fails to start after the crash

2018-08-30 Thread Michael Paquier
On Fri, Aug 31, 2018 at 09:48:46AM +0900, Kyotaro HORIGUCHI wrote:
> Please wait a bit.. I have a concern about this.

Sure, please feel free.
--
Michael


signature.asc
Description: PGP signature


Re: BUG #15346: Replica fails to start after the crash

2018-08-30 Thread Kyotaro HORIGUCHI
At Thu, 30 Aug 2018 11:57:05 -0700, Michael Paquier  wrote 
in <20180830185705.gf15...@paquier.xyz>
> On Thu, Aug 30, 2018 at 08:31:36PM +0200, Alexander Kukushkin wrote:
> > 2018-08-30 19:34 GMT+02:00 Michael Paquier :
> >> I have been struggling for a couple of hours to get a deterministic test
> >> case out of my pocket, and I did not get one as you would need to get
> >> the bgwriter to flush a page before crash recovery finishes, we could do
> > 
> > In my case the active standby server has crashed, it wasn't in the
> > crash recovery mode.
> 
> That's what I meant, a standby crashed and then restarted, doing crash
> recovery before moving on with archive recovery once it was done with
> all its local WAL.
> 
> > Minimum recovery ending location is AB3/4A1B3118, but at the same time
> > I managed to find pages from 00050AB30053 on disk (at
> > least in the index files). That could only mean that bgwriter was
> > flushing dirty pages, but pg_control wasn't properly updated and it
> > happened not during recovery after hardware crash, but while the
> > postgres was running before the hardware crash.
> 
> Exactly, that would explain the incorrect reference.
> 
> > The only possible way to recover such standby - cut off all possible
> > connections and let it replay all WAL files it managed to write to
> > disk before the first crash.
> 
> Yeah...  I am going to apply the patch after another lookup, that will
> fix the problem moving forward.  Thanks for checking by the way.

Please wait a bit.. I have a concern about this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_verify_checksums and -fno-strict-aliasing

2018-08-30 Thread Michael Paquier
On Thu, Aug 30, 2018 at 07:37:37PM -0400, Tom Lane wrote:
> Some of these are safe, I think, because the buffers are only used as
> targets for read() and write().  But some are definitely broken.

Yes, I have not spent more than a couple of minutes on this issue.  I
noticed some of them easily though.

> My own list of files that seem to have issues is
> 
> blinsert.c
> generic_xlog.c
> ginentrypage.c
> hashpage.c
> pg_verify_checksums.c
> pg_waldump.c
> xloginsert.c
> 
> The fact that some of these are pretty old and we've not noticed is
> not good.  It suggests that we don't currently have any compilers in the
> buildfarm that under-align char[] arrays on the stack, which seems like
> a gotcha waiting to bite us.  I wonder if there is any way to persuade
> some compiler on a non-Intel box to do that.

Agreed, that's annoying.

> Anyway, I'll work on a patch for that, unless you were on it already?

I have begun working on that and am less than halfway through it as I
needed a fresh problem, however I am not sure I would be able to finish
it today, perhaps tomorrow...  If you have time and want to press on as
11 would release soon, of course feel free to wrap up more quickly than
I can.
--
Michael


signature.asc
Description: PGP signature


Re: Stored procedures and out parameters

2018-08-30 Thread Chapman Flack
On 08/30/18 15:35, Robert Haas wrote:
> On Tue, Aug 28, 2018 at 6:30 AM, Peter Eisentraut
>  wrote:
>> CALL compatible with the SQL standard.  For example, if you have a
>> function f1(IN a int, OUT b int), you would call it as SELECT f1(x)
>> and the "b" would somehow be the return value.  But a procedure call
>> would be CALL p1(x, y), where x and y could be, say, PL/pgSQL
>> variables.

I suppose the key question for most driver writers is going to be,
what does that difference look like at the fe-be protocol level?
PL/pgSQL might be an unrepresentative example for that question,
as it lives in the backend and could have some other way of retrieving
b to store in y. For any remote client, the result still needs to get
back there before the client can apply any "this result gets assigned
to my y variable" semantics, and is there any material difference between
the protocol message sequences that return these results

  select foo(1,2);
  select * from foo(1,2);
  call bar(1,2);

to the client? And, in the parallel universe where functions got
implemented according to the standard, what in that picture would
be different?

-Chap



Re: pg_verify_checksums and -fno-strict-aliasing

2018-08-30 Thread Tom Lane
Michael Paquier  writes:
> On Thu, Aug 30, 2018 at 10:39:26AM -0400, Tom Lane wrote:
>> (The right fix, of course, is to malloc the work buffer rather than
>> put it on the stack.)

> pg_upgrade/file.c is careful about that (5afcd2a), and has a comment on
> the matter, as does pg_standby.c.

> Now, grepping around for "BLCKSZ]", some garbage may have accumulated?

Ah, that's a cute idea for finding trouble spots.

> - entrySplitPage in ginentrypage.c
> - rewind_copy_file_range in file.c
> - _hash_alloc_buckets in hashpage.c
> - pg_prewarm.c
> - blinsert.c
> - pg_waldump.c
> - logtape.c

Some of these are safe, I think, because the buffers are only used as
targets for read() and write().  But some are definitely broken.
My own list of files that seem to have issues is

blinsert.c
generic_xlog.c
ginentrypage.c
hashpage.c
pg_verify_checksums.c
pg_waldump.c
xloginsert.c

The fact that some of these are pretty old and we've not noticed is
not good.  It suggests that we don't currently have any compilers in the
buildfarm that under-align char[] arrays on the stack, which seems like
a gotcha waiting to bite us.  I wonder if there is any way to persuade
some compiler on a non-Intel box to do that.

Anyway, I'll work on a patch for that, unless you were on it already?

regards, tom lane



Re: some pg_dump query code simplification

2018-08-30 Thread Stephen Frost
Greetings,

* Andrew Dunstan (andrew.duns...@2ndquadrant.com) wrote:
> On 08/28/2018 06:05 PM, Tom Lane wrote:
> >Dunno about the idea of running the pg_dump TAP tests against back
> >branches.  I find that code sufficiently unreadable that maintaining
> >several more copies of it doesn't sound like fun at all.
> 
> Agreed. The code could do with a lot of comments. I recently looked at
> adding something to it and decided I had more pressing things to do.

I'm happy to add more comments to it..  There's a pretty good block that
tries to describe how the tests work above where the tests are actually
defined.  What would help?  Should I include an example in that code
block?  Or are you looking for more comments in the actual test
definitions about what they're covering or why they're included?  Or are
you interested in comments about the actual code down at the bottom
which runs the tests..?

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: some pg_dump query code simplification

2018-08-30 Thread Stephen Frost
Greetings,

* Andrew Dunstan (andrew.duns...@2ndquadrant.com) wrote:
> On 08/28/2018 06:10 PM, Stephen Frost wrote:
> >>Andrew has a buildfarm module that does precisely that, although
> >>I'm not sure what its test dataset is --- probably the regression
> >>database from each branch.  I also have a habit of doing such testing
> >>manually whenever I touch version-sensitive parts of pg_dump.
> >I've gotten better about doing that back-branch testing myself and
> >certainly prefer it, but I think we should also have buildfarm coverage.
> >I don't think we have the full matrix covered, or, really, anything
> >anywhere near it, so I'm looking for other options to at least get that
> >code exercised.
> 
> If by matrix you mean the version matrix, then the module in question covers
> every live branch as the source and every same or later live branch as the
> destination.

That's certainly better than I had thought it did.

> It could conceivably cover older branches as well - the branch names aren't
> actually hardcoded - I'd just have to be able to build them and do a
> standard buildfarm run once.

Having it cover older branches really is pretty important given how much
code there is for those older branches in pg_dump.  I've had pretty good
success compiling back down to about 7.2, as I recall, using patches
that Greg Stark made available.  Older is probably possible too.
Getting everything we claim to support covered would be fantastic (which
I believe is basically 9.6/9.5/9.4/9.3 against 7.0 and up, and
master/11/10 against 8.0 and up).

Thanks!

Stephen


signature.asc
Description: PGP signature


psql \dC incorrectly shows casts "with inout" as "binary coercible" on 9.5.14 and 11beta3

2018-08-30 Thread jean.pierre.pelletier0






To reproduce, compare the output of \dC on two built-in casts(json to jsonb) 
and (xml to text) where only the the first is really "with inout".

I've been using the folllowing query which (I believe) correctly shows
the (json to jsonb) cast as "with inout"

SELECT
   CONCAT('CREATE CAST (',
  C.castSource::regType, ' AS ',
  C.castTarget::regType,') ',
  CASE c.castMethod
 WHEN 'b' THEN 'WITHOUT FUNCTION'
 WHEN 'f' THEN 'WITH FUNCTION ' || C.castFunc::regProc || '('
|| pg_get_function_identity_arguments(C.castFunc) || ')'
 WHEN 'i' THEN 'WITH INOUT'
  END,
  CASE c.castContext
 WHEN 'a' THEN ' AS ASSIGNMENT '
 WHEN 'e' THEN ''
 WHEN 'i' THEN ' AS IMPLICIT'
  END)
FROM
   pg_cast C

   INNER JOIN pg_type TS
   ON C.castSource = TS.oid

   INNER JOIN pg_type TT
   ON C.castTarget = TT.oid
WHERE
   (C.castSource::regType::text, C.castTarget::regType::text) IN
(('json','jsonb'), ('xml','text'));


I've also noticed that pgAdmin III 1.22.2 has the same bug,
while pgAdmin 4 3.2 displays "with inout" casts properly.

Thanks,
Jean-Pierre Pelletier


Re: pg_verify_checksums and -fno-strict-aliasing

2018-08-30 Thread Andres Freund
On 2018-08-30 19:02:15 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > It certainly should be warned about.  Practically I don't think it's a
> > problem, because we pretty much always operate on a copy of the page
> > when writing out, as otherwise concurrently set hint bits would be
> > troublesome.
> 
> The write end of things is not a problem IMO, since presumably the caller
> would be about to overwrite the checksum field anyway.  The issue is for
> reading and/or verifying, where it's much less obvious that clobbering
> the page image is safe.

With "reading" you mean putting the page into the buffercache? If so,
that should be ok, the page isn't yet accessible (BM_VALID isn't set).

I don't think it's possible to do verification from the page in s_b,
because we don't keep the checksum current, so there should never be
calls to do so.

But we probably should still add a comment making clear that the
function modifies the buffer temporarily.

Greetings,

Andres Freund



Re: pg_verify_checksums and -fno-strict-aliasing

2018-08-30 Thread Tom Lane
Andres Freund  writes:
> On 2018-08-30 18:11:40 -0400, Tom Lane wrote:
>> I suspect people will complain about the added cost of doing that.

> I think the compiler will just optimize it away.

Maybe.  In any case, the attached version avoids any need for memcpy
and is, I think, cleaner code anyhow.  It fixes the problem for me
with Fedora's gcc, though I'm not sure that it'd be enough to get rid
of the warning Michael sees (I wonder what compiler he's using).

>> BTW, not to mention the elephant in the room, but: is it *really* OK
>> that pg_checksum_page scribbles on the page buffer, even temporarily?
>> It's certainly quite horrid that there aren't large warnings about
>> that in the function's API comment.

> It certainly should be warned about.  Practically I don't think it's a
> problem, because we pretty much always operate on a copy of the page
> when writing out, as otherwise concurrently set hint bits would be
> troublesome.

The write end of things is not a problem IMO, since presumably the caller
would be about to overwrite the checksum field anyway.  The issue is for
reading and/or verifying, where it's much less obvious that clobbering
the page image is safe.

regards, tom lane

diff --git a/src/include/storage/checksum_impl.h b/src/include/storage/checksum_impl.h
index 64d7622..a49d27f 100644
--- a/src/include/storage/checksum_impl.h
+++ b/src/include/storage/checksum_impl.h
@@ -107,6 +107,13 @@
 /* prime multiplier of FNV-1a hash */
 #define FNV_PRIME 16777619
 
+/* Use a union so that this code is valid under strict aliasing */
+typedef union
+{
+	PageHeaderData phdr;
+	uint32		data[BLCKSZ / (sizeof(uint32) * N_SUMS)][N_SUMS];
+} PGChecksummablePage;
+
 /*
  * Base offsets to initialize each of the parallel FNV hashes into a
  * different initial state.
@@ -132,28 +139,27 @@ do { \
 } while (0)
 
 /*
- * Block checksum algorithm.  The data argument must be aligned on a 4-byte
- * boundary.
+ * Block checksum algorithm.  The page must be adequately aligned
+ * (at least on 4-byte boundary).
  */
 static uint32
-pg_checksum_block(char *data, uint32 size)
+pg_checksum_block(const PGChecksummablePage *page)
 {
 	uint32		sums[N_SUMS];
-	uint32		(*dataArr)[N_SUMS] = (uint32 (*)[N_SUMS]) data;
 	uint32		result = 0;
 	uint32		i,
 j;
 
 	/* ensure that the size is compatible with the algorithm */
-	Assert((size % (sizeof(uint32) * N_SUMS)) == 0);
+	Assert(sizeof(PGChecksummablePage) == BLCKSZ);
 
 	/* initialize partial checksums to their corresponding offsets */
 	memcpy(sums, checksumBaseOffsets, sizeof(checksumBaseOffsets));
 
 	/* main checksum calculation */
-	for (i = 0; i < size / sizeof(uint32) / N_SUMS; i++)
+	for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++)
 		for (j = 0; j < N_SUMS; j++)
-			CHECKSUM_COMP(sums[j], dataArr[i][j]);
+			CHECKSUM_COMP(sums[j], page->data[i][j]);
 
 	/* finally add in two rounds of zeroes for additional mixing */
 	for (i = 0; i < 2; i++)
@@ -168,8 +174,10 @@ pg_checksum_block(char *data, uint32 size)
 }
 
 /*
- * Compute the checksum for a Postgres page.  The page must be aligned on a
- * 4-byte boundary.
+ * Compute the checksum for a Postgres page.
+ *
+ * The page must be adequately aligned (at least on a 4-byte boundary).
+ * Beware also that the checksum field of the page is transiently zeroed.
  *
  * The checksum includes the block number (to detect the case where a page is
  * somehow moved to a different location), the page header (excluding the
@@ -178,12 +186,12 @@ pg_checksum_block(char *data, uint32 size)
 uint16
 pg_checksum_page(char *page, BlockNumber blkno)
 {
-	PageHeader	phdr = (PageHeader) page;
+	PGChecksummablePage *cpage = (PGChecksummablePage *) page;
 	uint16		save_checksum;
 	uint32		checksum;
 
 	/* We only calculate the checksum for properly-initialized pages */
-	Assert(!PageIsNew(page));
+	Assert(!PageIsNew(>phdr));
 
 	/*
 	 * Save pd_checksum and temporarily set it to zero, so that the checksum
@@ -191,10 +199,10 @@ pg_checksum_page(char *page, BlockNumber blkno)
 	 * Restore it after, because actually updating the checksum is NOT part of
 	 * the API of this function.
 	 */
-	save_checksum = phdr->pd_checksum;
-	phdr->pd_checksum = 0;
-	checksum = pg_checksum_block(page, BLCKSZ);
-	phdr->pd_checksum = save_checksum;
+	save_checksum = cpage->phdr.pd_checksum;
+	cpage->phdr.pd_checksum = 0;
+	checksum = pg_checksum_block(cpage);
+	cpage->phdr.pd_checksum = save_checksum;
 
 	/* Mix in the block number to detect transposed pages */
 	checksum ^= blkno;


Re: pg_verify_checksums and -fno-strict-aliasing

2018-08-30 Thread Michael Paquier
On Thu, Aug 30, 2018 at 10:07:38PM +0200, Magnus Hagander wrote:
> I wonder if your tests that pg_control has picked things up belong more in
> the tests of initdb itself?

For the case where checksums are disabled, moving there the check on
control data makes sense.

> Do you think there is value in testing against a non-checksum cluster? I
> guess there's some point to it. I think testing actual corruption (like my
> version of the tests) is more valuable, but perhaps we should just do both?

Yeah, let's do stuff on a single cluster which has them only enabled,
as initializing a node is one of the most costly operations in TAP
tests.  Checking that the server is stopped is definitely a must in my
opinion, and your addition about emulating corrupted blocks is a good
idea.  I would personally vote for keeping a control file check within
the tests of pg_verify_checksums as that's cheap.
--
Michael


signature.asc
Description: PGP signature


Re: pg_verify_checksums and -fno-strict-aliasing

2018-08-30 Thread Andres Freund
On 2018-08-30 18:11:40 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2018-08-30 14:46:06 -0700, Andres Freund wrote:
> >> One way to fix it would be to memcpy in/out the modified PageHeader, or
> >> just do offset math and memcpy to that offset.
> 
> > It took me a bit to reproduce the issue (due to sheer stupidity on my
> > part: no, changing the flags passed to gcc to link pg_verify_checksums
> > doesn't do the trick), but the above indeed fixes the issue for me.
> 
> I suspect people will complain about the added cost of doing that.

I think the compiler will just optimize it away.  But if we're concerned
we could write it as

memcpy(_checksum, page + offsetof(PageHeaderData, pd_checksum), 
sizeof(save_checksum));
memset(page + offsetof(PageHeaderData, pd_checksum), 0, 
sizeof(save_checksum));

checksum = pg_checksum_block(page, BLCKSZ);

memcpy(page + offsetof(PageHeaderData, pd_checksum), _checksum, 
sizeof(save_checksum));

works, but still not exceedingly pretty :/.  The code generated looks
reasonable:

194 memcpy(_checksum, page + offsetof(PageHeaderData, 
pd_checksum), sizeof(save_checksum));
   0x35d0 <+0>: push   %r12
   0x35d2 <+2>: xor%eax,%eax
   0x35d4 <+4>: movzwl 0x8(%rdi),%r12d

195 memset(page + offsetof(PageHeaderData, pd_checksum), 0, 
sizeof(save_checksum));
   0x35d9 <+9>: push   %rbp
   0x35da <+10>:mov%ax,0x8(%rdi)

(the pushes are just interspersed stuff, yay latency aware instruction
scheduling)


> I've been AFK all afternoon, but what I was intending to try next was
> the union approach, specifically union'ing PageHeaderData with the uint32
> array representation needed by pg_checksum_block().  That might also
> end up giving us code less unreadable than this:
> 
>   uint32  (*dataArr)[N_SUMS] = (uint32 (*)[N_SUMS]) data;

Hm.


> BTW, not to mention the elephant in the room, but: is it *really* OK
> that pg_checksum_page scribbles on the page buffer, even temporarily?
> It's certainly quite horrid that there aren't large warnings about
> that in the function's API comment.

It certainly should be warned about.  Practically I don't think it's a
problem, because we pretty much always operate on a copy of the page
when writing out, as otherwise concurrently set hint bits would be
troublesome.

Greetings,

Andres Freund



Re: B-tree cache prefetches

2018-08-30 Thread Peter Geoghegan
On Thu, Aug 30, 2018 at 2:40 PM, Thomas Munro
 wrote:
> A related topic is the cache-unfriendliness of traditional binary
> searches of sorted data.  Check out "Array Layouts for
> Comparison-Based Searching"[1] if you haven't already.  It says that
> if your array fits in L2 cache, as our btree pages do, then binary
> search still wins against Eytzinger et al, which I was a bit
> disappointed about when I was (casually) investigating new layouts
> based on some comments from a fellow hacker (I can't remember if it
> was Andres or Peter G who mentioned this topic to me).

It might have been both of us, at different times.

> However, the
> paper is talking about "branch-free binary search with explicit
> prefetch", which apparently eats plain old branchy binary search for
> breakfast, and I gather they tested on a lot of hardware.  That sounds
> interesting to look into.

I think that the main win will come from stacking a bunch of different
optimizations on top of each other, including key normalization,
automatic prefix compression on internal pages (something that's a lot
easier than on leaf pages [1]), and abbreviated keys in the ItemId
array of nbtree pages [2] (even 15-bit abbreviated keys can have a lot
of entropy when combined with these other techniques). Once we have
all that in place, I expect drastically fewer iCache misses, making
other optimizations interesting.

I'm bullish on the idea of using something like an unrolled binary
search, or an interpolation search in the long term. In the meantime,
we have too many iCache misses. As a datapoint, iCache misses
reportedly dominate the overhead of TPC-C and TPC-E [3]. Nestloop
joins with inner index scans are dominant within those benchmarks.

[1] 
https://wiki.postgresql.org/wiki/Key_normalization#Using_existing_.22minus_infinity.22_index_tuple_as_a_low_key
[2] 
https://www.postgresql.org/message-id/CAH2-Wz=mv4dmoapficrsyntv2kinxeotbwuy5r7fxxoc-oe...@mail.gmail.com
[3] https://openproceedings.org/2013/conf/edbt/TozunPKJA13.pdf --
"6.1.2 Core stalls"
-- 
Peter Geoghegan



Re: pg_verify_checksums and -fno-strict-aliasing

2018-08-30 Thread Tom Lane
Andres Freund  writes:
> On 2018-08-30 14:46:06 -0700, Andres Freund wrote:
>> One way to fix it would be to memcpy in/out the modified PageHeader, or
>> just do offset math and memcpy to that offset.

> It took me a bit to reproduce the issue (due to sheer stupidity on my
> part: no, changing the flags passed to gcc to link pg_verify_checksums
> doesn't do the trick), but the above indeed fixes the issue for me.

I suspect people will complain about the added cost of doing that.

I've been AFK all afternoon, but what I was intending to try next was
the union approach, specifically union'ing PageHeaderData with the uint32
array representation needed by pg_checksum_block().  That might also
end up giving us code less unreadable than this:

uint32  (*dataArr)[N_SUMS] = (uint32 (*)[N_SUMS]) data;


BTW, not to mention the elephant in the room, but: is it *really* OK
that pg_checksum_page scribbles on the page buffer, even temporarily?
It's certainly quite horrid that there aren't large warnings about
that in the function's API comment.

regards, tom lane



Re: pg_verify_checksums and -fno-strict-aliasing

2018-08-30 Thread Andres Freund
On 2018-08-30 14:46:06 -0700, Andres Freund wrote:
> Hi,
> 
> On 2018-08-30 17:19:28 -0400, Tom Lane wrote:
> > So, I've been fooling around trying to get it to work without
> > -fno-strict-aliasing, but with little luck so far.
> 
> The problem presumably is that pg_checksum_block() accesses the relevant
> fields as an uint32, whereas pg_checksum_page() accesses it as a
> PageHeader. That's an aliasing violation.  *One* cast from char* to
> either type is fine, it's accessing under both those types that's
> problematic.
> 
> One way to fix it would be to memcpy in/out the modified PageHeader, or
> just do offset math and memcpy to that offset.

It took me a bit to reproduce the issue (due to sheer stupidity on my
part: no, changing the flags passed to gcc to link pg_verify_checksums
doesn't do the trick), but the above indeed fixes the issue for me.

The attached is just for demonstration that the approach works.

Greetings,

Andres Freund
diff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c b/src/bin/pg_verify_checksums/pg_verify_checksums.c
index 7cf3cf35c7a..c23b0dd5c37 100644
--- a/src/bin/pg_verify_checksums/pg_verify_checksums.c
+++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c
@@ -78,7 +78,7 @@ skipfile(char *fn)
 static void
 scan_file(char *fn, int segmentno)
 {
-	char		buf[BLCKSZ];
+	char	   *buf = palloc(BLCKSZ);
 	PageHeader	header = (PageHeader) buf;
 	int			f;
 	int			blockno;
@@ -126,6 +126,8 @@ scan_file(char *fn, int segmentno)
 	}
 
 	close(f);
+
+	pfree(buf);
 }
 
 static void
diff --git a/src/include/storage/checksum_impl.h b/src/include/storage/checksum_impl.h
index 64d76229add..a258ea8d117 100644
--- a/src/include/storage/checksum_impl.h
+++ b/src/include/storage/checksum_impl.h
@@ -178,12 +178,14 @@ pg_checksum_block(char *data, uint32 size)
 uint16
 pg_checksum_page(char *page, BlockNumber blkno)
 {
-	PageHeader	phdr = (PageHeader) page;
+	PageHeaderData	phdr;
 	uint16		save_checksum;
 	uint32		checksum;
 
+	memcpy(, page, sizeof(PageHeaderData));
+
 	/* We only calculate the checksum for properly-initialized pages */
-	Assert(!PageIsNew(page));
+	Assert(!PageIsNew());
 
 	/*
 	 * Save pd_checksum and temporarily set it to zero, so that the checksum
@@ -191,10 +193,14 @@ pg_checksum_page(char *page, BlockNumber blkno)
 	 * Restore it after, because actually updating the checksum is NOT part of
 	 * the API of this function.
 	 */
-	save_checksum = phdr->pd_checksum;
-	phdr->pd_checksum = 0;
+	save_checksum = phdr.pd_checksum;
+	phdr.pd_checksum = 0;
+	memcpy(page, , sizeof(PageHeaderData));
+
 	checksum = pg_checksum_block(page, BLCKSZ);
-	phdr->pd_checksum = save_checksum;
+
+	phdr.pd_checksum = save_checksum;
+	memcpy(page, , sizeof(PageHeaderData));
 
 	/* Mix in the block number to detect transposed pages */
 	checksum ^= blkno;


Re: 10.5 but not 10.4: backend startup during reindex system: could not read block 0 in file "base/16400/..": read only 0 of 8192 bytes

2018-08-30 Thread Tom Lane
Justin Pryzby  writes:
> Just curious, is there really any difficulty in reproducing this?

Once you have the right test case, it's not hard.  But it took us two
months to find one ...

regards, tom lane



Re: 10.5 but not 10.4: backend startup during reindex system: could not read block 0 in file "base/16400/..": read only 0 of 8192 bytes

2018-08-30 Thread Justin Pryzby
On Thu, Aug 30, 2018 at 05:30:30PM -0400, Tom Lane wrote:
> Justin Pryzby  writes:
> > On Wed, Aug 29, 2018 at 11:35:51AM -0400, Tom Lane wrote:
> >> As far as we can tell, that bug is a dozen years old, so it's not clear
> >> why you find that you can reproduce it only in 10.5.  But there might be
> >> some subtle timing change accounting for that.
> 
> > It seems to me there's one root problem occurring in (at least) two slightly
> > different ways.  The issue/symptom that I've been seeing occurs in 10.5 but 
> > not
> > 10.4, and specifically at commit 2ce64ca, but not before. 
> 
> Yeah, as you probably saw in the other thread, we later realized that
> 2ce64ca created an additional pathway for ScanPgRelation to recurse;
> a pathway that's evidently easier to hit than the pre-existing ones.
> I note that both of your stack traces display ScanPgRelation recursion,
> so I'm feeling pretty confident that what you're seeing is the same
> thing.
> 
> But, as Andres says, it'd be great if you could confirm whether the
> draft patches fix it for you.

I tested with relcache-rebuild.diff which hasn't broken in 15min, so I'm
confident that doesn't hit the additional recusive pathway, but have to wait
awhile and see if autovacuum survives, too.

I tried to apply fix-missed-inval-msg-accepts-1.patch on top of PG10.5 but
patch didn't apply, so I can test HEAD after the first patch soaks awhile.

Just curious, is there really any difficulty in reproducing this?  Once I
realized this was a continuing issue and started to suspect pg10.5, it takes
just about nothing to reproduce anywhere I've tried.  I just tested 5 servers,
and only one took more than a handful of seconds to fail.  I gave up waiting
for a 6th server, because I found it was waiting on a pre-existing lock.

[pryzbyj@database ~]$ while :; do for a in pg_class_oid_index 
pg_class_relname_nsp_index pg_class_tblspc_relfilenode_index; do psql ts -qc 
"REINDEX INDEX $a"; done; done&
[pryzbyj@database ~]$ a=0; time while psql ts -qc ''; do a=$((1+a)); done ; 
echo "$a"
psql: FATAL:  could not read block 0 in file "base/16400/313581263": read only 
0 of 8192 bytes

real0m1.772s
user0m0.076s
sys 0m0.116s
47

Justin



Re: Use C99 designated initializers for some structs

2018-08-30 Thread Tom Lane
Mark Dilger  writes:
> I tried doing this perhaps a year ago, and there are a few files with arrays
> of structs whose representations get much larger when you change the format
> in this way.  For instance, in guc.c:
> ...
> What should the general rule be for initializing arrays of structs such as 
> these?

I'd argue that there is no reason to convert initializers except where
it's a clear notational win.  If it isn't, leaving things alone will
be less of a maintenance problem.

regards, tom lane



Re: pg_verify_checksums and -fno-strict-aliasing

2018-08-30 Thread Tom Lane
Magnus Hagander  writes:
> On Thu, Aug 30, 2018 at 4:39 PM, Tom Lane  wrote:
>> (The right fix, of course, is to malloc the work buffer rather than
>> put it on the stack.)

> So if I get you  right, you're saying the attached patch should be all
> that's needed?

Well, that's some of what's needed.  I notice this code is also being
sloppy about whether block numbers are signed or unsigned, which means
it'd probably misbehave on relations exceeding 2^31 blocks.  I have
a patch in progress to clean all that up, though I'm still working
on the strict-aliasing part.

regards, tom lane



Re: pg_verify_checksums and -fno-strict-aliasing

2018-08-30 Thread Andres Freund
Hi,

On 2018-08-30 17:19:28 -0400, Tom Lane wrote:
> So, I've been fooling around trying to get it to work without
> -fno-strict-aliasing, but with little luck so far.

The problem presumably is that pg_checksum_block() accesses the relevant
fields as an uint32, whereas pg_checksum_page() accesses it as a
PageHeader. That's an aliasing violation.  *One* cast from char* to
either type is fine, it's accessing under both those types that's
problematic.

One way to fix it would be to memcpy in/out the modified PageHeader, or
just do offset math and memcpy to that offset.

Greetings,

Andres Freund



Re: B-tree cache prefetches

2018-08-30 Thread Thomas Munro
On Fri, Aug 31, 2018 at 5:53 AM Andrey Borodin  wrote:
> Hi hackers!
>
> I've been at the database conference and here everyone is talking about cache 
> prefetches.
>
> I've tried simple hack
>
> diff --git a/src/backend/access/nbtree/nbtsearch.c 
> b/src/backend/access/nbtree/nbtsearch.c
> index d3700bd082..ffddf553aa 100644
> --- a/src/backend/access/nbtree/nbtsearch.c
> +++ b/src/backend/access/nbtree/nbtsearch.c
> @@ -401,6 +401,13 @@ _bt_binsrch(Relation rel,
>
> /* We have low <= mid < high, so mid points at a real slot */
>
> +   OffsetNumber x = mid + 1 + ((high - mid + 1) / 2);
> +   if (x < high)
> +   __builtin_prefetch (PageGetItem(page, 
> PageGetItemId(page, x)), 0, 2);
> +   x = low + ((mid - low) / 2);
> +   if (x > low)
> +   __builtin_prefetch (PageGetItem(page, 
> PageGetItemId(page, x)), 0, 2);
> +
> result = _bt_compare(rel, keysz, scankey, page, mid);
>
> if (result >= cmpval)
>
> The idea is pretty simple - our search are cache erasing anyway, let's try to 
> get at least some of it by prefetching possible ways of binary search.
> And it seems to me that on a simple query
> > insert into x select (random()*100)::int from generate_series(1,1e7);
> it brings something like 2-4% of performance improvement on my laptop.
>
> Is there a reason why we do not use __builtin_prefetch? Have anyone tried to 
> use cache prefetching?

A related topic is the cache-unfriendliness of traditional binary
searches of sorted data.  Check out "Array Layouts for
Comparison-Based Searching"[1] if you haven't already.  It says that
if your array fits in L2 cache, as our btree pages do, then binary
search still wins against Eytzinger et al, which I was a bit
disappointed about when I was (casually) investigating new layouts
based on some comments from a fellow hacker (I can't remember if it
was Andres or Peter G who mentioned this topic to me).  However, the
paper is talking about "branch-free binary search with explicit
prefetch", which apparently eats plain old branchy binary search for
breakfast, and I gather they tested on a lot of hardware.  That sounds
interesting to look into.

[1] https://arxiv.org/pdf/1509.05053.pdf

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Startup cost of sequential scan

2018-08-30 Thread Tom Lane
Robert Haas  writes:
> Whose mental model?  I guess the Tom Lane mind is the canonical one
> for this project, but I'm not sure that it entirely agrees with mine.

Since the fact that we have a notion of startup cost at all is entirely my
fault, I don't feel shy about claiming to have the authoritative view of
what it means.

(Whether that's adequately documented is another question :-()

> IIRC, it was previously proposed that we ought to charge
> random_page_cost for the first block of a sequential scan, because at
> present the cost of fetching 1 block differs depending on whether we
> are fetching it from a heap or an index, which seems unprincipled.

That might well be a sane thing to do ... but it'd still be part of run
cost not startup cost.

regards, tom lane



Re: 10.5 but not 10.4: backend startup during reindex system: could not read block 0 in file "base/16400/..": read only 0 of 8192 bytes

2018-08-30 Thread Tom Lane
Justin Pryzby  writes:
> On Wed, Aug 29, 2018 at 11:35:51AM -0400, Tom Lane wrote:
>> As far as we can tell, that bug is a dozen years old, so it's not clear
>> why you find that you can reproduce it only in 10.5.  But there might be
>> some subtle timing change accounting for that.

> It seems to me there's one root problem occurring in (at least) two slightly
> different ways.  The issue/symptom that I've been seeing occurs in 10.5 but 
> not
> 10.4, and specifically at commit 2ce64ca, but not before. 

Yeah, as you probably saw in the other thread, we later realized that
2ce64ca created an additional pathway for ScanPgRelation to recurse;
a pathway that's evidently easier to hit than the pre-existing ones.
I note that both of your stack traces display ScanPgRelation recursion,
so I'm feeling pretty confident that what you're seeing is the same
thing.

But, as Andres says, it'd be great if you could confirm whether the
draft patches fix it for you.

regards, tom lane



Re: pg_verify_checksums and -fno-strict-aliasing

2018-08-30 Thread Tom Lane
Michael Banck  writes:
> Could well be I'm doing something wrong, so it would be cool if somebody
> could reproduce this first. In principle, it should be enough to run
> 'make clean && make CLFAGS=-O2' in the src/bin/pg_verify_checksums
> subdirectory in order to get a broken executable.

I can reproduce it on my Fedora 28 machine (gcc 8.1.1) by compiling
pg_verify_checksums with all the normal flags except -fno-strict-aliasing.
I don't see any warning, not even with -Wstrict-aliasing=2 :-(, but the
generated code malfunctions as Michael describes.  As best I can tell,
the problem is in the inlined code from checksum_impl.h rather than
in pg_verify_checksums.c itself.  I think what is happening is gcc
is concluding that it can optimize away the code that temporarily
sets the page's checksum field to zero.

Although, as Alvaro mentioned upthread, there's basically no hope of
getting away from -fno-strict-aliasing for Postgres-at-large, I think it'd
be a good thing if we could get checksum_impl.h to not depend on it.  The
whole point of that header, according to its own self-description, is to
export the checksum calculation for use by external programs --- and we
can't really control what compiler flags will be used by those.  The fact
that Michael even ran into this problem seems to be an instance of that.

So, I've been fooling around trying to get it to work without
-fno-strict-aliasing, but with little luck so far.

regards, tom lane



Re: pg_verify_checksums and -fno-strict-aliasing

2018-08-30 Thread Michael Banck
Hi,

Am Donnerstag, den 30.08.2018, 21:35 +0200 schrieb Magnus Hagander:
> So if I get you  right, you're saying the attached patch should be all
> that's needed? 

I tried to do some similar changes but neither what you proposed nor
what I came up with actually fixes the checksum failures, though they
silence the warning at -Wall level.

Could well be I'm doing something wrong, so it would be cool if somebody
could reproduce this first. In principle, it should be enough to run
'make clean && make CLFAGS=-O2' in the src/bin/pg_verify_checksums
subdirectory in order to get a broken executable.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz



Re: pg_verify_checksums and -fno-strict-aliasing

2018-08-30 Thread Michael Banck
Hi,

Am Donnerstag, den 30.08.2018, 22:02 +0200 schrieb Magnus Hagander:
> PFA some *very* basic tests for pg_verify_checksums, which should at
> least be enough to catch the kind of errors we had now in the tool
> itself.

I proposed something similar for pg_basebackup back then and IIRC Peter
(rightfully) complained that using a system catalog for that wasn't so
great - also, are you sure the relfilenode will always be stable? Then
again, this is obviously a throw-away data directory so it should be no
general issue, just a matter of taste.

https://github.com/credativ/pg_checksums/blob/master/t/001_checksums.pl
has some similar tests at the end but creates a table for corruption,
adds some safeguards for block size and also uses  command_checks_all()
to parse the output. It's PGDG licensed so you're welcome to take some
or all of that.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz



Re: 10.5 but not 10.4: backend startup during reindex system: could not read block 0 in file "base/16400/..": read only 0 of 8192 bytes

2018-08-30 Thread Andres Freund
Hi,

On 2018-08-30 15:00:28 -0500, Justin Pryzby wrote:
> On Wed, Aug 29, 2018 at 11:35:51AM -0400, Tom Lane wrote:
> > This looks suspiciously like the issue under discussion in
> > 
> > https://www.postgresql.org/message-id/12259.1532117714%40sss.pgh.pa.us
> > 
> > As far as we can tell, that bug is a dozen years old, so it's not clear
> > why you find that you can reproduce it only in 10.5.  But there might be
> > some subtle timing change accounting for that.
> 
> I confirmed the stack appears to be same as here, modulo differences between
> 9.6 to 10.

Could you check if both of the proposed attempts at fixing the issue
also solves your problem?

Thanks,

Andres Freund



Re: Stored procedures and out parameters

2018-08-30 Thread Dave Cramer
>
>
> In other words, being more like the SQL standard is probably good, but
> breaking compatibility is bad.  You've technically avoided a
> *backward* compatibility break by deciding that functions and
> procedures can work differently from each other, but that just moves
> the problem around.  Now instead of being unhappy that existing code
> is broken, people are unhappy that the new thing doesn't work like the
> existing thing.  That may be the lesser of evils, but it's still
> pretty evil.  People are not being unreasonable to want to call some
> code stored on the server without having to worry about whether that
> code is in a box labelled PROCEDURE or a box labelled FUNCTION.
>
>
Reading this from the (JDBC) drivers perspective, which is probably a
fairly popular one,
We now have a standard that we can't really support. Either the driver will
have to support
the new PROCEDURE with the {call } mechanism or stay with the existing
FUNCTIONS.
This puts the drivers in a no win situation.

This probably should have been discussed in more detail before this
> got committed, but I guess that's water under the bridge at this
> point.  Nevertheless, I predict that this is going to be an ongoing
> source of pain for a long time to come.
>
> Undoubtedly.. surely the opportunity to do something about this has not
passed as this has not been
officially released ?

Dave Cramer

da...@postgresintl.com
www.postgresintl.com


Re: Use C99 designated initializers for some structs

2018-08-30 Thread Andres Freund
Hi,

On 2018-08-30 13:54:41 -0300, Alvaro Herrera wrote:
> On 2018-Aug-30, Mark Dilger wrote:
>
> > static struct config_bool ConfigureNamesBool[] =
> > {
> > {
> > {"enable_seqscan", PGC_USERSET, QUERY_TUNING_METHOD,
> > gettext_noop("Enables the planner's use of sequential-scan 
> > plans."),
> > NULL
> > },
> > _seqscan,
> > true,
> > NULL, NULL, NULL
> > },
>
> Personally, I dislike this form -- it's very opaque and I have to refer
> to the struct definition each time I want to add a new member, to make
> sure I'm assigning the right thing.

Dito. Especially because it looks different for the different types of
GUCs.


> I welcome designated initializers in this case even though it becomes
> more verbose.  I don't think explicitly initializing to NULLs is
> sensible in this case; let's just omit those fields.

Yea - I mean a large portion of them previously weren't initialized
either, so there's really not a good reason to change that now.


> > What should the general rule be for initializing arrays of structs such as 
> > these?
>
> I don't know what a general rule would be.  Maybe we can try hand-
> inspecting a few cases, and come up with a general rule once we acquire
> sufficient experience.

I think we should have as rules:

1) Members should be defined in the same order as in the struct, that's
   the requirement C++ standard is going to impose. Think it's also
   reasonable stylistically.
2) It's OK to omit setting members if zero-initialization obviously is
   correct.

We probably should also check how well pgindent copes, and whether that
dictates some formatting choices.


Greetings,

Andres Freund



Re: pg_verify_checksums and -fno-strict-aliasing

2018-08-30 Thread Andres Freund
On 2018-08-30 10:39:26 -0400, Tom Lane wrote:
>   charbuf[BLCKSZ];
>   PageHeader  header = (PageHeader) buf;

> (The right fix, of course, is to malloc the work buffer rather than
> put it on the stack.)

Or alternatively, for places where such allocations could be a problem
for performance, using a union can be used to force the required
alignment.  I'm fairly certain that just allocating is more than OK
here, to be clear.

Greetings,

Andres Freund



Re: pg_verify_checksums and -fno-strict-aliasing

2018-08-30 Thread Magnus Hagander
On Thu, Aug 30, 2018 at 10:02 PM, Michael Paquier 
wrote:

> On Thu, Aug 30, 2018 at 09:35:33PM +0200, Magnus Hagander wrote:
> > Should we make it a separate test in pg_verify_checksums, or should we
> > piggyback on the pg_basebackup tests (which AFAICT is the only ones that
> > create a cluster with checksums enabled at all, and thus is the only
> > codepath that actually uses the backend checksum code at all, which I
> think
> > is an even worse thing to have no tests of)
>
> This should be a separate suite.  And FWIW, we only use pg_regress to
> make sure that an already-initialized data folder has the correct,
> secure authentication set.  So creating a node with checksums enabled is
> just that:
> $node->init(extra => ['--data-checksums']);
>
> [... 20 minutes later ...]
> Attached is a basic test suite ;)
>

Haha, nice timing :)

I wonder if your tests that pg_control has picked things up belong more in
the tests of initdb itself?

Do you think there is value in testing against a non-checksum cluster? I
guess there's some point to it. I think testing actual corruption (like my
version of the tests) is more valuable, but perhaps we should just do both?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: pg_verify_checksums and -fno-strict-aliasing

2018-08-30 Thread Michael Paquier
On Thu, Aug 30, 2018 at 09:35:33PM +0200, Magnus Hagander wrote:
> Should we make it a separate test in pg_verify_checksums, or should we
> piggyback on the pg_basebackup tests (which AFAICT is the only ones that
> create a cluster with checksums enabled at all, and thus is the only
> codepath that actually uses the backend checksum code at all, which I think
> is an even worse thing to have no tests of)

This should be a separate suite.  And FWIW, we only use pg_regress to
make sure that an already-initialized data folder has the correct,
secure authentication set.  So creating a node with checksums enabled is
just that:
$node->init(extra => ['--data-checksums']);

[... 20 minutes later ...]
Attached is a basic test suite ;)
--
Michael
diff --git a/src/bin/pg_verify_checksums/.gitignore b/src/bin/pg_verify_checksums/.gitignore
index d1dcdaf0dd..0e5e569a54 100644
--- a/src/bin/pg_verify_checksums/.gitignore
+++ b/src/bin/pg_verify_checksums/.gitignore
@@ -1 +1,3 @@
 /pg_verify_checksums
+
+/tmp_check/
diff --git a/src/bin/pg_verify_checksums/Makefile b/src/bin/pg_verify_checksums/Makefile
index d16261571f..cfe4ab1b8b 100644
--- a/src/bin/pg_verify_checksums/Makefile
+++ b/src/bin/pg_verify_checksums/Makefile
@@ -34,3 +34,9 @@ uninstall:
 clean distclean maintainer-clean:
 	rm -f pg_verify_checksums$(X) $(OBJS)
 	rm -rf tmp_check
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
diff --git a/src/bin/pg_verify_checksums/t/001_basic.pl b/src/bin/pg_verify_checksums/t/001_basic.pl
new file mode 100644
index 00..1fa2e12db2
--- /dev/null
+++ b/src/bin/pg_verify_checksums/t/001_basic.pl
@@ -0,0 +1,8 @@
+use strict;
+use warnings;
+use TestLib;
+use Test::More tests => 8;
+
+program_help_ok('pg_verify_checksums');
+program_version_ok('pg_verify_checksums');
+program_options_handling_ok('pg_verify_checksums');
diff --git a/src/bin/pg_verify_checksums/t/002_actions.pl b/src/bin/pg_verify_checksums/t/002_actions.pl
new file mode 100644
index 00..20d19f0291
--- /dev/null
+++ b/src/bin/pg_verify_checksums/t/002_actions.pl
@@ -0,0 +1,42 @@
+# Do basic sanity checks supported by pg_verify_checksums using
+# an initialized cluster.
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 9;
+
+# Initialize node with checksums disabled.
+my $node_no_checksum = get_new_node('node_no_checksum');
+$node_no_checksum->init;
+$node_no_checksum->start;
+my $pgdata = $node_no_checksum->data_dir;
+
+# Error, server must be shut down cleanly.
+command_fails(['pg_verify_checksums', '-d', $pgdata],
+			  "pg_verify_checksums requires server to be cleanly stopped");
+$node_no_checksum->stop;
+
+# Control file should know that checksums are enabled.
+command_like(['pg_controldata', $pgdata],
+	 qr/Data page checksum version:.*0/,
+	 'checksums are disabled in control file');
+
+# Run verification, which should fail as checksums are disabled.
+command_fails(['pg_verify_checksums', '-d', $pgdata],
+			  "checksums disabled so no verification");
+
+# Initialize node with checksums enabled.
+my $node_checksum = get_new_node('node_checksum');
+$node_checksum->init(extra => ['--data-checksums']);
+$pgdata = $node_checksum->data_dir;
+
+# Control file should know that checksums are disabled.
+command_like(['pg_controldata', $pgdata],
+	 qr/Data page checksum version:.*1/,
+		 'checksums are enabled in control file');
+
+# This time verification is able to work.
+command_ok(['pg_verify_checksums',  '-d', $pgdata],
+	   "checksums enabled so verification happens");


signature.asc
Description: PGP signature


Re: pg_verify_checksums and -fno-strict-aliasing

2018-08-30 Thread Magnus Hagander
On Thu, Aug 30, 2018 at 9:35 PM, Magnus Hagander 
wrote:

> On Thu, Aug 30, 2018 at 4:39 PM, Tom Lane  wrote:
>
>> Fabien COELHO  writes:
>> >> If I add -fno-strict-aliasing to $CFLAGS, the problem goes away.
>> >> Is this something to worry about, or just pilot error cause I am not
>> >> using the same $CFLAGS as for the rest of the build? I originally
>> >> noticed this problem with my external fork of pg_verify_checksums.
>>
>> > It looks like the code is using aliasing where the standard says it
>> should
>> > not, which breaks compiler optimization, and the added options tells
>> the
>> > compiler to not assume that the code conforms to the standard...
>>
>> Actually, this code is just plain broken:
>>
>> charbuf[BLCKSZ];
>> PageHeader  header = (PageHeader) buf;
>>
>> There is no guarantee that a local char[] array is aligned on anything
>> stronger than a byte boundary, and if it isn't, word-sized accesses to
>> *header are going to fail on machines with strict alignment rules.
>> I suppose that that is really what Michael's compiler is moaning about.
>>
>> I rather suspect that this hasn't been tested on anything but Intel
>> hardware, which is famously misalignment-tolerant.  The lack of any
>> apparent regression test infrastructure for it isn't leaving a warm
>> feeling about how much the buildfarm is testing it.
>>
>
> I know I certainly didn't test it on non-intel.
>
> We did have that in the online verify checksum patch, but it came out as
> part of the revert of that part.
>
> Given that we also recently found bugs in the actual hash index
> implementation that would've gotten caught by this, perhaps we should add a
> TAP test for this.
>
> Should we make it a separate test in pg_verify_checksums, or should we
> piggyback on the pg_basebackup tests (which AFAICT is the only ones that
> create a cluster with checksums enabled at all, and thus is the only
> codepath that actually uses the backend checksum code at all, which I think
> is an even worse thing to have no tests of)
>

PFA some *very* basic tests for pg_verify_checksums, which should at least
be enough to catch the kind of errors we had now in the tool itself.

Does not at this point try to solve the fact that the backend checksum code
is not tested.


(The right fix, of course, is to malloc the work buffer rather than
>> put it on the stack.)
>
>
> So if I get you  right, you're saying the attached patch should be all
> that's needed?
>
>
-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 
diff --git a/src/bin/pg_verify_checksums/Makefile b/src/bin/pg_verify_checksums/Makefile
index d16261571f..cfe4ab1b8b 100644
--- a/src/bin/pg_verify_checksums/Makefile
+++ b/src/bin/pg_verify_checksums/Makefile
@@ -34,3 +34,9 @@ uninstall:
 clean distclean maintainer-clean:
 	rm -f pg_verify_checksums$(X) $(OBJS)
 	rm -rf tmp_check
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
diff --git a/src/bin/pg_verify_checksums/t/010_pg_verify_checksums.pl b/src/bin/pg_verify_checksums/t/010_pg_verify_checksums.pl
new file mode 100644
index 00..b6d535d76b
--- /dev/null
+++ b/src/bin/pg_verify_checksums/t/010_pg_verify_checksums.pl
@@ -0,0 +1,33 @@
+use strict;
+use warnings;
+use Cwd;
+use Config;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 10;
+
+program_help_ok('pg_verify_checksums');
+program_version_ok('pg_verify_checksums');
+program_options_handling_ok('pg_verify_checksums');
+
+my $node = get_new_node('main');
+
+# Set umask so test directories and files are created with default permissions
+umask(0077);
+
+# Initialize node without replication settings
+$node->init(extra => ['--data-checksums']);
+
+# Verify checksums of fresh node
+$node->command_ok(['pg_verify_checksums', '-D', $node->data_dir()]);
+
+# Cause some corruption in pg_proc
+open my $file, '+<', $node->data_dir() . "/base/1/1255";
+seek($file, 131, 0);
+syswrite($file, '\0\0\0\0\0\0\0\0\0');
+seek($file, 15721, 0);
+syswrite($file, '\0\0\0\0\0\0\0\0\0');
+close $file;
+
+# Verify that checksums are now broken
+$node->command_fails(['pg_verify_checksums', '-D', $node->data_dir()]);


Re: 10.5 but not 10.4: backend startup during reindex system: could not read block 0 in file "base/16400/..": read only 0 of 8192 bytes

2018-08-30 Thread Justin Pryzby
On Wed, Aug 29, 2018 at 11:35:51AM -0400, Tom Lane wrote:
> Justin Pryzby  writes:
> > I've seen this message now a handful of times recently.  It seems to happen
> > overnight, during a maintenance job which reindex things, including system
> > catalog indices.
> > It's easy to reproduce error under 10.5, but not under 10.3 nor 10.4.
> > while :; do for a in pg_class_oid_index pg_class_relname_nsp_index 
> > pg_class_tblspc_relfilenode_index; do psql ts -qc "REINDEX INDEX $a"; done; 
> > done&
> > time while psql ts -qc ''; do :; done
> 
> This looks suspiciously like the issue under discussion in
> 
> https://www.postgresql.org/message-id/12259.1532117714%40sss.pgh.pa.us
> 
> As far as we can tell, that bug is a dozen years old, so it's not clear
> why you find that you can reproduce it only in 10.5.  But there might be
> some subtle timing change accounting for that.

I confirmed the stack appears to be same as here, modulo differences between
9.6 to 10.
https://www.postgresql.org/message-id/fcab55ec-660a-bc43-7fe1-09af3665e89a%402ndquadrant.com

It seems to me there's one root problem occurring in (at least) two slightly
different ways.  The issue/symptom that I've been seeing occurs in 10.5 but not
10.4, and specifically at commit 2ce64ca, but not before. 

|commit 2ce64caaf793615d0f7a7ce4380e7255077d130c
|Author: Andres Freund 
|Date:   Tue Jun 12 11:13:21 2018 -0700
|
|Fix bugs in vacuum of shared rels, by keeping their relcache entries current.

However, the same message CAN happen under 10.4, though less easily, due to
autovacuum but (afaict) not user queries.  Maybe that's a coincidence, but my
experience mirrors the build logs: the problem since 10.5 is easily tickled in
a few seconds rather than an hour.

The issue I see frequently under 10.5 has this stack:

Core was generated by `postgres: pryzbyj postgres [local] startup   
 '.
#2  0x00709425 in mdread ()
#3  0x006e2bc8 in ReadBuffer_common ()
#4  0x006e3511 in ReadBufferExtended ()
#5  0x004c2da9 in _bt_getbuf ()
#6  0x004c32b2 in _bt_getroot ()
#7  0x004c799b in _bt_search ()
#8  0x004c8568 in _bt_first ()
#9  0x004c52c7 in btgettuple ()
#10 0x004bf936 in index_getnext_tid ()
#11 0x004bfb43 in index_getnext ()
#12 0x004beeae in systable_getnext ()
#13 0x007fa074 in ScanPgRelation ()
#14 0x007fe371 in RelationClearRelation ()
#15 0x007fe84b in RelationCacheInvalidateEntry ()
#16 0x007f7e15 in LocalExecuteInvalidationMessage ()
#17 0x006f3b38 in ReceiveSharedInvalidMessages ()
#18 0x006f6eb5 in LockRelationOid ()
#19 0x004ae495 in relation_open ()
#20 0x004bf336 in index_open ()
#21 0x004bee3c in systable_beginscan ()
#22 0x007fa069 in ScanPgRelation ()
#23 0x007fce84 in RelationBuildDesc ()
#24 0x007fe6f8 in RelationIdGetRelation ()
#25 0x004ae463 in relation_open ()
#26 0x004ae6a6 in heap_open ()
#27 0x00819237 in InitPostgres ()
#28 0x0070e2e9 in PostgresMain ()
#29 0x00477f36 in ServerLoop ()
#30 0x006a5e3e in PostmasterMain ()
#31 0x00478a08 in main ()

>From autovacum running REL_10_4 I've now seen (since trying to tickle the
problem yesterday):

#1  0x7fbfb5b9d01a in __GI_abort () at abort.c:89
#2  0x00708489 in mdread ()
#3  0x006e1c48 in ReadBuffer_common ()
#4  0x006e2591 in ReadBufferExtended ()
#5  0x004c2369 in _bt_getbuf ()
#6  0x004c2810 in _bt_getroot ()
#7  0x004c6f5b in _bt_search ()
#8  0x004c7b28 in _bt_first ()
#9  0x004c4887 in btgettuple ()
#10 0x004bf00e in index_getnext_tid ()
#11 0x004bf1f3 in index_getnext ()
#12 0x004be679 in systable_getnext ()
#13 0x007f9084 in ScanPgRelation ()
#14 0x007fbe54 in RelationBuildDesc ()
#15 0x007fcc86 in RelationClearRelation ()
#16 0x007fd7bb in RelationCacheInvalidateEntry ()
#17 0x007f6e25 in LocalExecuteInvalidationMessage ()
#18 0x006f2bb8 in ReceiveSharedInvalidMessages ()
#19 0x006f5f35 in LockRelationOid ()
#20 0x004ade05 in relation_open ()
#21 0x004bea96 in index_open ()
#22 0x004be606 in systable_beginscan ()
#23 0x007f9079 in ScanPgRelation ()
#24 0x007fbe54 in RelationBuildDesc ()
#25 0x007fd668 in RelationIdGetRelation ()
#26 0x004addd3 in relation_open ()
#27 0x004bea96 in index_open ()
#28 0x004be606 in systable_beginscan ()
#29 0x007fef60 in RelationGetIndexList ()
#30 0x005dba6c in vac_open_indexes ()
#31 0x005dc652 in lazy_vacuum_rel ()
#32 0x005da7b4 in vacuum_rel ()
#33 0x005db661 in vacuum ()
#34 0x004770db in do_autovacuum ()
#35 0x00696242 in AutoVacWorkerMain.isra.6 ()
#36 0x00696a49 in StartAutoVacWorker ()

Re: Use C99 designated initializers for some structs

2018-08-30 Thread Robert Haas
On Wed, Aug 29, 2018 at 6:51 PM, Tom Lane  wrote:
> I agree that assuming that they're physically zeroes is OK from a
> portability standpoint, because we'd have a whole lot of other issues
> if they weren't.  But I have a different point to make, which is that
> it's fairly standard practice for us to initialize all fields of a struct
> explicitly, even when setting them to zero/false/NULL.  I don't think we
> should walk away from that practice just because C99 offers a shiny new
> syntax for doing so.

I hate that style with a fiery passion that cannot be quenched.  I
think you're basically the only one who has routinely done it like
this, and so it results in uselessly wasting a lot of mental energy
trying to decide whether a new member ought to be handled like the
ones Tom added or the ones somebody else added.  It makes initializers
that could be quite short and compact long and hard to read.  In
InitProcess(), it's entirely unclear which of those initializations
are merely insurance and which ones are actually doing something
useful, and there and in other places it's quite hard to tell whether
we might be wasting cycles (or instruction cache space) in sufficient
quantities to care about.  If it were up to me, which it isn't, we'd
remove every useless initialize-to-zero statement in the entire code
base and then have a party to celebrate their demise.  The party would
include burning the removed code in effigy.  :-)

All that being said, this is MOSTLY a coding style issue and it comes
down to a matter of preference, so in my opinion, it doesn't really
matter that much in the end what we decide to do.  Still, my
preference would definitely be for nuking it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Startup cost of sequential scan

2018-08-30 Thread Robert Haas
On Thu, Aug 30, 2018 at 10:04 AM, Tom Lane  wrote:
> Alexander Korotkov  writes:
>> But I think there is another issue in sequential scan cost.  We have
>> zero startup cost for sequential scan.  But why?
>
> Because it's what the mental model of startup cost says it should be.

Whose mental model?  I guess the Tom Lane mind is the canonical one
for this project, but I'm not sure that it entirely agrees with mine.
IIRC, it was previously proposed that we ought to charge
random_page_cost for the first block of a sequential scan, because at
present the cost of fetching 1 block differs depending on whether we
are fetching it from a heap or an index, which seems unprincipled.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Extra word in src/backend/optimizer/README

2018-08-30 Thread Magnus Hagander
On Thu, Aug 30, 2018 at 1:27 PM, Etsuro Fujita 
wrote:

> Hi,
>
> Here is a small patch to remove $SUBJECT: s/has contains/contains/
>

Definitely looks correct. A good first test to verify your own commit/push
privileges, perhaps?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Stored procedures and out parameters

2018-08-30 Thread Robert Haas
On Tue, Aug 28, 2018 at 6:30 AM, Peter Eisentraut
 wrote:
> Output parameter handling is not compatible between function calls and
> procedure calls.  Our implementation of output parameters in functions
> is an extension of the SQL standard, and while it's been useful, it's
> nonstandard, and I would like to make the output parameter handling in
> CALL compatible with the SQL standard.  For example, if you have a
> function f1(IN a int, OUT b int), you would call it as SELECT f1(x)
> and the "b" would somehow be the return value.  But a procedure call
> would be CALL p1(x, y), where x and y could be, say, PL/pgSQL
> variables.  So if you want to allow invoking functions using the CALL
> statement, you're going to have a hard time defining semantics that
> are not wildly confusing.  Moreover, if the intention is to switch the
> JDBC driver or similar drivers to use the CALL command always from
> PG11 on, then the meaning of {call f1(a, b)} will have changed and a
> lot of things will break in dangerous ways.

The semantics you've chosen for procedures are more like Oracle that
the existing function semantics, which, as I can attest from my work
experience, can be very useful for users looking to migrate.  Worth
noting, however, is Oracle also has those semantics for function
calls.  So what you've ended up creating here is a situation where
procedures behave more or less like they do in Oracle and the SQL
standard, but functions behave the way they historically have in
PostgreSQL.  That's kind of a weird incompatibility, and I think that
incompatibility is a significant part of what people are complaining
about.

In other words, being more like the SQL standard is probably good, but
breaking compatibility is bad.  You've technically avoided a
*backward* compatibility break by deciding that functions and
procedures can work differently from each other, but that just moves
the problem around.  Now instead of being unhappy that existing code
is broken, people are unhappy that the new thing doesn't work like the
existing thing.  That may be the lesser of evils, but it's still
pretty evil.  People are not being unreasonable to want to call some
code stored on the server without having to worry about whether that
code is in a box labelled PROCEDURE or a box labelled FUNCTION.

This probably should have been discussed in more detail before this
got committed, but I guess that's water under the bridge at this
point.  Nevertheless, I predict that this is going to be an ongoing
source of pain for a long time to come.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pg_verify_checksums and -fno-strict-aliasing

2018-08-30 Thread Magnus Hagander
On Thu, Aug 30, 2018 at 4:39 PM, Tom Lane  wrote:

> Fabien COELHO  writes:
> >> If I add -fno-strict-aliasing to $CFLAGS, the problem goes away.
> >> Is this something to worry about, or just pilot error cause I am not
> >> using the same $CFLAGS as for the rest of the build? I originally
> >> noticed this problem with my external fork of pg_verify_checksums.
>
> > It looks like the code is using aliasing where the standard says it
> should
> > not, which breaks compiler optimization, and the added options tells the
> > compiler to not assume that the code conforms to the standard...
>
> Actually, this code is just plain broken:
>
> charbuf[BLCKSZ];
> PageHeader  header = (PageHeader) buf;
>
> There is no guarantee that a local char[] array is aligned on anything
> stronger than a byte boundary, and if it isn't, word-sized accesses to
> *header are going to fail on machines with strict alignment rules.
> I suppose that that is really what Michael's compiler is moaning about.
>
> I rather suspect that this hasn't been tested on anything but Intel
> hardware, which is famously misalignment-tolerant.  The lack of any
> apparent regression test infrastructure for it isn't leaving a warm
> feeling about how much the buildfarm is testing it.
>

I know I certainly didn't test it on non-intel.

We did have that in the online verify checksum patch, but it came out as
part of the revert of that part.

Given that we also recently found bugs in the actual hash index
implementation that would've gotten caught by this, perhaps we should add a
TAP test for this.

Should we make it a separate test in pg_verify_checksums, or should we
piggyback on the pg_basebackup tests (which AFAICT is the only ones that
create a cluster with checksums enabled at all, and thus is the only
codepath that actually uses the backend checksum code at all, which I think
is an even worse thing to have no tests of)


(The right fix, of course, is to malloc the work buffer rather than
> put it on the stack.)


So if I get you  right, you're saying the attached patch should be all
that's needed?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 
diff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c b/src/bin/pg_verify_checksums/pg_verify_checksums.c
index a941236563..965ec70557 100644
--- a/src/bin/pg_verify_checksums/pg_verify_checksums.c
+++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c
@@ -78,7 +78,7 @@ skipfile(char *fn)
 static void
 scan_file(char *fn, int segmentno)
 {
-	char		buf[BLCKSZ];
+	char	   *buf = palloc(BLCKSZ);
 	PageHeader	header = (PageHeader) buf;
 	int			f;
 	int			blockno;
@@ -127,6 +127,8 @@ scan_file(char *fn, int segmentno)
 _("%s: checksums verified in file \"%s\"\n"), progname, fn);
 
 	close(f);
+
+	pfree(buf);
 }
 
 static void


Re: Proposal for disk quota feature

2018-08-30 Thread Pavel Stehule
2018-08-30 16:22 GMT+02:00 Chapman Flack :

> On 08/30/2018 09:57 AM, Hubert Zhang wrote:
>
> > 2 Keep one worker process for each database. But using a parent/global
> > quota worker process to manage the lifecycle of database level worker
> > processes. It could handle the newly created database(avoid restart
> > database) and save resource when a database is not used. But this needs
> to
> > change worker process to be hierarchical. Postmaster becomes the
> grandfather
> >  of database level worker processes in this case.
>
> I am using background workers this way in 9.5 at $work.
>
> In my case, one worker lives forever, wakes up on a set period, and
> starts a short-lived worker for every database, waiting for each
> one before starting the next.
>
> It was straightforward to implement. Looking back over the code,
> I see the global worker assigns its own PID to worker.bgw_notify_pid
> of each of its children, and also obtains a handle for each child
> from RegisterDynamicBackgroundWorker().
>
> I imagine the global quota worker would prefer to start workers
> for every database and then just wait for notifications from any
> of them, but that seems equally straightforward at first glance.
>

There are servers with thousands databases. Worker per database is not good
idea.

It should to share ideas, code with autovacuum process.

Not sure, how to effective implementation based on bg workers can be. On
servers with large set of databases, large set of tables it can identify
too big table too late.

Isn't better to implement some quotas on storage level?

Regards

Pavel



> -Chap
>
>


Re: Online verification of checksums

2018-08-30 Thread Magnus Hagander
On Thu, Aug 30, 2018 at 8:06 PM, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 26/07/2018 13:59, Michael Banck wrote:
> > I've now forward-ported this change to pg_verify_checksums, in order to
> > make this application useful for online clusters, see attached patch.
>
> Why not provide this functionality as a server function or command.
> Then you can access blocks with proper locks and don't have to do this
> rather ad hoc retry logic on concurrent access.
>

I think it would make sense to provide this functionality in the "checksum
worker" infrastruture suggested in the online checksum enabling patch. But
I think being able to run it from the outside would also be useful,
particularly when it's this simple.

But why do we need a sleep in it? AFAICT this is basically the same code
that we have in basebackup.c, and that one does not need the sleep?
Certainly 500ms would be very long since we're just protecting against a
torn page, but the comment is wrong I think, and we're actually sleeping
0.5ms?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: BUG #15346: Replica fails to start after the crash

2018-08-30 Thread Michael Paquier
On Thu, Aug 30, 2018 at 08:31:36PM +0200, Alexander Kukushkin wrote:
> 2018-08-30 19:34 GMT+02:00 Michael Paquier :
>> I have been struggling for a couple of hours to get a deterministic test
>> case out of my pocket, and I did not get one as you would need to get
>> the bgwriter to flush a page before crash recovery finishes, we could do
> 
> In my case the active standby server has crashed, it wasn't in the
> crash recovery mode.

That's what I meant, a standby crashed and then restarted, doing crash
recovery before moving on with archive recovery once it was done with
all its local WAL.

> Minimum recovery ending location is AB3/4A1B3118, but at the same time
> I managed to find pages from 00050AB30053 on disk (at
> least in the index files). That could only mean that bgwriter was
> flushing dirty pages, but pg_control wasn't properly updated and it
> happened not during recovery after hardware crash, but while the
> postgres was running before the hardware crash.

Exactly, that would explain the incorrect reference.

> The only possible way to recover such standby - cut off all possible
> connections and let it replay all WAL files it managed to write to
> disk before the first crash.

Yeah...  I am going to apply the patch after another lookup, that will
fix the problem moving forward.  Thanks for checking by the way.
--
Michael


signature.asc
Description: PGP signature


Re: BUG #15346: Replica fails to start after the crash

2018-08-30 Thread Alexander Kukushkin
2018-08-30 19:34 GMT+02:00 Michael Paquier :
> I have been struggling for a couple of hours to get a deterministic test
> case out of my pocket, and I did not get one as you would need to get
> the bgwriter to flush a page before crash recovery finishes, we could do

In my case the active standby server has crashed, it wasn't in the
crash recovery mode.

> the time, particularly for slow machines .  Anyway, I did more code
> review and I think that I found another issue with XLogNeedsFlush(),
> which could enforce updateMinRecoveryPoint to false if called before
> XLogFlush during crash recovery from another process than the startup
> process, so if it got called before XLogFlush() we'd still have the same
> issue for a process doing both operations.  Hence, I have come up with

At least XLogNeedsFlush() is called just from a couple of places and
doesn't break bgwriter, but anyway, thanks for finding it.

> the attached, which actually brings back the code to what it was before
> 8d68ee6 for those routines, except that we have fast-exit paths for the
> startup process so as it is still able to replay all WAL available and
> avoid page reference issues post-promotion, deciding when to update its
> own copy of minRecoveryPoint when it finishes crash recovery.  This also
> saves from a couple of locks on the control file from the startup
> process.

Sound good.

>
> If you apply the patch and try it on your standby, are you able to get
> things up and working?

Nope, minRecoveryPoint in pg_control is still wrong and therefore
startup still aborts on the same place if there are connections open.
I think there is no way to fix it other than let it replay sufficient
amount of WAL without open connections.
Just juddging from the timestamps of WAL files in the pg_xlog it is
obvious that a few moments before the hardware crashed postgres was
replaying 00050AB30057, because the next file has smaller
mtime (it was recycled).
-rw---  1 akukushkin akukushkin 16777216 Aug 22 07:22
00050AB30057
-rw---  1 akukushkin akukushkin 16777216 Aug 22 07:01
00050AB30058

Minimum recovery ending location is AB3/4A1B3118, but at the same time
I managed to find pages from 00050AB30053 on disk (at
least in the index files). That could only mean that bgwriter was
flushing dirty pages, but pg_control wasn't properly updated and it
happened not during recovery after hardware crash, but while the
postgres was running before the hardware crash.

The only possible way to recover such standby - cut off all possible
connections and let it replay all WAL files it managed to write to
disk before the first crash.

Regards,
--
Alexander Kukushkin



Re: Bug in slot.c and are replication slots ever used at Window?

2018-08-30 Thread Michael Paquier
On Thu, Aug 30, 2018 at 11:00:43AM +0300, Konstantin Knizhnik wrote:
> So if "isdir" is true (and it is true in this case), it sets O_RDONLY
> flag.  Then fsync_fname successfully opens slot file in readonly mode
> and calls fsync() which at windows is substituted with _commit() which
> in turn is wrapper for FlushFileBuffers. 

It seems to me that you are right here, "path" points to
pg_replslot/$SLOTNAME/state which is a file so the fsync is incorrect.
I am not sure that we'd want to publish fsync_parent_path out of fd.c
though, so perhaps we could just save the slot path in a different
variable and use it?
--
Michael


signature.asc
Description: PGP signature


Re: Online verification of checksums

2018-08-30 Thread Peter Eisentraut
On 26/07/2018 13:59, Michael Banck wrote:
> I've now forward-ported this change to pg_verify_checksums, in order to
> make this application useful for online clusters, see attached patch.

Why not provide this functionality as a server function or command.
Then you can access blocks with proper locks and don't have to do this
rather ad hoc retry logic on concurrent access.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: B-tree cache prefetches

2018-08-30 Thread Peter Geoghegan
On Thu, Aug 30, 2018 at 10:53 AM, Andrey Borodin  wrote:
> The idea is pretty simple - our search are cache erasing anyway, let's try to 
> get at least some of it by prefetching possible ways of binary search.
> And it seems to me that on a simple query
>> insert into x select (random()*100)::int from generate_series(1,1e7);
> it brings something like 2-4% of performance improvement on my laptop.
>
> Is there a reason why we do not use __builtin_prefetch? Have anyone tried to 
> use cache prefetching?

I once wrote a patch that used __builtin_prefetch() when fetching
tuples from a tuplesort. It worked reasonably well on my laptop, but
didn't seem to do much on another machine with another
microarchitecture (presumably the server with the alternative
microarchitecture had superior hardware prefetching). The conclusion
was that it wasn't really worth pursuing.

I'm not dismissing your idea. I'm just pointing out that the burden of
proving that explicit prefetching is a good idea is rather
significant. We especially want to avoid something that needs to be
reassessed every couple of years.

-- 
Peter Geoghegan



Re: Dimension limit in contrib/cube (dump/restore hazard?)

2018-08-30 Thread Michael Paquier
On Thu, Aug 30, 2018 at 02:28:20PM +0300, Alexander Korotkov wrote:
> In general looks good for me.  Personally I get tired with cube.out
> and cube_2.out.  They are different with only few checks involving
> scientific notation.  But all the patches touching cube regression
> tests should update both cube.out and cube_2.out.  I propose to split
> scientific notation checks into separate test.

+1.
--
Michael


signature.asc
Description: PGP signature


Re: pg_verify_checksums and -fno-strict-aliasing

2018-08-30 Thread Michael Paquier
On Thu, Aug 30, 2018 at 10:39:26AM -0400, Tom Lane wrote:
> I rather suspect that this hasn't been tested on anything but Intel
> hardware, which is famously misalignment-tolerant.  The lack of any
> apparent regression test infrastructure for it isn't leaving a warm
> feeling about how much the buildfarm is testing it.
> 
> (The right fix, of course, is to malloc the work buffer rather than
> put it on the stack.)

pg_upgrade/file.c is careful about that (5afcd2a), and has a comment on
the matter, as does pg_standby.c.

Now, grepping around for "BLCKSZ]", some garbage may have accumulated?
- entrySplitPage in ginentrypage.c
- rewind_copy_file_range in file.c
- _hash_alloc_buckets in hashpage.c
- pg_prewarm.c
- blinsert.c
- pg_waldump.c
- logtape.c
--
Michael


signature.asc
Description: PGP signature


B-tree cache prefetches

2018-08-30 Thread Andrey Borodin
Hi hackers!

I've been at the database conference and here everyone is talking about cache 
prefetches.

I've tried simple hack

diff --git a/src/backend/access/nbtree/nbtsearch.c 
b/src/backend/access/nbtree/nbtsearch.c
index d3700bd082..ffddf553aa 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -401,6 +401,13 @@ _bt_binsrch(Relation rel,
 
/* We have low <= mid < high, so mid points at a real slot */
 
+   OffsetNumber x = mid + 1 + ((high - mid + 1) / 2);
+   if (x < high)
+   __builtin_prefetch (PageGetItem(page, 
PageGetItemId(page, x)), 0, 2);
+   x = low + ((mid - low) / 2);
+   if (x > low)
+   __builtin_prefetch (PageGetItem(page, 
PageGetItemId(page, x)), 0, 2);
+
result = _bt_compare(rel, keysz, scankey, page, mid);
 
if (result >= cmpval)

The idea is pretty simple - our search are cache erasing anyway, let's try to 
get at least some of it by prefetching possible ways of binary search.
And it seems to me that on a simple query
> insert into x select (random()*100)::int from generate_series(1,1e7);
it brings something like 2-4% of performance improvement on my laptop.

Is there a reason why we do not use __builtin_prefetch? Have anyone tried to 
use cache prefetching?

Best regards, Andrey Borodin.


Re: [HACKERS] Proposal to add work_mem option to postgres_fdw module

2018-08-30 Thread Peter Eisentraut
On 28/08/2018 05:55, Shinoda, Noriyoshi (PN Japan GCS Delivery) wrote:
>>> I like the direction of your thinking, but it seems to me that this
>>> would cause a problem if you want to set search_path=foo,bar.
>> ... OPTIONS ( host 'remhost1', port '5433', dbname 'demodb', 
>> option='option1=foo', option='option2=bar' );
> I do not want to allow postgres_fdw to set "all" GUCs. Because I think it is 
> very difficult to check the validity of all input values.
> So, I attached modified the patch so that we can easily add GUC that we can 
> set to postgres_fdw module.
> If you wish to add more GUC options to the module, add values to the 
> non_libpq_options[] array of the InitPgFdwOptions function,
> And add the validator code for the GUC in the postgres_fdw_validator function.

We already have a method for libpq applications to pass GUC settings via
connection parameters.  And postgres_fdw supports passing libpq
connection parameters as server options.  So why doesn't this work here?

The reason is that postgres_fdw filters out connection settings that are
marked debug ("D"), and the "options" keyword is marked as such.  I
think this is wrong.  Just remove that designation and then this will
work.  (Perhaps filtering out the debug options is also wrong, but I can
see an argument for it.)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: BUG #15346: Replica fails to start after the crash

2018-08-30 Thread Michael Paquier
On Thu, Aug 30, 2018 at 04:03:43PM +0200, Alexander Kukushkin wrote:
> 2018-08-30 15:39 GMT+02:00 Michael Paquier :
>> Does it take care of the problem?
> 
> Yep, with the patch applied bgwriter acts as expected!

Thanks for double-checking.

I have been struggling for a couple of hours to get a deterministic test
case out of my pocket, and I did not get one as you would need to get
the bgwriter to flush a page before crash recovery finishes, we could do
that easily with a dedicated bgworker, now pg_ctl start expects the
standby to finish recovery before, which makes it harder to trigger all
the time, particularly for slow machines .  Anyway, I did more code
review and I think that I found another issue with XLogNeedsFlush(),
which could enforce updateMinRecoveryPoint to false if called before
XLogFlush during crash recovery from another process than the startup
process, so if it got called before XLogFlush() we'd still have the same
issue for a process doing both operations.  Hence, I have come up with
the attached, which actually brings back the code to what it was before
8d68ee6 for those routines, except that we have fast-exit paths for the
startup process so as it is still able to replay all WAL available and
avoid page reference issues post-promotion, deciding when to update its
own copy of minRecoveryPoint when it finishes crash recovery.  This also
saves from a couple of locks on the control file from the startup
process.

If you apply the patch and try it on your standby, are you able to get
things up and working?
--
Michael
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 493f1db7b9..65db2e48d8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2723,9 +2723,13 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
 	 * i.e., we're doing crash recovery.  We never modify the control file's
 	 * value in that case, so we can short-circuit future checks here too. The
 	 * local values of minRecoveryPoint and minRecoveryPointTLI should not be
-	 * updated until crash recovery finishes.
+	 * updated until crash recovery finishes.  We only do this for the startup
+	 * process as it should not update its own reference of minRecoveryPoint
+	 * until it has finished crash recovery to make sure that all WAL
+	 * available is replayed in this case.  This also saves from extra locks
+	 * taken on the control file from the startup process.
 	 */
-	if (XLogRecPtrIsInvalid(minRecoveryPoint))
+	if (XLogRecPtrIsInvalid(minRecoveryPoint) && InRecovery)
 	{
 		updateMinRecoveryPoint = false;
 		return;
@@ -2737,7 +2741,9 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
 	minRecoveryPoint = ControlFile->minRecoveryPoint;
 	minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
 
-	if (force || minRecoveryPoint < lsn)
+	if (XLogRecPtrIsInvalid(minRecoveryPoint))
+		updateMinRecoveryPoint = false;
+	else if (force || minRecoveryPoint < lsn)
 	{
 		XLogRecPtr	newMinRecoveryPoint;
 		TimeLineID	newMinRecoveryPointTLI;
@@ -3127,9 +3133,11 @@ XLogNeedsFlush(XLogRecPtr record)
 		 * An invalid minRecoveryPoint means that we need to recover all the
 		 * WAL, i.e., we're doing crash recovery.  We never modify the control
 		 * file's value in that case, so we can short-circuit future checks
-		 * here too.
+		 * here too.  This triggers a quick exit path for the startup process,
+		 * which cannot update its local copy of minRecoveryPoint as long as
+		 * it has not replayed all WAL available when doing crash recovery.
 		 */
-		if (XLogRecPtrIsInvalid(minRecoveryPoint))
+		if (XLogRecPtrIsInvalid(minRecoveryPoint) && InRecovery)
 			updateMinRecoveryPoint = false;
 
 		/* Quick exit if already known to be updated or cannot be updated */
@@ -3146,8 +3154,19 @@ XLogNeedsFlush(XLogRecPtr record)
 		minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
 		LWLockRelease(ControlFileLock);
 
+		/*
+		 * Check minRecoveryPoint for any other process than the startup
+		 * process doing crash recovery, which should not update the control
+		 * file value if crash recovery is still running.
+		 */
+		if (XLogRecPtrIsInvalid(minRecoveryPoint))
+			updateMinRecoveryPoint = false;
+
 		/* check again */
-		return record > minRecoveryPoint;
+		if (record <= minRecoveryPoint || !updateMinRecoveryPoint)
+			return false;
+		else
+			return true;
 	}
 
 	/* Quick exit if already known flushed */


signature.asc
Description: PGP signature


Re: TupleTableSlot abstraction

2018-08-30 Thread Alvaro Herrera
Man, how I dislike patches in tarballs.

0002 says:

+ * shouldFree is set 'true' since a tuple stored on a disk page should not be
+ * pfree'd.

Surely you mean 'false' :-)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Add a semicolon to query related to search_path

2018-08-30 Thread Peter Eisentraut
On 17/08/2018 05:32, Tatsuro Yamada wrote:
> Hi Robert,
> 
> On 2018/08/17 4:32, Robert Haas wrote:
>> On Thu, Aug 16, 2018 at 1:20 AM, Tatsuro Yamada
>>  wrote:
>>> As you can see, queries with and without a semicolon are mixed, it is hard
>>> to understand the end of each query. This is not beautiful and
>>> user-friendly,
>>> do not you think so?
>>
>> I agree with you.
> 
> Thanks!
> This fix is not a noteworthy new feature, but I believe it will be useful for
> users to improve readability.  I'll add the patch to commitfest.

committed

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Use C99 designated initializers for some structs

2018-08-30 Thread Alvaro Herrera
On 2018-Aug-30, Mark Dilger wrote:

> static struct config_bool ConfigureNamesBool[] =
> {
> {
> {"enable_seqscan", PGC_USERSET, QUERY_TUNING_METHOD,
> gettext_noop("Enables the planner's use of sequential-scan 
> plans."),
> NULL
> },
> _seqscan,
> true,
> NULL, NULL, NULL
> },

Personally, I dislike this form -- it's very opaque and I have to refer
to the struct definition each time I want to add a new member, to make
sure I'm assigning the right thing.  I welcome designated initializers
in this case even though it becomes more verbose.  I don't think
explicitly initializing to NULLs is sensible in this case; let's just
omit those fields.

> What should the general rule be for initializing arrays of structs such as 
> these?

I don't know what a general rule would be.  Maybe we can try hand-
inspecting a few cases, and come up with a general rule once we acquire
sufficient experience.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Tab completion for ALTER DATABASE … SET TABLESPACE

2018-08-30 Thread Dagfinn Ilmari Mannsåker
ilm...@ilmari.org (Dagfinn Ilmari Mannsåker) writes:

> Hi hackers,
>
> I just noticed that psql's tab completion for ALTER TABLE … SET
> TABLESPACE was treating it as any other configuration parameter and
> completing with FROM DEFAULT or TO after it, instead of a list of
> tablespaces.

And just after hitting send, I noticed I'd typed ALTER TABLE instead of
ALTER DATABASE, including in the commit message :(

Fixed patch attached.

- ilmari 
-- 
"I use RMS as a guide in the same way that a boat captain would use
 a lighthouse.  It's good to know where it is, but you generally
 don't want to find yourself in the same spot." - Tollef Fog Heen

>From e5ff67c3284dd456fe80ed8a8927e12665d68fea Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Thu, 30 Aug 2018 17:36:16 +0100
Subject: [PATCH] =?UTF-8?q?Fix=20tab=20completion=20for=20ALTER=20DATABASE?=
 =?UTF-8?q?=20=E2=80=A6=20SET=20TABLESPACE?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

It was being treated as any other configuration parameter.
---
 src/bin/psql/tab-complete.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index bb696f8ee9..f107ffb027 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1783,6 +1783,10 @@ psql_completion(const char *text, int start, int end)
 			"IS_TEMPLATE", "ALLOW_CONNECTIONS",
 			"CONNECTION LIMIT");
 
+	/* ALTER DATABASE  SET TABLESPACE */
+	else if (Matches5("ALTER", "DATABASE", MatchAny, "SET", "TABLESPACE"))
+		COMPLETE_WITH_QUERY(Query_for_list_of_tablespaces);
+
 	/* ALTER EVENT TRIGGER */
 	else if (Matches3("ALTER", "EVENT", "TRIGGER"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_event_triggers);
-- 
2.18.0



Re: Hint to set owner for tablespace directory

2018-08-30 Thread Peter Eisentraut
On 23/08/2018 13:24, Maksim Milyutin wrote:
> I have noticed the novice users are stuck trying to create tablespace 
> over a directory whose owner is not the system postgres user. They 
> observed the message "could not set permissions on directory ...: 
> permission denied".
> 
> I want to add patch that prints hint to set required owner for the 
> tablespace directory if this is the cause of the problem (*errno == 
> EPERM* after calling *chmod*).

I think the hint is backwards.  When you don't have permission to chmod
the tablespace directory, you probably want to fix the permissions on
the tablespace directory or its parent.  But the hint says to run the
database server as a different user.  That's a bit unusual.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: "Write amplification" is made worse by "getting tired" while inserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)

2018-08-30 Thread Peter Geoghegan
On Wed, Aug 29, 2018 at 11:28 PM, Simon Riggs  wrote:
> If you include heap TID as a column the suffix will be unique and cannot
> benefit from suffix truncation.

Right. During a page split, we must generate a new high key that's
less than or equal to all items on the left side (where the new high
key goes), and strictly greater than all items on the right side
(where the new high key values will be used as a downlink to). We'll
only include heap TID explicitly in the new high key when there is no
other way to get a value that is strictly less than all tuples on the
right side, and only at the leaf level (we don't truncate during
internal page splits, and the rules automatically work at non-leaf
levels because values come from the leaf level).

The original L invariant that I propose to restore requires that
downlinks be strictly less than all items on the child page the point
to. Truncated attributes logically retain the value minus infinity,
which is often what makes the true L invariant hold.

> It would be better to avoid including the heap TID as a column, since it is
> already there. Or the other way around, don't include it as payload if it is
> has to be a column.

The high level principle to bear in mind is that pivot tuples (high
keys at the leaf level, and all internal page tuples) only need to
correctly guide searches. There is no obligation for pivot tuples to
be able to retrieve the original values, for something like an
index-only scan. We could even make their representation into a flat
binary string someday -- this is called "key normalization" [1].
Likewise, pivot tuples are good candidates for automatic prefix
compression because there is scarcely any need to decompress them (not
working on that for v12, though).

Nothing changes about the direct representation of 99%+ of tuples
(those at the leaf level that point to the heap -- non-pivot tuples).
We need an alternative representation for heap TID in pivot tuples for
historical reasons (actually, I think what we have here would even
work best in a green field situation).

Not including a special heap TID representation in pivots, but
including everything else is what I've called "logical truncation" on
this thread. If you compare a serial primary key case with my patch
and without, you typically see no difference in the size of the index.
It is possible to end up with a larger index overall, but that should
be extremely rare, and only affect very small indexes.

> Alternatively, include only the heap block number. That would make it
> non-unique, but never more than 240 duplicates. So it would allow suffix
> truncation, and yet still avoid the multi-page split effect.

I'm not sure what you mean by the multi-page split effect -- all
internal pages are split recursively, during a leaf page split. In
general, free space management needs to remain as simple as possible,
and splitting the same page twice in one go is a big no-no.

I conservatively enforce that the 1/3 of a page restriction has a bit
of extra slop for an extra heap TID. There is one choke point that
everything takes place -- during leaf page splits. Nothing else
changes, apart from index scan logic, which may have to search with an
extra heap TID column, plus one or two other small things along those
lines.

[1] https://wiki.postgresql.org/wiki/Key_normalization
-- 
Peter Geoghegan



Re: Dimension limit in contrib/cube (dump/restore hazard?)

2018-08-30 Thread Alexander Korotkov
On Thu, Aug 30, 2018 at 4:59 PM Tom Lane  wrote:
> Alexander Korotkov  writes:
> > I'm going to check this patchset on Windows and commit if no objections.
>
> These error messages do not conform to our message style guidelines:
> you've copied an errdetail message as primary error message, but the
> rules are different for that (no complete sentences, no initial cap,
> no period).
>
> Using ERRCODE_ARRAY_ELEMENT_ERROR seems pretty random as well --- so far
> as I can see, that's generally used for cases like "this array has the
> wrong type of data elements".  Perhaps ERRCODE_PROGRAM_LIMIT_EXCEEDED
> would be the best choice.

Thank you for catching this!  I'll be more careful about error messages.


--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
commit 173eb698ffc0dcc69cb63a7a8f5fa9965acc0e8e
Author: Alexander Korotkov 
Date:   Thu Aug 30 14:09:25 2018 +0300

Split contrib/cube platform-depended checks into separate test

We're currently maintaining two outputs for cube regression test.  But that
appears to be unsuitable, because these outputs are different in out few checks
involving scientific notation.  So, split checks involving scientific notation
into separate test, making contrib/cube easier to maintain.

diff --git a/contrib/cube/Makefile b/contrib/cube/Makefile
index a679ac626ee..5e7b524dc22 100644
--- a/contrib/cube/Makefile
+++ b/contrib/cube/Makefile
@@ -11,7 +11,7 @@ PGFILEDESC = "cube - multidimensional cube data type"
 
 HEADERS = cubedata.h
 
-REGRESS = cube
+REGRESS = cube cube_sci
 
 EXTRA_CLEAN = y.tab.c y.tab.h
 
diff --git a/contrib/cube/expected/cube.out b/contrib/cube/expected/cube.out
index 6378db3004e..ac5f0bf7a8d 100644
--- a/contrib/cube/expected/cube.out
+++ b/contrib/cube/expected/cube.out
@@ -62,90 +62,6 @@ SELECT '-1.0'::cube AS cube;
  (-1)
 (1 row)
 
-SELECT '1e27'::cube AS cube;
-  cube   
--
- (1e+27)
-(1 row)
-
-SELECT '-1e27'::cube AS cube;
-   cube   
---
- (-1e+27)
-(1 row)
-
-SELECT '1.0e27'::cube AS cube;
-  cube   
--
- (1e+27)
-(1 row)
-
-SELECT '-1.0e27'::cube AS cube;
-   cube   
---
- (-1e+27)
-(1 row)
-
-SELECT '1e+27'::cube AS cube;
-  cube   
--
- (1e+27)
-(1 row)
-
-SELECT '-1e+27'::cube AS cube;
-   cube   
---
- (-1e+27)
-(1 row)
-
-SELECT '1.0e+27'::cube AS cube;
-  cube   
--
- (1e+27)
-(1 row)
-
-SELECT '-1.0e+27'::cube AS cube;
-   cube   
---
- (-1e+27)
-(1 row)
-
-SELECT '1e-7'::cube AS cube;
-  cube   
--
- (1e-07)
-(1 row)
-
-SELECT '-1e-7'::cube AS cube;
-   cube   
---
- (-1e-07)
-(1 row)
-
-SELECT '1.0e-7'::cube AS cube;
-  cube   
--
- (1e-07)
-(1 row)
-
-SELECT '-1.0e-7'::cube AS cube;
-   cube   
---
- (-1e-07)
-(1 row)
-
-SELECT '1e-300'::cube AS cube;
-   cube   
---
- (1e-300)
-(1 row)
-
-SELECT '-1e-300'::cube AS cube;
-   cube

- (-1e-300)
-(1 row)
-
 SELECT 'infinity'::cube AS cube;
 cube
 
@@ -164,24 +80,6 @@ SELECT 'NaN'::cube AS cube;
  (NaN)
 (1 row)
 
-SELECT '1234567890123456'::cube AS cube;
-  cube  
-
- (1.23456789012346e+15)
-(1 row)
-
-SELECT '+1234567890123456'::cube AS cube;
-  cube  
-
- (1.23456789012346e+15)
-(1 row)
-
-SELECT '-1234567890123456'::cube AS cube;
-  cube   
--
- (-1.23456789012346e+15)
-(1 row)
-
 SELECT '.1234567890123456'::cube AS cube;
 cube 
 -
diff --git a/contrib/cube/expected/cube_2.out b/contrib/cube/expected/cube_2.out
deleted file mode 100644
index 75fe405c497..000
--- a/contrib/cube/expected/cube_2.out
+++ /dev/null
@@ -1,2006 +0,0 @@
---
---  Test cube datatype
---
-CREATE EXTENSION cube;
--- Check whether any of our opclasses fail amvalidate
-SELECT amname, opcname
-FROM pg_opclass opc LEFT JOIN pg_am am ON am.oid = opcmethod
-WHERE opc.oid >= 16384 AND NOT amvalidate(opc.oid);
- amname | opcname 
-+-
-(0 rows)
-
---
--- testing the input and output functions
---
--- Any number (a one-dimensional point)
-SELECT '1'::cube AS cube;
- cube 
---
- (1)
-(1 row)
-
-SELECT '-1'::cube AS cube;
- cube 
---
- (-1)
-(1 row)
-
-SELECT '1.'::cube AS cube;
- cube 
---
- (1)
-(1 row)
-
-SELECT '-1.'::cube AS cube;
- cube 
---
- (-1)
-(1 row)
-
-SELECT '.1'::cube AS cube;
- cube  

- (0.1)
-(1 row)
-
-SELECT '-.1'::cube AS cube;
-  cube  
-
- (-0.1)
-(1 row)
-
-SELECT '1.0'::cube AS cube;
- cube 
---
- (1)
-(1 row)
-
-SELECT '-1.0'::cube AS cube;
- cube 
---
- (-1)
-(1 row)
-
-SELECT '1e27'::cube AS cube;
-   cube   
---
- (1e+027)
-(1 row)
-
-SELECT '-1e27'::cube AS cube;
-   cube

- (-1e+027)
-(1 row)
-
-SELECT '1.0e27'::cube AS cube;
-   cube   
---
- (1e+027)
-(1 row)
-
-SELECT '-1.0e27'::cube AS cube;
-   cube

- 

Re: Use C99 designated initializers for some structs

2018-08-30 Thread Mark Dilger



> On Aug 29, 2018, at 1:51 PM, David Steele  wrote:
> 
> On 8/29/18 5:14 AM, Peter Eisentraut wrote:
>> On 29/08/2018 12:13, Peter Eisentraut wrote:
>>> Here is a patch to change some struct initializations to use C99-style
>>> designated initializers.  These are just a few particularly egregious
>>> cases that were hard to read and write, and error prone because of many
>>> similar adjacent types.
>>> 
>>> (The PL/Python changes currently don't compile with Python 3 because of
>>> the situation described in the parallel thread "PL/Python: Remove use of
>>> simple slicing API".)
>>> 
>>> Thoughts?
> 
> +1.  This is an incredible win for readability/maintainability.

I tried doing this perhaps a year ago, and there are a few files with arrays
of structs whose representations get much larger when you change the format
in this way.  For instance, in guc.c:

static struct config_bool ConfigureNamesBool[] =
{
{
{"enable_seqscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of sequential-scan plans."),
NULL
},
_seqscan,
true,
NULL, NULL, NULL
},

becomes:

static struct config_bool ConfigureNamesBool[] =
{
{
.gen = {
.name = "enable_seqscan",
.context = PGC_USERSET,
.group = QUERY_TUNING_METHOD,
.short_desc = gettext_noop("Enables the planner's use of 
sequential-scan plans."),
.long_desc = NULL
},
.variable = _seqscan,
.boot_val = true,
.check_hook = NULL,
.assign_hook = NULL,
.show_hook = NULL
},

and then gets much longer if you do as per Tom's general suggestion and make
each field explicit (though Tom might not apply that rule to this case):

static struct config_bool ConfigureNamesBool[] =
{   
{
.gen = {
.name = "enable_seqscan",
.context = PGC_USERSET, 
.group = QUERY_TUNING_METHOD,
.short_desc = gettext_noop("Enables the planner's use of 
sequential-scan plans."),
.long_desc = NULL,
.flags = 0,
.vartype = 0,
.status = 0,
.source = 0,
.reset_source = 0,
.scontext = 0,
.reset_scontext = 0,
.stack = NULL,
.extra = NULL,
.sourcefile = NULL,
.sourceline = 0 
},
.variable = _seqscan,
.boot_val = true,
.check_hook = NULL,
.assign_hook = NULL,
.show_hook = NULL,
.reset_val = false,
.reset_extra = NULL
},

This sort of thing happens an awful lot in guc.c, but it comes up in syscache.c
also, and perhaps other places, though I don't recall any longer which other 
files
were like this.

What should the general rule be for initializing arrays of structs such as 
these?

mark




Re: Startup cost of sequential scan

2018-08-30 Thread Tom Lane
Alexander Korotkov  writes:
> I understand that startup cost is not "time to find the first row".
> But I think this example highlight not one but two issues.
> 1) Row count estimates for joins are wrong.

Yup.

> 2) Rows are assumed to be continuous while in reality they are
> discrete.

Where do you get that from?

The problem this example is highlighting is mostly just the bad
join size estimate, imo.  It's not at all clear that the planner
would still do the wrong thing if that were fixed.  In fact,
if I replace the contents of t2 with some other distribution,
such as
insert into t2 (select i from generate_series(1,100)i);
I get a much better join size estimate *and* a sane plan, even
in the LIMIT case.  So the problem here is just with the join
size estimate.

regards, tom lane



Re: Startup cost of sequential scan

2018-08-30 Thread Alexander Korotkov
On Thu, Aug 30, 2018 at 5:58 PM Tom Lane  wrote:
>
> Alexander Korotkov  writes:
> > On Thu, Aug 30, 2018 at 5:05 PM Tom Lane  wrote:
> >> Because it's what the mental model of startup cost says it should be.
>
> > From this model we make a conclusion that we're starting getting rows
> > from sequential scan sooner than from index scan.  And this conclusion
> > doesn't reflect reality.
>
> No, startup cost is not the "time to find the first row".  It's overhead
> paid before you even get to start examining rows.
>
> I'm disinclined to consider fundamental changes to our costing model
> on the basis of this example.  The fact that the rowcount estimates are
> so far off reality means that you're basically looking at "garbage in,
> garbage out" for the cost calculations --- and applying a small LIMIT
> just magnifies that.
>
> It'd be more useful to think first about how to make the selectivity
> estimates better; after that, we might or might not still think there's
> a costing issue.

I understand that startup cost is not "time to find the first row".
But I think this example highlight not one but two issues.
1) Row count estimates for joins are wrong.
2) Rows are assumed to be continuous while in reality they are
discrete.  So, if we reverse the assumptions made in LIMIT clause
estimation, we may say that it's basically assuming that we need to
fetch only fraction of row from the sequential scan node.  And in the
case we really fetch 101 rows in each join with t2, this logic would
still bring us to the bad plan.  And now I'm not proposing go rush
redesigning planner to fix that.  I just think it's probably something
worth discussion.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Startup cost of sequential scan

2018-08-30 Thread Tom Lane
Andrew Gierth  writes:
> The model (assuming I understand it rightly) is that what we're actually
> tracking is a startup cost and a per-output-row cost, but for comparison
> purposes we actually store the rows and the computed total, rather than
> just the per-row cost:

> rows
> startup_cost
> total_cost = startup_cost + (rows * per_row_cost)

Right.  I tend to think of it differently: the cost to retrieve the
first K rows out of a total of N is
startup_cost + (total_cost - startup_cost) * K/N
but that comes out to the same thing as what Andrew said.

regards, tom lane



Re: Startup cost of sequential scan

2018-08-30 Thread Andrew Gierth
> "Konstantin" == Konstantin Knizhnik  writes:

 >> No, startup cost is not the "time to find the first row". It's
 >> overhead paid before you even get to start examining rows.

 Konstantin> But it seems to me that calculation of cost in LIMIT node
 Konstantin> contradicts with this statement:

The model (assuming I understand it rightly) is that what we're actually
tracking is a startup cost and a per-output-row cost, but for comparison
purposes we actually store the rows and the computed total, rather than
just the per-row cost:

rows
startup_cost
total_cost = startup_cost + (rows * per_row_cost)

So what Limit is doing the for the offset count is recovering the
subpath's per_row_cost from (total_cost - startup_cost)/rows, and then
scaling that by the number of rows in the offset (which are being
discarded), and adding that to the startup cost. So this is saying: the
startup cost for OFFSET N is the startup cost of the subplan, plus the
cost of fetching N rows from the subplan. (And after fetching those N
rows, we still haven't found the first row that we will actually
return.)

For LIMIT N, we instead replace the old total cost with a new one
calculated from the startup cost plus N times the subplan's per-row
cost.

-- 
Andrew (irc:RhodiumToad)



Re: Startup cost of sequential scan

2018-08-30 Thread Alexander Korotkov
On Thu, Aug 30, 2018 at 6:08 PM Konstantin Knizhnik
 wrote:
> On 30.08.2018 17:58, Tom Lane wrote:
> > Alexander Korotkov  writes:
> >> On Thu, Aug 30, 2018 at 5:05 PM Tom Lane  wrote:
> >>> Because it's what the mental model of startup cost says it should be.
> >>  From this model we make a conclusion that we're starting getting rows
> >> from sequential scan sooner than from index scan.  And this conclusion
> >> doesn't reflect reality.
> > No, startup cost is not the "time to find the first row".  It's overhead
> > paid before you even get to start examining rows.
> But it seems to me that calculation of cost in LIMIT node contradicts
> with this statement:
>
>  pathnode->path.startup_cost +=
>  (subpath->total_cost - subpath->startup_cost)
>  * offset_rows / subpath->rows;

Why does it contradict?  It just assumes that skipping OFFSET rows to
be preliminary work before returning results rows...

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Startup cost of sequential scan

2018-08-30 Thread Konstantin Knizhnik




On 30.08.2018 17:58, Tom Lane wrote:

Alexander Korotkov  writes:

On Thu, Aug 30, 2018 at 5:05 PM Tom Lane  wrote:

Because it's what the mental model of startup cost says it should be.

 From this model we make a conclusion that we're starting getting rows
from sequential scan sooner than from index scan.  And this conclusion
doesn't reflect reality.

No, startup cost is not the "time to find the first row".  It's overhead
paid before you even get to start examining rows.
But it seems to me that calculation of cost in LIMIT node contradicts 
with this statement:


            pathnode->path.startup_cost +=
                (subpath->total_cost - subpath->startup_cost)
                * offset_rows / subpath->rows;





I'm disinclined to consider fundamental changes to our costing model
on the basis of this example.  The fact that the rowcount estimates are
so far off reality means that you're basically looking at "garbage in,
garbage out" for the cost calculations --- and applying a small LIMIT
just magnifies that.

It'd be more useful to think first about how to make the selectivity
estimates better; after that, we might or might not still think there's
a costing issue.

regards, tom lane



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Startup cost of sequential scan

2018-08-30 Thread Tom Lane
Alexander Korotkov  writes:
> On Thu, Aug 30, 2018 at 5:05 PM Tom Lane  wrote:
>> Because it's what the mental model of startup cost says it should be.

> From this model we make a conclusion that we're starting getting rows
> from sequential scan sooner than from index scan.  And this conclusion
> doesn't reflect reality.

No, startup cost is not the "time to find the first row".  It's overhead
paid before you even get to start examining rows.

I'm disinclined to consider fundamental changes to our costing model
on the basis of this example.  The fact that the rowcount estimates are
so far off reality means that you're basically looking at "garbage in,
garbage out" for the cost calculations --- and applying a small LIMIT
just magnifies that.

It'd be more useful to think first about how to make the selectivity
estimates better; after that, we might or might not still think there's
a costing issue.

regards, tom lane



Re: rare crash - FailedAssertion snapbuild.c Line: 580

2018-08-30 Thread Erik Rijkers

On 2018-08-30 16:44, Alvaro Herrera wrote:

On 2018-Aug-30, Erik Rijkers wrote:


ok, is this any use?


Seems mostly good, but the Xids are not printed.  Could you please do
"bt full"?  Also:

frame 3
print *snap


See the attached.




# gdb --quiet -ex 'bt full' --batch 
/var/data1/pg_stuff/pg_installations/pgsql.REL_11_STABLE/bin/postgres 
/var/data1/pg_stuff/tmp/cascade/REL_11_STABLE/6516_gW1Cl/data/core  &>  
gdb_bt_full.txt

[New LWP 147484]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: walsender rijkers [local] idle in transaction  
 '.
Program terminated with signal SIGABRT, Aborted.
#0  0x7f0fd20e7067 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
56  ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
#0  0x7f0fd20e7067 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
resultvar = 0
pid = 147484
selftid = 147484
#1  0x7f0fd20e8448 in __GI_abort () at abort.c:89
save_stage = 2
act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, 
sa_mask = {__val = {0, 35419576, 139705945370615, 1, 0, 2, 139705925209384, 
773155, 35419576, 36381032, 139705945396501, 0, 139705926081536, 0, 
139705929013136, 139705929007200}}, sa_flags = -748812544, sa_restorer = 
0x7f0fd20fd99a <_IO_vfprintf_internal+22490>}
sigs = {__val = {32, 0 }}
#2  0x008880bf in ExceptionalCondition 
(conditionName=conditionName@entry=0xa417f8 
"!(TransactionIdPrecedesOrEquals(safeXid, snap->xmin))", 
errorType=errorType@entry=0x8d365d "FailedAssertion", 
fileName=fileName@entry=0xa41223 "snapbuild.c", 
lineNumber=lineNumber@entry=580) at assert.c:54
No locals.
#3  0x0072676e in SnapBuildInitialSnapshot () at snapbuild.c:580
safeXid = 773155
snap = 0x22b2168
xid = 0
newxip = 
newxcnt = 0
__func__ = "SnapBuildInitialSnapshot"
#4  0x0072ed54 in CreateReplicationSlot (cmd=0x223bae0) at 
walsender.c:951
snap = 
ctx = 0x21c75b8
need_full_snapshot = true
snapshot_name = 0x0
nulls = {false, false, false, false}
xloc = 
"\001\231\034\002\000\000\000\000\030\347\031\002\002\000\000\000\000\000\000\000\002\000\000\000Q\000\000\000\000\000\000\000\370\034\032\002\000\000\000\000\030\347\031\002\000\000\000\000È»\034\002",
 '\000' 
slot_name = 
tstate = 
tupdesc = 
values = {0, 103, 8192, 6683594}
reserve_wal = 
snapshot_action = 
dest = 
__func__ = "CreateReplicationSlot"
#5  exec_replication_command (cmd_string=cmd_string@entry=0x21a1cf8 
"CREATE_REPLICATION_SLOT \"sub2_6517_6517_18748_sync_18728\" TEMPORARY LOGICAL 
pgoutput USE_SNAPSHOT") at walsender.c:1527
parse_rc = 
cmd_node = 0x223bae0
cmd_context = 0x223b410
old_context = 0x21a1be0
__func__ = "exec_replication_command"
#6  0x0077e8ee in PostgresMain (argc=, 
argv=argv@entry=0x21cbbc8, dbname=, username=) at 
postgres.c:4155
query_string = 0x21a1cf8 "CREATE_REPLICATION_SLOT 
\"sub2_6517_6517_18748_sync_18728\" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT"
firstchar = 
input_message = {data = 0x21a1cf8 "CREATE_REPLICATION_SLOT 
\"sub2_6517_6517_18748_sync_18728\" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT", 
len = 98, maxlen = 1024, cursor = 98}
local_sigjmp_buf = {{__jmpbuf = {657920533, -5186006737813875216, 
35251992, 35437512, 1535629124, 35251992, 5188120396153509360, 
-5186006335271315984}, __mask_was_saved = 1, __saved_mask = {__val = {0, 
35437440, 10709497, 35251992, 35265504, 1024, 35437512, 35437512, 9122032, 0, 
9094142, 14351168, 140721267739056, 35437512, 905, 35270984
send_ready_for_query = false
disable_idle_in_transaction_timeout = false
__func__ = "PostgresMain"
#7  0x00704fde in BackendRun (port=0x21c4520) at postmaster.c:4361
ac = 1
secs = 588944346
usecs = 999759
i = 1
av = 0x21cbbc8
maxac = 2
__func__ = "BackendRun"
#8  BackendStartup (port=0x21c4520) at postmaster.c:4033
bn = 
pid = 
__func__ = "BackendStartup"
#9  ServerLoop () at postmaster.c:1706
port = 0x21c4520
rmask = {fds_bits = {32, 0 }}
selres = 
now = 
readmask = {fds_bits = {56, 0 }}
nSockets = 6
last_lockfile_recheck_time = 
last_touch_time = 
__func__ = "ServerLoop"
#10 0x00705e0f in PostmasterMain (argc=argc@entry=12, 
argv=argv@entry=0x219c470) at postmaster.c:1379
opt = 
status = 
userDoption = 
listen_addr_saved = 
i = 
output_config_variable = 
__func__ = "PostmasterMain"
#11 0x00478d80 in main (argc=12, 

Re: pg_verify_checksums and -fno-strict-aliasing

2018-08-30 Thread Alvaro Herrera
On 2018-Aug-30, Fabien COELHO wrote:

> As PostgreSQL source is expected to conform to some C standard (unsure which
> one right now, possibly c89 but maybe it is beginning to switch to c99, a
> young 19 years old standard), I'd suggest that the right fix is rather to
> actually remove the aliasing issue.

Yeah, type aliasing is quite a rampant issue in our code and I don't
think there's an easy fix.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: rare crash - FailedAssertion snapbuild.c Line: 580

2018-08-30 Thread Alvaro Herrera
On 2018-Aug-30, Erik Rijkers wrote:

> ok, is this any use?

Seems mostly good, but the Xids are not printed.  Could you please do
"bt full"?  Also:

frame 3
print *snap

Thanks,

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_verify_checksums and -fno-strict-aliasing

2018-08-30 Thread Tom Lane
Fabien COELHO  writes:
>> If I add -fno-strict-aliasing to $CFLAGS, the problem goes away.
>> Is this something to worry about, or just pilot error cause I am not
>> using the same $CFLAGS as for the rest of the build? I originally
>> noticed this problem with my external fork of pg_verify_checksums.

> It looks like the code is using aliasing where the standard says it should 
> not, which breaks compiler optimization, and the added options tells the 
> compiler to not assume that the code conforms to the standard...

Actually, this code is just plain broken:

charbuf[BLCKSZ];
PageHeader  header = (PageHeader) buf;

There is no guarantee that a local char[] array is aligned on anything
stronger than a byte boundary, and if it isn't, word-sized accesses to
*header are going to fail on machines with strict alignment rules.
I suppose that that is really what Michael's compiler is moaning about.

I rather suspect that this hasn't been tested on anything but Intel
hardware, which is famously misalignment-tolerant.  The lack of any
apparent regression test infrastructure for it isn't leaving a warm
feeling about how much the buildfarm is testing it.

(The right fix, of course, is to malloc the work buffer rather than
put it on the stack.)

regards, tom lane



Re: Startup cost of sequential scan

2018-08-30 Thread Alexander Korotkov
On Thu, Aug 30, 2018 at 5:05 PM Tom Lane  wrote:
> Alexander Korotkov  writes:
> > But I think there is another issue in sequential scan cost.  We have
> > zero startup cost for sequential scan.  But why?
>
> Because it's what the mental model of startup cost says it should be.

Right.  So as I understand.  The model is that we start sequential
scan immediately, and then find one row uniformly distributed over the
whole table.

>From this model we make a conclusion that we're starting getting rows
from sequential scan sooner than from index scan.  And this conclusion
doesn't reflect reality.  If even estimates for join with t2 wouldn't
be wrong, then our plan for LIMIT query would be still bad, because it
would be still faster to get that one row using index scan rather than
sequential scan.

So, if understand the mental model correctly, our cost estimate for
LIMIT query is based on the idea that we need to fetch only *fraction*
of row from t1 in order to get 100 resulting rows.  This is obviously
wrong, because we're always dealing with whole rows :)  But I can't
imagine how we can take care of it without redesigning our whole
costing model...

> Also, I do not think we can change that without a whole lot of unpleasant
> side effects on cases that work well today.  Have you even checked to
> see how much of the regression tests break with this change?

I though it's to early to discuss this.  But yet, I've checked this.
It appears to be surprisingly few broken tests. Just some reordering
of tables in partition_join, select_parallel.--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


regression.diffs
Description: Binary data


Re: Proposal for disk quota feature

2018-08-30 Thread Chapman Flack
On 08/30/2018 09:57 AM, Hubert Zhang wrote:

> 2 Keep one worker process for each database. But using a parent/global
> quota worker process to manage the lifecycle of database level worker
> processes. It could handle the newly created database(avoid restart
> database) and save resource when a database is not used. But this needs to
> change worker process to be hierarchical. Postmaster becomes the grandfather
>  of database level worker processes in this case.

I am using background workers this way in 9.5 at $work.

In my case, one worker lives forever, wakes up on a set period, and
starts a short-lived worker for every database, waiting for each
one before starting the next.

It was straightforward to implement. Looking back over the code,
I see the global worker assigns its own PID to worker.bgw_notify_pid
of each of its children, and also obtains a handle for each child
from RegisterDynamicBackgroundWorker().

I imagine the global quota worker would prefer to start workers
for every database and then just wait for notifications from any
of them, but that seems equally straightforward at first glance.

-Chap



Re: Startup cost of sequential scan

2018-08-30 Thread Tom Lane
Alexander Korotkov  writes:
> But I think there is another issue in sequential scan cost.  We have
> zero startup cost for sequential scan.  But why?

Because it's what the mental model of startup cost says it should be.
Also, I do not think we can change that without a whole lot of unpleasant
side effects on cases that work well today.  Have you even checked to
see how much of the regression tests break with this change?

regards, tom lane



Re: BUG #15346: Replica fails to start after the crash

2018-08-30 Thread Alexander Kukushkin
Hi,

2018-08-30 15:39 GMT+02:00 Michael Paquier :

> That's indeed obvious by reading the code.  The bgwriter would be
> started only once a consistent point has been reached, so the startup
> process would have normally already updated the control file to the
> consistent point.  Something like the attached should take care of the
> problem.  As the updates of the local copy of minRecoveryPoint strongly
> rely on if the startup process is used, I think that we should use
> InRecovery for the sanity checks.
>
> I'd like to also add a TAP test for that, which should be easy enough if
> we do sanity checks by looking up at the output of the control file.
> I'll try to put more thoughts on that.
>
> Does it take care of the problem?

Yep, with the patch applied bgwriter acts as expected!

Breakpoint 1, UpdateControlFile () at xlog.c:4536
4536INIT_CRC32C(ControlFile->crc);
(gdb) bt
#0  UpdateControlFile () at xlog.c:4536
#1  0x5646d071ddb2 in UpdateMinRecoveryPoint (lsn=26341965784,
force=0 '\000') at xlog.c:2597
#2  0x5646d071de65 in XLogFlush (record=26341965784) at xlog.c:2632
#3  0x5646d09d693a in FlushBuffer (buf=0x7f8e1ca523c0,
reln=0x5646d2e86028) at bufmgr.c:2729
#4  0x5646d09d63d6 in SyncOneBuffer (buf_id=99693,
skip_recently_used=1 '\001', wb_context=0x7ffd07757380) at
bufmgr.c:2394
#5  0x5646d09d6172 in BgBufferSync (wb_context=0x7ffd07757380) at
bufmgr.c:2270
#6  0x5646d097c266 in BackgroundWriterMain () at bgwriter.c:279
#7  0x5646d073b38c in AuxiliaryProcessMain (argc=2,
argv=0x7ffd07758840) at bootstrap.c:424
#8  0x5646d098dc4a in StartChildProcess (type=BgWriterProcess) at
postmaster.c:5300
#9  0x5646d098d672 in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:4999
#10 
#11 0x7f8e5f5a6ff7 in __GI___select (nfds=5,
readfds=0x7ffd07759060, writefds=0x0, exceptfds=0x0,
timeout=0x7ffd07758fd0) at ../sysdeps/unix/sysv/linux/select.c:41
#12 0x5646d09890ca in ServerLoop () at postmaster.c:1685
#13 0x5646d0988799 in PostmasterMain (argc=17,
argv=0x5646d2e53390) at postmaster.c:1329
#14 0x5646d08d2880 in main (argc=17, argv=0x5646d2e53390) at main.c:228



Regards,
--
Alexander Kukushkin



Re: Dimension limit in contrib/cube (dump/restore hazard?)

2018-08-30 Thread Tom Lane
Alexander Korotkov  writes:
> I'm going to check this patchset on Windows and commit if no objections.

These error messages do not conform to our message style guidelines:
you've copied an errdetail message as primary error message, but the
rules are different for that (no complete sentences, no initial cap,
no period).

Using ERRCODE_ARRAY_ELEMENT_ERROR seems pretty random as well --- so far
as I can see, that's generally used for cases like "this array has the
wrong type of data elements".  Perhaps ERRCODE_PROGRAM_LIMIT_EXCEEDED
would be the best choice.

regards, tom lane



Proposal for disk quota feature

2018-08-30 Thread Hubert Zhang
Hi all,
We want to introduce disk quota feature into Postgres.

*Why disk quota*
*In a multi-tenant environment, there is a requirement to limit the disk
quota that database/schema/table can be written or a user can consume for
different organizations.*
*Meanwhile, other databases such as Oracle, Teradata, DB2 have already
supported disk quota feature.*

*Heikki has already implemented disk quota feature in Postgres as an
extension pg_quota . We plan to
enhance disk quota feature based on Heikki's implementation.*

*Scope*

The scope of disk quota feature is to limit the disk usage of
database/schema/table objects and users.

Here table means heap table, ao table, index table, partition table and
associated table( toast table, visible table, large object etc.). Schema
disk quota is the disk quota of all the tables in this schema. Database
disk quota is the disk quota of all the tables in this database.

User's quota is the size of all the tables whose owner are this user.
Out of Scope: Note that spill files, xlogs, clogs and logs are not
considered for database object level disk quota at this stage.


*Design*
We propose disk quota with the following components:

1. Quota Setting Store is where the disk quota setting to be stored and
accessed. DBA or object owner uses SQL queries to configure the disk quota
for each database objects.



*2. Quota Change Detector is the monitor of size change of database objects
in Postgres. It will write change information to shared memory to notify
Quota Size Checker. The implementation of Change Detector could be hooks in
smgr_extend/smgr_unlink/smgr_truncate when modifying the size of a heap
table. The hooks will write to shared memory in a batched way to reduce the
impact on OLTP performance.3. Quota Size Checker is implemented as a worker
process. It maintains the current disk usage for each database objects and
users, and compare them with settings in Quota Setting Store. If it detects
the disk usage hit the quota redzone(either upon or below), it will notify
Quota Enforcement Operator. 4. Quota Enforcement Operator has two roles:
one is to check the disk quota permission before queries are
executed(QueryBeforeRun Check), the other is to cancel the running queries
when it reaches the disk quota limit dynamically(QueryRunning Check). Quota
Enforcement Operator uses shared memory to store the enforcement
information to guarantee a quick check. *


*To implement the right proof of concept, we want to receive feedback from
the community from the following aspects: *
Q1. User Interface: when setting a disk quota,
Option 1 is to use *insert into quota.config(user1, 10G)*
Option 2 is to use UDF *select set_quota("role","user1",10G)*
Option 3 is to use native SQL syntax *create disk quota on ROLE user1
10G, or create disk quota on SCHEMA s1 25G;*
Q2. Quota Setting Store using user table or catalog?
Option 1 is to create a schema called quota for each database and write
quota settings into quota.config table, only DBA could modify it. This
corresponds to Q1.option1
Option 2 is to store quota settings into the catalog. For Schema and
Table write them to database level catalog. For database or user, write
them to either database level or global catalog.

I personally prefer Q1's option3 and Q2's option2, since it makes disk
quota more like a native feature. We could support the quota worker process
implementation as an extension for now, but in the long term, disk quota is
like a fundamental feature of a database and should be a native feature
just like other databases. So store quota conf into catalog and supply
native syntax is better.

Open Problem
We prepare to implement Quota Size Checker as a worker process. Worker
process needs to connect to a database to build the disk usage map and
quota map(e.g. a user/shcema's disk usage on a given database) But one
worker process can only bind to one database(InitPostgres(dbname)) at the
initialization stage. It results in we need a separate worker process for
each database. The solution to this problem is not straightforward, here
are some ideas:
1 To make worker process could retrieve and cache information from all the
databases.  As Tom Lane pointed out that it needs to flush all the database
specific thing, like relcache syscache etc.
2 Keep one worker process for each database. But using a parent/global
quota worker process to manage the lifecycle of database level worker
processes. It could handle the newly created database(avoid restart
database) and save resource when a database is not used. But this needs to
change worker process to be hierarchical. Postmaster becomes the grandfather
 of database level worker processes in this case.

Any better ideas on it?



-- 
Thanks

Hubert Zhang


Startup cost of sequential scan

2018-08-30 Thread Alexander Korotkov
Hi!

Our customer have a bad plan problem, which could be reduced to the
following example.

create table t1 (id int primary key, k int);
create table t2 (id int);

insert into t1 (select i, i from generate_series(1,100) i);
insert into t2 (select 0 from generate_series(1,100)i);
insert into t2 values (50);

For the following query the plan is OK despite number of resulting
rows is very much overestimated.

# explain select * from
t1 join t2 x1 on x1.id = t1.k
join t2 x2  on x2.id = t1.k
join t2 x3  on x3.id = t1.k
join t2 x4  on x4.id = t1.k
join t2 x5  on x5.id = t1.k
join t2 x6  on x6.id = t1.k
join t2 x7  on x7.id = t1.k
join t2 x8  on x8.id = t1.k
join t2 x9  on x9.id = t1.k
where t1.id = 50;

  QUERY PLAN

 Hash Join  (cost=265.98..104071348014760.11 rows=9240411823550620 width=44)
   Hash Cond: (x1.id = x8.id)
   ->  Hash Join  (cost=22.80..25.20 rows=942425865760 width=36)
 Hash Cond: (x1.id = t1.k)
 ->  Seq Scan on t2 x1  (cost=0.00..2.01 rows=101 width=4)
 ->  Hash  (cost=22.79..22.79 rows=1 width=32)
   ->  Hash Join  (cost=20.39..22.79 rows=1 width=32)
 Hash Cond: (x7.id = t1.k)
 ->  Seq Scan on t2 x7  (cost=0.00..2.01 rows=101 width=4)
 ->  Hash  (cost=20.38..20.38 rows=1 width=28)
   ->  Hash Join  (cost=17.98..20.38 rows=1 width=28)
 Hash Cond: (x6.id = t1.k)
 ->  Seq Scan on t2 x6
(cost=0.00..2.01 rows=101 width=4)
 ->  Hash  (cost=17.96..17.96 rows=1 width=24)
   ->  Hash Join
(cost=15.57..17.96 rows=1 width=24)
 Hash Cond: (x5.id = t1.k)
 ->  Seq Scan on t2 x5
(cost=0.00..2.01 rows=101 width=4)
 ->  Hash
(cost=15.55..15.55 rows=1 width=20)
   ->  Hash Join
(cost=13.15..15.55 rows=1 width=20)
 Hash Cond:
(x4.id = t1.k)
 ->  Seq Scan
on t2 x4  (cost=0.00..2.01 rows=101 width=4)
 ->  Hash
(cost=13.14..13.14 rows=1 width=16)
   ->
Hash Join  (cost=10.74..13.14 rows=1 width=16)

Hash Cond: (x3.id = t1.k)

->  Seq Scan on t2 x3  (cost=0.00..2.01 rows=101 width=4)

->  Hash  (cost=10.73..10.73 rows=1 width=12)

->  Hash Join  (cost=8.46..10.73 rows=1 width=12)

  Hash Cond: (x2.id = t1.k)

  ->  Seq Scan on t2 x2  (cost=0.00..2.01 rows=101 width=4)

  ->  Hash  (cost=8.44..8.44 rows=1 width=8)

->  Index Scan using t1_pkey on t1  (cost=0.42..8.44
rows=1 width=8)

  Index Cond: (id = 50)
   ->  Hash  (cost=118.17..118.17 rows=10001 width=8)
 ->  Hash Join  (cost=3.27..118.17 rows=10001 width=8)
   Hash Cond: (x8.id = x9.id)
   ->  Seq Scan on t2 x8  (cost=0.00..2.01 rows=101 width=4)
   ->  Hash  (cost=2.01..2.01 rows=101 width=4)
 ->  Seq Scan on t2 x9  (cost=0.00..2.01 rows=101 width=4)
(38 rows)

But when you add LIMIT clause, index scan on t1 turns out into sequential scan.

# explain select * from
t1 join t2 x1 on x1.id = t1.k
join t2 x2  on x2.id = t1.k
join t2 x3  on x3.id = t1.k
join t2 x4  on x4.id = t1.k
join t2 x5  on x5.id = t1.k
join t2 x6  on x6.id = t1.k
join t2 x7  on x7.id = t1.k
join t2 x8  on x8.id = t1.k
join t2 x9  on x9.id = t1.k
where t1.id = 50 limit 100;
QUERY PLAN
---
 Limit  (cost=0.00..1.55 rows=100 width=44)
   ->  Nested Loop  (cost=0.00..142805795187416.44
rows=9240411823550620 width=44)
 Join Filter: (x1.id = x9.id)
 ->  Nested Loop  (cost=0.00..1427775203576.57
rows=93318825071840 width=40)
   Join Filter: (x1.id = x8.id)
   ->  Nested Loop  (cost=0.00..16947.91 rows=942425865760 width=36)
 Join Filter: (t1.k = x1.id)
 ->  Nested Loop  (cost=0.00..16944.63 rows=1 width=32)
   Join Filter: (t1.k = x7.id)
   ->  Nested Loop  (cost=0.00..16941.36
rows=1 width=28)
 Join Filter: (t1.k = x6.id)
 ->  Nested Loop  (cost=0.00..16938.09
rows=1 width=24)
   Join Filter: (t1.k = x5.id)
   ->  Nested Loop

Re: BUG #15346: Replica fails to start after the crash

2018-08-30 Thread Michael Paquier
On Thu, Aug 30, 2018 at 10:55:23AM +0200, Alexander Kukushkin wrote:
> Bgwriter itself never changes updateMinRecoveryPoint to true and boom,
> we can get a lot of pages written to disk, but minRecoveryPoint in the
> pg_control won't be updated!

That's indeed obvious by reading the code.  The bgwriter would be
started only once a consistent point has been reached, so the startup
process would have normally already updated the control file to the
consistent point.  Something like the attached should take care of the
problem.  As the updates of the local copy of minRecoveryPoint strongly
rely on if the startup process is used, I think that we should use
InRecovery for the sanity checks.

I'd like to also add a TAP test for that, which should be easy enough if
we do sanity checks by looking up at the output of the control file.
I'll try to put more thoughts on that.

Does it take care of the problem?
--
Michael
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 493f1db7b9..f9c52fb9d8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2723,9 +2723,14 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
 	 * i.e., we're doing crash recovery.  We never modify the control file's
 	 * value in that case, so we can short-circuit future checks here too. The
 	 * local values of minRecoveryPoint and minRecoveryPointTLI should not be
-	 * updated until crash recovery finishes.
+	 * updated until crash recovery finishes.  We only do this for the startup
+	 * process replaying WAL which needs to keep clean references of
+	 * minRecoveryPoint until it is made sure that it has replayed all WAL
+	 * available for crash recovery.  Other processes would be started after
+	 * the consistency point has been reached on a standby, so it would be
+	 * actually safe to update the control file in this case.
 	 */
-	if (XLogRecPtrIsInvalid(minRecoveryPoint))
+	if (XLogRecPtrIsInvalid(minRecoveryPoint) && InRecovery)
 	{
 		updateMinRecoveryPoint = false;
 		return;
@@ -2737,7 +2742,9 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
 	minRecoveryPoint = ControlFile->minRecoveryPoint;
 	minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
 
-	if (force || minRecoveryPoint < lsn)
+	if (XLogRecPtrIsInvalid(minRecoveryPoint))
+		updateMinRecoveryPoint = false;
+	else if (force || minRecoveryPoint < lsn)
 	{
 		XLogRecPtr	newMinRecoveryPoint;
 		TimeLineID	newMinRecoveryPointTLI;


signature.asc
Description: PGP signature


Re: Problem while updating a foreign table pointing to a partitioned table on foreign server

2018-08-30 Thread Etsuro Fujita

(2018/08/30 20:37), Kyotaro HORIGUCHI wrote:

At Fri, 24 Aug 2018 21:45:35 +0900, Etsuro Fujita  wrote 
in<5b7ffdef.6020...@lab.ntt.co.jp>

(2018/08/21 11:01), Kyotaro HORIGUCHI wrote:

At Tue, 14 Aug 2018 20:49:02 +0900, Etsuro
Fujita  wrote
in<5b72c1ae.8010...@lab.ntt.co.jp>

(2018/08/09 22:04), Etsuro Fujita wrote:

(2018/08/08 17:30), Kyotaro HORIGUCHI wrote:



I spent more time looking at the patch.  ISTM that the patch well
suppresses the effect of the tuple-descriptor expansion by making
changes to code in the planner and executor (and ruleutils.c), but I'm
still not sure that the patch is the right direction to go in, because
ISTM that expanding the tuple descriptor on the fly might be a wart.



The exapansion should be safe if the expanded descriptor has the
same defitions for base columns and all the extended coulumns are
junks. The junk columns should be ignored by unrelated nodes and
they are passed safely as far as ForeignModify passes tuples as
is from underlying ForeignScan to ForeignUpdate/Delete.


I'm not sure that would be really safe.  Does that work well when
EvalPlanQual, for example?


Nothing. The reason was that core just doesn't know about the
extended portion. So only problematic case was
ExprEvalWholeRowVar, where explicit sanity check is
perfomed. But, I think it is a ugly wart as you said. So the
latest patch generates full fdw_scan_tlist.


Will review.


If we take the Param-based approach suggested by Tom, I suspect there
would be no need to worry about at least those things, so I'll try to
update your patch as such, if there are no objections from you (or
anyone else).



PARAM_EXEC is single storage side channel that can work as far as
it is set and read while each tuple is handled. In this case
postgresExecForeignUpdate/Delete must be called before
postgresIterateForeignScan returns the next tuple. An apparent
failure case for this usage is the join-update case below.

https://www.postgresql.org/message-id/20180605.191032.256535589.horiguchi.kyot...@lab.ntt.co.jp


What I have in mind would be to 1) create a tlist that contains not
only Vars/PHVs but Params, for each join rel involving the target rel
so we ensure that the Params will propagate up through all join plan
steps, and 2) convert a join rel's tlist Params into Vars referencing
the same Params in the tlists for the outer/inner rels, by setrefs.c.
I think that would probably work well even for the case you mentioned
above. Maybe I'm missing something, though.


As I wrote above, the problem was not param id propagation but
the per-query storage for a parameter holded in econtext.

PARAM_EXEC is assumed to be used between outer and inner
relations of a nestloop or retrieval from sub-query retrieval as
commented in primnodes.h.


PARAM_EXEC:  The parameter is an internal executor parameter, used
for passing values into and out of sub-queries or from
nestloop joins to their inner scans.
For historical reasons, such parameters are numbered from 0.
These numbers are independent of PARAM_EXTERN numbers.


Yeah, but IIUC, I think that #2 would allow us to propagate up the param 
values, not the param ids.



I'm waiting for the patch.


OK, but I will review your patch first.

Best regards,
Etsuro Fujita



Re: [HACKERS] [PATCH] kNN for SP-GiST

2018-08-30 Thread Alexander Korotkov
On Thu, Aug 30, 2018 at 12:41 PM Alexander Korotkov
 wrote:
> Right, performance regression appears to be caused by queue memory
> context allocation.  I've tried to apply the same approach that we've
> in GiST: allocate separate memory context for queue only at second
> rescan call.  And it appears to work, there is no measurable
> regression in comparison to master.
>
> Patch v8
> x run 1 run 2 run 3
> 0.1 680   660   662
> 0.01   1403  1395  1508
> 0.003  6561  6475  6223
>
> Revised version of patch is attached.  I've squashed patchset into one
> patch.  Later I'll split it into fewer parts before committing.  I'm
> going to also make some benchmarking of KNN itself: GiST vs SP-GiST.

I've made KNN GiST vs SP-GiST benchmarking similar to what I've done
previously for plain search.

create table knn_test as (select point(random(), random()) p from
generate_series(1,100) i);
create index knn_test_idx on knn_test using spgist(p);
create index knn_test_idx on knn_test using gist(p);

CREATE FUNCTION knn_bench(step float8, lim int) RETURNS void AS $$
DECLARE
x float8;
y float8;
BEGIN
y := 0.0;
WHILE y <= 1.0 LOOP
x := 0.0;
WHILE x <= 1.0 LOOP
PERFORM * FROM knn_test ORDER BY p <-> point(x,y) LIMIT lim;
x := x + step;
END LOOP;
y := y + step;
END LOOP;
END;
$$ LANGUAGE plpgsql;

And the results are following.

# GiST
step limit run 1 run 2 run 3 buffers
 0.1  1000   156   161   158  123191
0.03   100   298   305   301  122902
0.0110  1216  1214  1212  138591

# SP-GiST
step limit run 1 run 2 run 3 buffers
 0.1  1000   147   152   153  126446
0.03   100   227   223   226  131241
0.0110   734   733   730  175908

For me it looks expectable.  SP-GiST takes less CPU, but uses more
buffers.  It's pretty same to what we have for plain scans.  So, I see
no anomaly here.

Generally patch looks close to committable shape for me.  I'm going to
revise code and documentation again, split it up, and then propose to
commit.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: rare crash - FailedAssertion snapbuild.c Line: 580

2018-08-30 Thread Erik Rijkers

On 2018-08-29 21:15, Andres Freund wrote:

Hi,

On 2018-08-29 17:43:17 +0200, Erik Rijkers wrote:

To test postgres 11, I still regularly run series of short sessions of
pgbench-over-logical-replication (basically the same thing that I used 
last
year [1] - now in a perl incarnation).  Most of the time the 
replication is

stable and finishes correctly but sometimes (rarely) I get:

TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid,
snap->xmin))", File: "snapbuild.c", Line: 580)

This will probably be difficult to reproduce and to act upon but I 
wanted to
report it anyway as in the course of the last few months I have seen 
it
several times, on several machines. Always rarely, always postgres 11 
(I did

not try other versions).


Thanks for testing! Could you possibly run the tests with core files
enabled, so we at get a backtrace in case of trouble?  Knowing what the
values here are would be tremendously helpful...


ok, is this any use?

$ gdb --quiet 
/var/data1/pg_stuff/pg_installations/pgsql.REL_11_STABLE/bin/postgres  
/var/data1/pg_stuff/tmp/cascade/REL_11_STABLE/6516_gW1Cl/data/core
Reading symbols from 
/var/data1/pg_stuff/pg_installations/pgsql.REL_11_STABLE/bin/postgres...done.

[New LWP 147484]
[Thread debugging using libthread_db enabled]
Using host libthread_db library 
"/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: walsender rijkers [local] idle in 
transaction   '.

Program terminated with signal SIGABRT, Aborted.
#0  0x7f0fd20e7067 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
56  ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or 
directory.

(gdb) bt
#0  0x7f0fd20e7067 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56

#1  0x7f0fd20e8448 in __GI_abort () at abort.c:89
#2  0x008880bf in ExceptionalCondition 
(conditionName=conditionName@entry=0xa417f8 
"!(TransactionIdPrecedesOrEquals(safeXid, snap->xmin))", 
errorType=errorType@entry=0x8d365d "FailedAssertion", 
fileName=fileName@entry=0xa41223 "snapbuild.c", 
lineNumber=lineNumber@entry=580) at assert.c:54

#3  0x0072676e in SnapBuildInitialSnapshot () at snapbuild.c:580
#4  0x0072ed54 in CreateReplicationSlot (cmd=0x223bae0) at 
walsender.c:951
#5  exec_replication_command (cmd_string=cmd_string@entry=0x21a1cf8 
"CREATE_REPLICATION_SLOT \"sub2_6517_6517_18748_sync_18728\" TEMPORARY 
LOGICAL pgoutput USE_SNAPSHOT") at walsender.c:1527
#6  0x0077e8ee in PostgresMain (argc=, 
argv=argv@entry=0x21cbbc8, dbname=, username=out>) at postgres.c:4155
#7  0x00704fde in BackendRun (port=0x21c4520) at 
postmaster.c:4361

#8  BackendStartup (port=0x21c4520) at postmaster.c:4033
#9  ServerLoop () at postmaster.c:1706
#10 0x00705e0f in PostmasterMain (argc=argc@entry=12, 
argv=argv@entry=0x219c470) at postmaster.c:1379

#11 0x00478d80 in main (argc=12, argv=0x219c470) at main.c:228
(gdb)






Re: A strange GiST error message or fillfactor of GiST build

2018-08-30 Thread Andrey Borodin
Hello!

> 30 авг. 2018 г., в 2:42, Kyotaro HORIGUCHI  
> написал(а):
> 
> At Wed, 29 Aug 2018 10:42:59 -0300, Andrey Borodin  
> wrote in <6fbe12b2-4f59-4db9-bde9-62c880118...@yandex-team.ru>
>> 
>> We are passing freespace everywhere. Also, we pass GistInsertState, and 
>> GistState.
>> Maybe let's put GistState into GistInsertState, GistState already has free 
>> space, and pass just GistInsertState everywhere?
> 
> Yeah, I thought something like that first. GISTSTATE doesn't have
> freespace size but we could refactor so that all insert-related
> routines use GISTInsertState and make GISTBuildState have
> it. (patch 1) But this prevents future back-patching so I don't
> think this acceptable.

The patch looks good to me. Making code better for future development seem to 
me more important than backpatching this: error per se is cryptic but not 
threatening, it will be prevented by cube construction routines limitations.
But that is just my IMHO.

Best regards, Andrey Borodin.


Re: Catalog corruption

2018-08-30 Thread Andrew Gierth
> "Mariel" == Mariel Cherkassky  writes:

 Mariel> Hi Andrew,
 Mariel> what is the name of the channel ?

The name of the channel is #postgresql  (including the # character)

-- 
Andrew (irc:RhodiumToad)



Re: Reopen logfile on SIGHUP

2018-08-30 Thread Kyotaro HORIGUCHI
Hello.

At Thu, 30 Aug 2018 13:42:42 +0300, Alexander Korotkov 
 wrote in 

> It seems that http://commitfest.cputube.org/ runs only "make check" on
> Windows.  But my Postgres Pro colleagues checked that tests passed on
> 32-bit and 64-bit versions of Windows Server 2008.  Also I made some
> minor beautifications on code and documentation.
> 
> This patch seems to have good shape and generally being quite
> harmless.  Do we have any objections to committing this?

I checked that on my Win7 box and worked. Of course I have no
objection.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Catalog corruption

2018-08-30 Thread Mariel Cherkassky
Hi Andrew,
what is the name of the channel ?

Thanks , Mariel.

‫בתאריך יום ד׳, 29 באוג׳ 2018 ב-14:31 מאת ‪Andrew Gierth‬‏ <‪
and...@tao11.riddles.org.uk‬‏>:‬

> > "Mariel" == Mariel Cherkassky  writes:
>
>  Mariel> Hi,
>
>  Mariel> I sent already an email about this topic to pgsql-admins but I
>  Mariel> think that it might be more relevant to this mailing list.
>
> The -hackers mailing list is about the development of postgresql.
>
>  Mariel> I'm trying to investigate a corruption that happened on a
>  Mariel> machine of one of our clients.
>
> A good place to get help with that is actually the IRC channel
> (#postgresql on chat.freenode.net, or http://webchat.freenode.net for
> the web interface - note that currently you need to register a nickname,
> see http://freenode.net/kb/answer/registration because of some annoying
> spammers, but we can help you with that if you just go ahead and try and
> join the channel anyway).
>
> --
> Andrew (irc:RhodiumToad)
>


Re: Problem while updating a foreign table pointing to a partitioned table on foreign server

2018-08-30 Thread Kyotaro HORIGUCHI
Hello.

At Fri, 24 Aug 2018 21:45:35 +0900, Etsuro Fujita  
wrote in <5b7ffdef.6020...@lab.ntt.co.jp>
> (2018/08/21 11:01), Kyotaro HORIGUCHI wrote:
> > At Tue, 14 Aug 2018 20:49:02 +0900, Etsuro
> > Fujita wrote
> > in<5b72c1ae.8010...@lab.ntt.co.jp>
> >> (2018/08/09 22:04), Etsuro Fujita wrote:
> >>> (2018/08/08 17:30), Kyotaro HORIGUCHI wrote:
> 
> >> I spent more time looking at the patch.  ISTM that the patch well
> >> suppresses the effect of the tuple-descriptor expansion by making
> >> changes to code in the planner and executor (and ruleutils.c), but I'm
> >> still not sure that the patch is the right direction to go in, because
> >> ISTM that expanding the tuple descriptor on the fly might be a wart.
> 
> > The exapansion should be safe if the expanded descriptor has the
> > same defitions for base columns and all the extended coulumns are
> > junks. The junk columns should be ignored by unrelated nodes and
> > they are passed safely as far as ForeignModify passes tuples as
> > is from underlying ForeignScan to ForeignUpdate/Delete.
> 
> I'm not sure that would be really safe.  Does that work well when
> EvalPlanQual, for example?

Nothing. The reason was that core just doesn't know about the
extended portion. So only problematic case was
ExprEvalWholeRowVar, where explicit sanity check is
perfomed. But, I think it is a ugly wart as you said. So the
latest patch generates full fdw_scan_tlist.

> > https://www.postgresql.org/docs/devel/static/fdw-planning.html
> >
> > Hmm. Thanks for the pointer, it seems to need rewrite. However,
> > it doesn't seem to work for non-join foreign scans, since the
> > core igonres it and uses local table definition.
> 
> Really?

No, I was wrong here. The core doesn't consider the case where
fdw_scan_tlist has attributes that is not a part of base relation
but it doesn't affect the description.

> >> You wrote:
> >>> I'm not sure whether the following ponits are valid.
> >>>
> >>> - If fdw_scan_tlist is used for simple relation scans, this would
> >>> break the case. (ExecInitForeignScan,  set_foreignscan_references)
> >>
> >> Some FDWs might already use that list for the improved efficiency for
> >> simple foreign table scans as explained above, so we should avoid
> >> breaking that.
> >
> > I considered to use fdw_scan_tlist in that way but the core is
> > assuming that foreign scans with scanrelid>  0 uses the relation
> > descriptor.
> 
> Could you elaborate a bit more on this?

After all I found that core uses fdw_scan_tlist if any and the
attached patch doen't modify the "affected" part. Sorry, it's
still hot here:p

> > Do you have any example for that?
> 
> I don't know such an example, but in my understanding, the core allows
> the FDW to do that.

As above, I agreed. Sorry for the bogosity.

> >> If we take the Param-based approach suggested by Tom, I suspect there
> >> would be no need to worry about at least those things, so I'll try to
> >> update your patch as such, if there are no objections from you (or
> >> anyone else).
> 
> > PARAM_EXEC is single storage side channel that can work as far as
> > it is set and read while each tuple is handled. In this case
> > postgresExecForeignUpdate/Delete must be called before
> > postgresIterateForeignScan returns the next tuple. An apparent
> > failure case for this usage is the join-update case below.
> >
> > https://www.postgresql.org/message-id/20180605.191032.256535589.horiguchi.kyot...@lab.ntt.co.jp
> 
> What I have in mind would be to 1) create a tlist that contains not
> only Vars/PHVs but Params, for each join rel involving the target rel
> so we ensure that the Params will propagate up through all join plan
> steps, and 2) convert a join rel's tlist Params into Vars referencing
> the same Params in the tlists for the outer/inner rels, by setrefs.c.
> I think that would probably work well even for the case you mentioned
> above. Maybe I'm missing something, though.

As I wrote above, the problem was not param id propagation but
the per-query storage for a parameter holded in econtext.

PARAM_EXEC is assumed to be used between outer and inner
relations of a nestloop or retrieval from sub-query retrieval as
commented in primnodes.h.

>PARAM_EXEC:  The parameter is an internal executor parameter, used
>for passing values into and out of sub-queries or from
>nestloop joins to their inner scans.
>For historical reasons, such parameters are numbered from 0.
>These numbers are independent of PARAM_EXTERN numbers.

Anyway the odds are high that I'm missing far more than you.

> Sorry for the delay.

Nope. Thank you for the comment and I'm waiting for the patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_verify_checksums vs windows

2018-08-30 Thread Magnus Hagander
On Thu, Aug 30, 2018 at 1:32 PM, Amit Kapila 
wrote:

> On Wed, Aug 29, 2018 at 5:17 PM Magnus Hagander 
> wrote:
> >
> > On Wed, Aug 29, 2018 at 1:44 PM, Amit Kapila 
> wrote:
> >>
> >> On Wed, Aug 29, 2018 at 5:05 PM Magnus Hagander 
> wrote:
> >> >
> >> > On Wed, Aug 29, 2018 at 1:31 PM, Amit Kapila 
> wrote:
> >> >>
> >> >> So, I think we need to open the file in binary mode as in other parts
> >> >> of the code.  The attached patch fixes the problem for me.
> >> >>
> >> >> Thoughts?
> >> >
> >> >
> >> > Yikes. Yes, I believe you are correct, and that looks like the
> correct fix.
> >> >
> >> > I wonder why this was not caught on the buildfarm. We do have
> regression tests for it, AFAIK?
> >> >
> >>
> >> I am not able to find regression tests for it, but maybe I am not
> >> seeing it properly.  By any chance, you have removed it during revert
> >> of ""Allow on-line enabling and disabling of data checksums".
> >>
> >
> > Oh meh. You are right, it's in the reverted patch, I was looking in the
> wrong branch :/ Sorry about that. And that certainly explains why we don't
> have it.
> >
>
> Okay.  I will commit this in a day or so after once verifying it on
> PG11 as well.  I think this needs to be backpatched, let me know if
> you think otherwise.
>
>
Definitely a bug so yes, it needs backpatching.


-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: pg_verify_checksums vs windows

2018-08-30 Thread Amit Kapila
On Wed, Aug 29, 2018 at 5:17 PM Magnus Hagander  wrote:
>
> On Wed, Aug 29, 2018 at 1:44 PM, Amit Kapila  wrote:
>>
>> On Wed, Aug 29, 2018 at 5:05 PM Magnus Hagander  wrote:
>> >
>> > On Wed, Aug 29, 2018 at 1:31 PM, Amit Kapila  
>> > wrote:
>> >>
>> >> So, I think we need to open the file in binary mode as in other parts
>> >> of the code.  The attached patch fixes the problem for me.
>> >>
>> >> Thoughts?
>> >
>> >
>> > Yikes. Yes, I believe you are correct, and that looks like the correct fix.
>> >
>> > I wonder why this was not caught on the buildfarm. We do have regression 
>> > tests for it, AFAIK?
>> >
>>
>> I am not able to find regression tests for it, but maybe I am not
>> seeing it properly.  By any chance, you have removed it during revert
>> of ""Allow on-line enabling and disabling of data checksums".
>>
>
> Oh meh. You are right, it's in the reverted patch, I was looking in the wrong 
> branch :/ Sorry about that. And that certainly explains why we don't have it.
>

Okay.  I will commit this in a day or so after once verifying it on
PG11 as well.  I think this needs to be backpatched, let me know if
you think otherwise.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Dimension limit in contrib/cube (dump/restore hazard?)

2018-08-30 Thread Alexander Korotkov
Hi!

On Tue, Aug 28, 2018 at 10:30 PM Andrey Borodin  wrote:
> > 28 авг. 2018 г., в 14:18, Alexander Korotkov  
> > написал(а):
> >
> > OK, but I think cube_c_f8() and cube_c_f8_f8() also need to be
> > revised.  Also, I think this behavior should be covered by regression
> > tests.
> True. Also there's one case in cube_subset.

In general looks good for me.  Personally I get tired with cube.out
and cube_2.out.  They are different with only few checks involving
scientific notation.  But all the patches touching cube regression
tests should update both cube.out and cube_2.out.  I propose to split
scientific notation checks into separate test.  I've also add check
for sube_subset().

I'm going to check this patchset on Windows and commit if no objections.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-cube-split-scientific-notation-test-v1.patch
Description: Binary data


0002-cube-enforce-dimension-checks-v1.patch
Description: Binary data


Re: pg_verify_checksums failure with hash indexes

2018-08-30 Thread Amit Kapila
On Wed, Aug 29, 2018 at 4:05 PM Dilip Kumar  wrote:
>
> On Wed, Aug 29, 2018 at 3:39 PM, Dilip Kumar  wrote:
> >> SHOW block_size ;
> >>  block_size
> >> 
> >>  4096
> >>
> >> CREATE TABLE foo(val text);
> >> INSERT INTO foo VALUES('bernd');
> >>
> >> CREATE INDEX ON foo USING hash(val);
> >> ERROR:  index "foo_val_idx" contains corrupted page at block 0
> >> HINT:  Please REINDEX it.
> >>
> >> I have no idea wether this could be related, but  i thought it won't
> >> harm to share this here.
> >>
> >
> > This issue seems different than the one got fixed in this thread.  The
> > reason for this issue is that the size of the hashm_mapp in
> > HashMetaPageData is 4096, irrespective of the block size.  So when the
> > block size is big enough (i.e. 8192) then there is no problem, but
> > when you set it to 4096, in that case, the hashm_mapp of the meta page
> > is overwriting the special space of the meta page.  That's the reason
> > its showing corrupted page while checking the hash_page.
>

Your analysis appears correct to me.

> Just to verify this I just hacked it like below and it worked.  I
> think we need a more thoughtful value for HASH_MAX_BITMAPS.
>
> diff --git a/src/include/access/hash.h b/src/include/access/hash.h
..
> -#define HASH_MAX_BITMAPS   1024
> +#define HASH_MAX_BITMAPS   Min(BLCKSZ / 8, 1024)
>

We have previously changed this define in 620b49a1 with the intent to
allow many non-unique values in hash indexes without worrying to reach
the limit of the number of overflow pages.  I think this didn't occur
to us that it won't work for smaller block sizes.  As such, I don't
see any problem with the suggested fix.  It will allow us the same
limit for the number of overflow pages at 8K block size and a smaller
limit at smaller block size.  I am not sure if we can do any better
with the current design.  As it will change the metapage, I think we
need to bump HASH_VERSION.

Robert, others, any thoughts?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Extra word in src/backend/optimizer/README

2018-08-30 Thread Etsuro Fujita
Hi,

Here is a small patch to remove $SUBJECT: s/has contains/contains/

Best regards,
Etsuro Fujita
diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README
index 0db3d36..9c852a1 100644
--- a/src/backend/optimizer/README
+++ b/src/backend/optimizer/README
@@ -1109,7 +1109,7 @@ of joining relations.
 Partitionwise aggregates/grouping
 -
 
-If the GROUP BY clause has contains all of the partition keys, all the rows
+If the GROUP BY clause contains all of the partition keys, all the rows
 that belong to a given group must come from a single partition; therefore,
 aggregation can be done completely separately for each partition. Otherwise,
 partial aggregates can be computed for each partition, and then finalized


Re: pg_verify_checksums and -fno-strict-aliasing

2018-08-30 Thread Fabien COELHO




I noticed that pg_verify_checksums computes bogus checksums if I compile
it with '-O2 -Wall' but without -fno-strict-aliasing. Also I am getting
a compile warning then:

[...]

If I add -fno-strict-aliasing to $CFLAGS, the problem goes away.

Is this something to worry about, or just pilot error cause I am not
using the same $CFLAGS as for the rest of the build? I originally
noticed this problem with my external fork of pg_verify_checksums.


It looks like the code is using aliasing where the standard says it should 
not, which breaks compiler optimization, and the added options tells the 
compiler to not assume that the code conforms to the standard...


As PostgreSQL source is expected to conform to some C standard (unsure 
which one right now, possibly c89 but maybe it is beginning to switch to 
c99, a young 19 years old standard), I'd suggest that the right fix is 
rather to actually remove the aliasing issue.


--
Fabien.



Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-08-30 Thread Etsuro Fujita

(2018/08/29 18:40), Etsuro Fujita wrote:

(2018/08/29 0:21), Jonathan S. Katz wrote:

On Aug 24, 2018, at 8:38 AM, Etsuro
Fujita wrote:
(2018/08/24 11:47), Michael Paquier wrote:

On Thu, Aug 23, 2018 at 10:00:49PM +0900, Etsuro Fujita wrote:

I tried this today, but doing git behind the corporate firewall
doesn't
work. I don't know the clear cause of that, so I'll investigate that
tomorrow.


You may be able to tweak that by using https as origin point or proper
git proxy settings?


Yeah, my proxy settings were not correct. With the help of my
colleagues Horiguchi-san and Yamada-san, I corrected them but still
can't clone the master repository. Running git with GIT_CURL_VERBOSE
shows that there is another issue in my terminal environment, so I'm
trying to resolve that.


Are there any updates on getting this patch committed?


That investigation has shown that the cause is my company firewall, not
my terminal environment; that firewall has to be configured to allow me
to access to that repository. So, I'm asking my company about that.


I got the approval from my boss today, so I think that I can have that 
access soon.


Best regards,
Etsuro Fujita



Re: pg_verify_checksums -d option (was: Re: pg_verify_checksums -r option)

2018-08-30 Thread Yugo Nagata
On Thu, 30 Aug 2018 07:18:13 -0300
Alvaro Herrera  wrote:

> On 2018-Aug-30, Yugo Nagata wrote:
> 
> > On Thu, 30 Aug 2018 06:52:58 -0300
> > Alvaro Herrera  wrote:
> 
> > > This should have been listed in the pg11 open items.  Please list there
> > > everything that should apply be applied branch 11 before release, so
> > > that they get fixed (or at least considered) before we release.
> > > 
> > > https://wiki.postgresql.org/wiki/Open_Items
> > 
> > I don't have the editor privilege now, so I'll add this discussion to the 
> > wiki (Fixed issues or Resolve issues?) after I get the privilege.
> 
> You're an editor now, though IMO adding it as a resolved issue is
> pointless.  I meant if there are any items pending resolution, then
> please add them.

Sure. I will do that from next time.

Regards,
-- 
Yugo Nagata 



Re: Reopen logfile on SIGHUP

2018-08-30 Thread Alexander Korotkov
On Wed, Aug 29, 2018 at 12:01 PM Alexander Korotkov
 wrote:
> On Wed, Aug 29, 2018 at 5:05 AM Kyotaro HORIGUCHI
>  wrote:
> > At Tue, 28 Aug 2018 18:50:31 +0300, Alexander Korotkov 
> >  wrote in 
> > 
> > > Also I found that this new pg_ctl isn't covered with tests at all.  So
> > > I've added very simple tap tests, which ensures that when log file was
> > > renamed, it reappeared again after pg_ctl logrotate.  I wonder how
> > > that would work on Windows.  Thankfully commitfest.cputube.org have
> > > Windows checking facility now.
> >
> > Thanks for the test. Documentaion and help message looks fine
> > including the changed ordering. (180 seconds retry may be a bit
> > too long but I'm fine with it.)
>
> Thank you for the comments.  My idea about retry logic was to provide
> the similar behavior to poll_query_until().

It seems that http://commitfest.cputube.org/ runs only "make check" on
Windows.  But my Postgres Pro colleagues checked that tests passed on
32-bit and 64-bit versions of Windows Server 2008.  Also I made some
minor beautifications on code and documentation.

This patch seems to have good shape and generally being quite
harmless.  Do we have any objections to committing this?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


pg_ctl_logrotate_v7.patch
Description: Binary data


Re: [HACKERS] Proposal to add work_mem option to postgres_fdw module

2018-08-30 Thread Kyotaro HORIGUCHI
At Mon, 27 Aug 2018 19:38:19 -0700, Michael Paquier  wrote 
in <20180828023819.ga29...@paquier.xyz>
> On Mon, Aug 27, 2018 at 02:34:31PM -0400, Robert Haas wrote:
> > I like the direction of your thinking, but it seems to me that this
> > would cause a problem if you want to set search_path=foo,bar.
> 
> proconfig already handles such cases, that's what I was coming at.  We
> could reuse most of what it does, no?

IIUC, proconfig is a text array, options is a text, and the
options psql-option is in the format of '-cvar=val -cvar=val...'.
Thus I think we still need something to convert them.

As an experiment, just with the following mod:

contrib/postgres_fdw/option.c
   /* Hide debug options, as well as settings we override internally. */
-  if (strchr(lopt->dispchar, 'D') ||
+  if ((strchr(lopt->dispchar, 'D') &&
+   strcmp(lopt->keyword, "options") != 0) ||
   strcmp(lopt->keyword, "fallback_application_name") == 0 ||

and the following setup:

=# alter server sv1
options (set options '-csearch_path=\\"hoge,hage\\"\\ -cwork_mem=16MB');

actually works. Converting array representation into the
commandline options format shown above internally might make it
look better a bit..

=# alter server sv1
   options (set options '{"search_path=\"hoge,hage\"",work_mem=16MB}');

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




  1   2   >