Re: POC: Cleaning up orphaned files using undo logs

2019-07-24 Thread Dilip Kumar
Hi Thomas,

I have started reviewing 0003-Add-undo-log-manager,  I haven't yet
reviewed but some places I noticed that instead of UndoRecPtr you are
directly
using UndoLogOffset. Which seems like bugs to me

1.
+UndoRecPtr
+UndoLogAllocateInRecovery(UndoLogAllocContext *context,
+   TransactionId xid,
+   uint16 size,
+   bool *need_xact_header,
+   UndoRecPtr *last_xact_start,

+ *need_xact_header =
+ context->try_location == InvalidUndoRecPtr &&
+ slot->meta.unlogged.insert == slot->meta.unlogged.this_xact_start;
+ *last_xact_start = slot->meta.unlogged.last_xact_start;

the output parameter last_xact_start is of type UndoRecPtr whereas
slot->meta.unlogged.last_xact_start is of type UndoLogOffset
shouldn't we use MakeUndoRecPtr(logno, offset) here?

2.
+ slot = find_undo_log_slot(logno, false);
+ if (UndoLogOffsetPlusUsableBytes(try_offset, size) <= slot->meta.end)
+ {
+ *need_xact_header = false;
+ return try_offset;
+ }

Here also you are returning directly try_offset instead of UndoRecPtr

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-07-24 Thread Thomas Munro
On Wed, Jul 24, 2019 at 9:15 PM Amit Kapila  wrote:
> I have done some more review of undolog patch series and here are my comments:

Hi Amit,

Thanks!  There a number of actionable changes in your review.  I'll be
posting a new patch set soon that will address most of your complaints
individually.  In this message want to respond to one topic area,
because the answer is long enough already:

> 2.
> allocate_empty_undo_segment()
> {
> ..
> ..
> /* Flush the contents of the file to disk before the next checkpoint. */
> + undofile_request_sync(logno, end / UndoLogSegmentSize, tablespace);
> ..
> }
>
> +void
> +undofile_request_sync(UndoLogNumber logno, BlockNumber segno, Oid tablespace)
> +{
> + char path[MAXPGPATH];
> + FileTag tag;
> +
> + INIT_UNDOFILETAG(tag, logno, tablespace, segno);
> +
> + /* Try to send to the checkpointer, but if out of space, do it here. */
> + if (!RegisterSyncRequest(, SYNC_REQUEST, false))
>
>
> The comment in allocate_empty_undo_segment indicates that the code
> wants to flush before checkpoint, but the actual function tries to
> register the request with checkpointer.  Shouldn't this be similar to
> XLogFileInit where we use pg_fsync to flush the contents immediately?
> I guess that will avoid what you have written in comments in the same
> function (we just want to make sure that the filesystem has allocated
> physical blocks for it so that non-COW filesystems will report ENOSPC
> now rather than later when space is needed).  OTOH, I think it is
> performance-wise better to postpone the work to checkpointer.  If we
> want to push this work to checkpointer, then we might need to change
> comments or alternatively, we might want to use bigger segment sizes
> to mitigate the performance effect.

In an early version I was doing the fsync() immediately.  While
testing zheap, Mithun CY reported that whenever segments couldn't be
recycled in the background, such as during a bit long-running
transaction, he could measure ~6% of the time time spent waiting for
fsync(), and throughput increased with bigger segments (and thus fewer
files to fsync()).  Passing the work off to the checkpointer is better
not only because it's done in the background but also because there is
a chance that the work can be consolidated with other sync requests,
and perhaps even avoided completely if the file is discarded and
unlinked before the next checkpoint.

I'll update the comment to make it clearer.

> If my above understanding is correct and the reason to fsync
> immediately is to reserve space now, then we also need to think
> whether we are always safe in postponing the work?  Basically, if this
> means that it can fail when we are actually trying to write undo, then
> it could be risky because we could be in the critical section at that
> time.  I am not sure about this point, rather it is just to discuss if
> there are any impacts of postponing the fsync work.

Here is my theory for why this arrangement is safe, and why it differs
from what we're doing with WAL segments and regular relation files.
First, let's review why those things work the way they do (as I
understand it):

1.  WAL's use of fdatasync():  The reason we fill and then fsync()
newly created WAL files up front is because we want to make sure the
blocks are definitely on disk.  The comment doesn't spell out exactly
why the author considered later fdatasync() calls to be insufficient,
but they were: it was many years after commit 33cc5d8a4d0d that Linux
ext3/4 filesystems began flushing file size changes to disk in
fdatasync()[1][2].  I don't know if its original behaviour was
intentional or not.  So, if you didn't use the bigger fsync() hammer
on that OS, you might lose the end of a recently extended file in a
power failure even though fdatasync() had returned success.

By my reading of POSIX, that shouldn't be necessary on a conforming
implementation of fdatasync(), and that was fixed years ago in Linux.
I'm not proposing any changes there, and I'm not proposing to take
advantage of that in the new code.  I'm pointing out that that we
don't have to worry about that for these undo segments, because they
are already flushed with fsync(), not fdatasync().

(To understand POSIX's descriptions of fsync() and fdatasync() you
have to find the meanings of "Synchronized I/O Data Integrity
Completion" and "Synchronized I/O File Integrity Completion" elsewhere
in the spec.  TL;DR: fdatasync() is only allowed to skip flushing
attributes like the modified time, it's not allowed to skip flushing a
file size change since that would interfere with retrieving the data.)

2.  Time of reservation:  Although they don't call fsync(), regular
relations and these new undo files still write zeroes up front
(respectively, for a new block and for a new segment).  One reason for
that is that most popular filesystems reserve space at write time, so
you'll get ENOSPC when trying to allocate undo space, and that's a
non-fatal ERROR.  If we deferred until 

Re: POC: Cleaning up orphaned files using undo logs

2019-07-24 Thread Kuntal Ghosh
Hello Thomas,

Here are some review comments on 0003-Add-undo-log-manager.patch. I've
tried to avoid duplicate comments as much as possible.

1. In UndoLogAllocate,
+ * time this backend as needed to write to an undo log at all or because
s/as/has

+ * Maintain our tracking of the and the previous transaction start
Do you mean current log's transaction start as well?

2. In UndoLogAllocateInRecovery,
we try to find the current log from the first undo buffer. So, after a
log switch, we always have to register at least one buffer from the
current undo log first. If we're updating something in the previous
log, the respective buffer should be registered after that. I think we
should document this in the comments.

3. In UndoLogGetOldestRecord(UndoLogNumber logno, bool *full),
it seems the 'full' parameter is not used anywhere. Do we still need this?

+ /* It's been recycled.  SO it must have been entirely discarded. */
s/SO/So

4. In CleanUpUndoCheckPointFiles,
we can emit a debug2 message with something similar to : 'removed
unreachable undo metadata files'

+ if (unlink(path) != 0)
+ elog(ERROR, "could not unlink file \"%s\": %m", path);
according to my observation, whenever we deal with a file operation,
we usually emit a ereport message with errcode_for_file_access().
Should we change it to ereport? There are other file operations as
well including read(), OpenTransientFile() etc.

5. In CheckPointUndoLogs,
+ /* Capture snapshot while holding each mutex. */
+ LWLockAcquire(>mutex, LW_EXCLUSIVE);
+ serialized[num_logs++] = slot->meta;
+ LWLockRelease(>mutex);
why do we need an exclusive lock to read something from the slot? A
share lock seems to be sufficient.

pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_SYNC) is called
after pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_WRITE)
without calling pgstat_report_wait_end(). I think you've done the
same to avoid an extra function call. But, it differs from other
places in the PG code. Perhaps, we should follow this approach
everywhere.

6. In StartupUndoLogs,
+ if (fd < 0)
+ elog(ERROR, "cannot open undo checkpoint snapshot \"%s\": %m", path);
assuming your agreement upon changing above elog to ereport, the
message should be more user friendly. May be something like 'cannot
open pg_undo file'.

+ if ((size = read(fd, >meta, sizeof(slot->meta))) != sizeof(slot->meta))
The usage of size of doesn't look like a problem. But, we can save
some extra padding bytes at the end if we use (offsetof + sizeof)
approach similar to other places in PG.

7. In free_undo_log_slot,
+ /*
+ * When removing an undo log from a slot in shared memory, we acquire
+ * UndoLogLock, log->mutex and log->discard_lock, so that other code can
+ * hold any one of those locks to prevent the slot from being recycled.
+ */
+ LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+ LWLockAcquire(>mutex, LW_EXCLUSIVE);
+ Assert(slot->logno != InvalidUndoLogNumber);
+ slot->logno = InvalidUndoLogNumber;
+ memset(>meta, 0, sizeof(slot->meta));
+ LWLockRelease(>mutex);
+ LWLockRelease(UndoLogLock);
you've not taken the discard_lock as mentioned in the comment.

8. In find_undo_log_slot,
+ * 1.  If the calling code knows that it is attached to this lock or is the
s/lock/slot

+ * 2.  All other code should acquire log->mutex before accessing any members,
+ * and after doing so, check that the logno hasn't moved.  If it is not, the
+ * entire undo log must be assumed to be discarded (as if this function
+ * returned NULL) and the caller must behave accordingly.
Perhaps, you meant '..check that the logno remains same. If it is not..'.

+ /*
+ * If we didn't find it, then it must already have been entirely
+ * discarded.  We create a negative cache entry so that we can answer
+ * this question quickly next time.
+ *
+ * TODO: We could track the lowest known undo log number, to reduce
+ * the negative cache entry bloat.
+ */
This is an interesting thought. But, I'm wondering how we are going to
search the discarded logno in the simple hash. I guess that's why it's
in the TODO list.

9. In attach_undo_log,
+ * For now we have a simple linked list of unattached undo logs for each
+ * persistence level.  We'll grovel though it to find something for the
+ * tablespace you asked for.  If you're not using multiple tablespaces
s/though/through

+ if (slot == NULL)
+ {
+ if (UndoLogShared->next_logno > MaxUndoLogNumber)
+ {
+ /*
+ * You've used up all 16 exabytes of undo log addressing space.
+ * This is a difficult state to reach using only 16 exabytes of
+ * WAL.
+ */
+ elog(ERROR, "undo log address space exhausted");
+ }
looks like a potential unlikely() condition.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com




Initdb failure

2019-07-24 Thread vignesh C
Hi,

Initdb fails when following path is provided as input:
datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/datasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafddsdatasadfasfdsafdds/

Also the cleanup also tends to fail in the cleanup path.

Could be something to do with path handling.
I'm not sure if this is already known.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




RE: [Patch] PQconnectPoll() is blocked if target_session_attrs is read-write

2019-07-24 Thread Tsunakawa, Takayuki
From: Matsumura, Ryo [mailto:matsumura@jp.fujitsu.com]
> Detail:
> If target_session_attrs is set to read-write, PQconnectPoll() calls
> PQsendQuery("SHOW transaction_read_only") althogh previous return value
> was PGRES_POLLING_READING not WRITING.

The current code probably assumes that PQsendQuery() to send "SHOW 
transaction_read_only" shouldn't block, because the message is small enough to 
fit in the socket send buffer.  Likewise, the code in 
CONNECTION_AWAITING_RESPONSE case sends authentication data using 
pg_fe_sendauth() without checking for the write-ready status.  OTOH, the code 
in CONNECTION_MADE case waits for write-ready status in advance before sending 
the startup packet.  That's because the startup packet could get large enough 
to cause pqPacketSend() to block.

So, I don't think the fix is necessary.

> In result, PQsendQuery() may be blocked in pqsecure_raw_write().

FWIW, if PQsendQuery() blocked during connection establishment, I think it 
should block in poll() called from  from pqWait(), because the libpq's 
socket is set non-blocking.


Regards
Takayuki Tsunakawa





Re: double free in ExecHashJoin, 9.6.12

2019-07-24 Thread Thomas Munro
On Thu, Jul 25, 2019 at 2:39 AM Merlin Moncure  wrote:
> Server is generally running pretty well, and is high volume.  This
> query is not new and is also medium volume.  Database rebooted in
> about 4 seconds with no damage; fast enough we didn't even trip alarms
> (I noticed this troubleshooting another issue).  We are a couple of
> bug fixes releases behind but I didn't see anything obvious in the
> release notes suggesting a resolved issue. Anyone have any ideas?
> thanks in advance.

> postgres: rms ysconfig 10.33.190.21(36788) 
> SELECT(ExecHashJoin+0x5a2)[0x5e2d32]

Hi Merlin,

Where's the binary from (exact package name, if installed with a
package)?  Not sure if this is going to help, but is there any chance
you could disassemble that function so we can try to see what it's
doing at that offset?  For example on Debian if you have
postgresql-9.6 and postgresql-9.6-dbg installed you could run "gdb
/usr/lib/postgresql/9.6/bin/postgres" and then "disassemble
ExecHashJoin".  The code at "<+1442>" (0x5a2) is presumably calling
free or some other libc thing (though I'm surprised not to see an
intervening palloc thing).

--
Thomas Munro
https://enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-07-24 Thread vignesh C
On Thu, Jul 25, 2019 at 7:48 AM Amit Kapila  wrote:
>
> On Wed, Jul 24, 2019 at 11:04 PM vignesh C  wrote:
> >
> > Hi,
> >
> > I have done some review of undolog patch series
> > and here are my comments:
> > 0003-Add-undo-log-manager.patch
> >
> > 1) As undo log is being created in tablespace,
> >  if the tablespace is dropped later, will it have any impact?

Thanks Amit, that clarifies the problem I was thinking.
I  have another question regarding drop table space failure, but I
don't have a better solution for that problem.
Let me think more about it and discuss.
>
> Yes, it drops the undo logs present in tablespace being dropped.  See
> DropUndoLogsInTablespace() in the same patch.
>
> >
> > 4) Should we add a readme file for undolog as it does a fair amount of work
> > and is core part of the undo system?
> >
Thanks Amit, I could get the details of readme.
>
> The Readme is already present in the patch series posted by Thomas.
> See 0019-Add-developer-documentation-for-the-undo-log-storage.patch in
> email [1].
>
> [1] - 
> https://www.postgresql.org/message-id/CA%2BhUKGKni7EEU4FT71vZCCwPeaGb2PQOeKOFjQJavKnD577UMQ%40mail.gmail.com
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com

-- 
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-24 Thread Peter Geoghegan
On Wed, Jul 24, 2019 at 3:06 PM Peter Geoghegan  wrote:
> There seems to be a kind of "synergy" between the nbtsplitloc.c
> handling of pages that have lots of duplicates and posting list
> compression. It seems as if the former mechanism "sets up the bowling
> pins", while the latter mechanism "knocks them down", which is really
> cool. We should try to gain a better understanding of how that works,
> because it's possible that it could be even more effective in some
> cases.

I found another important way in which this synergy can fail to take
place, which I can fix.

By removing the BT_COMPRESS_THRESHOLD limit entirely, certain indexes
from my test suite become much smaller, while most are not affected.
These indexes were not helped too much by the patch before. For
example, the TPC-E i_t_st_id index is 50% smaller. It is entirely full
of duplicates of a single value (that's how it appears after an
initial TPC-E bulk load), as are a couple of other TPC-E indexes.
TPC-H's idx_partsupp_partkey index becomes ~18% smaller, while its
idx_lineitem_orderkey index becomes ~15% smaller.

I believe that this happened because rightmost page splits were an
inefficient case for compression. But rightmost page split heavy
indexes with lots of duplicates are not that uncommon. Think of any
index with many NULL values, for example.

I don't know for sure if BT_COMPRESS_THRESHOLD should be removed. I'm
not sure what the idea is behind it. My sense is that we're likely to
benefit by delaying page splits, no matter what. Though I am still
looking at it purely from a space utilization point of view, at least
for now.

-- 
Peter Geoghegan




Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2019-07-24 Thread Noah Misch
On Wed, Jul 24, 2019 at 05:27:18PM +0900, Kyotaro Horiguchi wrote:
> Sorry in advance for link-breaking message forced by gmail..

Using the archives page "Resend email" link avoids that.

> https://www.postgresql.org/message-id/flat/20190202083822.gc32...@gust.leadboat.com
> 
> > 1. The result of the test is valid only until we release the SLRU 
> > ControlLock,
> >which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to 
> > evaluate
> >segments for deletion.  Once we release that lock, latest_page_number can
> >advance.  This creates a TOCTOU race condition, allowing excess deletion:
> >
> >
> >[local] test=# table trunc_clog_concurrency ;
> >ERROR:  could not access status of transaction 2149484247
> >DETAIL:  Could not open file "pg_xact/0801": No such file or directory.
> 
> It seems like some other vacuum process saw larger cutoff page?

No, just one VACUUM suffices.

> If I'm
> not missing something, the missing page is no longer the
> "recently-populated" page at the time (As I understand it as the last
> page that holds valid data). Couldn't we just ignore ENOENT there?

The server reported this error while attempting to read CLOG to determine
whether a tuple's xmin committed or aborted.  That ENOENT means the relevant
CLOG page is not available.  To ignore that ENOENT, the server would need to
guess whether to consider the xmin committed or consider it aborted.  So, no,
we can't just ignore the ENOENT.




Re: dropdb --force

2019-07-24 Thread Tom Lane
Pavel Stehule  writes:
> [ drop-database-force-20190708.patch ]

I took a brief look at this, but I don't think it's really close to
being committable.

* The documentation claims FORCE will fail if you don't have privileges
to terminate the other session(s) in the target DB.  This is a lie; the
code issues kill() without any regard for such niceties.  You could
perhaps make that better by going through pg_signal_backend().

* You've hacked CountOtherDBBackends to do something that's completely
outside the charter one would expect from its name, and not even
bothered to update its header comment.  This requires more attention
to not confusing future hackers; I'd say you can't even use that
function name anymore.

* You've also ignored the existing code in CountOtherDBBackends
that is careful *not* to issue a kill() while holding the critical
ProcArrayLock.  That problem would get enormously worse if you
tried to sub in pg_signal_backend() there, since that function
may do catalog accesses --- it's pretty likely that you could
get actual deadlocks, never mind just trashing performance.

* I really dislike the addition of more hard-wired delays and
timeouts to dropdb().  It's bad enough that CountOtherDBBackends
has got that.  Two layers of timeout are a rickety mess, and
who's to say that a 1-minute timeout is appropriate for anything?

* I'm concerned that the proposed syntax is not future-proof.
FORCE is not a reserved word, and we surely don't want to make
it one; but just appending it to the end of the command without
any decoration seems like a recipe for problems if anybody wants
to add other options later.  (Possible examples: RESTRICT/CASCADE,
or a user-defined timeout.)  Maybe some parentheses would help?
Or possibly I'm being overly paranoid, but ...


I hadn't been paying any attention to this thread before now,
but I'd assumed from the thread title that the idea was to implement
any attempted kills in the dropdb app, not on the backend side.
(As indeed it looks like the first version did.)  Maybe it would be
better to go back to that, instead of putting dubious behaviors
into the core server.

regards, tom lane




Re: POC: Cleaning up orphaned files using undo logs

2019-07-24 Thread Amit Kapila
On Wed, Jul 24, 2019 at 11:04 PM vignesh C  wrote:
>
> Hi,
>
> I have done some review of undolog patch series
> and here are my comments:
> 0003-Add-undo-log-manager.patch
>
> 1) As undo log is being created in tablespace,
>  if the tablespace is dropped later, will it have any impact?
>

Yes, it drops the undo logs present in tablespace being dropped.  See
DropUndoLogsInTablespace() in the same patch.

>
> 4) Should we add a readme file for undolog as it does a fair amount of work
> and is core part of the undo system?
>

The Readme is already present in the patch series posted by Thomas.
See 0019-Add-developer-documentation-for-the-undo-log-storage.patch in
email [1].

[1] - 
https://www.postgresql.org/message-id/CA%2BhUKGKni7EEU4FT71vZCCwPeaGb2PQOeKOFjQJavKnD577UMQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: benchmarking Flex practices

2019-07-24 Thread Tom Lane
Chapman Flack  writes:
> On 07/24/19 03:45, John Naylor wrote:
>> On Sun, Jul 21, 2019 at 3:14 AM Tom Lane  wrote:
>>> However, my second reaction was that maybe you were on to something
>>> upthread when you speculated about postponing de-escaping of
>>> Unicode literals into the grammar.  If we did it like that then

> With the de-escaping postponed, I think we'd be able to move beyond the
> current odd situation where Unicode escapes can't describe non-ascii
> characters, in exactly and only the cases where you need them to.

How so?  The grammar doesn't really have any more context information
than the lexer does.  (In both cases, it would be ugly but not really
invalid for the transformation to depend on the database encoding,
I think.)

regards, tom lane




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-07-24 Thread Kyotaro Horiguchi
I found that CF-bot complaining on this.

Seems that some comment fixes by the recent 21039555cd are the
cause.

No substantial change have been made by this rebasing.

regards.

On Fri, Jul 12, 2019 at 5:37 PM Kyotaro Horiguchi
 wrote:
>
> At Fri, 12 Jul 2019 17:30:41 +0900 (Tokyo Standard Time), Kyotaro Horiguchi 
>  wrote in 
> <20190712.173041.236938840.horikyota@gmail.com>
> > The v16 seems no longer works so I'll send further rebased version.
>
> It's just by renaming of TestLib::real_dir to perl2host.
> This is rebased version v17.
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center



--
Kyotaro Horiguchi
NTT Open Source Software Center


v18-0001-TAP-test-for-copy-truncation-optimization.patch
Description: Binary data


v18-0002-Fix-WAL-skipping-feature.patch
Description: Binary data


v18-0003-Rename-smgrDoPendingDeletes-to-smgrDoPendingOperatio.patch
Description: Binary data


[PATCH] Race condition in logical walsender causes long postgresql shutdown delay

2019-07-24 Thread Craig Ringer
Hi folks

I recently tracked down a race in shutdown of logical walsenders that can
cause PostgreSQL shutdown to hang for wal_sender_timeout/2 before it
continues to a normal shutdown. With a long timeout that can be quite
disruptive.

TL;DR: The logical walsender may be signalled to stop, then read the last
WAL record before the shutdown checkpoint is due to be written and go to
sleep. The checkpointer will wait for it to acknowledge the shutdown and
the walsender will wait for new WAL. The deadlock is eventually broken by
the walsender timeout keepalive timer.

Patch attached.

The issue arises from the difference between logical and physical walsender
shutdown as introduced by commit c6c3334364 "Prevent possibility of panics
during shutdown checkpoint". It's probably fairly hard to trigger. I ran
into a case where it happened regularly only because of an unrelated patch
that caused some WAL to be written just before the checkpointer issued
walsender shutdown requests. But it's still a legit bug.

If you hit the issue you'll see that walsender(s) can be seen to be
sleeping in WaitLatchOrSocket in WalSndLoop. They'll keep sleeping until
woken by the keepalive timeout. The checkpointer will be waiting in
WalSndWaitStopping() for the walsenders to enter WALSNDSTATE_STOPPING or
exit, whichever happens first. The postmaster will be waiting in ServerLoop
for the checkpointer to finish the shutdown checkpoint.

The checkpointer waits in WalSndWaitStopping() for all walsenders to either
exit or enter WALSNDSTATE_STOPPING state. Logical walsenders never enter
WALSNDSTATE_STOPPING, they go straight to exiting, so the checkpointer
can't finish WalSndWaitStopping() and write the shutdown checkpoint. A
logical walsender usually notices the shutdown request and exits as soon as
it has flushed all WAL up to the server's flushpoint, while physical
walsenders enter WALSNDSTATE_STOPPING.

But there's a race where a logical walsender may read the final available
record and notice it has caught up - but not notice that it has reached
end-of-WAL and check whether it should exit. This happens on the following
(simplified) code path in XLogSendLogical:

if (record != NULL)
{
XLogRecPtr  flushPtr = GetFlushRecPtr();
LogicalDecodingProcessRecord(...);
sentPtr = ...;
if (sentPtr >= flushPtr)
WalSndCaughtUp = true;// <-- HERE
}

because the test for got_STOPPING that sets got_SIGUSR2 is only on the
other branch where getting a record returns `NULL`; this branch can sleep
before checking if shutdown was requested.

So if the walsender read the last WAL record available, when it's >= the
flush pointer and it already handled the SIGUSR1 latch wakeup for the WAL
write, it might go back to sleep and not wake up until the timeout.

The checkpointer already sent PROCSIG_WALSND_INIT_STOPPING to the
walsenders in the prior WalSndInitStopping() call so the walsender won't be
woken by a signal from the checkpointer. No new WAL will be written because
the walsender just consumed the final record written before the
checkpointer went to sleep, and the checkpointer won't write anything more
until the walsender exits. The client might not be due a keepalive for some
time.The only reason this doesn't turn into a total deadlock is that
keepalive wakeup.

An alternative fix would be to have the logical walsender set
WALSNDSTATE_STOPPING instead of faking got_SIGUSR2, then go to sleep
waiting for more WAL. Logical decoding would need to check if it was
running during shutdown and Assert(...) then ERROR if it saw any WAL
records that result in output plugin calls or snapshot management calls. I
avoided this approach as it's more intrusive and I'm not confident I can
concoct a reliable test to trigger it.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise
From 559dda09b35870d3630a65cbca682e50343c6f0f Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Thu, 25 Jul 2019 09:14:58 +0800
Subject: [PATCH] Fix a delay in PostgreSQL shutdown caused by logical
 replication

Due to a race with WAL writing during shutdown, if logical walsenders were
running then PostgreSQL's shutdown could be delayed by up to
wal_sender_timeout/2 while it waits for the walsenders to shut down. The
walsenders wait for new WAL or end-of-wal which won't come until shutdown so
there's a deadlock. The walsender timeout eventually breaks the deadlock.

The issue was introduced by PostgreSQL 10 commit c6c3334364
"Prevent possibility of panics during shutdown checkpoint".

A logical walsender never enters WALSNDSTATE_STOPPING and allows the
checkpointer to continue shutdown. Instead it exits when it reads end-of-WAL.
But if it reads the last WAL record written before shutdown and that record
doesn't generate a client network write, it can mark itself caught up and go to
sleep without checking to see if it's been asked to shut 

Re: Should we add xid_current() or a int8->xid cast?

2019-07-24 Thread Thomas Munro
On Thu, Jul 25, 2019 at 12:42 PM Tom Lane  wrote:
> Andres Freund  writes:
> > On 2019-07-24 20:34:39 -0400, Tom Lane wrote:
> >> Yeah, I would absolutely NOT recommend that you open that can of worms
> >> right now.  We have looked at adding unsigned integer types in the past
> >> and it looked like a mess.
>
> > I assume Thomas was thinking more of another bespoke type like xid, just
> > wider.  There's some notational advantage in not being able to
> > immediately do math etc on xids.
>
> Well, we could invent an xid8 type if we want, just don't try to make
> it part of the numeric hierarchy (as indeed xid isn't).

Yeah, I meant an xid64/xid8/fxid/pg_something/... type that isn't a
kind of number.

-- 
Thomas Munro
https://enterprisedb.com




Re: On the stability of TAP tests for LDAP

2019-07-24 Thread Michael Paquier
On Wed, Jul 24, 2019 at 09:01:47PM +1200, Thomas Munro wrote:
> Huh, yeah, I don't know why slapd requires credentials on Debian, when
> the version that ships with FreeBSD is OK with an anonymous
> connection.  Rather than worrying about that, I just adjusted it to
> supply the credentials.  It works on both for me.

Thanks for the updated patch, this looks good.  I have done a series
of tests keeping my laptop busy and I haven't seen a failure where I
usually see problems 10%~20% of the time.  So that seems to help,
thanks!
--
Michael


signature.asc
Description: PGP signature


Re: Should we add xid_current() or a int8->xid cast?

2019-07-24 Thread Tom Lane
Andres Freund  writes:
> On 2019-07-24 20:34:39 -0400, Tom Lane wrote:
>> Yeah, I would absolutely NOT recommend that you open that can of worms
>> right now.  We have looked at adding unsigned integer types in the past
>> and it looked like a mess.

> I assume Thomas was thinking more of another bespoke type like xid, just
> wider.  There's some notational advantage in not being able to
> immediately do math etc on xids.

Well, we could invent an xid8 type if we want, just don't try to make
it part of the numeric hierarchy (as indeed xid isn't).

regards, tom lane




Re: Should we add xid_current() or a int8->xid cast?

2019-07-24 Thread Andres Freund
Hi,

On 2019-07-24 20:34:39 -0400, Tom Lane wrote:
> Yeah, I would absolutely NOT recommend that you open that can of worms
> right now.  We have looked at adding unsigned integer types in the past
> and it looked like a mess.

I assume Thomas was thinking more of another bespoke type like xid, just
wider.  There's some notational advantage in not being able to
immediately do math etc on xids.

- Andres




Re: Should we add xid_current() or a int8->xid cast?

2019-07-24 Thread Tom Lane
Andres Freund  writes:
> On 2019-07-25 12:20:58 +1200, Thomas Munro wrote:
>> On Thu, Jul 25, 2019 at 12:06 PM Andres Freund  wrote:
>>> Seems easiest to just add xid_current(), or add a cast from int8 to xid
>>> (probably explicit?) that handles the wraparound logic correctly?

>> Yeah, I was wondering about that.  int8 isn't really the right type,
>> since FullTransactionId is unsigned.

> For now that doesn't seem that big an impediment...

Yeah, I would absolutely NOT recommend that you open that can of worms
right now.  We have looked at adding unsigned integer types in the past
and it looked like a mess.

I think an explicit cast is a reasonable thing to add, though.

regards, tom lane




Re: Should we add xid_current() or a int8->xid cast?

2019-07-24 Thread Andres Freund
Hi,

On 2019-07-25 12:20:58 +1200, Thomas Munro wrote:
> On Thu, Jul 25, 2019 at 12:06 PM Andres Freund  wrote:
> > we have txid_current(), which returns an int8. But there's no convenient
> > way to convert that to type 'xid'. Which is fairly inconvenient, given
> > that we expose xids in various places.
> >
> > My current need for this was just a regression test to make sure that
> > system columns (xmin/xmax in particular) don't get broken again for ON
> > CONFLICT. But I've needed this before in other scenarios - e.g. age(xid)
> > can be useful to figure out how old a transaction is, but age() doesn't
> > work with txid_current()'s return value.
> >
> > Seems easiest to just add xid_current(), or add a cast from int8 to xid
> > (probably explicit?) that handles the wraparound logic correctly?
> 
> Yeah, I was wondering about that.  int8 isn't really the right type,
> since FullTransactionId is unsigned.

For now that doesn't seem that big an impediment...


> If we had a SQL type for 64 bit xids, it should be convertible to xid,
> and the reverse conversion should require a more complicated dance.
> Of course we can't casually change txid_current() without annoying
> people who are using it, so perhaps if we invent a new SQL type we
> should also make a new function that returns it.

Possibly we could add a fullxid or xid8, xid64, pg_xid64, ... type, and
have an implicit cast to int8?

Greetings,

Andres Freund




Re: Should we add xid_current() or a int8->xid cast?

2019-07-24 Thread Thomas Munro
On Thu, Jul 25, 2019 at 12:06 PM Andres Freund  wrote:
> we have txid_current(), which returns an int8. But there's no convenient
> way to convert that to type 'xid'. Which is fairly inconvenient, given
> that we expose xids in various places.
>
> My current need for this was just a regression test to make sure that
> system columns (xmin/xmax in particular) don't get broken again for ON
> CONFLICT. But I've needed this before in other scenarios - e.g. age(xid)
> can be useful to figure out how old a transaction is, but age() doesn't
> work with txid_current()'s return value.
>
> Seems easiest to just add xid_current(), or add a cast from int8 to xid
> (probably explicit?) that handles the wraparound logic correctly?

Yeah, I was wondering about that.  int8 isn't really the right type,
since FullTransactionId is unsigned.  If we had a SQL type for 64 bit
xids, it should be convertible to xid, and the reverse conversion
should require a more complicated dance.  Of course we can't casually
change txid_current() without annoying people who are using it, so
perhaps if we invent a new SQL type we should also make a new function
that returns it.

-- 
Thomas Munro
https://enterprisedb.com




Re: ON CONFLICT (and manual row locks) cause xmax of updated tuple to unnecessarily be set

2019-07-24 Thread Peter Geoghegan
On Wed, Jul 24, 2019 at 4:24 PM Andres Freund  wrote:
> as you can see the same xmax is set for both row version, with the new
> infomask being  HEAP_XMAX_KEYSHR_LOCK | HEAP_XMAX_LOCK_ONLY | HEAP_UPDATED.

Meta remark about your test case: I am a big fan of microbenchmarks
like this, which execute simple DML queries using a single connection,
and then consider if the on-disk state looks as good as expected, for
some value of "good". I had a lot of success with this approach while
developing the v12 work on nbtree, where I went to the trouble of
automating everything. The same test suite also helped with the nbtree
compression/deduplication patch just today.

I like to call these tests "wind tunnel tests". It's far from obvious
that you can take a totally synthetic, serial test, and use it to
measure something that is important to real workloads. It seems to
work well when there is a narrow, specific thing that you're
interested in. This is especially true when there is a real risk of
regressing performance in some way.

> but we really don't need to do any of that in this case - the only
> locker is the current backend, after all.
>
> I think this isn't great, because it'll later will cause unnecessary
> hint bit writes (although ones somewhat likely combined with setting
> XMIN_COMMITTED), and even extra work for freezing.
>
> Based on a quick look this wasn't the case before the finer grained
> tuple locking - which makes sense, there was no cases where locks would
> need to be carried forward.

I agree that this is unfortunate. Are you planning on working on it?

-- 
Peter Geoghegan




Should we add xid_current() or a int8->xid cast?

2019-07-24 Thread Andres Freund
Hi,

we have txid_current(), which returns an int8. But there's no convenient
way to convert that to type 'xid'. Which is fairly inconvenient, given
that we expose xids in various places.

My current need for this was just a regression test to make sure that
system columns (xmin/xmax in particular) don't get broken again for ON
CONFLICT. But I've needed this before in other scenarios - e.g. age(xid)
can be useful to figure out how old a transaction is, but age() doesn't
work with txid_current()'s return value.

Seems easiest to just add xid_current(), or add a cast from int8 to xid
(probably explicit?) that handles the wraparound logic correctly?

Greetings,

Andres Freund




Re: Compile from source using latest Microsoft Windows SDK

2019-07-24 Thread Michael Paquier
On Wed, Jul 24, 2019 at 10:38:47AM -0400, Andrew Dunstan wrote:
> Yeah, on consideration I think Peifeng's patch upthread looks OK.
> (Incidentally, this variable is not set in the very old version of VC
> running on currawong).

Interesting.  I am not actually sure in which version of VS this has
been introduced.  But it would be fine enough to do nothing if the
variable is not defined and rely on the default.  Except for the
formatting and indentation, the patch looks right.  Andrew, perhaps
you would prefer doing the final touch on it and commit it?
--
Michael


signature.asc
Description: PGP signature


RE: seems like a bug in pgbench -R

2019-07-24 Thread Imai, Yoshikazu
On Wed, July 24, 2019 at 7:02 PM, Fabien COELHO wrote:
> > but I have one question. Is it better adding any check like if(maxsock
> > != -1) before the select?
> >
> > else/* no explicit delay, select without timeout */
> > {
> >nsocks = select(maxsock + 1, _mask, NULL, NULL, NULL); }
> 
> I think that it is not necessary because this case cannot happen: If some
> clients are still running (remains > 0), they are either sleeping, in
> which case there would be a timeout, or they are waiting for something
> from the server, otherwise the script could be advanced further so there
> would be something else to do for the thread.

Ah, I understand.

> We could check this by adding "Assert(maxsock != -1);" before this select,
> but I would not do that for a released version.

Yeah I also imagined that we can use Assert, but ah, it's released version.
I got it. Thanks for telling that.

So I'll mark this ready for committer.

--
Yoshikazu Imai





Re: [PATCH] minor bugfix for pg_basebackup (9.6 ~ )

2019-07-24 Thread Michael Paquier
On Wed, Jul 24, 2019 at 11:23:30AM -0400, Alvaro Herrera wrote:
> Heh, yesterday I revised the original patch as attached and was about to
> push when the bell rang.  I like this one because it keeps the comment
> to one line and it mentions the function name in charge of the
> validation (useful for grepping later on).  It's a bit laconic because
> of the long function name and the desire to keep it to one line, but it
> seems sufficient to me.

Looks fine to me.  A nit: addition of braces for the if block.  Even
if that a one-liner, there is a comment so I think that this makes the
code more readable.
--
Michael


signature.asc
Description: PGP signature


Re: Statistical aggregate functions are not working with PARTIAL aggregation

2019-07-24 Thread David Rowley
On Thu, 25 Jul 2019 at 11:33, Andres Freund  wrote:
>
> On 2019-07-25 10:36:26 +1200, David Rowley wrote:
> > 2) Planner trying to give nodeAgg.c a sorted path to work with on
> > DISTINCT / ORDER BY aggs
>
> That'll have to be a best effort thing though, i.e. there'll always be
> cases where we'll have to retain the current logic (or just regress
> performance really badly)?

It's something we already do for windowing functions. We just don't do
it for aggregates. It's slightly different since windowing functions
just chain nodes together to evaluate multiple window definitions.
Aggregates can't/don't do that since aggregates... well, aggregate,
(i.e the input to the 2nd one can't be aggregated by the 1st one) but
there's likely not much of a reason why standard_qp_callback couldn't
choose some pathkeys for the first AggRef with a ORDER BY / DISTINCT
clause. nodeAgg.c would still need to know how to change the sort
order in order to evaluate other Aggrefs in some different order.

I'm not quite sure where the regression would be. nodeAgg.c must
perform the sort, or if we give the planner some pathkeys, then worst
case the planner adds a Sort node. That seems equivalent to me.
However, in the best case, there's a suitable index and no sorting is
required anywhere. Probably then we can add combine function support
for the remaining built-in aggregates. There was trouble doing that in
[1] due to some concerns about messing up results for people who rely
on the order of an aggregate without actually writing an ORDER BY.

[1] 
https://www.postgresql.org/message-id/cakjs1f9sx_6gtcvd6tmuznntch0vhbzhx6fzqw17tgvfh-g...@mail.gmail.com


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Statistical aggregate functions are not working with PARTIAL aggregation

2019-07-24 Thread Andres Freund
Hi,

On 2019-07-25 10:36:26 +1200, David Rowley wrote:
> I'd like to do > much more in nodeAgg.c, TBH. It would be good to remove some 
> code from
> nodeAgg.c and put it in the planner.

Indeed!


> I'd like to see:
> 
> 1) Planner doing the Aggref merging for aggregates with the same
> transfn etc.

Makes sense.

I assume this would entail associating T_Aggref expressions with the
corresponding Agg at an earlier state? The whole business of having to
prepare expression evaluation, just so ExecInitAgg() can figure out
which aggregates it has to compute always has struck me as
architecturally bad.


> 2) Planner trying to give nodeAgg.c a sorted path to work with on
> DISTINCT / ORDER BY aggs

That'll have to be a best effort thing though, i.e. there'll always be
cases where we'll have to retain the current logic (or just regress
performance really badly)?

Greetings,

Andres Freund




ON CONFLICT (and manual row locks) cause xmax of updated tuple to unnecessarily be set

2019-07-24 Thread Andres Freund
Hi,


Scenario is a very plain upsert:

CREATE TABLE upsert(key int primary key);
INSERT INTO upsert VALUES(1) ON CONFLICT (key) DO UPDATE SET key = excluded.key;
INSERT INTO upsert VALUES(1) ON CONFLICT (key) DO UPDATE SET key = excluded.key;
INSERT 0 1
INSERT 0 1

postgres[8755][1]=# SELECT page_items.* FROM generate_series(0, 
pg_relation_size('upsert'::regclass::text) / 8192 - 1) blkno, 
get_raw_page('upsert'::regclass::text, blkno::int4) AS raw_page, 
heap_page_items(raw_page) as page_items;
┌┬┬──┬┬──┬──┬──┬┬─┬┬┬┬┬┐
│ lp │ lp_off │ lp_flags │ lp_len │  t_xmin  │  t_xmax  │ t_field3 │ t_ctid │ 
t_infomask2 │ t_infomask │ t_hoff │ t_bits │ t_oid  │   t_data   │
├┼┼──┼┼──┼──┼──┼┼─┼┼┼┼┼┤
│  1 │   8160 │1 │ 28 │ 19742591 │ 19742592 │0 │ (0,2)  │   
24577 │256 │ 24 │ (null) │ (null) │ \x0100 │
│  2 │   8128 │1 │ 28 │ 19742592 │ 19742592 │0 │ (0,2)  │   
32769 │   8336 │ 24 │ (null) │ (null) │ \x0100 │
└┴┴──┴┴──┴──┴──┴┴─┴┴┴┴┴┘
(2 rows)

as you can see the same xmax is set for both row version, with the new
infomask being  HEAP_XMAX_KEYSHR_LOCK | HEAP_XMAX_LOCK_ONLY | HEAP_UPDATED.


The reason that happens is that ExecOnConflictUpdate() needs to lock the
row to be able to reliably compute a new tuple version based on that
row. heap_update() then decides it needs to carry that xmax forward, as
it's a valid lock:


/*
 * If the tuple we're updating is locked, we need to preserve the 
locking
 * info in the old tuple's Xmax.  Prepare a new Xmax value for this.
 */
compute_new_xmax_infomask(HeapTupleHeaderGetRawXmax(oldtup.t_data),
  
oldtup.t_data->t_infomask,
  
oldtup.t_data->t_infomask2,
  xid, *lockmode, true,
  _old_tuple, 
_old_tuple,
  _old_tuple);

/*
 * And also prepare an Xmax value for the new copy of the tuple.  If 
there
 * was no xmax previously, or there was one but all lockers are now 
gone,
 * then use InvalidXid; otherwise, get the xmax from the old tuple.  (In
 * rare cases that might also be InvalidXid and yet not have the
 * HEAP_XMAX_INVALID bit set; that's fine.)
 */
if ((oldtup.t_data->t_infomask & HEAP_XMAX_INVALID) ||
HEAP_LOCKED_UPGRADED(oldtup.t_data->t_infomask) ||
(checked_lockers && !locker_remains))
xmax_new_tuple = InvalidTransactionId;
else
xmax_new_tuple = HeapTupleHeaderGetRawXmax(oldtup.t_data);

but we really don't need to do any of that in this case - the only
locker is the current backend, after all.

I think this isn't great, because it'll later will cause unnecessary
hint bit writes (although ones somewhat likely combined with setting
XMIN_COMMITTED), and even extra work for freezing.

Based on a quick look this wasn't the case before the finer grained
tuple locking - which makes sense, there was no cases where locks would
need to be carried forward.


It's worthwhile to note that this *nearly* already works, because of the
following codepath in compute_new_xmax_infomask():

 * If the lock to be acquired is for the same TransactionId as 
the
 * existing lock, there's an optimization possible: consider 
only the
 * strongest of both locks as the only one present, and restart.
 */
if (xmax == add_to_xmax)
{
/*
 * Note that it's not possible for the original tuple 
to be updated:
 * we wouldn't be here because the tuple would have 
been invisible and
 * we wouldn't try to update it.  As a subtlety, this 
code can also
 * run when traversing an update chain to lock future 
versions of a
 * tuple.  But we wouldn't be here either, because the 
add_to_xmax
 * would be different from the original updater.
 */
Assert(HEAP_XMAX_IS_LOCKED_ONLY(old_infomask));

/* acquire the strongest of both */
if (mode < old_mode)
mode = old_mode;
/* mustn't touch is_update */


Re: Statistical aggregate functions are not working with PARTIAL aggregation

2019-07-24 Thread David Rowley
On Thu, 25 Jul 2019 at 06:52, Andres Freund  wrote:
> Now that master is open for development, and you have a commit bit, are
> you planning to go forward with this on your own?

I plan to, but it's not a high priority at the moment.  I'd like to do
much more in nodeAgg.c, TBH. It would be good to remove some code from
nodeAgg.c and put it in the planner.

I'd like to see:

1) Planner doing the Aggref merging for aggregates with the same transfn etc.
2) Planner trying to give nodeAgg.c a sorted path to work with on
DISTINCT / ORDER BY aggs
3) Planner providing nodeAgg.c with the order that the aggregates
should be evaluated in order to minimise sorting for DISTINCT  / ORDER
BY aggs.

I'd take all those up on a separate thread though.

If you're in a rush to see the cleanup proposed a few months ago then
please feel free to take it up. It might be a while before I can get a
chance to look at it again.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: pgbench - allow to create partitioned tables

2019-07-24 Thread Fabien COELHO



  # and look at latency:
  # no parts = 0.071 ms
  #   1 hash = 0.071 ms (did someone optimize this case?!)
  #   2 hash ~ 0.126 ms (+ 0.055 ms)
  #  50 hash ~ 0.155 ms
  # 100 hash ~ 0.178 ms
  # 150 hash ~ 0.232 ms
  # 200 hash ~ 0.279 ms
  # overhead ~ (0.050 + [0.0005-0.0008] * nparts) ms


It is linear?


Good question. I would have hoped affine, but this is not very clear on these 
data, which are the median of about five runs, hence the bracket on the slope 
factor. At least it is increasing with the number of partitions. Maybe it 
would be clearer on the minimum of five runs.


Here is a fellow up.

On the minimum of all available runs the query time on hash partitions is 
about:


   0.64375 nparts + 118.30979 (in µs).

So the overhead is about 47.30979 + 0.64375 nparts, and it is indeed 
pretty convincingly linear as suggested by the attached figure.


--
Fabien.

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-24 Thread Peter Geoghegan
On Tue, Jul 23, 2019 at 6:22 PM Peter Geoghegan  wrote:
> Attached is a revised version of your v2 that fixes this issue -- I'll
> call this v3.

Remember that index that I said was 5.5x smaller with the patch
applied, following retail insertions (a single big INSERT ... SELECT
...)? Well, it's 6.5x faster with this small additional patch applied
on top of the v3 I posted yesterday. Many of the indexes in my test
suite are about ~20% smaller __in addition to__ very big size
reductions. Some are even ~30% smaller than they were with v3 of the
patch. For example, the fair use implementation of TPC-H that my test
data comes from has an index on the "orders" o_orderdate column, named
idx_orders_orderdate, which is made ~30% smaller by the addition of
this simple patch (once again, this is following a single big INSERT
... SELECT ...). This change makes idx_orders_orderdate ~3.3x smaller
than it is with master/Postgres 12, in case you were wondering.

This new patch teaches nbtsplitloc.c to subtract posting list overhead
when sizing the new high key for the left half of a candidate split
point, since we know for sure that _bt_truncate() will at least manage
to truncate away that much from the new high key, even in the worst
case. Since posting lists are often very large, this can make a big
difference. This is actually just a bugfix, not a new idea -- I merely
made nbtsplitloc.c understand how truncation works with posting lists.

There seems to be a kind of "synergy" between the nbtsplitloc.c
handling of pages that have lots of duplicates and posting list
compression. It seems as if the former mechanism "sets up the bowling
pins", while the latter mechanism "knocks them down", which is really
cool. We should try to gain a better understanding of how that works,
because it's possible that it could be even more effective in some
cases.

-- 
Peter Geoghegan


0003-Account-for-posting-list-overhead-during-splits.patch
Description: Binary data


Re: Built-in connection pooler

2019-07-24 Thread Ryan Lambert
Hello Konstantin,

> Concerning testing: I do not think that connection pooler needs some kind
of special tests.
> The idea of built-in connection pooler is that it should be able to
handle all requests normal postgres can do.
> I have added to regression tests extra path with enabled connection
proxies.
> Unfortunately, pg_regress is altering some session variables, so it
backend becomes tainted and so
> pooling is not actually used (but communication through proxy is tested).
> Thank you for your work on this patch, I took some good time to really
explore the configuration and do some testing with pgbench.  This round of
testing was done against patch 10 [1] and master branch commit a0555ddab9
from 7/22.

Thank you for explaining, I wasn't sure.


make installcheck-world: tested, passed
Implements feature:  tested, passed
Documentation:  I need to review again, I saw typos when testing but didn't
make note of the details.

Applying the patch [1] has improved from v9, still getting these:

git apply -p1 < builtin_connection_proxy-10.patch
:1536: indent with spaces.
/* Has pending clients: serve one of them */
:1936: indent with spaces.
/* If we attach new client to the existed backend,
then we need to send handshake response to the client */
:2208: indent with spaces.
if (port->sock == PGINVALID_SOCKET)
:2416: indent with spaces.
FuncCallContext* srf_ctx;
:2429: indent with spaces.
ps_ctx = (PoolerStateContext*)palloc(sizeof(PoolerStateContext));
warning: squelched 5 whitespace errors
warning: 10 lines add whitespace errors.


I used a DigitalOcean droplet with 2 CPU and 2 GB RAM and SSD for this
testing, Ubuntu 18.04.  I chose the smaller server size based on the
availability of similar and recent results around connection pooling [2]
that used AWS EC2 m4.large instance (2 cores, 8 GB RAM) and pgbouncer.
Your prior pgbench tests [3] also focused on larger servers so I wanted to
see how this works on smaller hardware.

Considering this from connpool.sgml:
"connection_proxies specifies number of connection proxy
processes which will be spawned. Default value is zero, so connection
pooling is disabled by default."

That hints to me that connection_proxies is the main configuration to start
with so that was the only configuration I changed from the default for this
feature.  I adjusted shared_buffers to 500MB (25% of total) and
max_connections to 1000.  Only having one proxy gives subpar performance
across the board, so did setting this value to 10.  My hunch is this value
should roughly follow the # of cpus available, but that's just a hunch.

I tested with 25, 75, 150, 300 and 600 connections.  Initialized with a
scale of 1000 and ran read-only tests.  Basic pgbench commands look like
the following, I have full commands and results from 18 tests included in
the attached MD file.  Postgres was restarted between each test.

pgbench -i -s 1000 bench_test
pgbench -p 6543 -c 300 -j 2 -T 600 -P 60 -S bench_test

Tests were all ran from the same server.  I intend to do further testing
with external connections using SSL.

General observations:
For each value of connection_proxies, the TPS observed at 25 connections
held up reliably through 600 connections. For this server using
connection_proxies = 2 was the fastest, but setting to 1 or 10 still
provided **predictable** throughput.  That seems to be a good attribute for
this feature.

Also predictable was the latency increase, doubling the connections roughly
doubles the latency.  This was true with or without connection pooling.

Focusing on disabled connection pooling vs the feature with two proxies,
the results are promising, the breakpoint seems to be around 150
connections.

Low connections (25):  -15% TPS; +45% latency
Medium connections (300):  +21% TPS; +38% latency
High connections (600):  Couldn't run without pooling... aka: Win for
pooling!

The two attached charts show the TPS and average latency of these two
scenarios.  This feature does a great job of maintaining a consistent TPS
as connection numbers increase.  This comes with tradeoffs of lower
throughput with < 150 connections, and higher latency across the board.
The increase in latency seems reasonable to me based on the testing I have
done so far. Compared to the results from [2] it seems latency is affecting
this feature a bit more than it does pgbouncer, yet not unreasonably so
given the benefit of having the feature built in and the reduced complexity.

I don't understand yet how max_sessions ties in.
Also, having both session_pool_size and connection_proxies seemed confusing
at first.  I still haven't figured out exactly how they relate together in
the overall operation and their impact on performance.  The new view
helped, I get the concept of **what** it is doing (connection_proxies =
more rows, session_pool_size = n_backends for each row), it's more a lack
of understanding the **why** regarding how it will 

Re: Memory Accounting

2019-07-24 Thread Tomas Vondra

On Tue, Jul 23, 2019 at 06:18:26PM -0700, Melanie Plageman wrote:

On Thu, Jul 18, 2019 at 11:24 AM Jeff Davis  wrote:


Previous discussion:
https://postgr.es/m/1407012053.15301.53.camel%40jeff-desktop

This patch introduces a way to ask a memory context how much memory it
currently has allocated. Each time a new block (not an individual
chunk, but a new malloc'd block) is allocated, it increments a struct
member; and each time a block is free'd, it decrements the struct
member. So it tracks blocks allocated by malloc, not what is actually
used for chunks allocated by palloc.



Cool! I like how straight-forward this approach is. It seems easy to
build on, as well.

Are there cases where we are likely to palloc a lot without needing to
malloc in a certain memory context? For example, do we have a pattern
where, for some kind of memory intensive operator, we might palloc in
a per tuple context and consistently get chunks without having to
malloc and then later, where we to try and check the bytes allocated
for one of these per tuple contexts to decide on some behavior, the
number would not be representative?



I think there's plenty of places where we quickly get into a state with
enough chunks in the freelist - the per-tuple contexts are a good
example of that, I think.


I think that is basically what Heikki is asking about with HashAgg,
but I wondered if there were other cases that you had already thought
through where this might happen.



I think Heikki was asking about places with a lot of sub-contexts, which a
completely different issue. It used to be the case that some aggregates
created a separate context for each group - like array_agg. That would
make Jeff's approach to accounting rather inefficient, because checking
how much memory is used would be very expensive (having to loop over a
large number of contexts).




The purpose is for Memory Bounded Hash Aggregation, but it may be
useful in more cases as we try to better adapt to and control memory
usage at execution time.



This approach seems like it would be good for memory intensive
operators which use a large, representative context. I think the
HashTableContext for HashJoin might be one?



Yes, that might me a good candidate (and it would be much simpler than
the manual accounting we use now).

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Index Skip Scan

2019-07-24 Thread Dmitry Dolgov
> On Mon, Jul 22, 2019 at 7:10 PM Jesper Pedersen  
> wrote:
>
> On 7/22/19 1:44 AM, David Rowley wrote:
> > Here are the comments I noted down during the review:
> >
> > cost_index:
> >
> > I know you've not finished here, but I think it'll need to adjust
> > tuples_fetched somehow to account for estimate_num_groups() on the
> > Path's unique keys. Any Eclass with an ec_has_const = true does not
> > need to be part of the estimate there as there can only be at most one
> > value for these.
> >
> > For example, in a query such as:
> >
> > SELECT x,y FROM t WHERE x = 1 GROUP BY x,y;
> >
> > you only need to perform estimate_num_groups() on "y".
> >
> > I'm really not quite sure on what exactly will be required from
> > amcostestimate() here.  The cost of the skip scan is not the same as
> > the normal scan. So other that API needs adjusted to allow the caller
> > to mention that we want skip scans estimated, or there needs to be
> > another callback.
> >
>
> I think this part will become more clear once the index skip scan patch
> is rebased, as we got the uniquekeys field on the Path, and the
> indexskipprefixy info on the IndexPath node.

Here is what I came up with to address the problems, mentioned above in this
thread. It passes tests, but I haven't tested it yet more thoughtful (e.g. it
occurred to me, that `_bt_read_closest` probably wouldn't work, if a next key,
that passes an index condition is few pages away - I'll try to tackle it soon).
Just another small step forward, but I hope it's enough to rebase on top of it
planner changes.

Also I've added few tags, mostly to mention reviewers contribution.


v22-0001-Index-skip-scan.patch
Description: Binary data


Re: initdb recommendations

2019-07-24 Thread Peter Eisentraut
On 2019-07-24 22:18, Tom Lane wrote:
>> I think we could just define that if geteuid == getpeereid, then
>> authentication succeeds.  Possibly make that a setting if someone wants
>> to turn it off.
> 
> We would still need to make the proposed buildfarm changes, though,
> because Windows.  (And HPUX, though if it were the only holdout
> maybe we could consider blowing it off.)
> 
> I'm not that excited about weakening our authentication rules
> just to make things easier for the buildfarm.

Yes, this idea is separate from those buildfarm changes.

> It's possible that what you suggest is a good idea anyway to reduce
> the user impact of switching from trust to peer as default auth.
> However, I'm a little worried that we'll start getting a lot of "it
> works in psql but I can't connect via JDBC-or-whatever" complaints.

Well, the existence of "local" vs. "host" already has that effect anyway.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pg_upgrade version checking questions

2019-07-24 Thread Peter Eisentraut
On 2019-07-23 17:30, Daniel Gustafsson wrote:
> The reason for moving is that we print default values in usage(), and that
> requires the value to be computed before calling usage().  We already do this
> for resolving environment values in parseCommandLine().  If we do it in setup,
> then we’d have to split out resolving the new_cluster.bindir into it’s own
> function exposed to option.c, or do you have any other suggestions there?

I think doing nontrivial work in order to print default values in
usage() is bad practice, because in unfortunate cases it would even
prevent you from calling --help.  Also, in this case, it would probably
very often exceed the typical line length of --help output and create
some general ugliness.  Writing something like "(default: same as this
pg_upgrade)" would probably achieve just about the same.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Support for jsonpath .datetime() method

2019-07-24 Thread Peter Eisentraut
On 2019-07-24 00:48, Nikita Glukhov wrote:
> It seems that our YY works like RR should:
> 
> SELECT to_date('69', 'YY');
>   to_date   
> 
>  2069-01-01
> (1 row)
> 
> SELECT to_date('70', 'YY');
>   to_date   
> 
>  1970-01-01
> (1 row)
> 
> But by the standard first two digits of current year should be used in YY.

Is this behavior even documented anywhere in our documentation?  I
couldn't find it.  What's the exact specification of what it does in
these cases?

> So it's unclear what we should do: 
>  - implement YY and RR strictly following the standard only in .datetime()
>  - fix YY implementation in to_date()/to_timestamp() and implement RR
>  - use our non-standard templates in .datetime()

I think we definitely should try to use the same template system in both
the general functions and in .datetime().  This might involve some
compromises between existing behavior, Oracle behavior, SQL standard.
So far I'm not worried: If you're using two-digit years like above,
you're playing with fire anyway.  Also some of the other cases like
dealing with trailing spaces are probably acceptable as slight
incompatibilities or extensions.

We should collect a list of test cases that illustrate the differences
and then work out how to deal with them.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: initdb recommendations

2019-07-24 Thread Tom Lane
Peter Eisentraut  writes:
> If I'm logged in as the OS user that owns the data directory, I should
> be able to log in to the database system via local socket as any user.
> Because why stop me?  I can just change pg_hba.conf to let me in.

Hmm ... there's probably some minor loss of safety there, but not
much, as you say.

> I think we could just define that if geteuid == getpeereid, then
> authentication succeeds.  Possibly make that a setting if someone wants
> to turn it off.

We would still need to make the proposed buildfarm changes, though,
because Windows.  (And HPUX, though if it were the only holdout
maybe we could consider blowing it off.)

I'm not that excited about weakening our authentication rules
just to make things easier for the buildfarm.

It's possible that what you suggest is a good idea anyway to reduce
the user impact of switching from trust to peer as default auth.
However, I'm a little worried that we'll start getting a lot of "it
works in psql but I can't connect via JDBC-or-whatever" complaints.
So I dunno if it will really make things easier for users.

regards, tom lane




Re: initdb recommendations

2019-07-24 Thread Peter Eisentraut
On 2019-07-22 19:40, Andres Freund wrote:
> On 2019-07-22 13:02:13 -0400, Andrew Dunstan wrote:
>> There are a few things we could do. We could force trust auth, or we
>> could add an ident map that allowed $USER to login as buildfarm. Finding
>> all the places we would need to fix that could be a fun project ...
> 
> Perhaps we could actually do so automatically when the initdb invoking
> user isn't the same as the OS user? Imo that'd be generally quite
> useful, and not just for the regression tets.

It seems to me that there is something missing in our client
authentication system here.

If I'm logged in as the OS user that owns the data directory, I should
be able to log in to the database system via local socket as any user.
Because why stop me?  I can just change pg_hba.conf to let me in.

That would also address this problem that when you use the initdb -U
option, the proposed default "peer" setting doesn't help you much.
Making a pg_ident.conf map automatically helps for that particular user
combination, but then not for other users.  (There is no "sameuser plus
these additional mappings".)

I think we could just define that if geteuid == getpeereid, then
authentication succeeds.  Possibly make that a setting if someone wants
to turn it off.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: initdb recommendations

2019-07-24 Thread Andrew Dunstan

On 7/24/19 10:00 AM, Andrew Dunstan wrote:
> On 7/23/19 2:12 AM, Peter Eisentraut wrote:
>> On 2019-07-22 21:16, Andrew Dunstan wrote:
>>> Modulo this issue, experimentation shows that adding '-A trust' to the
>>> line in run_build.pl where initdb is called fixes the issue. If we're
>>> going to rely on a buildfarm client fix that one seems simplest.
>> Yes, that is the right fix.  It's what the in-tree test drivers
>> (pg_regress, PostgresNode.pm) do.
>>
>
> I have done that, I will put out a new release probably right after the
> CF closes.
>
>
> I think we also need to change vcregress.pl to use trust explicitly for
> upgrade checks, just like the Unix upgrade test script does. That should
> help to future-proof us a bit.
>
>

Here's a patch along those lines that pretty much syncs up
vcregress.pl's initdb with pg_upgrade's test.sh.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

diff --git a/src/tools/msvc/vcregress.pl b/src/tools/msvc/vcregress.pl
index 5495066b4d..05446c3f59 100644
--- a/src/tools/msvc/vcregress.pl
+++ b/src/tools/msvc/vcregress.pl
@@ -496,12 +496,28 @@ sub recoverycheck
 }
 
 # Run "initdb", then reconfigure authentication.
+# mimics what is done in src/bin/pg_upgrade/test.sh:standard_initdb()
 sub standard_initdb
 {
-	return (
-		system('initdb', '-N') == 0 and system(
-			"$topdir/$Config/pg_regress/pg_regress", '--config-auth',
-			$ENV{PGDATA}) == 0);
+	my @opts = qw(-N --wal-segsize 1 -g -A trust);
+	my $status = system('initdb',@opts);
+	if ($status == 0)
+	{
+		if (defined($ENV{TEMP_CONFIG}) && -r $ENV{TEMP_CONFIG})
+		{
+			open(my $handle, '<', $ENV{TEMP_CONFIG})
+			  || die "file $ENV{TEMP_CONFIG} not found";
+			my $contents = <$handle>;
+			close($handle);
+			open($handle, '>>', "$ENV{PGDATA}/postgresql.conf")
+			  || die "file $ENV{PGDATA}/postgresql.conf not found";
+			print $handle $contents;
+			close($handle);
+		}
+		$status = system("$topdir/$Config/pg_regress/pg_regress",
+		 '--config-auth', $ENV{PGDATA})
+	}
+	return $status;
 }
 
 # This is similar to appendShellString().  Perl system(@args) bypasses


Re: initdb recommendations

2019-07-24 Thread Peter Eisentraut
On 2019-07-24 16:00, Andrew Dunstan wrote:
> I think we also need to change vcregress.pl to use trust explicitly for
> upgrade checks, just like the Unix upgrade test script does. That should
> help to future-proof us a bit.

Right, I'll add that to my patch.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: psql - add SHOW_ALL_RESULTS option

2019-07-24 Thread Fabien COELHO


Bonjour Daniel,


I kind of agree as well, but I was pretty sure that someone would complain
if the current behavior was changed.


If queries in a compound statement must be kept silent,
they can be converted to CTEs or DO-blocks to produce the
same behavior without having to configure anything in psql.
That cost on users doesn't seem too bad, compared to introducing
a knob in psql, and presumably maintaining it forever.


Ok.

Attached a "do it always version", which does the necessary refactoring. 
There is seldom new code, it is rather moved around, some functions are 
created for clarity.


--
Fabien.diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 7789fc6177..5e4f653f88 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -3355,10 +3355,8 @@ select 1\; select 2\; select 3;
 commands included in the string to divide it into multiple
 transactions.  (See 
 for more details about how the server handles multi-query strings.)
-psql prints only the last query result
-it receives for each request; in this example, although all
-three SELECTs are indeed executed, psql
-only prints the 3.
+psql prints all results it receives, one
+after the other.
 
 
   
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 44a782478d..4534c45854 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -472,6 +472,16 @@ ResetCancelConn(void)
 #endif
 }
 
+static void
+ShowErrorMessage(const PGresult *result)
+{
+	const char *error = PQerrorMessage(pset.db);
+
+	if (strlen(error))
+		pg_log_info("%s", error);
+
+	CheckConnection();
+}
 
 /*
  * AcceptResult
@@ -482,7 +492,7 @@ ResetCancelConn(void)
  * Returns true for valid result, false for error state.
  */
 static bool
-AcceptResult(const PGresult *result)
+AcceptResult(const PGresult *result, bool show_error)
 {
 	bool		OK;
 
@@ -513,15 +523,8 @@ AcceptResult(const PGresult *result)
 break;
 		}
 
-	if (!OK)
-	{
-		const char *error = PQerrorMessage(pset.db);
-
-		if (strlen(error))
-			pg_log_info("%s", error);
-
-		CheckConnection();
-	}
+	if (!OK && show_error)
+		ShowErrorMessage(result);
 
 	return OK;
 }
@@ -701,7 +704,7 @@ PSQLexec(const char *query)
 
 	ResetCancelConn();
 
-	if (!AcceptResult(res))
+	if (!AcceptResult(res, true))
 	{
 		ClearOrSaveResult(res);
 		res = NULL;
@@ -743,7 +746,7 @@ PSQLexecWatch(const char *query, const printQueryOpt *opt)
 
 	ResetCancelConn();
 
-	if (!AcceptResult(res))
+	if (!AcceptResult(res, true))
 	{
 		ClearOrSaveResult(res);
 		return 0;
@@ -999,199 +1002,113 @@ loop_exit:
 	return success;
 }
 
-
 /*
- * ProcessResult: utility function for use by SendQuery() only
- *
- * When our command string contained a COPY FROM STDIN or COPY TO STDOUT,
- * PQexec() has stopped at the PGresult associated with the first such
- * command.  In that event, we'll marshal data for the COPY and then cycle
- * through any subsequent PGresult objects.
- *
- * When the command string contained no such COPY command, this function
- * degenerates to an AcceptResult() call.
- *
- * Changes its argument to point to the last PGresult of the command string,
- * or NULL if that result was for a COPY TO STDOUT.  (Returning NULL prevents
- * the command status from being printed, which we want in that case so that
- * the status line doesn't get taken as part of the COPY data.)
- *
- * Returns true on complete success, false otherwise.  Possible failure modes
- * include purely client-side problems; check the transaction status for the
- * server-side opinion.
+ * Marshal the COPY data.  Either subroutine will get the
+ * connection out of its COPY state, then call PQresultStatus()
+ * once and report any error.
+ *
+ * For COPY OUT, direct the output to pset.copyStream if it's set,
+ * otherwise to pset.gfname if it's set, otherwise to queryFout.
+ * For COPY IN, use pset.copyStream as data source if it's set,
+ * otherwise cur_cmd_source.
+ *
+ * Update result if further processing is necessary.  (Returning NULL
+ * prevents the command status from being printed, which we want in that
+ * case so that the status line doesn't get taken as part of the COPY data.)
  */
 static bool
-ProcessResult(PGresult **results)
+HandleCopyResult(PGresult **result)
 {
 	bool		success = true;
-	bool		first_cycle = true;
+	FILE	   *copystream;
+	PGresult   *copy_result;
+	ExecStatusType result_status = PQresultStatus(*result);
 
-	for (;;)
+	Assert(result_status == PGRES_COPY_OUT ||
+		   result_status == PGRES_COPY_IN);
+
+	SetCancelConn();
+	if (result_status == PGRES_COPY_OUT)
 	{
-		ExecStatusType result_status;
-		bool		is_copy;
-		PGresult   *next_result;
+		bool		need_close = false;
+		bool		is_pipe = false;
 
-		if (!AcceptResult(*results))
+		if (pset.copyStream)
 		{
-			/*
-			 * Failure at this point is always a server-side failure or a
-			 * failure to submit the command 

RE: seems like a bug in pgbench -R

2019-07-24 Thread Fabien COELHO



Hello Yoshikazu,


I could not reproduce this issue on head, but I confirm on 11.2.


I could reproduce the stuck on 11.4.


Attached is a fix to apply on pg11.


I confirm the stuck doesn't happen after applying your patch.


Ok, thanks for the feedback.


+   /* under throttling we may have finished the last client above 
*/
+   if (remains == 0)
+   break;


If there are only CSTATE_WAIT_RESULT, CSTATE_SLEEP or CSTATE_THROTTLE clients,
a thread needs to wait the results or sleep. In that logic, there are the case
that a thread tried to wait the results when there are no clients wait the
results, and this causes the issue. This is happened when there are only
CSTATE_THROTLE clients and pgbench timeout is occured. Those clients will be
finished and "remains" will be 0.

I confirmed above codes prevent such a case.


Yep.


I almost think this is ready for committer,


Almost great, then.

but I have one question. Is it better adding any check like if(maxsock 
!= -1) before the select?


else/* no explicit delay, select without timeout */
{
   nsocks = select(maxsock + 1, _mask, NULL, NULL, NULL);
}


I think that it is not necessary because this case cannot happen: If some 
clients are still running (remains > 0), they are either sleeping, in 
which case there would be a timeout, or they are waiting for something 
from the server, otherwise the script could be advanced further so there 
would be something else to do for the thread.


We could check this by adding "Assert(maxsock != -1);" before this select, 
but I would not do that for a released version.


--
Fabien.




add a MAC check for TRUNCATE

2019-07-24 Thread Yuli Khodorkovskiy
Hackers,

Since all DAC checks should have corresponding MAC, this patch adds a
hook to allow extensions to implement a MAC check on TRUNCATE. I have
also implemented this access check in the sepgsql extension.

One important thing to note is that refpolicy [1] and Redhat based
distributions do not have the SELinux permission for db_table {truncate}
implemented. This patch is the first step to add this permission to the
upstream SELinux policy. If this permission does not exist in the
policy, sepgsql is being used, and `deny_unknown` is set to 1, the
TRUNCATE will be denied.

As a workaround for this behavior, the SELinux aware system would need
to have `/sys/fs/selinux/deny_unknown` set to 0 until the permission has
been added to refpolicy/Redhat SELinux policy.

The deny_unknown behavior can be set using CIL [2] by extracting the
base SELinux module, and setting how the kernel handles unknown
permissions.  The dependencies for overriding handle_unknown are
policycoreutils, selinux-policy-targeted, and a libsemanage version that
supports CIL (CentOS 7+).

$ sudo semodule -cE base
$ sed -Ei 's/(handleunknown )deny/\1allow/g' base.cil
$ sudo semodule -i base.cil

Thanks,

Yuli

[1] 
https://github.com/SELinuxProject/refpolicy/blob/master/policy/flask/access_vectors#L794
[2] 
https://github.com/SELinuxProject/selinux/blob/master/secilc/docs/cil_policy_config_statements.md#handleunknown
0001-Use-MAC-in-addition-to-DAC-for-TRUNCATE.patch


0001-Use-MAC-in-addition-to-DAC-for-TRUNCATE.patch
Description: Binary data


Re: Statistical aggregate functions are not working with PARTIAL aggregation

2019-07-24 Thread Andres Freund
Hi,

On 2019-05-20 17:27:10 +1200, David Rowley wrote:
> On Mon, 20 May 2019 at 13:20, Andres Freund  wrote:
> > How
> > about we have something roughly like:
> >
> > int numTransFnArgs = -1;
> > int numCombineFnArgs = -1;
> > Oid transFnInputTypes[FUNC_MAX_ARGS];
> > Oid combineFnInputTypes[2];
> >
> > if (DO_AGGSPLIT_COMBINE(...)
> >numCombineFnArgs = 1;
> >combineFnInputTypes = list_make2(aggtranstype, aggtranstype);
> > else
> >numTransFnArgs = get_aggregate_argtypes(aggref, transFnInputTypes);
> >
> > ...
> >
> > if (DO_AGGSPLIT_COMBINE(...))
> > build_pertrans_for_aggref(pertrans, aggstate, estate,
> >   aggref, combinefn_oid, aggtranstype,
> >   serialfn_oid, deserialfn_oid,
> >   initValue, initValueIsNull,
> >   combineFnInputTypes, numCombineFnArgs);
> > else
> > build_pertrans_for_aggref(pertrans, aggstate, estate,
> >   aggref, transfn_oid, aggtranstype,
> >   serialfn_oid, deserialfn_oid,
> >   initValue, initValueIsNull,
> >   transFnInputTypes, numTransFnArgs);
> >
> > seems like that'd make the code clearer?
> 
> I think that might be a good idea... I mean apart from trying to
> assign a List to an array :)  We still must call
> get_aggregate_argtypes() in order to determine the final function, so
> the code can't look exactly like you've written.
> 
> >  I wonder if we shouldn't
> > strive to have *no* DO_AGGSPLIT_COMBINE specific logic in
> > build_pertrans_for_aggref (except perhaps for an error check or two).
> 
> Just so we have a hard copy to review and discuss, I think this would
> look something like the attached.
> 
> We do miss out on a few very small optimisations, but I don't think
> they'll be anything we could measure. Namely
> build_aggregate_combinefn_expr() called make_agg_arg() once and used
> it twice instead of calling it once for each arg.  I don't think
> that's anything we could measure, especially in a situation where
> two-stage aggregation is being used.
> 
> I ended up also renaming aggtransfn to transfn_oid in
> build_pertrans_for_aggref(). Having it called aggtranfn seems a bit
> too close to the pg_aggregate.aggtransfn column which is confusion
> given that we might pass it the value of the aggcombinefn column.

Now that master is open for development, and you have a commit bit, are
you planning to go forward with this on your own?

Greetings,

Andres Freund




Re: Adding a test for speculative insert abort case

2019-07-24 Thread Andres Freund
Hi,

On 2019-06-05 15:49:47 -0700, Melanie Plageman wrote:
> On Thu, May 16, 2019 at 8:46 PM Melanie Plageman 
> wrote:
> 
> >
> > Good idea.
> > I squashed the changes I suggested in previous emails, Ashwin's patch, my
> > suggested updates to that patch, and the index order check all into one
> > updated
> > patch attached.
> >
> >
> I've updated this patch to make it apply on master cleanly. Thanks to
> Alvaro for format-patch suggestion.

Planning to push this, now that v12 is branched off. But only to master, I
don't think it's worth backpatching at the moment.


> The second patch in the set is another suggestion I have. I noticed
> that the insert-conflict-toast test mentions that it is "not
> guaranteed to lead to a failed speculative insertion" and, since it
> seems to be testing the speculative abort but with TOAST tables, I
> thought it might work to kill that spec file and move that test case
> into insert-conflict-specconflict so the test can utilize the existing
> advisory locks being used for the other tests in that file to make it
> deterministic which session succeeds in inserting the tuple.

Seems like a good plan.
> diff --git a/src/test/isolation/specs/insert-conflict-specconflict.spec 
> b/src/test/isolation/specs/insert-conflict-specconflict.spec
> index 3a70484fc2..7f29fb9d02 100644
> --- a/src/test/isolation/specs/insert-conflict-specconflict.spec
> +++ b/src/test/isolation/specs/insert-conflict-specconflict.spec
> @@ -10,7 +10,7 @@ setup
>  {
>   CREATE OR REPLACE FUNCTION blurt_and_lock(text) RETURNS text IMMUTABLE 
> LANGUAGE plpgsql AS $$
>   BEGIN
> -RAISE NOTICE 'called for %', $1;
> +RAISE NOTICE 'blurt_and_lock() called for %', $1;
>  
>   -- depending on lock state, wait for lock 2 or 3
>  IF pg_try_advisory_xact_lock(current_setting('spec.session')::int, 
> 1) THEN
> @@ -23,9 +23,16 @@ setup
>  RETURN $1;
>  END;$$;
>  
> +CREATE OR REPLACE FUNCTION blurt_and_lock2(text) RETURNS text IMMUTABLE 
> LANGUAGE plpgsql AS $$
> +BEGIN
> +RAISE NOTICE 'blurt_and_lock2() called for %', $1;
> +PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 
> 4);
> +RETURN $1;
> +END;$$;
> +

Any chance for a bit more descriptive naming than *2? I can live with
it, but ...


> +step "controller_print_speculative_locks" { SELECT 
> locktype,classid,objid,mode,granted FROM pg_locks WHERE locktype='speculative
> +token' ORDER BY granted; }

I think showing the speculative locks is possibly going to be unreliable
- the release time of speculative locks is IIRC not that reliable. I
think it could e.g. happen that speculative locks are held longer
because autovacuum spawned an analyze in the background.


> +   # Should report s1 is waiting on speculative lock
> +   "controller_print_speculative_locks"

Hm, I might be missing something, but I don't think it currently
does. Looking at the expected file:

+step controller_print_speculative_locks: SELECT 
locktype,classid,objid,mode,granted FROM pg_locks WHERE locktype='speculative
+token' ORDER BY granted;
+locktype   classidobjid  mode   granted
+

And if it showed something, it'd make the test not work, because
classid/objid aren't necessarily going to be the same from test to test.

Greetings,

Andres Freund




Re: GiST VACUUM

2019-07-24 Thread Peter Geoghegan
On Wed, Jul 24, 2019 at 11:33 AM Heikki Linnakangas  wrote:
> That's probably how it's going to go, but hey, doesn't hurt to ask :-).

I think that it would be fine to be conservative with nbtree, and only
target the master branch. The problem is annoying, certainly, but it's
not likely to make a huge difference for most real world workloads.
OTOH, perhaps the risk is so low that we might as well target
backbranches.

How do you feel about it?

-- 
Peter Geoghegan




Re: GiST VACUUM

2019-07-24 Thread Heikki Linnakangas

On 24/07/2019 21:02, Peter Geoghegan wrote:

On Wed, Jul 24, 2019 at 10:30 AM Heikki Linnakangas  wrote:

Pushed this now, to master and REL_12_STABLE.

Now, B-tree indexes still have the same problem, in all versions. Any
volunteers to write a similar fix for B-trees?


I was hoping that you'd work on it.   :-)


That's probably how it's going to go, but hey, doesn't hurt to ask :-).


Any reason to think that it's much different to what you've done with GiST?


No, it should be very similar.

- Heikki




Re: GiST VACUUM

2019-07-24 Thread Peter Geoghegan
On Wed, Jul 24, 2019 at 10:30 AM Heikki Linnakangas  wrote:
> Pushed this now, to master and REL_12_STABLE.
>
> Now, B-tree indexes still have the same problem, in all versions. Any
> volunteers to write a similar fix for B-trees?

I was hoping that you'd work on it.   :-)

Any reason to think that it's much different to what you've done with GiST?

-- 
Peter Geoghegan




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-24 Thread Tom Lane
David Rowley  writes:
> Here's a more polished version with the debug code removed, complete
> with benchmarks.

A few gripes:

You're measuring the number of locks held at completion of the
transaction, which fails to account for locks transiently taken and
released, so that the actual peak usage will be more.  I think we mostly
only do that for system catalog accesses, so *maybe* the number of extra
locks involved isn't very large, but that's a very shaky assumption.

I don't especially like the fact that this seems to have a hard-wired
(and undocumented) assumption that buckets == entries, ie that the
fillfactor of the table is set at 1.0.  lock.c has no business knowing
that.  Perhaps instead of returning the raw bucket count, you could have
dynahash.c return bucket count times fillfactor, so that the number is in
the same units as the entry count.

This:
running_avg_locks -= running_avg_locks / 10.0;
running_avg_locks += numLocksHeld / 10.0;
seems like a weird way to do the calculation.  Personally I'd write
running_avg_locks += (numLocksHeld - running_avg_locks) / 10.0;
which is more the way I'm used to seeing exponential moving averages
computed --- at least, it seems clearer to me why this will converge
towards the average value of numLocksHeld over time.  It also makes
it clear that it wouldn't be sane to use two different divisors,
whereas your formulation makes it look like maybe they could be
set independently.

Your compiler might not complain that LockReleaseAll's numLocksHeld
is potentially uninitialized, but other people's compilers will.


On the whole, I don't especially like this approach, because of the
confusion between peak lock count and end-of-xact lock count.  That
seems way too likely to cause problems.

regards, tom lane




Re: POC: Cleaning up orphaned files using undo logs

2019-07-24 Thread vignesh C
Hi,

I have done some review of undolog patch series
and here are my comments:
0003-Add-undo-log-manager.patch

1) As undo log is being created in tablespace,
 if the tablespace is dropped later, will it have any impact?
+void
+UndoLogDirectory(Oid tablespace, char *dir)
+{
+ if (tablespace == DEFAULTTABLESPACE_OID ||
+ tablespace == InvalidOid)
+ snprintf(dir, MAXPGPATH, "base/undo");
+ else
+ snprintf(dir, MAXPGPATH, "pg_tblspc/%u/%s/undo",
+ tablespace, TABLESPACE_VERSION_DIRECTORY);
+}

2) Header file exclusion
a) The following headers can be excluded in undolog.c
+#include "access/transam.h"
+#include "access/undolog.h"
+#include "access/xlogreader.h"
+#include "catalog/catalog.h"
+#include "nodes/execnodes.h"
+#include "storage/buf.h"
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/lwlock.h"
+#include "storage/shmem.h"
+#include "storage/standby.h"
+#include "storage/sync.h"
+#include "utils/memutils.h"

b) The following headers can be excluded from undofile.c
+#include "access/undolog.h"
+#include "catalog/database_internal.h"
+#include "miscadmin.h"
+#include "postmaster/bgwriter.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"

3) Some macro replacement.
a)Session.h

+++ b/src/include/access/session.h
@@ -17,6 +17,9 @@
 /* Avoid including typcache.h */
 struct SharedRecordTypmodRegistry;

+/* Avoid including undolog.h */
+struct UndoLogSlot;
+
 /*
  * A struct encapsulating some elements of a user's session.  For now this
  * manages state that applies to parallel query, but it principle it could
@@ -27,6 +30,10 @@ typedef struct Session
  dsm_segment *segment; /* The session-scoped DSM segment. */
  dsa_area   *area; /* The session-scoped DSA area. */

+ /* State managed by undolog.c. */
+ struct UndoLogSlot *attached_undo_slots[4]; /* UndoLogCategories */
+ bool need_to_choose_undo_tablespace;
+

Should we change 4 to UndoLogCategories or suitable macro?
b)
+static inline size_t
+UndoLogNumSlots(void)
+{
+ return MaxBackends * 4;
+}
Should we change 4 to UndoLogCategories  or suitable macro

c)
+allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
+ UndoLogOffset end)
+{
+ struct stat stat_buffer;
+ off_t size;
+ char path[MAXPGPATH];
+ void   *zeroes;
+ size_t nzeroes = 8192;
+ int fd;

should we use BLCKSZ instead of 8192?

4) Should we add a readme file for undolog as it does a fair amount of work
and is core part of the undo system?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


On Wed, Jul 24, 2019 at 5:44 PM Amit Kapila  wrote:
>
> On Wed, Jul 24, 2019 at 2:45 PM Amit Kapila  wrote:
> >
> > On Thu, Jul 18, 2019 at 5:10 PM Amit Kapila  wrote:
> >
> > 7.
> > +attach_undo_log(UndoLogCategory category, Oid tablespace)
> > {
> > ..
> > if (candidate->meta.tablespace == tablespace)
> > + {
> > + logno = *place;
> > + slot = candidate;
> > + *place = candidate->next_free;
> > + break;
> > + }
> >
> > Here, the code is breaking from the loop, so why do we need to set
> > *place?  Am I missing something obvious?
> >
>
> I think I know what I was missing.  It seems here you are removing an
> element from the freelist.
>
> One point related to detach_current_undo_log.
>
> + LWLockAcquire(>mutex, LW_EXCLUSIVE);
> + slot->pid = InvalidPid;
> + slot->meta.unlogged.xid = InvalidTransactionId;
> + if (full)
> + slot->meta.status = UNDO_LOG_STATUS_FULL;
> + LWLockRelease(>mutex);
>
> If I read the comments in structure UndoLogMetaData, it is mentioned
> that 'status' is changed by explicit WAL record whereas there is no
> WAL record in code to change the status.  I see the problem as well if
> we don't WAL log this change.  Suppose after changing the status of
> this log, we allocate a new log and insert some records in that log as
> well for the same transaction for which we have inserted records in
> the log which we just marked as FULL.  Now, here we form the link
> between two logs as the same transaction has overflowed into a new
> log.  Say, we crash after this.  Now, after recovery the log won't be
> marked as FULL which means there is a chance that it can be used for
> some other transaction, if that happens, then our link for a
> transaction spanning to different log will break and we won't be able
> to access the data in another log.  In short, I think it is important
> to WAL log this status change unless I am missing something.
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
>


--
Regards,
vignesh
  Have a nice day




Re: Seek failure at end of FSM file during WAL replay (in 11)

2019-07-24 Thread Tom Lane
Michael Paquier  writes:
> Recently, one of the test beds we use has blown up once when doing
> streaming replication like that:
> FATAL:  could not seek to end of file "base/16386/19817_fsm": No such
>file or directory 
> CONTEXT:  WAL redo at 60/8DA22448 for Heap2/CLEAN: remxid 65751197
> LOG:  startup process (PID 44886) exited with exit code 1

> All the WAL records have been wiped out since, so I don't know exactly
> what happened, but I could track down that this FSM file got removed
> a couple of hours before as I got my hands on some FS-level logs which
> showed a deletion.

Hm.  AFAICS the immediate issuer of the error must have been
_mdnblocks(); there are other matches to that error string but
they are in places where we can tell which file the seek must
have been applied to, and it wasn't a FSM file.

> Before blaming a lower level of
> the application stack, I am wondering if we have some issues with
> mdfd_vfd meaning that the file has been removed but that it is still
> tracked as opened.

lseek() per se presumably would never return ENOENT.  A more likely
theory is that the file wasn't actually open but only had a leftover
VFD entry, and when FileSize() -> FileAccess() tried to open it,
the open failed with ENOENT --- but _mdnblocks() would still call it
a seek failure.

So I'd opine that this is a pretty high-level failure --- what are
we doing trying to replay WAL against a table that's been dropped?
Or if it wasn't dropped, why was the FSM removed?

regards, tom lane




Re: GiST VACUUM

2019-07-24 Thread Heikki Linnakangas

On 22/07/2019 16:09, Heikki Linnakangas wrote:

Unless something comes up, I'll commit this tomorrow.


Pushed this now, to master and REL_12_STABLE.

Now, B-tree indexes still have the same problem, in all versions. Any 
volunteers to write a similar fix for B-trees?


- Heikki




Re: "localtime" value in TimeZone

2019-07-24 Thread Tom Lane
Shay Rojansky  writes:
> In (certain) out-of-the-box PostgreSQL installations, the timezone GUC is
> set to "localtime", which seems to mean to query the OS for the value.
> Unless I'm mistaken, the issue with this is that it doesn't allow clients
> inspecting the TimeZone GUC to actually know what timezone the server is
> in, making the GUC largely useless (and creates friction as the GUC can't
> be expected to always contain valid IANA/Olson values). It would be more
> useful if PostgreSQL exposed the actual timezone provided by the OS.

> Does this make sense?

Yeah, this is something that some tzdb packagers do --- they put a
"localtime" file into /usr/share/zoneinfo that is a symlink or hard link
to the active zone file, and then initdb tends to seize on that as being
the shortest available spelling of the active zone.

I opined in
https://www.postgresql.org/message-id/27991.1560984...@sss.pgh.pa.us
that we should avoid choosing "localtime", but that thread seems
stalled on larger disagreements about how complicated we want that
mechanism to be.

> As a side note, there doesn't seem to be any specific documentation on the
> special "localtime" value of this GUC

That's because it's nonstandard and platform-specific.  It's also
not special from our standpoint --- it's jsut another zone file.

regards, tom lane




Re: [PATCH] minor bugfix for pg_basebackup (9.6 ~ )

2019-07-24 Thread Alvaro Herrera
On 2019-Jul-24, Ian Barwick wrote:

> It'd be better if such a hypothetical option validated the provided
> slot name anwyay, to prevent later surprises.

Hmm, but what would we do if the validation failed?

> Revised patch attached, which as Alvaro suggests removes the escaping
> and adds a comment explaining why the raw value can be passed as-is.

Heh, yesterday I revised the original patch as attached and was about to
push when the bell rang.  I like this one because it keeps the comment
to one line and it mentions the function name in charge of the
validation (useful for grepping later on).  It's a bit laconic because
of the long function name and the desire to keep it to one line, but it
seems sufficient to me.

BTW upper case letters are not allowed :-)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From e258d242fc0ef653aa7654af6fb065c7133ba430 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 23 Jul 2019 17:48:18 -0400
Subject: [PATCH v3] Don't uselessly escape a string that doesn't need escaping

Per gripe from Ian Barwick
Discussion: https://postgr.es/m/cabvvfjwnnnkb8chstlhktsvl1+g6bvcv+57+w1jz61p8ygp...@mail.gmail.com
---
 src/bin/pg_basebackup/pg_basebackup.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 77a7c148ba..e7755e807d 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -1715,11 +1715,9 @@ GenerateRecoveryConf(PGconn *conn)
 	free(escaped);
 
 	if (replication_slot)
-	{
-		escaped = escape_quotes(replication_slot);
-		appendPQExpBuffer(recoveryconfcontents, "primary_slot_name = '%s'\n", replication_slot);
-		free(escaped);
-	}
+		/* unescaped: ReplicationSlotValidateName allows [a-z0-9_] only */
+		appendPQExpBuffer(recoveryconfcontents, "primary_slot_name = '%s'\n",
+		  replication_slot);
 
 	if (PQExpBufferBroken(recoveryconfcontents) ||
 		PQExpBufferDataBroken(conninfo_buf))
-- 
2.17.1



"localtime" value in TimeZone

2019-07-24 Thread Shay Rojansky
Greetings everyone.

In (certain) out-of-the-box PostgreSQL installations, the timezone GUC is
set to "localtime", which seems to mean to query the OS for the value.
Unless I'm mistaken, the issue with this is that it doesn't allow clients
inspecting the TimeZone GUC to actually know what timezone the server is
in, making the GUC largely useless (and creates friction as the GUC can't
be expected to always contain valid IANA/Olson values). It would be more
useful if PostgreSQL exposed the actual timezone provided by the OS.

Does this make sense?

As a side note, there doesn't seem to be any specific documentation on the
special "localtime" value of this GUC (e.g.
https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-TIMEZONES
).

Shay


Re: [bug fix] Produce a crash dump before main() on Windows

2019-07-24 Thread Alvaro Herrera
On 2019-Jul-24, Kyotaro Horiguchi wrote:

> Hello.
> 
> On Wed, Jul 24, 2019 at 2:31 AM Alvaro Herrera  
> wrote:
> >
> > On 2019-Jul-23, Tom Lane wrote:
> >
> > > Kyotaro Horiguchi  writes:
> >
> > > > My investigation convinced me that there is no way for a process
> > > > to detect wheter it is running as a service (except the process
> > > > directly called from system (aka entry function)).
> >
> > Maybe we can try calling pgwin32_is_service()?
> 
> Ah, it might be viable. (I'm not sure though.) I didn't thought that
> some process property we are enforcing is available.

Well, ereport() relies on this pretty extensively, so it seems okay to
rely on it for this too.

> The difference of internal behavior is added to unify the external
> (apparenet) behavior. It prevents WER dialog from showing while
> running on console. Service doesn't show a dialog even if WER is
> enabled.

Yeah, that seems to be what we want (namely, not to get the server
process blocked waiting for the user to click on the dialog box).

> The remaining issue is we cannot obtain a dump when running
> on console without being blocked by a dialog, but Windows just doesn't
> allow it.. (Might be possible if we ignore Windows 7? 8? and earlier)

I don't know what a "dump" is in this context, but in the errbacktrace
thread I linked to a Stackoverflow question where they mentioned the API
to obtain a stack backtrace for a process in Windows.  Maybe logging
that much info is sufficient for most cases.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Compile from source using latest Microsoft Windows SDK

2019-07-24 Thread Andrew Dunstan


On 7/22/19 4:23 AM, Michael Paquier wrote:
> On Mon, Jul 22, 2019 at 04:01:46PM +0800, Peifeng Qiu wrote:
>>> but it's really only a major issue for VS2019
>> VS2019 will use the latest v10 SDK by default. So no need to install 8.1
>> for VS2019.
> Yes, FWIW, I have tested with VS2019 when committing 2b1394f, and in
> this case only the v10 SDK got installed, with no actual issues
> related to the dependency of the SDK reported.  In this case I have
> installed VS using the community installer provided by Microsoft.
>
>>> I guess we might need a test for what SDK is available?
>> We can just use the WindowsSDKVersion environment variable to
>> determine the SDK for current cmd session. It's set when you start
>> the Visual Studio Prompt or call one bat script.  Developers can
>> choose the right version best suit their need. Detecting all
>> installed SDK version can be done with some registry magic but I
>> think that's not necessary in this case.
> This looks more sensible to do if the environment variable is
> available.  Looking around this variable is available when using the
> command prompt for native tools.  So using it sounds like a good idea
> to me if it exists.



Yeah, on consideration I think Peifeng's patch upthread looks OK.
(Incidentally, this variable is not set in the very old version of VC
running on currawong).


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





double free in ExecHashJoin, 9.6.12

2019-07-24 Thread Merlin Moncure
Server is generally running pretty well, and is high volume.  This
query is not new and is also medium volume.  Database rebooted in
about 4 seconds with no damage; fast enough we didn't even trip alarms
(I noticed this troubleshooting another issue).  We are a couple of
bug fixes releases behind but I didn't see anything obvious in the
release notes suggesting a resolved issue. Anyone have any ideas?
thanks in advance.

merlin

*** glibc detected *** postgres: rms ysconfig 10.33.190.21(36788)
SELECT: double free or corruption (!prev): 0x01fb2140 ***
=== Backtrace: =
/lib64/libc.so.6(+0x75dee)[0x7f4fde053dee]
/lib64/libc.so.6(+0x78c80)[0x7f4fde056c80]
postgres: rms ysconfig 10.33.190.21(36788) SELECT(ExecHashJoin+0x5a2)[0x5e2d32]
postgres: rms ysconfig 10.33.190.21(36788) SELECT(ExecProcNode+0x208)[0x5cf728]
postgres: rms ysconfig 10.33.190.21(36788)
SELECT(standard_ExecutorRun+0x18a)[0x5cd1ca]
postgres: rms ysconfig 10.33.190.21(36788) SELECT[0x6e5607]
postgres: rms ysconfig 10.33.190.21(36788) SELECT(PortalRun+0x188)[0x6e67d8]
postgres: rms ysconfig 10.33.190.21(36788) SELECT[0x6e2af3]
postgres: rms ysconfig 10.33.190.21(36788) SELECT(PostgresMain+0x75a)[0x6e456a]
postgres: rms ysconfig 10.33.190.21(36788)
SELECT(PostmasterMain+0x1875)[0x6840b5]
postgres: rms ysconfig 10.33.190.21(36788) SELECT(main+0x7a8)[0x60b528]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7f4fddffcd1d]
postgres: rms ysconfig 10.33.190.21(36788) SELECT[0x46c589]

2019-07-23 09:41:41 CDT[:@]:LOG:  server process (PID 18057) was
terminated by signal 6: Aborted
2019-07-23 09:41:41 CDT[:@]:DETAIL:  Failed process was running:
SELECT JR.job_id as jobId,JR.job_execution_id as
jobResultId,JR.created as lastRunDate, JR.status as status,
JR.status_message as statusMessage, JR.output_format as outputFormat,
JS.schedule_name as scheduleName, JS.job_name as reportName,
JS.created_by as scheduledBy, JS.product as source FROM (SELECT
JR.job_id, MAX(JR.created) AS MaxCreated FROM job_schedule JS JOIN
job_result JR ON JR.job_id=JS.job_id WHERE (lower(JS.recepients) like
lower($1) OR lower(JS.created_by) = lower($2)) GROUP BY JR.job_id) TMP
JOIN job_result JR ON JR.job_id = TMP.job_id AND JR.created =
TMP.MaxCreated JOIN job_schedule JS ON JS.job_id = JR.job_id AND
JS.job_type='CRON'

merlin




Re: initdb recommendations

2019-07-24 Thread Andrew Dunstan


On 7/23/19 2:12 AM, Peter Eisentraut wrote:
> On 2019-07-22 21:16, Andrew Dunstan wrote:
>> Modulo this issue, experimentation shows that adding '-A trust' to the
>> line in run_build.pl where initdb is called fixes the issue. If we're
>> going to rely on a buildfarm client fix that one seems simplest.
> Yes, that is the right fix.  It's what the in-tree test drivers
> (pg_regress, PostgresNode.pm) do.
>


I have done that, I will put out a new release probably right after the
CF closes.


I think we also need to change vcregress.pl to use trust explicitly for
upgrade checks, just like the Unix upgrade test script does. That should
help to future-proof us a bit.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Psql patch to show access methods info

2019-07-24 Thread Alexander Korotkov
On Wed, Jul 24, 2019 at 9:01 AM Andres Freund  wrote:
> Based on a quick skim of the thread - which means I most definitely
> missed things - there's not been discussion of why we actually want to
> add this.  Who's the prospective user of this facility? And why wouldn't
> they just query pg_am[proc]?  None of this information seems like it's
> going to be even remotely targeted towards even advanced users.  For
> developers it's not clear what these add?

I see your point regarding pg_am details.  Probably nobody expect
developers need this.  And probably even developers don't need this,
because it's easier to see IndexAmRoutine directly with more details.
So, +1 for removing this.

pg_amproc for gin/gist/sp-gist/brin is probably for developers.  But I
think pg_amproc for btree/hash could be useful for advanced users.
btree/hash opclasses could be written by advanced users using
pl/something, I've faced that several times.

> Adding stuff to psql isn't free. It adds clutter to psql's help output,
> the commands need to be maintained (including cross-version code).

Sure.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: initdb recommendations

2019-07-24 Thread Andrew Dunstan


On 7/22/19 1:40 PM, Andres Freund wrote:
> Hi,
>
> On 2019-07-22 13:02:13 -0400, Andrew Dunstan wrote:
>> There are a few things we could do. We could force trust auth, or we
>> could add an ident map that allowed $USER to login as buildfarm. Finding
>> all the places we would need to fix that could be a fun project ...
> Perhaps we could actually do so automatically when the initdb invoking
> user isn't the same as the OS user? Imo that'd be generally quite
> useful, and not just for the regression tets.
>

yeah, although I think that's a separate exercise.


So we'd have something like


in pg_hba.conf


    local    all all    peer map=datadir_owner


and in pg_ident.conf


    datadir_owner $USER $superuser


cheers


andrew




-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Psql patch to show access methods info

2019-07-24 Thread Alexander Korotkov
Hi!

On Wed, Jul 24, 2019 at 9:00 AM Andres Freund  wrote:
> On 2019-07-23 01:57:29 +0300, Alexander Korotkov wrote:
> > It was always scary there is no way in psql to see am/opclass/opfamily
> > information rather than query catalog directly.
>
> What does make that scary?

For it's unclear why do we have backslash commands for observing
almost every part of system catalog, but this quite large part is
missed.

> > I'm going to push it.  Probably, someone find that commands syntax and
> > output formats are not well discussed yet.  But we're pretty earlier
> > in 13 release cycle.  So, we will have time to work out a criticism if
> > any.
>
> Please don't before we've had some discussion as to why we want this
> additional code, and who'd be helped by it.

OK.  Given that few senior developers participate in discussion of
details, I thought we kind of agree that need this.  Now you've
explicitly express other opinion, so let's discuss.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Support for jsonpath .datetime() method

2019-07-24 Thread Alexander Korotkov
On Wed, Jul 24, 2019 at 1:50 AM Nikita Glukhov  wrote:
> So it's unclear what we should do:
>  - implement YY and RR strictly following the standard only in .datetime()
>  - fix YY implementation in to_date()/to_timestamp() and implement RR
>  - use our non-standard templates in .datetime()

Also it appears that according to standard .datetime() should treat
spaces and delimiters differently than our to_date()/to_timestamp().
It requires strict matching of spaces and delimiters in input and
format strings.  We don't have such behavior in both non-FX and FX
modes.  Also, standard doesn't define FX mode at all.  This rules
cover jsonpath .datetime() method and CAST(... FORMAT ...) – new cast
clause defined by standard.

So, I think due to reasons of compatibility it doesn't worth trying to
make behavior of our to_date()/to_timestamp() to fit requirements for
jsonpath .datetime() and CAST(... FORMAT ...).  I propose to leave
this functions as is (maybe add new patterns), but introduce another
datetime parsing mode, which would fit to the standard.  Opinions?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: pgbench tests vs Windows

2019-07-24 Thread Andrew Dunstan


On 7/24/19 3:56 AM, Fabien COELHO wrote:
>
> Hello Andrew,
>
>> Unfortunately, this isn't portable, as I've just discovered at the cost
>> of quite a bit of time. In particular, you can't assume expr is present
>> and in the path on Windows. The Windows equivalent would be something
>> like:
>>
>>    \setshell two\
>>      @set /a c = 1 + :one  && echo %c%
>
> Hmmm... Can we assume that echo is really always there on Windows? If
> so, the attached patch does something only with "echo".


Yes, it's built into the cmd processor (as is "set /a", to answer Tom's
earlier question about portability - I tested the above back to XP.)


echo is much more universal, and I can confirm that the suggested fix
works on the Windows test box I'm using.


I'll apply and backpatch that. Thanks


cheers


andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: pg_receivewal documentation

2019-07-24 Thread Jehan-Guillaume de Rorthais
Hi,

On Wed, 24 Jul 2019 11:29:28 +0900
Michael Paquier  wrote:

> On Tue, Jul 23, 2019 at 08:00:41AM -0400, Jesper Pedersen wrote:
> > Sure.  
> 
> Thanks.  Applied down to 9.6 where remote_apply has been introduced,
> with tweaks for 9.6 as the tool is named pg_receivexlog there.

Sorry to step in so lately.

Unless I am missing something, another solution might be to use a dedicated
role to pg_receive{xlog|wal} with synchronous_commit lower than
remote_apply.

Not sure we want to add such detail, but if you consider it useful, you'll find
a patch in attachment.

Regards,
>From 01a7de92dbaae5a61d5ec7bd04bef1553467f29d Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Wed, 24 Jul 2019 14:58:41 +0200
Subject: [PATCH] Add doc details for pg_receivewal with remote_apply

---
 doc/src/sgml/ref/pg_receivewal.sgml | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/pg_receivewal.sgml b/doc/src/sgml/ref/pg_receivewal.sgml
index 177e9211c0..abf55ce713 100644
--- a/doc/src/sgml/ref/pg_receivewal.sgml
+++ b/doc/src/sgml/ref/pg_receivewal.sgml
@@ -62,7 +62,12 @@ PostgreSQL documentation
application_name for
pg_receivewal that does not match it, or
change the value of synchronous_commit to
-   something other than remote_apply.
+   something other than remote_apply either cluster wide or
+   just for a dedicated role to pg_receivewal:
+
+CREATE ROLE receivewal WITH LOGIN REPLICATION;
+ALTER ROLE receivewal SET synchronous_commit TO on;
+
   
 
   
-- 
2.20.1



Re: psql - add SHOW_ALL_RESULTS option

2019-07-24 Thread Daniel Verite
Fabien COELHO wrote:

> >> I'd go further and suggest that there shouldn't be a variable
> >> controlling this. All results that come in should be processed, period.
> >
> > I agree with that.
> 
> I kind of agree as well, but I was pretty sure that someone would complain 
> if the current behavior was changed.

If queries in a compound statement must be kept silent,
they can be converted to CTEs or DO-blocks to produce the
same behavior without having to configure anything in psql.
That cost on users doesn't seem too bad, compared to introducing
a knob in psql, and presumably maintaining it forever.


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite




Re: Issue in to_timestamp/to_date while handling the quoted literal string

2019-07-24 Thread Brendan Jurd
Hi Suraj,

I think the documentation is reasonably clear about this behaviour, quote:

" In to_date, to_number, and to_timestamp, literal text and double-quoted
strings result in skipping the number of characters contained in the
string; for example "XX" skips two input characters (whether or not they
are XX)."

I can appreciate that this isn't the behaviour you intuitively expected
from to_timestamp, and I don't think you'd be the first or the last.  The
purpose of these functions was never to validate that your input string
precisely matches the non-coding parts of your format pattern.  For that, I
think you'd be better served by using regular expressions.

Just as an aside, in the example you gave, the string '2019-05-24T23:12:45'
will cast directly to timestamp just fine, so it isn't the kind of
situation to_timestamp was really intended for.  It's more for when your
input string is in an obscure (or ambiguous) format that is known to you in
advance.

I hope that helps.

Cheers
Brendan

On Wed, 24 Jul 2019 at 21:38, Suraj Kharage 
wrote:

> Hi,
>
> I noticed the issue in to_timestamp()/to_date() while handling the double
> quote literal string. If any double quote literal characters found in
> format, we generate the NODE_TYPE_CHAR in parse format and store that
> actual character in FormatNode->character. n DCH_from_char, we just
> increment the input string by length of character for NODE_TYPE_CHAR.
> We are actually not matching these characters in input string and because
> of this, date values get changed if quoted literal string is not identical
> in input and format string.
>
> e.g:
>
> postgres@78619=#select to_timestamp('2019-05-24T23:12:45',
> '-mm-dd"TT"hh24:mi:ss');
>to_timestamp
> ---
>  2019-05-24 03:12:45+05:30
> (1 row)
>
>
> In above example, the quoted string is 'TT', so it just increment the
> input string by 2 while handling these characters and returned the wrong
> hour value.
>
> My suggestion is to match the exact characters from quoted literal string
> in input string and if doesn't match then throw an error.
>
> Attached is the POC patch which almost works for all scenarios except for
> whitespace - as a quote character.
>
> Suggestions?
> --
> --
>
> Thanks & Regards,
> Suraj kharage,
> EnterpriseDB Corporation,
> The Postgres Database Company.
>


Re: Fetching timeline during recovery

2019-07-24 Thread Jehan-Guillaume de Rorthais
Hello Michael,

On Wed, 24 Jul 2019 09:49:05 +0900
Michael Paquier  wrote:

> On Tue, Jul 23, 2019 at 06:05:18PM +0200, Jehan-Guillaume de Rorthais wrote:
> > Please, find in attachment a first trivial patch to support
> > pg_walfile_name() and pg_walfile_name_offset() on a standby.
> > Previous restriction on this functions seems related to ThisTimeLineID not
> > being safe on standby. This patch is fetching the timeline from
> > WalRcv->receivedTLI using GetWalRcvWriteRecPtr(). As far as I understand,
> > this is updated each time some data are flushed to the WAL.  
[...]
> Your patch does not count for the case of archive recovery, where
> there is no WAL receiver, and as the shared memory state of the WAL
> receiver is not set 0 would be set.

Indeed. I tested this topic with the following query and was fine with the
NULL result:

  select pg_walfile_name(pg_last_wal_receive_lsn());

I was fine with this result because my use case requires replication anyway. A
NULL result would mean that the node never streamed from the old primary since
its last startup, so a failover should ignore it anyway.

However, NULL just comes from pg_last_wal_receive_lsn() here. The following
query result is wrong:

  > select pg_walfile_name('0/1')
  

I fixed that. See patch 0001-v2-* in attachement


> The replay timeline is something we could use here instead via
> GetXLogReplayRecPtr(). CreateRestartPoint actually takes the latest WAL
> receiver or replayed point for its end LSN position, whichever is newer.

I did consider GetXLogReplayRecPtr() or even XLogCtl->replayEndTLI (which is
updated right before the replay). However, both depend on read activity on the
standby. That's why I picked WalRcv->receivedTLI which is updated whatever the
reading activity on the standby.

> > Last, I plan to produce an extension to support this on older release. Is
> > it something that could be integrated in official source tree during a minor
> > release or should I publish it on eg. pgxn?  
> 
> Unfortunately no.  This is a behavior change so it cannot find its way
> into back branches.

Yes, my patch is a behavior change. But here, I was yalking about an
extension, not the core itself, to support this feature in older releases.

> The WAL receiver state is in shared memory and published, so that's easy
> enough to get.  We don't do that for XLogCtl unfortunately.

Both are in shared memory, but WalRcv have a public function to get its
receivedTLI member.

XLogCtl has nothing in public to expose its ThisTimeLineID member. However, from
a module, I'm able to fetch it using:

  XLogCtl = ShmemInitStruct("XLOG Ctl", XLOGShmemSize(), );
  SpinLockAcquire(>info_lck);
  tl = XLogCtl->ThisTimeLineID;
  SpinLockRelease(>info_lck);

As the "XLOG Ctl" index entry already exists in shmem, ShmemInitStruct returns
the correct structure from there. Not sure this was supposed to be used this
way though...Adding a public function might be cleaner, but it will not help
for older releases.

> I think that there are arguments for being more flexible with it, and perhaps
> have a system-level view to be able to look at some of its fields.

Great idea. I'll give it a try to keep the discussion on.

> There is also a downside with get_controlfile(), which is that it
> fetches directly the data from the on-disk pg_control, and
> post-recovery this only gets updated at the first checkpoint.

Indeed, that's why I started this patch and thread.

Thanks,
>From fdf133645b8cc2728cca3677e71bdd5cb69cdbd4 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Tue, 23 Jul 2019 17:28:44 +0200
Subject: [PATCH] Support pg_walfile_name on standby

Support executing both SQL functions pg_walfile_name() and
pg_walfile_name_offset() on a standby.
---
 src/backend/access/transam/xlogfuncs.c | 39 +-
 1 file changed, 25 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b35043bf71..86c4d8382b 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -460,13 +460,7 @@ pg_walfile_name_offset(PG_FUNCTION_ARGS)
 	TupleDesc	resultTupleDesc;
 	HeapTuple	resultHeapTuple;
 	Datum		result;
-
-	if (RecoveryInProgress())
-		ereport(ERROR,
-(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
- errmsg("recovery is in progress"),
- errhint("%s cannot be executed during recovery.",
-		 "pg_walfile_name_offset()")));
+	TimeLineID  tl;
 
 	/*
 	 * Construct a tuple descriptor for the result row.  This must match this
@@ -480,11 +474,24 @@ pg_walfile_name_offset(PG_FUNCTION_ARGS)
 
 	resultTupleDesc = BlessTupleDesc(resultTupleDesc);
 
+	if (RecoveryInProgress())
+	{
+		GetWalRcvWriteRecPtr(NULL, );
+
+		if (!tl)
+		{
+			isnull[0] = isnull[1] = true;
+			goto result;
+		}
+	}
+	else
+		tl = ThisTimeLineID;
+
 	/*
 	 * xlogfilename
 	 */
 	XLByteToPrevSeg(locationpoint, xlogsegno, 

Re: benchmarking Flex practices

2019-07-24 Thread Chapman Flack
On 07/24/19 03:45, John Naylor wrote:
> On Sun, Jul 21, 2019 at 3:14 AM Tom Lane  wrote:
>> However, my second reaction was that maybe you were on to something
>> upthread when you speculated about postponing de-escaping of
>> Unicode literals into the grammar.  If we did it like that then

Wow, yay. I hadn't been following this thread, but I had just recently
looked over my own earlier musings [1] and started thinking "no, it would
be outlandish to ask the lexer to return utf-8 always ... but what about
postponing the de-escaping of Unicode literals into the grammar?" and
had started to think about when I might have a chance to try making a
patch.

With the de-escaping postponed, I think we'd be able to move beyond the
current odd situation where Unicode escapes can't describe non-ascii
characters, in exactly and only the cases where you need them to.

-Chap


[1]
https://www.postgresql.org/message-id/6688474e-7c28-b352-bcec-ea0ef59d7a1a%40anastigmatix.net




Re: POC: Cleaning up orphaned files using undo logs

2019-07-24 Thread Amit Kapila
On Wed, Jul 24, 2019 at 2:45 PM Amit Kapila  wrote:
>
> On Thu, Jul 18, 2019 at 5:10 PM Amit Kapila  wrote:
>
> 7.
> +attach_undo_log(UndoLogCategory category, Oid tablespace)
> {
> ..
> if (candidate->meta.tablespace == tablespace)
> + {
> + logno = *place;
> + slot = candidate;
> + *place = candidate->next_free;
> + break;
> + }
>
> Here, the code is breaking from the loop, so why do we need to set
> *place?  Am I missing something obvious?
>

I think I know what I was missing.  It seems here you are removing an
element from the freelist.

One point related to detach_current_undo_log.

+ LWLockAcquire(>mutex, LW_EXCLUSIVE);
+ slot->pid = InvalidPid;
+ slot->meta.unlogged.xid = InvalidTransactionId;
+ if (full)
+ slot->meta.status = UNDO_LOG_STATUS_FULL;
+ LWLockRelease(>mutex);

If I read the comments in structure UndoLogMetaData, it is mentioned
that 'status' is changed by explicit WAL record whereas there is no
WAL record in code to change the status.  I see the problem as well if
we don't WAL log this change.  Suppose after changing the status of
this log, we allocate a new log and insert some records in that log as
well for the same transaction for which we have inserted records in
the log which we just marked as FULL.  Now, here we form the link
between two logs as the same transaction has overflowed into a new
log.  Say, we crash after this.  Now, after recovery the log won't be
marked as FULL which means there is a chance that it can be used for
some other transaction, if that happens, then our link for a
transaction spanning to different log will break and we won't be able
to access the data in another log.  In short, I think it is important
to WAL log this status change unless I am missing something.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: pg_receivewal documentation

2019-07-24 Thread Jesper Pedersen

Hi,

On 7/23/19 10:29 PM, Michael Paquier wrote:

Thanks.  Applied down to 9.6 where remote_apply has been introduced,
with tweaks for 9.6 as the tool is named pg_receivexlog there.


Thanks to everybody involved !

Best regards,
 Jesper




Issue in to_timestamp/to_date while handling the quoted literal string

2019-07-24 Thread Suraj Kharage
Hi,

I noticed the issue in to_timestamp()/to_date() while handling the double
quote literal string. If any double quote literal characters found in
format, we generate the NODE_TYPE_CHAR in parse format and store that
actual character in FormatNode->character. n DCH_from_char, we just
increment the input string by length of character for NODE_TYPE_CHAR.
We are actually not matching these characters in input string and because
of this, date values get changed if quoted literal string is not identical
in input and format string.

e.g:

postgres@78619=#select to_timestamp('2019-05-24T23:12:45',
'-mm-dd"TT"hh24:mi:ss');
   to_timestamp
---
 2019-05-24 03:12:45+05:30
(1 row)


In above example, the quoted string is 'TT', so it just increment the input
string by 2 while handling these characters and returned the wrong hour
value.

My suggestion is to match the exact characters from quoted literal string
in input string and if doesn't match then throw an error.

Attached is the POC patch which almost works for all scenarios except for
whitespace - as a quote character.

Suggestions?
-- 
--

Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.


to_timestamp_quoted_string_POC.patch
Description: Binary data


RE: seems like a bug in pgbench -R

2019-07-24 Thread Imai, Yoshikazu
Hi Fabien,

On Fri, Mar 15, 2019 at 4:17 PM, Fabien COELHO wrote:
> >> echo 'select 1' > select.sql
> >> 
> >> while /bin/true; do
> >>  pgbench -n -f select.sql -R 1000 -j 8 -c 8 -T 1 > /dev/null 2>&1;
> >>  date;
> >> done;
> >
> > Indeed. I'll look at it over the weekend.
> >
> >> So I guess this is a bug in 12788ae49e1933f463bc59a6efe46c4a01701b76, or
> >> one of the other commits touching this part of the code.
> 
> I could not reproduce this issue on head, but I confirm on 11.2.

I could reproduce the stuck on 11.4.

On Sat, Mar 16, 2019 at 10:14 AM, Fabien COELHO wrote:
> Attached is a fix to apply on pg11.

I confirm the stuck doesn't happen after applying your patch.

It passes make check-world.

This change seems not to affect performance, so I didn't do any performance
test.

> + /* under throttling we may have finished the last client above 
> */
> + if (remains == 0)
> + break;

If there are only CSTATE_WAIT_RESULT, CSTATE_SLEEP or CSTATE_THROTTLE clients,
a thread needs to wait the results or sleep. In that logic, there are the case
that a thread tried to wait the results when there are no clients wait the
results, and this causes the issue. This is happened when there are only
CSTATE_THROTLE clients and pgbench timeout is occured. Those clients will be
finished and "remains" will be 0.

I confirmed above codes prevent such a case.


I almost think this is ready for committer, but I have one question.

Is it better adding any check like if(maxsock != -1) before the select?


else/* no explicit delay, select without timeout */
{
nsocks = select(maxsock + 1, _mask, NULL, NULL, NULL);
}

--
Yoshikazu Imai




Re: POC: Cleaning up orphaned files using undo logs

2019-07-24 Thread Amit Kapila
On Wed, Jul 24, 2019 at 2:45 PM Amit Kapila  wrote:
>
> On Thu, Jul 18, 2019 at 5:10 PM Amit Kapila  wrote:
> >
> > On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro  wrote:
> > > Yep, that was completely wrong.  Here's a new version.
>
> 10.
> I think UndoLogAllocate can leak allocation of slots.  It first
> allocates the slot for a new log from the free pool in there is no
> existing slot/log, writes a WAL record and then at a later point of
> time it actually creates the required physical space in the log via
> extend_undo_log which also writes a separate WAL.  Now, if there is a
> error between these two operations, then we will have a redundant slot
> allocated.  What if there are repeated errors for similar thing from
> multiple backends after which system crashes.  Now, after restart, we
> will allocate multiple slots for different lognos which don't have any
> actual (physical) logs.  This might not be a big problem in practice
> because the chances of error between two operations are less, but
> can't we delay the WAL logging for allocation of a slot for a new log.
>

After sending this email, I was browsing the previous comments raised
by me for this patch and it seems this same point was raised
previously [1] as well and there were few additional questions related
to same (See point-1 in email [1].)


[1] - 
https://www.postgresql.org/message-id/CAA4eK1LDctrYeZ8ev1N1v-8KwiigAmNMx%3Dt-UTs9qgEFt%2BP0XQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-07-24 Thread Amit Kapila
On Thu, Jul 18, 2019 at 5:10 PM Amit Kapila  wrote:
>
> On Mon, Jul 1, 2019 at 1:24 PM Thomas Munro  wrote:
> >
> > On Fri, Jun 28, 2019 at 6:09 AM Robert Haas  wrote:
> > > I happened to open up 0001 from this series, which is from Thomas, and
> > > I do not think that the pg_buffercache changes are correct.  The idea
> > > here is that the customer might install version 1.3 or any prior
> > > version on an old release, then upgrade to PostgreSQL 13. When they
> > > do, they will be running with the old SQL definitions and the new
> > > binaries.  At that point, it sure looks to me like the code in
> > > pg_buffercache_pages.c is going to do the Wrong Thing.  [...]
> >
> > Yep, that was completely wrong.  Here's a new version.
> >
>
> One comment/question related to
> 0022-Use-undo-based-rollback-to-clean-up-files-on-abort.patch.
>

I have done some more review of undolog patch series and here are my comments:
0003-Add-undo-log-manager.patch
1.
allocate_empty_undo_segment()
{
..
..
if (mkdir(undo_path, S_IRWXU) != 0 && errno != EEXIST)
+ {
+ char*parentdir;
+
+ if (errno != ENOENT || !InRecovery)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m",
+ undo_path)));
+
+ /*
+ * In recovery, it's possible that the tablespace directory
+ * doesn't exist because a later WAL record removed the whole
+ * tablespace.  In that case we create a regular directory to
+ * stand in for it.  This is similar to the logic in
+ * TablespaceCreateDbspace().
+ */
+
+ /* create two parents up if not exist */
+ parentdir = pstrdup(undo_path);
+ get_parent_directory(parentdir);
+ get_parent_directory(parentdir);
+ /* Can't create parent and it doesn't already exist? */
+ if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST)


All of this code is almost same as we have code in
TablespaceCreateDbspace still we have small differences like here you
are using mkdir instead of MakePGDirectory which as far as I can see
use similar permissions for creating directory.  Also, it checks
whether the directory exists before trying to create it.  Is there a
reason why we need to do a few things differently here if not, they
can both the places use one common function?

2.
allocate_empty_undo_segment()
{
..
..
/* Flush the contents of the file to disk before the next checkpoint. */
+ undofile_request_sync(logno, end / UndoLogSegmentSize, tablespace);
..
}

+void
+undofile_request_sync(UndoLogNumber logno, BlockNumber segno, Oid tablespace)
+{
+ char path[MAXPGPATH];
+ FileTag tag;
+
+ INIT_UNDOFILETAG(tag, logno, tablespace, segno);
+
+ /* Try to send to the checkpointer, but if out of space, do it here. */
+ if (!RegisterSyncRequest(, SYNC_REQUEST, false))


The comment in allocate_empty_undo_segment indicates that the code
wants to flush before checkpoint, but the actual function tries to
register the request with checkpointer.  Shouldn't this be similar to
XLogFileInit where we use pg_fsync to flush the contents immediately?
I guess that will avoid what you have written in comments in the same
function (we just want to make sure that the filesystem has allocated
physical blocks for it so that non-COW filesystems will report ENOSPC
now rather than later when space is needed).  OTOH, I think it is
performance-wise better to postpone the work to checkpointer.  If we
want to push this work to checkpointer, then we might need to change
comments or alternatively, we might want to use bigger segment sizes
to mitigate the performance effect.

If my above understanding is correct and the reason to fsync
immediately is to reserve space now, then we also need to think
whether we are always safe in postponing the work?  Basically, if this
means that it can fail when we are actually trying to write undo, then
it could be risky because we could be in the critical section at that
time.  I am not sure about this point, rather it is just to discuss if
there are any impacts of postponing the fsync work.

Another thing is that recently in commit 475861b261 (commit by you),
we have introduced a mechanism to not fill the files with zero's for
certain filesystems like ZFS.  Do we want similar behavior for undo
files?

3.
+extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
+{
+ UndoLogSlot *slot;
+ size_t end;
+
+ slot = find_undo_log_slot(logno, false);
+
+ /* TODO review interlocking */
+
+ Assert(slot != NULL);
+ Assert(slot->meta.end % UndoLogSegmentSize == 0);
+ Assert(new_end % UndoLogSegmentSize == 0);
+ Assert(InRecovery ||
+CurrentSession->attached_undo_slots[slot->meta.category] == slot);

Can you write some comments explaining the above Asserts?  Also, can
you explain what interlocking issues are you worried about here?

4.
while (end < new_end)
+ {
+ allocate_empty_undo_segment(logno, slot->meta.tablespace, end);
+ end += UndoLogSegmentSize;
+ }
+
+ /* Flush the directory entries before next checkpoint. */
+ undofile_request_sync_dir(slot->meta.tablespace);

I see that at two places after 

Re: On the stability of TAP tests for LDAP

2019-07-24 Thread Thomas Munro
On Wed, Jul 24, 2019 at 7:50 PM Michael Paquier  wrote:
> Perhaps this worked on freebsd?  Now that I test it, the test gets
> stuck on my Debian box:
> # waiting for slapd to accept requests...
> # Running: ldapsearch -h localhost -p 49534 -s base -b
> dc=example,dc=net -n 'objectclass=*'
> SASL/DIGEST-MD5 authentication started
> Please enter your password:
> ldap_sasl_interactive_bind_s: Invalid credentials (49)
> additional info: SASL(-13): user not found: no secret in
> database

Huh, yeah, I don't know why slapd requires credentials on Debian, when
the version that ships with FreeBSD is OK with an anonymous
connection.  Rather than worrying about that, I just adjusted it to
supply the credentials.  It works on both for me.

> pgperltidy complains about the patch indentation using perltidy
> v20170521 (version mentioned in tools/pgindent/README).

Fixed.

-- 
Thomas Munro
https://enterprisedb.com


wait-for-slapd-v3.patch
Description: Binary data


Re: [bug fix] Produce a crash dump before main() on Windows

2019-07-24 Thread Kyotaro Horiguchi
Hello.

On Wed, Jul 24, 2019 at 2:31 AM Alvaro Herrera  wrote:
>
> On 2019-Jul-23, Tom Lane wrote:
>
> > Kyotaro Horiguchi  writes:
>
> > > My investigation convinced me that there is no way for a process
> > > to detect wheter it is running as a service (except the process
> > > directly called from system (aka entry function)).
>
> Maybe we can try calling pgwin32_is_service()?

Ah, it might be viable. (I'm not sure though.) I didn't thought that
some process property we are enforcing is available.

> > But I will say that in my experience, behavioral differences between
> > Postgres started manually and Postgres started as a daemon are bad.
> > So I think going out of our way to make the cases behave differently
> > on Windows is probably not a good plan.
>
> We already have such a difference in Windows -- elog.c uses it
> extensively to use the eventlog to print if a service, console print if
> not.
>
> Given this thread's OP, ISTM failing to do likewise for this case makes
> debugging problems difficult for no good reason.

The difference of internal behavior is added to unify the external
(apparenet) behavior. It prevents WER dialog from showing while
running on console. Service doesn't show a dialog even if WER is
enabled. The remaining issue is we cannot obtain a dump when running
on console without being blocked by a dialog, but Windows just doesn't
allow it.. (Might be possible if we ignore Windows 7? 8? and earlier)

reagards
-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [PATCH] Improve performance of NOTIFY over many databases (issue blocking on AccessExclusiveLock on object 0 of class 1262 of database 0)

2019-07-24 Thread Martijn van Oosterhout
On Tue, 23 Jul 2019 at 23:32, Tom Lane  wrote:
>
> Martijn van Oosterhout  writes:
> > I mean tracking the listening backends specifically, so you can
> > replace the loops:
> > for (i=0; i < MaxBackends; i++)
> > with
> > for (i=QUEUE_FIRST_LISTENING_BACKEND; i; i = 
> > QUEUE_NEXT_LISTENING_BACKEND(i))
>
> Ah ... but surely you would not put such info in AsyncQueueEntry,
> where there'd be a separate copy for each message.  I think you
> meant to add the info to AsyncQueueControl.

Umm, yeah. Got that mixed up.

> It might be better to redefine the backends[] array as being mostly
> contiguous (ie, a new backend takes the first free slot not the one
> indexed by its own BackendId), at the price of needing to store
> BackendId in each slot explicitly instead of assuming it's equal to
> the array index.  I suspect the existing data structure is putting too
> much of a premium on making sizeof(QueueBackendStatus) a power of 2.

This would require adding a "MyListenerId" to each backend which I'm not sure
helps the readability. And there's a chance of mixing the id up. The
power-of-2-ness
is I think indeed overrated.

I'll give it a shot a see how it looks.

Have a nice day,

-- 
Martijn van Oosterhout  http://svana.org/kleptog/




Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2019-07-24 Thread Kyotaro Horiguchi
Sorry in advance for link-breaking message forced by gmail..

https://www.postgresql.org/message-id/flat/20190202083822.gc32...@gust.leadboat.com

> 1. The result of the test is valid only until we release the SLRU ControlLock,
>which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to evaluate
>segments for deletion.  Once we release that lock, latest_page_number can
>advance.  This creates a TOCTOU race condition, allowing excess deletion:
>
>
>[local] test=# table trunc_clog_concurrency ;
>ERROR:  could not access status of transaction 2149484247
>DETAIL:  Could not open file "pg_xact/0801": No such file or directory.

It seems like some other vacuum process saw larger cutoff page? If I'm
not missing something, the missing page is no longer the
"recently-populated" page at the time (As I understand it as the last
page that holds valid data). Couldn't we just ignore ENOENT there?

> 2. By the time the "apparent wraparound" test fires, we've already WAL-logged
>the truncation.  clog_redo() suppresses the "apparent wraparound" test,
>then deletes too much.  Startup then fails:

I agree that if truncation is skipped after issuing log, it will
lead to data-loss at the next recovery. But the follwoing log..:

>881997 2019-02-10 02:53:32.105 GMT FATAL:  could not access status of 
> transaction 708112327
>881997 2019-02-10 02:53:32.105 GMT DETAIL:  Could not open file 
> "pg_xact/02A3": No such file or directory.
>881855 2019-02-10 02:53:32.107 GMT LOG:  startup process (PID 881997) 
> exited with exit code 1

If it came from the same reason as 1, the log is simply ignorable, so
recovery stopping by the error is unacceptable, but the ENOENT is just
ignorable for the same reason.

As the result, I agree to (a) (fix rounding), and (c) (test
wrap-around before writing WAL) but I'm not sure for others. And
additional fix for ignorable ENOENT is needed.

What do you think about this?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pgbench - allow to create partitioned tables

2019-07-24 Thread Fabien COELHO


Hello Simon,


While doing some performance tests and reviewing patches, I needed to
create partitioned tables. Given the current syntax this is time
consumming.


Good idea. I wonder why we didn't have it already.


Probably because I did not have to create partitioned table for some 
testing:-)



  sh> pgench -i -s 1 --partition-number=$N --partition-type=hash


Given current naming of options, I would call this
--partitions=number-of-partitions and --partition-method=hash


Ok.


  # then run
  sh> pgench -S -M prepared -P 1 -T 10

  # and look at latency:
  # no parts = 0.071 ms
  #   1 hash = 0.071 ms (did someone optimize this case?!)
  #   2 hash ~ 0.126 ms (+ 0.055 ms)
  #  50 hash ~ 0.155 ms
  # 100 hash ~ 0.178 ms
  # 150 hash ~ 0.232 ms
  # 200 hash ~ 0.279 ms
  # overhead ~ (0.050 + [0.0005-0.0008] * nparts) ms


It is linear?


Good question. I would have hoped affine, but this is not very clear on 
these data, which are the median of about five runs, hence the bracket on 
the slope factor. At least it is increasing with the number of partitions. 
Maybe it would be clearer on the minimum of five runs.


--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 816f9cc4c7..3e8e292e39 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -306,6 +306,32 @@ pgbench  options  d
   
  
 
+ 
+  --partitions=NUM
+  
+   
+Create a partitioned pgbench_accounts table with
+NUM partitions of nearly equal size for
+the scaled number of accounts.
+Default is 0, meaning no partitioning.
+   
+  
+ 
+
+ 
+  --partition-method=NAME
+  
+   
+Create a partitioned pgbench_accounts table with
+NAME method.
+Expected values are range or hash.
+This option is only taken into account if
+--partitions is non-zero.
+Default is range.
+   
+  
+ 
+
  
   --tablespace=tablespace
   
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 570cf3306a..6819b4e433 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -186,6 +186,11 @@ int64		latency_limit = 0;
 char	   *tablespace = NULL;
 char	   *index_tablespace = NULL;
 
+/* partitioning for pgbench_accounts table, 0 for no partitioning */
+int 		partitions = 0;
+enum { PART_RANGE, PART_HASH }
+			partition_method = PART_RANGE;
+
 /* random seed used to initialize base_random_sequence */
 int64		random_seed = -1;
 
@@ -617,6 +622,9 @@ usage(void)
 		   "  --foreign-keys   create foreign key constraints between tables\n"
 		   "  --index-tablespace=TABLESPACE\n"
 		   "   create indexes in the specified tablespace\n"
+		   "  --partitions=NUM partition account table in NUM parts (defaults: 0)\n"
+		   "  --partition-method=(range|hash)\n"
+		   "   partition account table with this method (default: range)\n"
 		   "  --tablespace=TABLESPACE  create tables in the specified tablespace\n"
 		   "  --unlogged-tablescreate tables as unlogged tables\n"
 		   "\nOptions to select what to run:\n"
@@ -3601,6 +3609,17 @@ initDropTables(PGconn *con)
 	 "pgbench_tellers");
 }
 
+/*
+ * add fillfactor percent option if not 100.
+ */
+static void
+append_fillfactor(char *opts, int len)
+{
+	if (fillfactor < 100)
+		snprintf(opts + strlen(opts), len - strlen(opts),
+ " with (fillfactor=%d)", fillfactor);
+}
+
 /*
  * Create pgbench's standard tables
  */
@@ -3625,6 +3644,7 @@ initCreateTables(PGconn *con)
 		const char *bigcols;	/* column decls if accountIDs are 64 bits */
 		int			declare_fillfactor;
 	};
+
 	static const struct ddlinfo DDLs[] = {
 		{
 			"pgbench_history",
@@ -3651,11 +3671,10 @@ initCreateTables(PGconn *con)
 			1
 		}
 	};
-	int			i;
 
 	fprintf(stderr, "creating tables...\n");
 
-	for (i = 0; i < lengthof(DDLs); i++)
+	for (int i = 0; i < lengthof(DDLs); i++)
 	{
 		char		opts[256];
 		char		buffer[256];
@@ -3664,9 +3683,17 @@ initCreateTables(PGconn *con)
 
 		/* Construct new create table statement. */
 		opts[0] = '\0';
-		if (ddl->declare_fillfactor)
+
+		/* Partition pgbench_accounts table */
+		if (partitions >= 1 && strcmp(ddl->table, "pgbench_accounts") == 0)
+		{
 			snprintf(opts + strlen(opts), sizeof(opts) - strlen(opts),
-	 " with (fillfactor=%d)", fillfactor);
+	 " partition by %s (aid)",
+	 partition_method == PART_RANGE ? "range" : "hash");
+		}
+		else if (ddl->declare_fillfactor)
+			append_fillfactor(opts, sizeof(opts));
+
 		if (tablespace != NULL)
 		{
 			char	   *escape_tablespace;
@@ -3686,6 +3713,54 @@ initCreateTables(PGconn *con)
 
 		executeStatement(con, buffer);
 	}
+
+	if (partitions >= 1)
+	{
+		int64		part_size = (naccounts * (int64) scale + partitions - 1) / partitions;
+		char		ff[64];
+		ff[0] = '\0';
+		append_fillfactor(ff, sizeof(ff));
+
+		fprintf(stderr, "creating %d 

Re: pgbench tests vs Windows

2019-07-24 Thread Fabien COELHO


Hello Andrew,


Unfortunately, this isn't portable, as I've just discovered at the cost
of quite a bit of time. In particular, you can't assume expr is present
and in the path on Windows. The Windows equivalent would be something like:

   \setshell two\
     @set /a c = 1 + :one  && echo %c%


Hmmm... Can we assume that echo is really always there on Windows? If so, 
the attached patch does something only with "echo".



I propose to prepare a patch along these lines. Alternatively we could
just drop it - I don't think the test matters all that hugely.


The point is to have some minimal coverage so that unexpected changes are 
caught. This is the only call to a working \setshell.


--
Fabien.diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl 
b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 5a2fdb9acb..b82d3f65c4 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -512,7 +512,7 @@ pgbench(
qr{processed: 1/1},
qr{shell-echo-output}
],
-   [qr{command=8.: int 2\b}],
+   [qr{command=8.: int 1\b}],
'pgbench backslash commands',
{
'001_pgbench_backslash_commands' => q{-- run set
@@ -524,10 +524,10 @@ pgbench(
 \sleep 0 s
 \sleep :zero
 -- setshell and continuation
-\setshell two\
-  expr \
-1 + :one
-\set n debug(:two)
+\setshell another_one\
+  echo \
+:one
+\set n debug(:another_one)
 -- shell
 \shell echo shell-echo-output
 }


Re: On the stability of TAP tests for LDAP

2019-07-24 Thread Michael Paquier
On Wed, Jul 24, 2019 at 05:47:13PM +1200, Thomas Munro wrote:
> Thanks, here's v2.

Perhaps this worked on freebsd?  Now that I test it, the test gets
stuck on my Debian box:
# waiting for slapd to accept requests...
# Running: ldapsearch -h localhost -p 49534 -s base -b
dc=example,dc=net -n 'objectclass=*'
SASL/DIGEST-MD5 authentication started
Please enter your password:
ldap_sasl_interactive_bind_s: Invalid credentials (49)
additional info: SASL(-13): user not found: no secret in
database

pgperltidy complains about the patch indentation using perltidy
v20170521 (version mentioned in tools/pgindent/README).
--
Michael


signature.asc
Description: PGP signature


Re: benchmarking Flex practices

2019-07-24 Thread John Naylor
On Sun, Jul 21, 2019 at 3:14 AM Tom Lane  wrote:
>
> John Naylor  writes:
> > The pre-existing ecpg var "state_before" was a bit confusing when
> > combined with the new var "state_before_quote_stop", and the former is
> > also used with C-comments, so I decided to go with
> > "state_before_lit_start" and "state_before_lit_stop". Even though
> > comments aren't literals, it's less of a stretch than referring to
> > quotes. To keep things consistent, I went with the latter var in psql
> > and core.
>
> Hm, what do you think of "state_before_str_stop" instead?  It seems
> to me that both "quote" and "lit" are pretty specific terms, so
> maybe we need something a bit vaguer.

Sounds fine to me.

> While poking at that, I also came across this unhappiness:
>
> regression=# select u&'foo' uescape 'bogus';
> regression'#
>
> that is, psql thinks we're still in a literal at this point.  That's
> because the uesccharfail rule eats "'b" and then we go to INITIAL
> state, so that consuming the last "'" puts us back in a string state.
> The backend would have thrown an error before parsing as far as the
> incomplete literal, so it doesn't care (or probably not, anyway),
> but that's not an option for psql.
>
> My first reaction as to how to fix this was to rip the xuend and
> xuchar states out of psql, and let it just lex UESCAPE as an
> identifier and the escape-character literal like any other literal.
> psql doesn't need to account for the escape character's effect on
> the meaning of the Unicode literal, so it doesn't have any need to
> lex the sequence as one big token.  I think the same is true of ecpg
> though I've not looked really closely.
>
> However, my second reaction was that maybe you were on to something
> upthread when you speculated about postponing de-escaping of
> Unicode literals into the grammar.  If we did it like that then
> we would not need to have this difference between the backend and
> frontend lexers, and we'd not have to worry about what
> psql_scan_in_quote should do about the whitespace before and after
> UESCAPE, either.
>
> So I'm feeling like maybe we should experiment to see what that
> solution looks like, before we commit to going in this direction.
> What do you think?

Given the above wrinkles, I thought it was worth trying. Attached is a
rough patch (don't mind the #include mess yet :-) ) that works like
this:

The lexer returns UCONST from xus and UIDENT from xui. The grammar has
rules that are effectively:

SCONST { do nothing}
| UCONST { esc char is backslash }
| UCONST UESCAPE SCONST { esc char is from $3 }

...where UESCAPE is now an unreserved keyword. To prevent shift-reduce
conflicts, I added UIDENT to the %nonassoc precedence list to match
IDENT, and for UESCAPE I added a %left precedence declaration. Maybe
there's a more principled way. I also added an unsigned char type to
the %union, but it worked fine on my compiler without it.

litbuf_udeescape() and check_uescapechar() were moved to gram.y. The
former had be massaged to give error messages similar to HEAD. They're
not quite identical, but the position info is preserved. Some of the
functions I moved around don't seem to have any test coverage, so I
should eventually do some work in that regard.

Notes:

-Binary size is very close to v6. That is to say the grammar tables
grew by about the same amount the scanner table shrank, so the binary
is still about  200kB smaller than HEAD.
-Performance is very close to v6 with the information_schema and
pgbench-like queries with standard strings, which is to say also very
close to HEAD. When the latter was changed to use Unicode escapes,
however, it was about 15% slower than HEAD. That's a big regression
and I haven't tried to pinpoint why.
-psql was changed to follow suit. It doesn't think it's inside a
string with your too-long escape char above, and it removes all blank
lines from this query output:

$ cat >> test-uesc-lit.sql
SELECT

u&'!0041'

uescape

'!'

as col
;


On HEAD and v6 I get this:

$ ./inst/bin/psql -a -f test-uesc-lit.sql

SELECT
u&'!0041'

uescape
'!'
as col
;
 col
-
 A
(1 row)


-The ecpg changes here are only the bare minimum from HEAD to get it
to compile, since I'm borrowing its additional token names (although
they mean slightly different things). After a bit of experimentation,
it's clear there's a bit more work needed to get it functional, and
it's not easy to debug, so I'm putting that off until we decide
whether this is the way forward.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


v7-draft-handle-uescapes-in-parser.patch
Description: Binary data


Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-24 Thread David Rowley
On Wed, 24 Jul 2019 at 16:16, David Rowley  wrote:
>
> On Wed, 24 Jul 2019 at 15:05, David Rowley  
> wrote:
> > To be able to reduce the threshold down again we'd need to make a
> > hash_get_num_entries(LockMethodLocalHash) call before performing the
> > guts of LockReleaseAll(). We could then weight that onto some running
> > average counter with a weight of, say... 10, so we react to changes
> > fairly quickly, but not instantly. We could then have some sort of
> > logic like "rebuild the hash table if running average 4 times less
> > than max_bucket"
> >
> > I've attached a spreadsheet of that idea and the algorithm we could
> > use to track the running average.  Initially, I've mocked it up a
> > series of transactions that use 1000 locks, then at row 123 dropped
> > that to 10 locks. If we assume the max_bucket is 1000, then it takes
> > until row 136 for the running average to drop below the max_bucket
> > count, i.e 13 xacts. There we'd reset there at the init size of 16. If
> > the average went up again, then we'd automatically expand the table as
> > we do now.  To make this work we'd need an additional call to
> > hash_get_num_entries(), before we release the locks, so there is more
> > overhead.
>
> Here's a patch with this implemented. I've left a NOTICE in there to
> make it easier for people to follow along at home and see when the
> lock table is reset.

Here's a more polished version with the debug code removed, complete
with benchmarks.

-- Test 1. Select 1 record from a 140 partitioned table. Tests
creating a large number of locks with a fast query.

create table hp (a int, b int) partition by hash(a);
select 'create table hp'||x||' partition of hp for values with
(modulus 140, remainder ' || x || ');' from generate_series(0,139)x;
create index on hp (b);
insert into hp select x,x from generate_series(1, 14) x;
analyze hp;

select3.sql: select * from hp where b = 1

-
Master:

$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2098.628975 (excluding connections establishing)
tps = 2101.797672 (excluding connections establishing)
tps = 2085.317292 (excluding connections establishing)
tps = 2094.931999 (excluding connections establishing)
tps = 2092.328908 (excluding connections establishing)

Patched:

$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2101.691459 (excluding connections establishing)
tps = 2104.533249 (excluding connections establishing)
tps = 2106.499123 (excluding connections establishing)
tps = 2104.033459 (excluding connections establishing)
tps = 2105.463629 (excluding connections establishing)

(I'm surprised there is not more overhead in the additional tracking
added to calculate the running average)

-- Test 2. Tests a prepared query which will perform a generic plan on
the 6th execution then fallback on a custom plan due to it pruning all
but one partition.  Master suffers from the lock table becoming
bloated after locking all partitions when planning the generic plan.

create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values
with (modulus 8192, remainder ' || (x)::text || ');' from
generate_series(0,8191) x;
\gexec

select.sql:
\set p 1
select * from ht where a = :p

Master:

$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 10207.780843 (excluding connections establishing)
tps = 10205.772688 (excluding connections establishing)
tps = 10214.896449 (excluding connections establishing)
tps = 10157.572153 (excluding connections establishing)
tps = 10147.764477 (excluding connections establishing)

Patched:

$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 14677.636570 (excluding connections establishing)
tps = 14661.437186 (excluding connections establishing)
tps = 14647.202877 (excluding connections establishing)
tps = 14784.165753 (excluding connections establishing)
tps = 14850.355344 (excluding connections establishing)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


shrink_bloated_locallocktable_v8.patch
Description: Binary data


RE: [PATCH] Fix Proposal - Deadlock Issue in Single User Mode When IO Failure Occurs

2019-07-24 Thread Kyotaro Horiguchi
Sorry in advance for link-breaking message force by gmail..

https://www.postgresql.org/message-id/flat/cy4pr2101mb0804ce9836e582c0702214e8aa...@cy4pr2101mb0804.namprd21.prod.outlook.com

I assume that we are in a consensus about the problem we are to  fix
here.

> 0a 0004`8080cc30 0004`80dcf917 postgres!PGSemaphoreLock+0x65 
> [d:\orcasqlagsea10\14\s\src\backend\port\win32_sema.c @ 158]
> 0b 0004`8080cc90 0004`80db025c postgres!LWLockAcquire+0x137 
> [d:\orcasqlagsea10\14\s\src\backend\storage\lmgr\lwlock.c @ 1234]
> 0c 0004`8080ccd0 0004`80db25db postgres!AbortBufferIO+0x2c 
> [d:\orcasqlagsea10\14\s\src\backend\storage\buffer\bufmgr.c @ 3995]
> 0d 0004`8080cd20 0004`80dbce36 postgres!AtProcExit_Buffers+0xb 
> [d:\orcasqlagsea10\14\s\src\backend\storage\buffer\bufmgr.c @ 2479]
> 0e 0004`8080cd50 0004`80dbd1bd postgres!shmem_exit+0xf6 
> [d:\orcasqlagsea10\14\s\src\backend\storage\ipc\ipc.c @ 262]
> 0f 0004`8080cd80 0004`80dbccfd postgres!proc_exit_prepare+0x4d 
> [d:\orcasqlagsea10\14\s\src\backend\storage\ipc\ipc.c @ 188]
> 10 0004`8080cdb0 0004`80ef9e74 postgres!proc_exit+0xd 
> [d:\orcasqlagsea10\14\s\src\backend\storage\ipc\ipc.c @ 141]
> 11 0004`8080cde0 0004`80ddb6ef postgres!errfinish+0x204 
> [d:\orcasqlagsea10\14\s\src\backend\utils\error\elog.c @ 624]
> 12 0004`8080ce50 0004`80db0f59 postgres!mdread+0x12f 
> [d:\orcasqlagsea10\14\s\src\backend\storage\smgr\md.c @ 806]

Ok, we are fixing this. The proposed patch lets LWLockReleaseAll()
called before InitBufferPoolBackend() by registering the former after
the latter into on_shmem_exit list. Even if it works, I think it's
neither clean nor safe to register multiple order-sensitive callbacks.

AtProcExit_Buffers has the following comment:

> * During backend exit, ensure that we released all shared-buffer locks and
> * assert that we have no remaining pins.

And the only caller of it is shmem_exit. More of that, all other
caller sites calls LWLockReleaseAll() just before calling it. If
that's the case, why don't we just release all LWLocks in shmem_exit
or in AtProcExit_Buffers before calling AbortBufferIO()? I think it's
sufficient that AtProcExit_Buffers calls it at the beginning. (The
comment for the funcgtion needs editing).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: POC: Cleaning up orphaned files using undo logs

2019-07-24 Thread Dilip Kumar
On Wed, Jul 24, 2019 at 11:28 AM Rushabh Lathia
 wrote:
>
> Hi,
>
> I have stated review of
> 0008-Provide-interfaces-to-store-and-fetch-undo-records.patch, here are few
> quick comments.

Thanks for the review, I will work on them soon and post the updated
patch along with other comments.  I have noticed some comments are
pointing to the code which is not part of this patch
for example
>
> 5) In function UndoBlockGetFirstUndoRecord() below code:
>
> /* Calculate the size of the partial record. */
> partial_rec_size = UndoRecordHeaderSize(phdr->uur_info) +
>phdr->tuple_len + phdr->payload_len -
>phdr->record_offset;
>
> can directly use UndoPagePartialRecSize().

UndoBlockGetFirstUndoRecord is added under 0014 patch, I think you got
confused because this code is in undoaccess.c file.  But actually
later patch set added some code under undoaccess.c.  Basically, this
comment needs to be fixed but under another patch.  I am pointing out
so that we don't miss this.

> 8)
>
> /*
>  * Defines the number of times we try to wait for rollback hash table to get
>  * initialized.  After these many attempts it will return error and the user
>  * can retry the operation.
>  */
> #define ROLLBACK_HT_INIT_WAIT_TRY  60
>
> %s/error/an error
>
This macro also added under 0014.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Seek failure at end of FSM file during WAL replay (in 11)

2019-07-24 Thread Michael Paquier
Hi all,

Recently, one of the test beds we use has blown up once when doing
streaming replication like that:
FATAL:  could not seek to end of file "base/16386/19817_fsm": No such
   file or directory 
CONTEXT:  WAL redo at 60/8DA22448 for Heap2/CLEAN: remxid 65751197
LOG:  startup process (PID 44886) exited with exit code 1

All the WAL records have been wiped out since, so I don't know exactly
what happened, but I could track down that this FSM file got removed
a couple of hours before as I got my hands on some FS-level logs which
showed a deletion.

This happens in the context of a WAL record XLOG_HEAP2_CLEAN, and the
redo logic is in heap_xlog_clean(), where there are FSM lookups within
XLogRecordPageWithFreeSpace() -> XLogReadBufferExtended().  At the
subsequent restart, recovery has been able to move on after the
failing record, so the FSM has been rebuilt correctly, still that
caused an HA setup to be less...  Available.  However, we are rather
careful in those code paths to call smgrcreate() so as the file gets
created at redo if it is not around.  Before blaming a lower level of
the application stack, I am wondering if we have some issues with
mdfd_vfd meaning that the file has been removed but that it is still
tracked as opened.  A quick lookup of the code does not show any
issues, has anyone seen this particular error recently?

The last commit on REL_11_STABLE which touched this area is this one
FWIW:
commit: 6872c2be6a97057aa736110e31f0390a53305c9e
author: Alvaro Herrera 
date: Wed, 15 Aug 2018 18:09:29 -0300
Update FSM on WAL replay of page all-visible/frozen

Also, this setup was using 11.2 (I know this one lags behind a bit,
anyway...).

Thanks,
--
Michael


signature.asc
Description: PGP signature


Re: Problem during Windows service start

2019-07-24 Thread Kyotaro Horiguchi
Sorry in advance for link-breaking message, but the original mail was
too old and gmail doesn't allow me to craft required headers to link
to it.

https://www.postgresql.org/message-id/CAKm4Xs71Ma8bV1fY6Gfz9Mg3AKmiHuoJNpxeDVF_KTVOKoy1WQ%40mail.gmail.com

> Please find the proposed patch for review. I will attach it to
> commitfest as well

Pacemaker suffers the same thing. We suggest our customers that "start
server alone to perform recovery then start pacemaker if it is
expected to take a long time for recovery so that reaches time out".

I don't think it is good think to let status SERVICE_RUNNING although
it actually is not (yet). I think the right direction here is that, if
pg_ctl returns by timeout, pgwin32_ServiceMain kills the starting
server then report something like "timedout and server was stopped,
please make sure the server not to take a long time to perform
recovery.".

Thougts?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_basebackup delays closing of stdout

2019-07-24 Thread Andres Freund
Hi,

On 2019-07-23 22:16:26 -0400, Jeff Janes wrote:
> Ever since pg_basebackup was created, it had a comment like this:
> 
>  * End of chunk. If requested, and this is the base tablespace
>  * write configuration file into the tarfile. When done, close the
>  * file (but not stdout).
> 
> But, why make the exception for output going to stdout?  If we are done
> with it, why not close it?

I think closing stdout is a bad idea that can cause issues in a lot of
situations. E.g. because a later open() will then use that fd (the
lowest unused fd always gets used), and then the next time somebody
wants to write something to stdout, there's normal output interspersed
with some random file.  You'd at the least have to reopen /dev/null into
its place or such.

It also seems likely to be a trap for some future feature additions that
want to write another file to stdout or such - in contrast to the normal
files it can't be reopened.


> After a massive maintenance operation, I want to re-seed a streaming
> standby, which I start to do by:
> 
> pg_basebackup -D - -Ft -P -X none  | pxz > base.tar.xz
> 
> But the archiver is way behind, so when it finishes the basebackup part, I
> get:
> 
> NOTICE:  pg_stop_backup cleanup done, waiting for required WAL segments to
> be archived
> WARNING:  pg_stop_backup still waiting for all required WAL segments to be
> archived (60 seconds elapsed)
> ...
> 
> The base backup file is not finalized, because pg_basebackup has not closed
> its stdout while waiting for the WAL segment to be archived. The file is
> incomplete due to data stuck in buffers, so I can't copy it to where I want
> and bring up a new streaming replica (which bypasses the WAL archive, so
> would otherwise work).

That seems more like an argument for sticking a fflush() there, than
closing stdout.


Greetings,

Andres Freund




Re: POC: Cleaning up orphaned files using undo logs

2019-07-24 Thread Andres Freund
On 2019-07-22 14:21:36 +0530, Amit Kapila wrote:
> Another thing is changing subxids to fxids can increase the size of
> two-phase file for a xact having many sub-transactions which again
> might be okay, but not completely sure.

I can't see that being a problem.




Re: Psql patch to show access methods info

2019-07-24 Thread Andres Freund
Hi,

On 2019-07-15 22:03:31 +0300, Nikita Glukhov wrote:
> +  
> +
> +  \dAc[+]
> +[ class="parameter">access-method-pattern
> +  [ class="parameter">input-type-pattern]]
> +  
> +
> +
> +
> +Shows info index access method operator classes listed in
> +.
> +If  class="parameter">access-method-patttern
> +is specified, only operator classes associated with access method 
> whose
> +name matches pattern are shown.
> +If input-type-pattern
> +is specified, only procedures associated with families whose input 
> type
> +matches the pattern are shown.
> +If + is appended to the command name, operator 
> family
> +and owner are listed.
> +
> +
> +  
> +
> +  
> +
> +  \dAo[+]
> +[ class="parameter">access-method-pattern
> +  [ class="parameter">operator-family-pattern]]
> +  
> +
> +
> +
> +
> +Lists operators () associated
> +with access method operator families. If
> +access-method-patttern 
> is
> +specified, only operators associated with access method whose name
> +matches pattern are shown. If
> +operator-family-pattern 
> is
> +specified, only operators associated with families whose name matches
> +the pattern are shown.
> +If + is appended to the command name, displays
> +additional info.
> +
> +
> +  
> +
> +  
> +
> +  \dAp[+]
> +[ class="parameter">access-method-pattern
> +  [ class="parameter">operator-family-pattern]]
> +  
> +
> +
> +
> +Lists procedures () 
> associated
> +with access method operator families.
> +If  class="parameter">access-method-patttern
> +is specified, only procedures associated with access method whose 
> name
> +matches pattern are shown.
> +If  class="parameter">operator-family-pattern
> +is specified, only procedures associated with families whose name
> +matches the pattern are shown.
> +If + is appended to the command name, procedures
> +listed with its names.
>  

Based on a quick skim of the thread - which means I most definitely
missed things - there's not been discussion of why we actually want to
add this.  Who's the prospective user of this facility? And why wouldn't
they just query pg_am[proc]?  None of this information seems like it's
going to be even remotely targeted towards even advanced users.  For
developers it's not clear what these add?

Adding stuff to psql isn't free. It adds clutter to psql's help output,
the commands need to be maintained (including cross-version code).

Greetings,

Andres Freund




Re: initdb recommendations

2019-07-24 Thread Andres Freund
Hi,

On 2019-07-22 13:02:13 -0400, Andrew Dunstan wrote:
> There are a few things we could do. We could force trust auth, or we
> could add an ident map that allowed $USER to login as buildfarm. Finding
> all the places we would need to fix that could be a fun project ...

Perhaps we could actually do so automatically when the initdb invoking
user isn't the same as the OS user? Imo that'd be generally quite
useful, and not just for the regression tets.

Greetings,

Andres Freund




Re: ANALYZE: ERROR: tuple already updated by self

2019-07-24 Thread Andres Freund
Hi,

On 2019-06-18 17:08:37 -0700, Andres Freund wrote:
> On 2019-06-18 18:48:58 -0500, Justin Pryzby wrote:
> > Ah: the table is an inheritence parent.  If I uninherit its child, there's 
> > no
> > error during ANALYZE.  MV stats on the child are ok:
> 
> It's a "classical" inheritance parent, not a builtin-partitioning type
> of parent, right? And it contains data?
> 
> I assume it ought to not be too hard to come up with a reproducer
> then...
> 
> I think the problem is pretty plainly that for inheritance tables we'll
> try to store extended statistics twice. And thus end up updating the
> same row twice.
> 
> > #6  0x00588346 in do_analyze_rel (onerel=0x7fee16140de8, options=2, 
> > params=0x7ffe5b6bf8b0, va_cols=0x0, acquirefunc=0x492b4, relpages=36, 
> > inh=true, in_outer_xact=false, elevel=13) at analyze.c:627
> 
>   /* Build extended statistics (if there are any). */
>   BuildRelationExtStatistics(onerel, totalrows, numrows, rows, 
> attr_cnt,
>  
> vacattrstats);
> 
> Note that for plain statistics we a) pass down the 'inh' flag to the
> storage function, b) stainherit is part of pg_statistics' key:
> 
> Indexes:
> "pg_statistic_relid_att_inh_index" UNIQUE, btree (starelid, staattnum, 
> stainherit)
> 
> 
> Tomas, I think at the very least extended statistics shouldn't be built
> when building inherited stats. But going forward I think we should
> probably have it as part of the key for pg_statistic_ext.

Tomas, ping?

Greetings,

Andres Freund




Re: Psql patch to show access methods info

2019-07-24 Thread Andres Freund
Hi,

On 2019-07-23 01:57:29 +0300, Alexander Korotkov wrote:
> It was always scary there is no way in psql to see am/opclass/opfamily
> information rather than query catalog directly.

What does make that scary?


> I'm going to push it.  Probably, someone find that commands syntax and
> output formats are not well discussed yet.  But we're pretty earlier
> in 13 release cycle.  So, we will have time to work out a criticism if
> any.

Please don't before we've had some discussion as to why we want this
additional code, and who'd be helped by it.

Greetings,

Andres Freund




Re: Psql patch to show access methods info

2019-07-24 Thread Andres Freund
Hi,

On 2019-07-15 22:03:31 +0300, Nikita Glukhov wrote:
> +  
> +
> +  \dAc[+]
> +[ class="parameter">access-method-pattern
> +  [ class="parameter">input-type-pattern]]
> +  
> +
> +
> +
> +Shows info index access method operator classes listed in
> +.
> +If  class="parameter">access-method-patttern
> +is specified, only operator classes associated with access method 
> whose
> +name matches pattern are shown.
> +If input-type-pattern
> +is specified, only procedures associated with families whose input 
> type
> +matches the pattern are shown.
> +If + is appended to the command name, operator 
> family
> +and owner are listed.
> +
> +
> +  
> +
> +  
> +
> +  \dAo[+]
> +[ class="parameter">access-method-pattern
> +  [ class="parameter">operator-family-pattern]]
> +  
> +
> +
> +
> +
> +Lists operators () associated
> +with access method operator families. If
> +access-method-patttern 
> is
> +specified, only operators associated with access method whose name
> +matches pattern are shown. If
> +operator-family-pattern 
> is
> +specified, only operators associated with families whose name matches
> +the pattern are shown.
> +If + is appended to the command name, displays
> +additional info.
> +
> +
> +  
> +
> +  
> +
> +  \dAp[+]
> +[ class="parameter">access-method-pattern
> +  [ class="parameter">operator-family-pattern]]
> +  
> +
> +
> +
> +Lists procedures () 
> associated
> +with access method operator families.
> +If  class="parameter">access-method-patttern
> +is specified, only procedures associated with access method whose 
> name
> +matches pattern are shown.
> +If  class="parameter">operator-family-pattern
> +is specified, only procedures associated with families whose name
> +matches the pattern are shown.
> +If + is appended to the command name, procedures
> +listed with its names.
>  

Based on a quick skim of the thread - which means I most definitely
missed things - there's not been discussion of why we actually want to
add this.  Who's the prospective user of this facility? And why wouldn't
they just query pg_am[proc]?  None of this information seems like it's
going to be even remotely targeted towards even advanced users.  For
developers it's not clear what these add?

Adding stuff to psql isn't free. It adds clutter to psql's help output,
the commands need to be maintained (including cross-version code).

Greetings,

Andres Freund




Re: Change atoi to strtol in same place

2019-07-24 Thread Andres Freund
On 2019-07-24 16:57:42 +1200, David Rowley wrote:
> On Wed, 24 Jul 2019 at 16:02, Joe Nelson  wrote:
> > > In general, I think it's a good idea to fix those places, but I wonder
> > > if we need to change the error messages too.
> >
> > I'll leave that decision for the community to debate. I did, however,
> > remove the newlines for the new error messages being passed to
> > pg_log_error().
> 
> I'd like to put my vote not to add this complex code to each option
> validation that requires an integer number. I'm not sure there
> currently is a home for it, but if there was, wouldn't it be better
> writing a function that takes a lower and upper bound and sets some
> output param with the value and returns a bool to indicate if it's
> within range or not?

+many




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-24 Thread Andres Freund
Hi,

On 2019-07-21 21:37:28 +1200, David Rowley wrote:
> select.sql:
> \set p 1
> select * from ht where a = :p
> 
> Master:
> 
> $ pgbench -n -f select.sql -T 60 -M prepared postgres
> tps = 10172.035036 (excluding connections establishing)
> tps = 10192.780529 (excluding connections establishing)
> tps = 10331.306003 (excluding connections establishing)
> 
> Patched:
> 
> $ pgbench -n -f select.sql -T 60 -M prepared postgres
> tps = 15080.765549 (excluding connections establishing)
> tps = 14994.404069 (excluding connections establishing)
> tps = 14982.923480 (excluding connections establishing)
> 
> That seems fine, 46% faster.
> 
> v6 is attached.
> 
> I plan to push this in a few days unless someone objects.

It does seem far less objectionable than the other case.  I hate to
throw in one more wrench into a topic finally making progress, but: Have
either of you considered just replacing the dynahash table with a
simplehash style one?  Given the obvious speed sensitivity, and the fact
that for it (in contrast to the shared lock table) no partitioning is
needed, that seems like a good thing to try.  It seems quite possible
that both the iteration and plain manipulations are going to be faster,
due to far less indirections - e.g. the iteration through the array will
just be an array walk with a known stride, far easier for the CPU to
prefetch.

Greetings,

Andres Freund