from:"Thomas Munro"

Re: Preallocation changes in Postgresql 16

2024-04-25 Thread Thomas Munro

On Fri, Apr 26, 2024 at 4:37 AM Riku Iki  wrote:
> I am wondering if there were preallocation related changes in PG16, and if it 
> is possible to disable preallocation in PostgreSQL 16?

I have no opinion on the btrfs details, but I was wondering if someone
might show up with a system that doesn't like that change.  Here is a
magic 8, tuned on "some filesystems":

/*
 * If available and useful, use posix_fallocate() (via
 * FileFallocate()) to extend the relation. That's often more
 * efficient than using write(), as it commonly won't cause the kernel
 * to allocate page cache space for the extended pages.
 *
 * However, we don't use FileFallocate() for small extensions, as it
 * defeats delayed allocation on some filesystems. Not clear where
 * that decision should be made though? For now just use a cutoff of
 * 8, anything between 4 and 8 worked OK in some local testing.
 */
if (numblocks > 8)

I wonder if it wants to be a GUC.

Re: could not open file "global/pg_filenode.map": Operation not permitted

2024-04-03 Thread Thomas Munro

On Sat, Mar 23, 2024 at 3:01 AM Nick Renders  wrote:
> We now have a second machine with this issue: it is an Intel Mac mini running 
> macOS Sonoma (14.4) and PostgreSQL 16.2.
> This one only has a single Data directory, so there are no multiple instances 
> running.

BTW if you're running databases on mains-powered Macs, I have a patch
that you might be interested in, which so far hasn't attracted any
reviews.  The short version is that I bet you can at least lose many
seconds of commits (because WAL doesn't really hit durable part of
disk), and possibly also fail to recover (pg_control hits disk before
WAL, not sure if this is really possible), if you yank the power and
you're using the default settings for wal_sync_method.  I'd like to
rationalise the settings for that stuff and make it safe by default.

I don't know anything about the USB storage pathway but I'd be
surprised if it's different.

https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BF0EL4Up6yVYbbcWse4xKaqW4wc2xpw67Pq9FjmByWVg%40mail.gmail.com

Re: could not open file "global/pg_filenode.map": Operation not permitted

2024-04-03 Thread Thomas Munro

On Thu, Apr 4, 2024 at 3:11 AM Nick Renders  wrote:
> In the macOS Finder, when you show the Info (command+i) for an external drive 
> (or any partition that is not the boot drive), there is a checkbox "Ignore 
> ownership on this volume" in the Permissions section. I think it is by 
> default "on" for external drives.

Hmm.  Sounds suspicious, but why would only this file be affected?

> The external SSD is an Orico drive that is connected with USB-C. It is 
> initialised as a GUID Partition Map with a single AFPS partition.
>
> We have run PostgreSQL 15 and earlier, before upgrading to 16 when it came 
> out last year. We didn't have any problems with 16 until recently, after 
> upgrading to Sonoma.

Interesting.  So the rename might have something to do with it, though
I don't have a theory for how,.

Can you show what the permissions and ownership looks like for pg_*
under there, normally, and once the system reaches this state?
Something like:

tmunro@phonebox postgresql % ls -slap pgdata/global/pg_*
16 -rw---  1 tmunro  staff  8192  4 Apr 09:50 pgdata/global/pg_control
 8 -rw---  1 tmunro  staff   524  4 Apr 09:50 pgdata/global/pg_filenode.map

I'm asking for "pg_*" because I want to see pg_control as well, to
understand the permissions for the other files in the cluster, and
because I want to see if there are any stray remnants of a temporary
file, which would be called pg_filenode.map.tmp.

Re: could not open file "global/pg_filenode.map": Operation not permitted

2024-03-28 Thread Thomas Munro

On Fri, Mar 29, 2024 at 4:47 AM Nick Renders  wrote:
> Looking at the 2 machines that are having this issue (and the others that 
> don't), I think it is somehow related to the following setup:
> - macOS Sonoma (14.4 and 14.4.1)
> - data directory on an external drive
>
> That external drive (a Promise RAID system in one case, a simple SSD in the 
> other) has the option "ignore ownership" on by default. I have tried 
> disabling that, and updating the data directory to have owner + read/write 
> access for the postgres user. It seemed to work at first, but just now the 
> issue re-appeared again.
>
> Any other suggestions?

I don't have any specific ideas and I have no idea what "ignore
ownership" means ... what kind of filesystem is running on it?  For
the simple SSD, is it directly connected, running a normal Apple APFS
filesystem, or something more complicated?

I wonder if this could be related to the change in 16 which started to
rename that file:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d8cd0c6c95c0120168df93aae095df4e0682a08a

Did you ever run 15 or earlier on that system?

Re: How do you optimize the disk IO when you cannot assume a file will start at a boundary then?

2024-02-22 Thread Thomas Munro

On Thu, Feb 22, 2024 at 2:23 PM Siddharth Jain  wrote:
> I understand the storage layer in databases goes to great lengths to ensure:
> - a row does not cross a block boundary
> - read/writes/allocation happen in units of blocks
> etc. The motivation is that at the OS level, it reads and writes pages 
> (blocks), not individual bytes. I am only concerned about SSDs but I think 
> the principle applies to HDD as well.
>
> but how can we do all this when we are not even guaranteed that the beginning 
> of a file will be aligned with a block boundary? refer this.

Interesting question.  I was aware of FFS/UFS fragments 'cause I've
poked at that code a bit (one of my learn-to-hack-kernels side
projects was to add O_DSYNC support), but not some of those others
mentioned.  I don't think they are a problem for anything PostgreSQL
does.  Even with O_DIRECT, FFS (at least in its FreeBSD descendant)
just quietly skips its raw fast path if the I/O is not *sector*
aligned, but otherwise quietly falls back to the usual buffered path;
actual interesting sized tables wouldn't use that feature anyway, so
really it's just an internal edge case space optimisation that we
don't have to worry about.  Generally, that family of systems
interprets O_DIRECT to mean "'*try* to bypass buffers" anyway, so
there is no way for it to bite us.  On the other hand, it does perform
pretty badly if you use logical blocks that are too large: that's why
in src/tools/ci/gcp_freebsd_repartition.sh I set up a new partition
with 8KB blocks for CI testing (I originally did that in a branch
doing a lot more with O_DIRECT, where the slow down due to block
mismatch is worse).  I think we just have to call that an
administrator's problem to worry about.

I have idly wondered before about some way to set a preferred 'grain'
size on a tablespace so that, for example, a parallel seq scan over a
very large read-only compressed ZFS/BTRFS/whatever file with 128KB
blocks could try to align the allocation of block ranges with those
boundaries in mind, just as a sort of performance hint.  Or perhaps
read it directly from st_blksize[1].  Combined with the I/O merging
thing I'm working on right now which can build up single 128KB
preadv() calls, that'd allow parallel workers to line those up neatly,
and avoid accessing the same underlying block from two processes,
which might have advantages in terms of ZFS's decompression threads
and other kinds of contentions.  I haven't looked into the potential
benefits of that though, that is just speculation...

For direct I/O, which we're still working on bringing to PostgreSQL in
useful form, we had to do this:

commit faeedbcefd40bfdf314e048c425b6d9208896d90
Author: Thomas Munro 
Date:   Sat Apr 8 10:38:09 2023 +1200

Introduce PG_IO_ALIGN_SIZE and align all I/O buffers.

... to avoid EINVAL errors, falling back to buffered mode or
pathologically bad performance (depending on the system).  It probably
also helps buffered I/O performance for non-huge-pages a tiny bit (by
touching the minimum number of memory pages).  There is no industry
standard to tell us what exactly is required for O_DIRECT, but we make
sure addresses, offsets and lengths are all aligned to our macro
PG_IO_ALIGN_SIZE, which is 4KB, because this is enough to satisfy all
the systems we are aware of today (on some systems the true
offset/length requirement is based on real or historical sectors,
either 4KB or 512 bytes, which may be smaller than the filesystem
block size, while on others it may be the filesystem block size which
is usually some power-of-two multiple of those numbers).  Since you
can theoretically set PostgreSQL's data or WAL block size to less than
that at compile time, we just don't let you turn direct I/O on in that
case (perhaps we should, in case it does work, and just let it error
out if it's going to?  It's early days...).  If the file system's
block size is larger than that, you may have some issues, though,
depending on the file system.

Nobody standardised this stuff, and we certainly can't make any
guarantees that anything we come up with will work on future systems,
eg if someone invents 8KB sectors maybe it will need some work.
That's OK, because direct I/O will likely always be a non-default
optional mode, and many systems don't support it at all today.

> Further, I don't see any APIs exposing I/O operations in terms of blocks. All 
> File I/O APIs I see expose a file as a randomly accessible contiguous byte 
> buffer. Would it not have been easier if there were APIs that exposed I/O 
> operations in terms of blocks?

There are/were operating systems with block or record oriented I/O (eg
IBM ones).  Unix was about minimalism and simplicity and explicitly
rejected that idea (first bullet point in [2]).  Buffered,
byte-oriented, synchronous I/O suited most applications.  But yeah,
not so much databases.  Dr Stonebraker, who started this project,
wrote a bunch of papers

Re: Query crash with 15.5 on debian bookworm/armv8

2023-12-26 Thread Thomas Munro

On Wed, Dec 27, 2023 at 5:17 AM Clemens Eisserer  wrote:
> > FWIW, since this crash is inside LLVM you could presumably dodge the bug
> > by setting "jit" to off.
>
> Thanks, this indeed solved the crash.
> Just to make sure this crash doesn't have anything to do with my
> setup/config (I'd changed quite a few settings in postgresql.conf),
> I gave it a try on a fresh bookworm install and it also crashed immeditaly.
>
> > As for an actual fix, perhaps a newer version of LLVM is needed?
> > I don't see a problem testing this query on my RPI with Ubuntu 23.10
> > (LLVM 16).
>
> I also gave Ubuntu 23.10 a try (15.4 built with llvm-15) and it worked
> as expected, explain analyze even mentioned the JIT was active.

I can't reproduce this on LLVM 14 on an aarch64 Mac FWIW (after
setting jit_*_cost to 0, as required since the table is empty).

> I've filed a debian bug report with a link to this discussion and a
> plea to build postgresql against llvm >= 15:
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1059476

I doubt they'll change that, and in any case we'll need to get to the
bottom of this.  Perhaps an assertion build of LLVM will fail in some
illuminating internal assertion?  Unfortunately it's a non-trivial
business to get a debug build of LLVM going (it takes oodles of disk
and CPU and a few confusing-to-me steps)...

. o O ( It would be wonderful if assertion-enabled packages were
readily available for a common platform like Debian.  I've finally
been spurred on to reach out to the maintainer of apt.llvm.org to ask
about that.  It'd also be very handy for automated next-version
monitoring. )

Re: How to generate random bigint

2023-12-20 Thread Thomas Munro

On Thu, Dec 21, 2023 at 7:21 PM Tom Lane  wrote:
> Phillip Diffley  writes:
> > Postgres's random() function generates a random double. That can be
> > converted to a random int for smaller integers, but a double can't
> > represent all of the values in a bigint. Is there a recommended way to
> > generate a random bigint in Postgres?
>
> Doesn't look like there's anything directly exposed for that.
> Since PG v13 you could use gen_random_uuid, if you're careful
> to extract only bits that aren't fixed by the v4 UUID spec.
> pgcrypto's pg_random_bytes() function offers another
> some-assembly-required solution that'd work considerably
> further back.  Or you could make a custom C function that
> leverages pg_strong_random().

Also pg_read_binary_file('/dev/urandom', 0, 8) could be useful
(assuming you're on Unix) if you can figure out how to cast it...

Re: psql crash with custom build on RedHat 7

2023-12-19 Thread Thomas Munro

On Wed, Dec 20, 2023 at 4:41 AM Dominique Devienne  wrote:
> On Tue, Dec 19, 2023 at 2:02 PM Thomas Munro  wrote:
>> On Wed, Dec 20, 2023 at 1:39 AM Dominique Devienne  
>> wrote:
>> > Program received signal SIGSEGV, Segmentation fault.
>> > 0x004232b8 in slash_yylex ()
>>
>> I think this might have to do with flex changing.  Does it help if you
>> "make maintainer-clean"?
>
> My colleague who did the custom build double-checked the flex/bison 
> requirements,
> and the version of the packages on the RH7 machine he built on, and they 
> check out (see below).
>
> He also tells me he builds debug and release versions off different 
> workspaces/checkouts,
> thus there are no remnants of previous builds, assuming that's what `make 
> maintainer-clean` is for.

OK but be warned that if you're using tarballs, we shipped lexer
remnants in the tree (until
https://github.com/postgres/postgres/commit/721856ff, an interesting
commit to read).  The slash lexer is a kind of extension that (IIRC)
shares the same PsqlScanState (opaque pointer to private lexer state),
but if these two things are compiled to C by different flex versions,
they may contain non-identical 'struct yyguts_t' (and even if the
structs were identical, what the code does with them might still be
incompatible, but I guess the struct itself would be a good first
thing to look at along with the versions mentioned near the top of the
.c):

src/fe_utils/psqlscan.l -> psqlscan.c
src/bin/psql/psqlscanslash.l -> psqlscanslash.c

The usual "clean" doesn't remove those .c files in PG < 17, which
means that if your pipeline involves tarballs but you finished up
regenerating one of the files, or some other sequence involving
different flex versions, you could get that.  I've seen it myself on a
few systems, a decade ago when I guess flex rolled out an incompatible
change (maybe contemporaneous with RHEL7) and flex was upgraded
underneath my feet.  I remember that "maintainer-clean" (or maybe I'm
misremembering and it was "distclean") fixed it.

Re: psql crash with custom build on RedHat 7

2023-12-19 Thread Thomas Munro

On Wed, Dec 20, 2023 at 1:39 AM Dominique Devienne  wrote:
> Program received signal SIGSEGV, Segmentation fault.
> 0x004232b8 in slash_yylex ()

I think this might have to do with flex changing.  Does it help if you
"make maintainer-clean"?

Re: fsync data directory after DB crash

2023-07-20 Thread Thomas Munro

On Wed, Jul 19, 2023 at 2:09 PM Pandora  wrote:
> Yes, I saw the usage of syncfs in PG14, but it is recommended to use it on 
> Linux 5.8 or higher. If my OS version is lower than 5.8, can I still enable 
> it?

Nothing stops you from enabling it, it's fairly ancient and should
work.  It just doesn't promise to report errors before Linux 5.8,
which is why we don't recommend it, so you have to figure out the
risks.  One way to think about the risks: all we do is log the errors,
but you could probably also check the kernel logs for errors.

The edge cases around writeback failure are a tricky subject.  If the
reason we are running crash recovery is because we experienced an I/O
error and PANIC'd before, then it's possible for
recovery_init_sync_method=fsync to succeed while there is still
phantom data in the page cache masquerading as "clean" (ie will never
be sent to the disk by Linux).  So at least in some cases, it's no
better than older Linux's syncfs for our purposes.

(I think the comment that Michael quoted assumes the default FreeBSD
caching model: that cached data stays dirty until it's transferred to
disk or the file system is force-removed, whereas the Linux model is:
cached data stays dirty until the kernel has attempted to transfer it
to disk just once, and then it'll report an error to user space one
time (or, in older versions, sometimes fewer) and it is undefined (ie
depends on file system) whether the affected data is forgotten from
cache, or still present as phantom data that is bogusly considered
clean.  The reason this probably isn't a bigger deal than it sounds
may be that "transient" I/O failures are probably rare -- it's more
likely that a system with failing storage just completely
self-destructs and you never reach these fun edge cases.  But as
database hackers, we try to think about this stuff... perhaps one day
soon we'll be able to just go around this particular molehill with
direct I/O.)

Re: fsync data directory after DB crash

2023-07-18 Thread Thomas Munro

On Wed, Jul 19, 2023 at 12:41 PM Michael Paquier  wrote:
> On Tue, Jul 18, 2023 at 04:50:25PM +0800, Pandora wrote:
> > I found that starting from version 9.5, PostgreSQL will do fsync on
> > the entire data directory after DB crash. Here's a question: if I
> > have FPW = on, why is this step still necessary?
>
> Yes, see around the call of SyncDataDirectory() in xlog.c:
>  * - There might be data which we had written, intending to fsync it, but
>  *   which we had not actually fsync'd yet.  Therefore, a power failure in
>  *   the near future might cause earlier unflushed writes to be lost, even
>  *   though more recent data written to disk from here on would be
>  *   persisted.  To avoid that, fsync the entire data directory.

FTR there was some discussion and experimental patches that would add
recovery_init_sync_method=none and recovery_init_sync_method=wal,
which are based on the OP's observation + an idea for how to make it
work even without FPWs enabled:

https://www.postgresql.org/message-id/flat/CA%2BhUKGKgj%2BSN6z91nVmOmTv2KYrG7VnAGdTkWdSjbOPghdtooQ%40mail.gmail.com#576caccf21cb6c3e883601fceb28d36b

Only recovery_init_sync_method=syncfs actually went in from that
thread.  It works better for some setups (systems where opening
squillions of files just do perform a no-op fsync() is painfully
expensive).

Re: "PANIC: could not open critical system index 2662" - twice

2023-05-08 Thread Thomas Munro

On Tue, May 9, 2023 at 10:04 AM Tom Lane  wrote:
> Michael Paquier  writes:
> > One thing I was wondering about to improve the odds of the hits is to
> > be more aggressive with the number of relations created at once, so as
> > we are much more aggressive with the number of pages extended in
> > pg_class from the origin database.
>
> Andres seems to think it's a problem with aborting a DROP DATABASE.
> Adding more data might serve to make the window wider, perhaps.

Here's an easy way:

@@ -1689,6 +1689,14 @@ dropdb(const char *dbname, bool missing_ok, bool force)
/* Close all smgr fds in all backends. */

WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));

+/* XXX pretend one of the above steps got interrupted by a statement
timeout or ^C */
+if (random() % 2 == 0)
+{
+QueryCancelPending = true;
+InterruptPending = true;
+CHECK_FOR_INTERRUPTS();
+}

postgres=# create database db2;
CREATE DATABASE
postgres=# drop database db2;
ERROR:  canceling statement due to user request

$ psql db2
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432"
failed: PANIC:  could not open critical system index 2662

$ od -t x1 base/17/2662
000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
010
$ od -t x1 base/17/2837
000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
004
$ od -t x1 base/17/2840
000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
010

Re: "PANIC: could not open critical system index 2662" - twice

2023-05-07 Thread Thomas Munro

On Mon, May 8, 2023 at 2:24 PM Michael Paquier  wrote:
> I can reproduce the same backtrace here.  That's just my usual laptop
> with ext4, so this would be a Postgres bug.  First, here are the four
> things running in parallel so as I can get a failure in loading a
> critical index when connecting:

That sounds like good news, but I'm still confused: do you see all 0s
in the target database (popo)'s catalogs, as reported (and if so can
you explain how they got there?), or is it regression that is
corrupted in more subtle ways also involving 1s?

Re: "PANIC: could not open critical system index 2662" - twice

2023-05-07 Thread Thomas Munro

On Mon, May 8, 2023 at 4:10 AM Evgeny Morozov
 wrote:
> On 6/05/2023 11:13 pm, Thomas Munro wrote:
> > Would you like to try requesting FILE_COPY for a while and see if it 
> > eventually happens like that too?
> Sure, we can try that.

Maybe you could do some one way and some the other, so that we try to
learn more?

> Ubuntu 18.04.6
> Kernel 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023
> x86_64 x86_64 x86_64 GNU/Linux
> zfsutils-linux package version 0.7.5-1ubuntu16.12 amd64

I tried for a few hours to reproduce this by trying to make as many
things as similar to yours as I could based on info in this thread
(Linux: up-to-date Ubuntu 18.04 in vagrant which has nearly the same
kernel 4.15.0-208-generic and a close but slightly different version
of ancient ZFS 0.7.5-1ubuntu15, not sure why, ZFS: mirror (I used a
pair of loopback files), recordsize=16kB, compression=lz4, PG:
compiled from tag REL_15_2, data_checksums=on, full_page_writes=off,
wal_recycle=off, wal_init_zero=off), with what I thought might be
roughly what you're doing (creating three DBs, two clones of the
first, with various modification at various points, with various
overlapping activities, and then checking for catalog corruption).  No
cigar.  Hrmph.

Re: "PANIC: could not open critical system index 2662" - twice

2023-05-07 Thread Thomas Munro

On Sun, May 7, 2023 at 1:21 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Did you previously run this same workload on versions < 15 and never
> > see any problem?  15 gained a new feature CREATE DATABASE ...
> > STRATEGY=WAL_LOG, which is also the default.  I wonder if there is a
> > bug somewhere near that, though I have no specific idea.
>
> Per the release notes I was just writing ...
>
> 
> 
>  
>   Fix potential corruption of the template (source) database after
>   CREATE DATABASE with the STRATEGY
>   WAL_LOG option (Nathan Bossart, Ryo Matsumura)
>  

Hmm.  That bug seems to have caused corruption (backwards time travel)
of blocks in the *source* DB's pg_class, by failing to write back
changes.  We seem to have zeroed pages in the *target* database, for
all catalogs (apparently everything copied by
RelationCopyStorageUsingBuffer()), even though the template is still
fine.  It is as if RelationCopyStorageUsingBuffer() created the
zero-filed file with smgrextend(), but then the buffer data was never
written out even though we memcpy'd it into the a buffer and set the
buffer dirty.

Bug-in-PostgreSQL explanations could include that we forgot it was
dirty, or some backend wrote it out to the wrong file; but if we were
forgetting something like permanent or dirty, would there be a more
systematic failure?  Oh, it could require special rare timing if it is
similar to 8a8661828's confusion about permanence level or otherwise
somehow not setting BM_PERMANENT, but in the target blocks, so I think
that'd require a checkpoint AND a crash.  It doesn't reproduce for me,
but perhaps more unlucky ingredients are needed.

Bug-in-OS/FS explanations could include that a whole lot of writes
were mysteriously lost in some time window, so all those files still
contain the zeroes we write first in smgrextend().  I guess this
previously rare (previously limited to hash indexes?) use of sparse
file hole-punching could be a factor in an it's-all-ZFS's-fault
explanation:

openat(AT_FDCWD,"base/16390/2662",O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC,0600)
= 36 (0x24)
openat(AT_FDCWD,"base/1/2662",O_RDWR|O_CLOEXEC,00) = 37 (0x25)
lseek(37,0x0,SEEK_END)   = 32768 (0x8000)
lseek(37,0x0,SEEK_END)   = 32768 (0x8000)
pwrite(36,"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,8192,0x6000) = 8192
(0x2000) <-- smgrextend(final block)
lseek(36,0x0,SEEK_END)   = 32768 (0x8000)

I  was trying to think about how I might go about trying to repro the
exact system setup.  Evgeny, do you mind sharing your "zfs get all
/path/to/pgdata" (curious to see block size, compression settings,
anything else etc) and your postgresql.conf? And your exact Ubuntu
kernel version and ZFS package versions?

Re: "PANIC: could not open critical system index 2662" - twice

2023-05-06 Thread Thomas Munro

On Sun, May 7, 2023 at 10:23 AM Jeffrey Walton  wrote:
> This may be related... I seem to recall the GNUlib folks talking about
> a cp bug on sparse files. It looks like it may be fixed in coreutils
> release 9.2 (2023-03-20):
> https://github.com/coreutils/coreutils/blob/master/NEWS#L233
>
> If I recall correctly, it had something to do with the way
> copy_file_range worked. (Or maybe, it did not work as expected).
>
> According to the GNUlib docs
> (https://www.gnu.org/software/gnulib/manual/html_node/copy_005ffile_005frange.html):
>
> This function has many problems on Linux
> kernel versions before 5.3

That's quite interesting, thanks (we've been talking about making
direct use of copy_file_range() in a few threads, I'll definitely be
looking into that history), but we don't currently use
copy_file_range() or any coreutils stuff in the relevant code paths
here -- this data is copied by plain old pread() and pwrite().

Re: "PANIC: could not open critical system index 2662" - twice

2023-05-06 Thread Thomas Munro

On Sun, May 7, 2023 at 12:29 AM Evgeny Morozov
 wrote:
> On 6/05/2023 12:34 pm, Thomas Munro wrote:
> > So it does indeed look like something unknown has replaced 32KB of
> > data with 32KB of zeroes underneath us.  Are there more non-empty
> > files that are all-zeroes?  Something like this might find them:
> >
> > for F in base/1414389/*
> > do
> >   if [ -s $F ] && ! xxd -p $F | grep -qEv '^(00)*$' > /dev/null
> >   then
> > echo $F
> >   fi
> > done
>
> Yes, a total of 309 files are all-zeroes (and 52 files are not).
>
> I also checked the other DB that reports the same "unexpected zero page
> at block 0" error, "test_behavior_638186280406544656" (OID 1414967) -
> similar story there. I uploaded the lists of zeroed and non-zeroed files
> and the ls -la output for both as
> https://objective.realityexists.net/temp/pgstuff3.zip
>
> I then searched recursively such all-zeroes files in $PGDATA/base and
> did not find any outside of those two directories (base/1414389 and
> base/1414967). None in $PGDATA/global, either.

So "diff -u zeroed-files-1414967.txt zeroed-files-1414389.txt" shows
that they have the same broken stuff in the range cloned from the
template database by CREATE DATABASE STRATEGY=WAL_LOG, and it looks
like it's *all* the cloned catalogs, and then they have some
non-matching relfilenodes > 140, presumably stuff you created
directly in the new database (I'm not sure if I can say for sure that
those files are broken, without knowing what they are).

Did you previously run this same workload on versions < 15 and never
see any problem?  15 gained a new feature CREATE DATABASE ...
STRATEGY=WAL_LOG, which is also the default.  I wonder if there is a
bug somewhere near that, though I have no specific idea.  If you
explicitly added STRATEGY=FILE_COPY to your CREATE DATABASE commands,
you'll get the traditional behaviour.  It seems like you have some
kind of high frequency testing workload that creates and tests
databases all day long, and just occasionally detects this corruption.
Would you like to try requesting FILE_COPY for a while and see if it
eventually happens like that too?

My spidey sense is leaning away from filesystem bugs.  We've found
plenty of filesystem bugs on these mailing lists over the years and of
course it's not impossible, but I dunno... it seems quite suspicious
that all the system catalogs have apparently been wiped during or
moments after the creation of a new database that's running new
PostgreSQL 15 code...

Re: "PANIC: could not open critical system index 2662" - twice

2023-05-06 Thread Thomas Munro

On Sat, May 6, 2023 at 9:58 PM Evgeny Morozov
 wrote:
> Right - I should have realised that! base/1414389/2662 is indeed all
> nulls, 32KB of them. I included the file anyway in
> https://objective.realityexists.net/temp/pgstuff2.zip

OK so it's not just page 0, you have 32KB or 4 pages of all zeroes.
That's the expected length of that relation when copied from the
initial template, and consistent with the pg_waldump output (it uses
FPIs to copy blocks 0-3).  We can't see the block contents but we know
that block 2 definitely is not all zeroes at that point because there
are various modifications to it, which not only write non-zeroes but
must surely have required a sane page 0.

So it does indeed look like something unknown has replaced 32KB of
data with 32KB of zeroes underneath us.  Are there more non-empty
files that are all-zeroes?  Something like this might find them:

for F in base/1414389/*
do
  if [ -s $F ] && ! xxd -p $F | grep -qEv '^(00)*$' > /dev/null
  then
echo $F
  fi
done

Re: "PANIC: could not open critical system index 2662" - twice

2023-05-05 Thread Thomas Munro

On Fri, May 5, 2023 at 7:50 PM Evgeny Morozov
 wrote:
> The OID of the bad DB ('test_behavior_638186279733138190') is 1414389 and 
> I've uploaded base/1414389/pg_filenode.map and also base/5/2662 (in case 
> that's helpful) as https://objective.realityexists.net/temp/pgstuff1.zip

Thanks.  That pg_filenode.map looks healthy to me.

tmunro@build1:~/junk $ od -t x1 pg_filenode.map
00017  27  59  00  11  00  00  00  eb  04  00  00  eb  04  00  00
020e1  04  00  00  e1  04  00  00  e7  04  00  00  e7  04  00  00
040df  04  00  00  df  04  00  00  14  0b  00  00  14  0b  00  00
06015  0b  00  00  15  0b  00  00  4b  10  00  00  4b  10  00  00
1004c  10  00  00  4c  10  00  00  82  0a  00  00  82  0a  00  00
12083  0a  00  00  83  0a  00  00  8f  0a  00  00  8f  0a  00  00
14090  0a  00  00  90  0a  00  00  62  0a  00  00  62  0a  00  00
16063  0a  00  00  63  0a  00  00  66  0a  00  00  66  0a  00  00
...

hex(2662) is 0xa66, and we see 63 0a 00 00 followed by 63 0a 00 00 in
that last line as expected, so that rules out the idea that it's
somehow trashed that map file and points to the wrong relation file.

Next can you share the file base/1414389/2662?  ("5" was from the
wrong database.)

> > Maybe you still have enough WAL if it happened recently?
>
> Maybe! What should I do with pg_waldump? I've never used it before.

Try something like:

pg_waldump -R 1663/1414389/2662 -F main 00010001
00010007

... but change that to the range of files you have in your pg_wal.

Re: "PANIC: could not open critical system index 2662" - twice

2023-05-04 Thread Thomas Munro

On Fri, May 5, 2023 at 11:15 AM Thomas Munro  wrote:
> What does select
> pg_relation_filepath('pg_class_oid_index') show in the corrupted
> database, base/5/2662 or something else?

Oh, you can't get that far, but perhaps you could share the
pg_filenode.map file?  Or alternatively strace -f PostgreSQL while
it's starting up to see which file it's reading, just to be sure.  One
way to find clues about whether PostgreSQL did something wrong, once
we definitely have the right relfilenode for the index, aside from
examining its contents, would be to search the WAL for references to
that block with pg_waldump.  Maybe you still have enough WAL if it
happened recently?

Re: "PANIC: could not open critical system index 2662" - twice

2023-05-04 Thread Thomas Munro

On Fri, May 5, 2023 at 11:15 AM Thomas Munro  wrote:
> Now *that* is a piece of
> logic that changed in PostgreSQL 15.  It changed from sector-based
> atomicity assumptions to a directory entry swizzling trick, in commit
> d8cd0c6c95c0120168df93aae095df4e0682a08a.  Hmm.

I spoke too soon, that only changed in 16.  But still, it means there
are two files that could be corrupted here, pg_filenode.map which
might somehow be pointing to the wrong file, and the relation (index)
main fork file.

Re: "PANIC: could not open critical system index 2662" - twice

2023-05-04 Thread Thomas Munro

On Fri, May 5, 2023 at 6:11 AM Evgeny Morozov
 wrote:
> Meanwhile, what do I do with the existing server, though? Just try to
> drop the problematic DBs again manually?

That earlier link to a FreeBSD thread is surely about bleeding edge
new ZFS stuff that was briefly broken then fixed, being discovered by
people running code imported from OpenZFS master branch into FreeBSD
main branch (ie it's not exactly released, not following the details
but I think it might soon be 2.2?), but you're talking about an LTS
Ubuntu release from 2018, which shipped "ZFS on Linux" version 0.7.5,
unless you installed a newer version somehow?  So it doesn't sound
like it could be related.

That doesn't mean it couldn't be a different ZFS bug though.  While
looking into file system corruption issues that had similar symptoms
on some other file system (which turned out to be a bug in btrfs) I
did bump into a claim that ZFS could product unexpected zeroes in some
mmap coherency scenario, OpenZFS issue #14548.  I don't immediately
see how PostgreSQL could get tangled up with that problem though, as
we aren't doing that...

It seems quite interesting that it's always pg_class_oid_index block 0
(the btree meta-page), which feels more like a PostgreSQL bug, unless
the access pattern of that particular file/block is somehow highly
unusual compared to every other block and tickling bugs elsewhere in
the stack.  How does that file look, in terms of size, and how many
pages in it are zero?  I think it should be called base/5/2662.

Oooh, but this is a relation that goes through
RelationMapOidToFilenumber.  What does select
pg_relation_filepath('pg_class_oid_index') show in the corrupted
database, base/5/2662 or something else?  Now *that* is a piece of
logic that changed in PostgreSQL 15.  It changed from sector-based
atomicity assumptions to a directory entry swizzling trick, in commit
d8cd0c6c95c0120168df93aae095df4e0682a08a.  Hmm.

Re: src/test/examples/testlibpq2.c where the HAVE_SYS_SELECT_H is defined.

2023-03-17 Thread Thomas Munro

On Fri, Mar 17, 2023 at 7:48 PM jian he  wrote:
> Hi,
> playing around with $[0] testlibpq2.c example. I wondered where 
> HAVE_SYS_SELECT_H is defined?
>
> I searched on the internet, founded that people also asked the same question 
> in $[1].
>
> In my machine, I do have .
> system version: Ubuntu 22.04.1 LTS
> gcc version: gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
> gcc compile command: gcc pg_testlibpq2.c -I/home/jian/postgres/pg16/include \
> -L/home/jian/postgres/pg16/lib -lpq
>
> [0]https://www.postgresql.org/docs/current/libpq-example.html
> [1]https://stackoverflow.com/questions/37876850/in-compilation-time-how-to-find-the-macro-is-defined-in-which-header-file

In 15 and earlier, it is defined in pg_config.h, which is created by
configure.  But in 16, that particular macro was removed by commit
7e50b4e3c.  It looks like you are using PostgreSQL 16 sources, but
looking at PostgreSQL 15 examples?

Re: 13.x, stream replication and locale(?) issues

2023-03-03 Thread Thomas Munro

On Wed, Mar 1, 2023 at 10:30 AM Thomas Munro  wrote:
> On Wed, Mar 1, 2023 at 12:09 AM Eugene M. Zheganin  wrote:
> > 3) how do I fix it ? Should I take locale sources for ru_RU.utf8 on Linux 
> > and compile it on FreeBSD  - will it help ?
>
> Out of curiosity (I'm not saying it's a good idea!), do you know if
> FreeBSD's localedef can compile glibc's collation definitions?  In
> theory they are in a format standardised by POSIX...  I suspect there
> may be extensions and quirks...

Another thought: if you upgrade to 15, you could use ICU as the
default collation provider, and then make sure you have the same major
version of ICU on your Linux and FreeBSD systems (which probably means
compiling from source on FreeBSD, as FreeBSD tends to have a newish
ICU in packages, while popular stable Linux distributions tend to have
a fairly old one).  As far as I know, Linux and FreeBSD systems on
amd64 arch should match in every other important detail already (they
both follow the System V ABI, so there shouldn't be any layout
differences in eg structs that are sent over the wire AFAIK).

For what it's worth, for the future we are trying to figure out how to
support multi-version ICU so that you could explicitly set the
provider to ICU v72 to get that sort of thing working across OS
versions and even "close enough" cousins like your case, without
having to compile anything from source, but unfortunately we didn't
manage to get it ready in time for 16.

Re: Interval in hours but not in days Leap second not taken into account

2023-02-28 Thread Thomas Munro

On Mon, Feb 27, 2023 at 8:26 PM PALAYRET Jacques
 wrote:
> # PostgreSQL does not take into account the additional second (leap second) 
> in some calendar days ; eg. 2016, 31 dec. :
> SELECT to_timestamp('20170102 10:11:12','mmdd hh24:mi:ss') - 
> to_timestamp('20161230 00:00:00','mmdd hh24:mi:ss') intervalle ;
>intervalle
> -
>  3 days 10:11:12

Bonjour Jacques,

Just for fun:

postgres=# SELECT utc_to_tai(to_timestamp('20170102
10:11:12','mmdd hh24:mi:ss')) -
   utc_to_tai(to_timestamp('20161230 00:00:00','mmdd
hh24:mi:ss')) intervalle;
   intervalle
-
 3 days 10:11:13
(1 row)

PostgreSQL could, in theory, provide built-in UTC/TAI conversions
functions using a leap second table that would be updated in each
minor release, considering that the leap second table is included in
the tzdata package that PostgreSQL vendors (ie includes a copy of),
but it doesn't do anything like that or know anything about leap
seconds.  Here's a quick and dirty low technology version of the
above:

CREATE TABLE leap_seconds (time timestamptz primary key, off int);

-- refresh leap second table from ietf.org using not-very-secure hairy
shell code
BEGIN;
CREATE TEMP TABLE import_leap_seconds (s int8, off int);
COPY import_leap_seconds FROM PROGRAM 'curl -s
https://www.ietf.org/timezones/data/leap-seconds.list | grep -v ''^#''
| cut -f1,2';
TRUNCATE TABLE leap_seconds;
INSERT INTO leap_seconds (time, off)
SELECT '1900-01-01 00:00:00Z'::timestamptz + interval '1 second' * s, off
  FROM import_leap_seconds;
DROP TABLE import_leap_seconds;
COMMIT;

CREATE OR REPLACE FUNCTION leap_seconds_before_utc_time(t timestamptz)
RETURNS int STRICT LANGUAGE SQL AS
$$
  SELECT off FROM leap_seconds WHERE time <= t ORDER BY time DESC
FETCH FIRST ROW ONLY
$$;

CREATE OR REPLACE FUNCTION utc_to_tai(t timestamptz)
RETURNS timestamptz STRICT LANGUAGE SQL AS
$$
  SELECT t + interval '1 second' * coalesce(leap_seconds_before_utc_time(t), 0);
$$;

CREATE OR REPLACE FUNCTION tai_to_utc(t timestamptz)
RETURNS timestamptz STRICT LANGUAGE SQL AS
$$
  SELECT t - interval '1 second' * coalesce(leap_seconds_before_utc_time(t), 0);
$$;

Re: 13.x, stream replication and locale(?) issues

2023-02-28 Thread Thomas Munro

On Wed, Mar 1, 2023 at 12:09 AM Eugene M. Zheganin  wrote:
> 3) how do I fix it ? Should I take locale sources for ru_RU.utf8 on Linux and 
> compile it on FreeBSD  - will it help ?

Out of curiosity (I'm not saying it's a good idea!), do you know if
FreeBSD's localedef can compile glibc's collation definitions?  In
theory they are in a format standardised by POSIX...  I suspect there
may be extensions and quirks...

At a wild guess, since the data you showed doesn't even look like it
contains non-ASCII characters (it looks like machine readable
identifiers or something, and perhaps its the sort order of '-' that
is causing you trouble), so it might also be possible to use
"ucs_basic" locale for that column and then all computers will agree
on the sort order, but of course that doesn't address the more general
problem; presumably you might also have Russian language text in your
system too.

As for ".utf8" vs ".UTF-8", which one is selected by initdb as the
database default seems to be something that varies between Linux
distributions, so I guess maybe the installers use different
techniques for discovering and selecting default locale names.  Unlike
glibc, FreeBSD doesn't do any name mangling at all when mapping
LC_COLLATE to a pathname to find the file, whereas glibc downcases and
removes '-' so you can find both formats of name in the various
places...

Re: Getting PSQL in Windows to support TAB/Autocomplete via modified readline...

2022-11-22 Thread Thomas Munro

On Tue, Nov 22, 2022 at 4:25 AM Kirk Wolak  wrote:
>   In researching this problem, it appears that the decision was made like 
> 17yrs ago, when windows did not have a realistic "terminal" type interface.  
> Assuming we target Windows 8.1 or higher, I believe this goes away.

FWIW PostgreSQL 16 will require Windows 10+.  Not a Windows user
myself, but I read somewhere that Windows 8.1 has already started
showing full screen warnings at startup that security patches etc end
in January, while PostgreSQL 16 (the soonest release that could
include your changes) is expected late in the year.

>   What we are looking for is a simple bullet point list of what would make 
> such a contribution acceptable.  And how far should we get through that list 
> on our own, before getting some help, especially from the build teams?  [Our 
> goal would be an NEW Config type flag: READLINE_FOR_WINDOWS (you guys name 
> it, and control the default setting)]

Some thoughts:

Re configuration flags: don't waste time with the old perl-based build
system.  The configuration should be done only with the new meson
build system (soon to be the only way to build on Windows).

I didn't quite understand if you were saying that readline itself
needs patches for this (I gather from earlier threads about this that
there were some problems with dll symbol export stuff, so maybe that's
it?).  In passing, I noticed that there is also a Windows port of
editline AKA libedit, the BSD-licensed alternative to readline.  It
has a compatible API and PostgreSQL can use that too.  I have no idea
which is easier, more supported, etc on Windows.

It's OK to post a work-in-progress patch to pgsql-hackers, even if it
doesn't work right yet.  With any luck, people will show up to help
with problems.  I am 100% sure that our Windows user community would
love this feature.  It would be good if the tests in
src/bin/psql/t/010_tab_completion.pl pass on Windows, but if that's
hard, don't let that stop you sharing a patch.

Re: PANIC: could not flush dirty data: Cannot allocate memory

2022-11-15 Thread Thomas Munro

On Wed, Nov 16, 2022 at 1:24 AM  wrote:
> Filesystem is ext4. VM technology is mixed: VMware, KVM and XEN PV.
> Kernel is 5.15.0-52-generic.
>
> We have not seen this with Ubutnu 18.04 and 20.04 (although we might not
> have noticed it).
>
> I guess upgrading to postgresql 13/14/15 does not help as the problem
> happens in the kernel.
>
> Do you have any advice how to go further? Shall I lookout for certain
> kernel changes? In the kernel itself or in ext4 changelog?

It'd be good to figure out what is up with Linux or tuning.  I'll go
write a patch to reduce that error level for non-EIO errors, to
discuss for the next point release.  In the meantime, you could
experiment with setting checkpoint_flush_after to 0, so the
checkpointer/bgwriter/other backends don't call sync_file_range() all
day long.  That would have performance consequences for checkpoints
which might be unacceptable though.  The checkpointer will fsync
relations one after another, with less I/O concurrency.   Linux is
generally quite lazy at writing back dirty data, and doesn't know
about our checkpointer's plans to fsync files on a certain schedule,
which is why we ask it to get started on multiple files concurrently
using sync_file_range().

https://www.postgresql.org/docs/15/runtime-config-wal.html#RUNTIME-CONFIG-WAL-CHECKPOINTS

Re: PANIC: could not flush dirty data: Cannot allocate memory

2022-11-14 Thread Thomas Munro

On Tue, Nov 15, 2022 at 10:54 AM Christoph Moench-Tegeder
 wrote:
> ## klaus.mailingli...@pernau.at (klaus.mailingli...@pernau.at):
> > On several servers we see the error message: PANIC:  could not flush
> > dirty data: Cannot allocate memory

> Of these three places, there's an sync_file_range(), an posix_fadvise()
> and an msync(), all in src/backend/storage/file/fd.c. "Cannot allocate
> memory" would be ENOMEM, which posix_fadvise() does not return (as per
> it's docs). So this would be sync_file_range(), which could run out
> of memory (as per the manual) or msync() where ENOMEM actually means
> "The indicated memory (or part of it) was not mapped". Both cases are
> somewhat WTF for this setup.

It must be sync_file_range().  The others are fallbacks that wouldn't
apply on a modern Linux.

It has been argued before that we might have been over-zealous
applying the PANIC promotion logic to sync_file_range().  It's used to
start asynchronous writeback to make the later fsync() call fast, so
it's "only a hint", but I have no idea if it could report a writeback
error from the kernel that would then be consumed and not reported to
the later fsync(), so I defaulted to assuming that it could.

Re: Segmentation Fault PG 14

2022-11-07 Thread Thomas Munro

On Tue, Nov 8, 2022 at 11:45 AM Willian Colognesi
 wrote:
> root@ip-10-x-x-x:/home/ubuntu# pg_config --configure
> ... --with-extra-version= (Ubuntu 14.5-2.pgdg20.04+2)' ...
> ... '--with-llvm' 'LLVM_CONFIG=/usr/bin/llvm-config-10' ...

> There is no llvm installed on ubuntu server, postgresql was installed via apt 
> package `apt install postgresql-14`

We can see from the pg_config output that it's built with LLVM 10.
Also that looks like it's the usual pgdg packages which are certainly
built against LLVM and will install it automatically.

Re: Strange collation names ("hu_HU.UTF-8")

2022-08-02 Thread Thomas Munro

On Wed, Aug 3, 2022 at 1:43 AM Tom Lane  wrote:
> I believe most if not all variants of Unix are
> permissive about the spelling of the encoding part.

I've only seen glibc doing that downcase-and-strip-hyphens thing to
the codeset part of a locale name when looking for locale definition
files.  Other systems like FreeBSD expect to be able to open
/usr/share/locale/$LC_COLLATE/LC_COLLATE directly without any kind of
munging.  On a Mac it's probably a little fuzzy because the filenames
are case insensitive...

Re: AIX and EAGAIN on open()

2022-07-03 Thread Thomas Munro

On Mon, Jun 20, 2022 at 9:53 PM Christoph Berg
 wrote:
> IBM's reply to the issue back in December 2020 was this:
>
>   The man page / infocenter document is not intended as an exhaustive
>   list of all possible error codes returned and their circumstances.
>   "Resource temporarily unavailable" may also be returned for
>   O_NSHARE, O_RSHARE with O_NONBLOCK.
>
> Afaict, PG does not use these flags either.
>
> We also ruled out that the system is using any anti-virus or similar
> tooling that would intercept IO traffic.
>
> Does anything of that ring a bell for someone? Is that an AIX bug, a
> PG bug, or something else?

No clue here.  Anything unusual about the file system (NFS etc)?  Can
you truss/strace the system calls, to sanity check the flags arriving
into open(), and see if there's any unexpected other activity around
open() calls that might be coming from something you're linked
against?

Re: Order of rows in statement triggers NEW/OLD tables

2022-05-31 Thread Thomas Munro

On Fri, May 6, 2022 at 6:20 PM hubert depesz lubaczewski
 wrote:
> when defining statement triggers on update I can use:
>
> REFERENCING OLD TABLE AS xxx NEW TABLE as YYY
>
> these "pseudo" tables contain rows that were before and after.
>
> Is the order guaranteed?
>
> Can I assume that "first" row returned by select from xxx, will be older
> version of first row returned by select from yyy?

Good question, and one I've wondered about before.  I *think* that is
currently true, due to implementation details, but it could change.
The trigger code fills up both tuplestores (= memory/disk containers
for transition tables) in sync with each other in
AfterTriggerSaveEvent(), and then NamedTuplestoreScan node emits
tuples in insertion order.  We already banned the use of transition
tables when there is "ON UPDATE OR INSERT" etc so you can't get mixed
INSERT/UPDATE/DELETE results which would desynchronise the old and new
tuples, and I also wondered if something tricky like FK ON DELETE
CASCADE/SET NULL in a self-referencing table could mix some
old-with-no-new into UPDATE results, but I can't see how to do that,
and I can't think of any other way off the top of my head.  Of course,
joins etc could destroy the order higher in your query plan.

While reading about proposed incremental materialized view ideas, I
once idly wondered if it'd be useful, as an extension to the standard,
to be able to use WITH ORDINALITY for transition tables (normally only
used for unnest(), and in PG also any SRF) and then you could use that
+ ORDER BY to explicitly state your desired order (and possibly teach
the planner that ORDINALITY attributes are path keys by definition so
it doesn't need to insert sort nodes in simple cases).  That is,
instead of relying on scan order.  In any case, an in-core IMV feature
is allowed to peek deeper into the implementation and doesn't even
need SQL here, so I didn't follow that thought very far...  I am not
sure about this, but I wonder if any user-level
portable-across-SQL-implementation user-level scheme for
replication/materialization built on top of trigger transition tables
would need to require immutable unique keys in the rows in order to be
able match up before/after tuples.

Re: Improve configurability for IO related behavoir

2022-05-28 Thread Thomas Munro

On Sun, May 29, 2022 at 4:29 AM 浩辰 何  wrote:
> Furthermore, the results above are also related to IO API supported by OS. 
> MySQL support synchronized IO and Linux libaio. It seems
> that PostgreSQL only supports synchronized IO, so shall we support more IO 
> engines? like io_uring which is very popular in recent years.

Hi Haochen,

There is an active project to bring these things to PostgreSQL.
https://wiki.postgresql.org/wiki/AIO has some information and links.
The short version is that there is a development patch set to add
these GUCs:

io_method=worker,io_uring,posix_aio,iocp,...
io_data_direct=on,off
io_wal_direct=on,off

It also adds a bunch of specialised logic that knows how to initiate
IO in key places (scans, writeback, recovery, ...), because it's not
enough to just turn off kernel I/O buffering, we also have to do all
the work the kernel is doing for us or performance will be terrible.

Re: Pg14 possible index corruption after reindex concurrently

2022-05-24 Thread Thomas Munro

On Wed, May 25, 2022 at 6:17 AM Aleš Zelený  wrote:
> SELECT format('REINDEX SCHEMA CONCURRENTLY %I;', n.nspname)

This may be related to bug #17485, discussed at:

https://www.postgresql.org/message-id/flat/17485-396609c6925b982d%40postgresql.org

Re: SELECT creates millions of temp files in a single directory

2022-04-23 Thread Thomas Munro

On Sun, Apr 24, 2022 at 8:00 AM Peter  wrote:
> More than a million files in a single directory, this is
> inacceptable.

You didn't show EXPLAIN (ANALYZE) but if [Parallel] Hash is making
insane numbers of temporary files then something is not working as
intended... and I can take a guess at what it is.  I tried to recreate
this with randomly distributed file.pathid, same size tables as you,
and I got 32 batches and a nice manageable number of temporary files.
Adding millions of extra file rows with duplicate pathid=42 gets me
something like "Batches: 524288 (originally 32)" in EXPLAIN (ANALYZE)
output.  I guess that's the sort of distribution you have here?

Extensive discussion of the cause of that and potential (hard)
solutions over here:

https://www.postgresql.org/message-id/flat/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com

To summarise, when the hash table doesn't fit in work_mem, then we
"partition" (spill part of the data to disk) to make twice as many
(hopefully) smaller hash tables that do fit.  Sometimes partitoning
produces one or more hash tables that are still too big because of
uneven key distribution, so we go around again, doubling the number of
partitions (and thus temporary files) every time.  I would say that
once we get past hundreds of partitions, things are really turning to
custard (file descriptors, directory entries, memory overheads, ... it
just doesn't work well anymore), but currently we don't give up until
it becomes very clear that repartitioning is not helping.

This algorithmic problem existed before parallelism was added, but
it's possible that the parallel version of the meltdown hurts a lot
more (it has extra per-process files, and in multi-attempt scenarios
the useless earlier attempts hang around until the end of the query
instead of being cleaned up sooner, which doubles the number of
files).

Hopefully that gives some clues about how one might rewrite the query
to avoid massive unsplittable set of duplicate keys in hash tables,
assuming I'm somewhere close to the explanation here (maybe some
subselect with UNIQUE or GROUP BY in it, or some way to rewrite the
query to avoid having the problematic duplicates on the "inner" side,
or completely different plan..., or just crank up work_mem massively
for this query so that you don't need a partition step at all) .
Obviously it would be nice for PostgreSQL to be hardened against this
risk, eventually, though.

As for merge join planning, unfortunately they aren't fully
parallelisable and in the plan you showed, a separate copy of the big
sort runs in every process, which isn't nice (we don't have a Parallel
Sort that could fix that yet).

Re: frequency of positive and negative numbers, sizes of numbers and frequency of alteration of polarity

2022-02-17 Thread Thomas Munro

On Fri, Feb 18, 2022 at 10:42 AM Shaozhong SHI  wrote:
> Given 2 or more such columns, is there any measure that can be calculated to 
> tell which one alternates more than others?

Well, you could report non-flips as NULL and flips as magnitude, and
then wrap that query in another query to compute whatever statistical
properties you need... and you could have multiple columns so you're
computing those numbers for each input column...  I was mainly trying
to point out the LAG() facility, which lets you compare a row with the
preceding row, according to some sort order, which I think you'd want
to build your query on top of.  Hope that helps...

postgres=# with
 flips as (select time,
  value,
  case
when sign(lag(value) over (order by
time)) != sign(value)
then abs(lag(value) over (order by
time) - value)
  end as flip_magnitude
 from time_series)
   select count(flip_magnitude) as num_flips,
  avg(flip_magnitude) as avg_magnitude
 from flips;
 count | avg
---+-
 2 | 14.
(1 row)

Re: frequency of positive and negative numbers, sizes of numbers and frequency of alteration of polarity

2022-02-17 Thread Thomas Munro

On Fri, Feb 18, 2022 at 9:11 AM Shaozhong SHI  wrote:
> How to calculate frequency of positive and negative numbers and define and 
> calculate frequency of alteration of polarity?
>
> Surely, we can use frequency of alteration of polarity and level of change 
> (e.g., size of positive and negative numbers) to measure degree and frequency 
> of alteration.
>
> Any ideas in doing so in postgres tables' columns full of positive and 
> negative numbers?

Window functions might be useful to detect polarity changes:

postgres=# create table time_series (time int, value int);
CREATE TABLE
postgres=# insert into time_series values (1, -5), (2, -5), (3, 10), (4, -3);
INSERT 0 4
postgres=# select time,
  value,
  sign(lag(value) over (order by time)) != sign(value)
as flipped
 from time_series;
 time | value | flipped
--+---+-
1 |-5 |
2 |-5 | f
3 |10 | t
4 |-3 | t
(4 rows)

Re: Compile 14.1 in EL5.8

2022-02-09 Thread Thomas Munro

On Thu, Feb 10, 2022 at 2:23 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > ... I wondered about also removing the leftover comment
> > "We assume that any system that has Linux epoll() also has Linux
> > signalfd()" which was my attempt to explain that there wasn't a
> > separate configure check for signalfd.h, but I guess the sentence is
> > still true in a more general sense, so we can just leave it there.
>
> Oh, I didn't notice that comment, or I probably would have tweaked it.
> Perhaps along the line of "there are too few systems that have epoll
> and not signalfd to justify maintaining a separate code path"?

WFM, though I remain a little unclear on whether our support policy is
stochastic or declarative :-D

Re: Compile 14.1 in EL5.8

2022-02-09 Thread Thomas Munro

On Thu, Feb 10, 2022 at 3:38 AM Tom Lane  wrote:
> Gabriela Serventi  writes:
> > Hi! I'm trying to compile release 14.1 in a very old Linux Server (Red Hat
> > Enterprise Linux Server release 5.8). I can run configure successfully, but
> > when I try to run make, I get the following error:
> > latch.c:85:26: error: sys/signalfd.h: No such file or directory
>
> It looks like since 6a2a70a02, latch.c effectively uses HAVE_SYS_EPOLL_H
> to decide whether it can include , which seems not too
> well thought out.  A proper fix will require another configure check,
> but you could hack it by adding -DWAIT_USE_POLL to CPPFLAGS in
> src/Makefile.global after running configure.

I mean, I did think about it, but I thought it wasn't worth the
configure cycles to detect EOL'd kernels manufactured 2005-2007, since
we've said before that we don't support RHEL5 or kernels that old[1].
My primary goal was not to have to write the support for the
epoll/self-pipe combination with zero coverage and zero users.
Apparently I was off by one on the users, and since it's so easy to
fall back to poll() I agree that we might as well!  BTW after seeing
your commit today I wondered about also removing the leftover comment
"We assume that any system that has Linux epoll() also has Linux
signalfd()" which was my attempt to explain that there wasn't a
separate configure check for signalfd.h, but I guess the sentence is
still true in a more general sense, so we can just leave it there.

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGKL_%3DaO%3Dr30N%3Ds9VoDgTqHpRSzePRbA9dkYO7snc7HsxA%40mail.gmail.com

Re: sort order for UTF-8 char column with Japanese UTF-8

2022-02-03 Thread Thomas Munro

On Fri, Feb 4, 2022 at 8:11 AM Matthias Apitz  wrote:
> On my FreeBSD laptop the same file sorts as
>
> guru@c720-r368166:~ $ LANG=de_DE.UTF-8 sort swd
> A
> ゲアハルト・A・リッター
> ゲルハルト・A・リッター
> チャールズ・A・ビアード
> A010STRUKTUR
> A010STRUKTUR
> A010STRUKTUR
> A0150SUPRALEITER

Wow, so it's one thing to have a different default "script order" than
glibc and ICU (which is something you can customise IIRC), but isn't
something broken here if the Japanese text comes between "A" and
"A0..."??  Hmm, it's almost as if it completely ignored the Japanese
text.  From my FreeBSD box:

tmunro=> select * from t order by x collate "de_DE.UTF-8";
x
--
 ゲアハルト
 A
 ゲアハルト・A・リッター
 A0
 A010STRUKTUR
 AA
 ゲアハルト・AA・リッター
 ゲアハルト・B・リッター
(8 rows)

tmunro=> select * from t order by x collate "ja_JP.UTF-8";
x
--
 A
 A0
 A010STRUKTUR
 AA
 ゲアハルト
 ゲアハルト・AA・リッター
 ゲアハルト・A・リッター
 ゲアハルト・B・リッター
(8 rows)

Seems like something to investigate in FreeBSD land.

Re: could not accept SSL connection: Success

2022-01-19 Thread Thomas Munro

On Thu, Jan 20, 2022 at 12:06 AM Carla Iriberri
 wrote:
> On Wed, Jan 19, 2022 at 5:42 AM Michael Paquier  wrote:
>> "On an unexpected EOF, versions before OpenSSL 3.0 returned
>> SSL_ERROR_SYSCALL, nothing was added to the error stack, and errno was
>> 0. Since OpenSSL 3.0 the returned error is SSL_ERROR_SSL with a
>> meaningful error on the error stack."

> Thanks, Michael, that's it, indeed! I had missed that part of the
> OpenSSL docs. These PG instances are running on Ubuntu Focal hosts that come
> with OpenSSL 1.1.1.

Good news, I'm glad they nailed that down.  I recall that this
behaviour was a bit of a moving target in earlier versions:

https://www.postgresql.org/message-id/CAEepm%3D3cc5wYv%3DX4Nzy7VOUkdHBiJs9bpLzqtqJWxdDUp5DiPQ%40mail.gmail.com

Re: create database hangs forever on WSL - autovacuum deadlock?

2022-01-09 Thread Thomas Munro

On Sun, Jan 9, 2022 at 2:15 PM Alicja Kucharczyk
 wrote:
> sob., 8 sty 2022 o 22:40 Thomas Munro  napisał(a):
>> On Sat, Jan 8, 2022 at 8:11 PM Alicja Kucharczyk
>>  wrote:
>> > this is postgres running on windows wsl:  PostgreSQL 14.1 (Ubuntu 
>> > 14.1-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
>> > 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
>>
>> Is this WSL1 (some kind of Linux system call emulator running on an NT
>> kernel) or WSL2 (some kind of virtual machine running a Linux kernel)?
>
>
>  lookd like WSL1:
>   NAME  STATE   VERSION
> * UbuntuRunning 1

I don't think you're going to like this answer.  In PostgreSQL 14, we
started using a signalfd in an epoll set (multiplexed with other fds
for sockets etc) to receive "latch wakeups".  This works pretty well
on a real Linux kernel, but apparently it is not emulated well enough
to work on WSL1.  I don't know the exact reason why, but if someone
can figure it out I'd be interested in seeing if we can fix it (though
I thought everyone moved to WSL2 which has none of these problems?).
There is a WSL1 machine in the build farm, but it's using
-DWAIT_USE_POLL to avoid this problem; that's useless if you're trying
to run with stock PostgreSQL packages from Ubuntu or whatever, though,
it only helps if you compile from source.  For what it's worth,
running PostgreSQL compiled for Windows using Wine on a Unix system
also fails in various ways.  It turns out that emulating a foreign
kernel is super hard...  Previous discussion:

https://www.postgresql.org/message-id/flat/CAEP4nAymAZP1VEBNoWAQca85ZtU5YxuwS95%2BVu%2BXW%2B-eMfq_vQ%40mail.gmail.com

Re: create database hangs forever on WSL - autovacuum deadlock?

2022-01-08 Thread Thomas Munro

On Sat, Jan 8, 2022 at 8:11 PM Alicja Kucharczyk
 wrote:
> this is postgres running on windows wsl:  PostgreSQL 14.1 (Ubuntu 
> 14.1-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
> 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit

Is this WSL1 (some kind of Linux system call emulator running on an NT
kernel) or WSL2 (some kind of virtual machine running a Linux kernel)?

Re: Need to know more about pg_test_fsync utility

2021-12-12 Thread Thomas Munro

On Mon, Dec 13, 2021 at 3:04 PM PGSQL DBA  wrote:
> As you mentioned in question-8, "I'd investigate whether data is being cached 
> unexpectedly, perhaps indicating that committed transactions be lost in a 
> system crash event." So, I would like to know that if we configure the disk 
> for the WALs with read+write disk cache then will it create any performance 
> issue and show the attached output?

Which OS and file system are you running and what kind of storage
system is it?  If you can turn write caching on and off in your
storage layer without affecting the ops/sec that's a pretty good clue
that it's respecting the cache control commands that the OS sends.
The numbers you showed are reassuringly low.  Hmm, I wonder why you
have such a low number for non-sync'd writes (the last line).

I have a concrete example of using this tool to learn something about
an unfamiliar-to-me operating system, from this morning: I saw a
system that reported ~6k IOPS for open_datasync and only ~600 IOPS for
fdatasync.  With some limited knowledge of file systems I expect the
latter to involve sending a SYNCHRONIZE CACHE command to flush device
caches, and the former to do that, or use a more efficient FUA flag to
flush just particular writes and not the whole write cache.  I didn't
expect it to go 10x faster, so something had to be wrong.  With some
more investigation and reading I learned that the storage drivers I
was using do indeed ignore FUA cache control flags, so that
wal_sync_method is not crash safe; if you lose power and come back up,
you might lose a bunch of committed transactions from the end of the
WAL.  I turned off write caches in the storage device, and then
open_datasync became a lot slower, matching my expectations.

These are primitive tools, but can help to check some assumptions...

Obligatory remark: PostgreSQL 9.5 is out of support, please see
https://www.postgresql.org/support/versioning/.

> I also would like to know is there any best Practice from PostgreSQL which 
> mentions what is the disk latency required for the WAL & DATA disk?

No particular latency is required by PostgreSQL, and I don't have a
general answer to this.  Is it fast enough for you?  What sort of
workload is it running, OLTP, reporting, ...?

Re: Need to know more about pg_test_fsync utility

2021-12-09 Thread Thomas Munro

On Fri, Dec 10, 2021 at 3:20 PM PGSQL DBA  wrote:
> 1) How to interpret the output of pg_test_fsync?

The main interesting area is probably the top section that compares
the different wal_sync_method settings.  For example, it's useful to
verify the claim that fdatasync() is faster than fsync() (because it
only flushes data, not meta-data like file modified time).  It may
also be useful for measuring the effects of different caching settings
on your OS and storage.  Unfortunately open_datasync is a bit
misleading; we don't actually use O_DIRECT with open_datasync anymore,
unless you set wal_level=minimal, which almost nobody ever does.

> 2) What is the meaning of ops/sec & usecs/op?

Number of times it managed to flush data to disk per second
sequentially, and the same information expressed as microseconds per
flush.

> 3) How does this utility work internally?

It just does a loop over some system calls, or to be more precise,

https://github.com/postgres/postgres/blob/master/src/bin/pg_test_fsync/pg_test_fsync.c

> 4) What is the IO pattern of this utility? serial/sequence IO or Multiple 
> thread with Parallel IO?

Sequential, no threads.

> 5) Can we change the testing like FIO with multiple threads and parallel IO?

Nope.  This is a simple tool.  Fio is much more general and useful.

> 6) How a commit happened in the background  while executing this utility?

Nothing happens in the background, it uses synchronous system calls
from one thread.

> 7) How can we use this tool to measure the I/O issue?

It's a type of micro-benchmark that gives you an idea of a sort of
baseline you can expect from a single PostgreSQL session committing to
the WAL.

> 8) In which area or section in the output do we need to focus while 
> troubleshooting I/O issues?

If PostgreSQL couldn't commit small sequential transactions about that
fast I'd be interested in finding out why, and if fdatasync is
performing faster than published/device IOPS suggest should be
possible then I'd investigate whether data is being cached
unexpectedly, perhaps indicating that committed transactions be lost
in a system crash event.

> 9) What is the meaning of “Non-sync’ed 8kB writes?

Calling the pwrite() system call, which writes into your operating
system's page cache but (usually) doesn't wait for any I/O.  Should be
somewhere north of 1 million/sec.

Re: Wrong sorting on docker image

2021-10-16 Thread Thomas Munro

On Sun, Oct 17, 2021 at 4:42 AM Tom Lane  wrote:
> Speaking of ICU, if you are using an ICU-enabled Postgres build,
> maybe you could find an ICU collation that acts the way you want.
> This wouldn't be a perfect solution, because we don't yet have
> the ability to set an ICU collation as a database's default.
> But you can attach ICU collations to individual text columns,
> and maybe that would be a good enough workaround.

For what it's worth, ICU's "ru-RU-x-icu" and FreeBSD's libc agree with
glibc on these sort orders, so I suspect this might be coming from
CLDR/UCA/DUCET/ISO 14651 common/synchronised data.  It does look quite
suspicious to me, but I don't know Russian and I'm only speculating
wildly here: it does look as if ё is perhaps getting a lower weight
than it should.  That said, it seems strange that something so basic
should be wrong.  Nosing around in the unicode.org issue tracker, it
seems as though some people might think there is something funny about
Ё (and I wonder if there are/were similar issues with й/Й):

https://unicode-org.atlassian.net/browse/CLDR-2745?jql=text%20~%20%22%D0%81%22
https://unicode-org.atlassian.net/browse/CLDR-1974?jql=text%20~%20%22%D0%81%22
(and more)

It's probably not a great idea, but for the record, you can build your
own collation for glibc and other POSIX-oid systems.  For example, see
glibc commit 159738548130d5ac4fe6178977e940ed5f8cfdc4, where they
previously had customisations on top of the iso14651_t1 file to
reorder a special Ukrainian character in ru_RU, so in theory you could
reorder ё/Ё with a similar local hack and call it ru_RU_X...  I also
wonder if there is some magic switch you can put after an @ symbol on
ICU collations that would change this, perhaps some way to disable the
"contractions" that are potentially implicated here.  Not sure.

Re: To JIT (or not!) in Ubuntu packages

2021-09-14 Thread Thomas Munro

On Wed, Sep 15, 2021 at 3:30 PM Ben Chobot  wrote:
> So I've installed 
> http://apt.postgresql.org/pub/repos/apt/pool/14/p/postgresql-14/postgresql-14_14~beta3-1.pgdg18.04+1_arm64.deb,
>  after which I see:

Ahhh, so you're on 18.04, an old LTS.  I remember now, there was this
issue in LLVM 3.9 on that aarch64 (which I saw on Debian 9 during
testing of the JIT stuff), resolved by later LLVM versions (maybe 6?):

https://www.postgresql.org/message-id/flat/CAEepm%3D0HqkxWk2w8N2nXQXC_43Mucn-v%3D8QdY8vOG5ojo9kJRA%40mail.gmail.com#5d656391e0b1d49d3e577b7a41e69b7c

I don't know Ubuntu versioning policies etc, but maybe something to do
with which LLVM versions it's allowed to depend on?

Re: To JIT (or not!) in Ubuntu packages

2021-09-14 Thread Thomas Munro

On Tue, Sep 14, 2021 at 10:11 AM Ben Chobot  wrote:
> We've noticed that the Ubuntu postgresql-12 package has --with-llvm
> enabled on x86_64, but not on aarch64. Does anybody know if this was
> intentional, or just an oversight?
>
> For what it's worth, it seems the beta postgresql-14 package for Ubuntu
> still doesn't have --with-llvm.
>
> I'm not sure if this explains why our aarch64 DBs are missing a
> jit_provider GUC variable, but I expect it does explain why
> pg_jit_available() tells me no.

Hmm.  No Ubuntu here and I don't know the answer (CC Christoph).  Can
you show exactly where the package is coming from, what pg_config
outputs.  For what it's worth, it does look like it's enabled in the
Ubuntu aarch64 packages from apt.postgresql.org, according to a quick
look using caveman techniques:

$ curl -sO 
http://apt.postgresql.org/pub/repos/apt/pool/14/p/postgresql-14/postgresql-14_14~beta3-1.pgdg+1_arm64.deb
$ ar x postgresql-14_14~beta3-1.pgdg+1_arm64.deb
$ tar xf data.tar.xz
$ strings usr/lib/postgresql/14/bin/postgres | grep 'with-llvm'
 '--build=aarch64-linux-gnu' '--prefix=/usr'
'--includedir=${prefix}/include' '--mandir=${prefix}/share/man'
'--infodir=${prefix}/share/info' '--sysconfdir=/etc'
'--localstatedir=/var' '--disable-option-checking'
'--disable-silent-rules' '--libdir=${prefix}/lib/aarch64-linux-gnu'
'--runstatedir=/run' '--disable-maintainer-mode'
'--disable-dependency-tracking' '--with-icu' '--with-tcl'
'--with-perl' '--with-python' '--with-pam' '--with-openssl'
'--with-libxml' '--with-libxslt' 'PYTHON=/usr/bin/python3'
'--mandir=/usr/share/postgresql/14/man'
'--docdir=/usr/share/doc/postgresql-doc-14'
'--sysconfdir=/etc/postgresql-common' '--datarootdir=/usr/share/'
'--datadir=/usr/share/postgresql/14'
'--bindir=/usr/lib/postgresql/14/bin'
'--libdir=/usr/lib/aarch64-linux-gnu/'
'--libexecdir=/usr/lib/postgresql/'
'--includedir=/usr/include/postgresql/' '--with-extra-version= (Debian
14~beta3-1.pgdg+1)' '--enable-nls' '--enable-thread-safety'
'--enable-tap-tests' '--enable-debug' '--enable-dtrace'
'--enable-cassert' '--disable-rpath' '--with-uuid=e2fs'
'--with-gnu-ld' '--with-lz4' '--with-pgport=5432'
'--with-system-tzdata=/usr/share/zoneinfo' '--with-llvm'
'LLVM_CONFIG=/usr/bin/llvm-config-11' 'CLANG=/usr/bin/clang-11'
'--with-systemd' '--with-selinux' 'MKDIR_P=/bin/mkdir -p'
'PROVE=/usr/bin/prove' 'TAR=/bin/tar' 'XSLTPROC=xsltproc --nonet'
'CFLAGS=-g -O2 -fstack-protector-strong -Wformat
-Werror=format-security' 'LDFLAGS=-Wl,-z,relro -Wl,-z,now'
'--with-gssapi' '--with-ldap' 'build_alias=aarch64-linux-gnu'
'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O2
-fstack-protector-strong -Wformat -Werror=format-security'
$ find . -name '*jit*.so'
./usr/lib/postgresql/14/lib/llvmjit.so

It's certainly expected to work on this arch, and we have snakefly and
eelpout testing it in the build farm.

Re: spiexceptions.UndefinedFile: could not open file "base/16482/681830": No such file or directory

2021-09-08 Thread Thomas Munro

On Thu, Sep 9, 2021 at 9:19 AM Celia McInnis  wrote:
> Note that the file does exist:! (How do I know if it is looking under the 
> correct directory? Other times I have done similar temporary table creations 
> with no problems!):

PostgreSQL internally uses relative paths.  It's probably not a very
good idea to use 'chdir' in a procedure.

Re: Is there something similar like flashback query from Oracle planned for PostgreSQL

2021-06-23 Thread Thomas Munro

On Thu, Jun 24, 2021 at 6:54 AM Dirk Krautschick
 wrote:
> Is there something planned to get a behaviour like Oracle's flashback query 
> based on the old values
> before deleted by vacuum?
>
> So a feature to recreate old versions of rows if still there?
>
> Or are there any related extensions or tools doing this?

There are some things like pg_dirtyread and probably more.  You might
be interested in some of the references in this thread:

https://www.postgresql.org/message-id/flat/CAKLmikOkK%2Bs0V%2B3Pi1vS2GUWQ0FAj8fEkVj9WTGSwZE9nRsCbQ%40mail.gmail.com

As for the SQL standard's approach to this, there are some active
-hackers threads on that with patches in development... look for
"temporal tables" and "system versioned".

Re: order by

2021-06-10 Thread Thomas Munro

On Thu, Jun 10, 2021 at 8:40 PM Luca Ferrari  wrote:
> On Thu, Jun 10, 2021 at 10:15 AM Vijaykumar Jain
> > when you provide a table in query in the order by clause, it is
> > ordered by cols of that table in that order.
>
> Clever, thanks!
> I also realized that this "table to tuples" expansion works for GROUP BY too.
> However, I'm not able to find this documented in GROUP BY, WHERE,
> ORDER BY clauses sections into select documentation
> https://www.postgresql.org/docs/12/sql-select.html>. Could be
> my fault, of course.

There's something about this here:

https://www.postgresql.org/docs/13/rowtypes.html#ROWTYPES-USAGE

Re: CREATE/REFRESH MATERIALIZED VIEW planner difference?

2021-06-01 Thread Thomas Munro

On Wed, Jun 2, 2021 at 7:15 AM Vijaykumar Jain
 wrote:
> i only get workers to create mv, but refresh mv plan does not use workers for 
> the same conf params.

Yeah, this changed in v14:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9e7ccd9ef64d05e87ceb1985d459bef9031205c0

Re: Chain Hashing

2021-05-06 Thread Thomas Munro

On Thu, May 6, 2021 at 9:48 PM Jian He  wrote:
> The following part is about the Chain Hashing.
>>
>>   Maintain a linked list of buckets for each slot in the hash table.
>> Resolve collisions by placing all elements with the same hash key into the 
>> same bucket.
>> → To determine whether an element is present, hash to its bucket and scan 
>> for it.
>>  → Insertions and deletions are generalizations of lookups.
>
>
>I still don't get it. Stackoverflow seems don't have good answers yet. So 
> I come here, asking

Not sure which part you're asking about, but here's an example.
Suppose you want to put three objects (in our case tuples) into a hash
table, where the first attribute is the key:

('AAA', 'Aardvark'),
('BBB', 'Bumblebee'),
('CCC', 'Cat')

Suppose your hash table has 4 buckets.  We can use the lower 2 bits to
map hash values to bucket numbers.

hash('AAA') = 0xfe7f3cba, hash('AAA') & 0x3 = 2
hash('BBB') = 0x87e3287b, hash('BBB') & 0x3 = 3
hash('CCC') = 0x194bcedf, hash('CCC') & 0x3 = 3

'BBB' and 'CCC' collided: they both want to be in bucket 3.

To insert, a chaining hash table just add each object to the computed
bucket's chain:

 +---+
 | 0 |
 +---+
 | 1 |
 +---+
 | 2 |->('AAA', 'Aardvark')
 +---+
 | 3 |->('BBB', 'Bumblebee')->('CCC', 'Cat')
 +---+

When looking up key 'CCC' during the "probe" phase of a hash join,
we'll again compute hash('CCC') & 0x3 = 3, look in bucket 3, and then
compare the key of every tuple in that list with 'CCC' to see if we
can find any matches.  That's called "lookup" in the text you quoted.

It also mentions deletion, which is pretty much just lookup following
by removing the matching entry from the list, but that's a general
comment about hash tables.  It doesn't apply to their use in hash
joins: there is never a need to remove individual keys, we just build,
probe and destroy the whole table.  Another difference between general
purpose hash tables such as you might find in a typical programming
language standard library and the hash tables used to implement hash
joins is the latter need to be able to tolerate duplicate keys, so the
'scan' of a bucket doesn't give up as soon as it finds a match (unless
it's a semi-join): it normally has to emit all of the matches.

PostgreSQL uses chaining for hash joins, but it also uses Robin Hood
hash tables in some other places, including hash-based GROUP BY.

Re: Huge performance penalty with parallel queries in Windows x64 v. Linux x64

2021-05-04 Thread Thomas Munro

On Wed, May 5, 2021 at 2:12 PM Thomas Munro  wrote:
> It might be interesting to know how that 40ms time scales as you add
> more workers.  ...

Another thought: I'd also try tests like that in large databases (ie
large virtual memory) vs small ones, and with and without huge/locked
memory pages configured[1], to see how much of the overhead is due to
virtual memory work (though I know nothing about Windows VM, it's just
an idea).

[1] 
https://www.postgresql.org/docs/13/runtime-config-resource.html#GUC-HUGE-PAGES

Re: Huge performance penalty with parallel queries in Windows x64 v. Linux x64

2021-05-04 Thread Thomas Munro

On Wed, May 5, 2021 at 3:50 AM Hans Buschmann  wrote:
> (BTW: Is this cost multiplied by the real count of workers choosen 
> (max_parallel_workers_per_gather) or only a value independent of the number 
> of workers?. This would matter in windows-high-parallel scenarios)

It's not multiplied:

https://github.com/postgres/postgres/blob/50e17ad281b8d1c1b410c9833955bc80fbad4078/src/backend/optimizer/path/costsize.c#L398

It might be interesting to know how that 40ms time scales as you add
more workers.  For what it's worth, I see that the following query
takes around about ~6ms + ~1.5ms per worker on my FreeBSD machine, and
on Linux it's harder to pin down, varying a lot, usually a bit slower
(sorry I didn't have time to do proper statistics).

create table t ();
alter table t set (parallel_workers=8);
set min_parallel_table_scan_size = 0;
set parallel_setup_cost = 0;
set parallel_tuple_cost = 0;

set max_parallel_workers_per_gather = 1;
explain analyze select count(*) from t;

...

set max_parallel_workers_per_gather = 7;
explain analyze select count(*) from t;

Re: Huge performance penalty with parallel queries in Windows x64 v. Linux x64

2021-05-04 Thread Thomas Munro

On Tue, May 4, 2021 at 7:40 PM Hans Buschmann  wrote:
> The problem seems that this (probably inherent) performance disadvantage of 
> windows is not reflected in the cost model.

https://www.postgresql.org/docs/13/runtime-config-query.html#GUC-PARALLEL-SETUP-COST
is for that.

It might be interesting to do some profiling to see exactly what is
slow.  Presumably CreateProcess(), but I wonder why exactly.

It'd be nice if we had a way to reuse parallel workers, but alas we do
not, yet.  Or threads.

Re: Huge performance penalty with parallel queries in Windows x64 v. Linux x64

2021-05-03 Thread Thomas Munro

On Tue, May 4, 2021 at 4:05 AM Hans Buschmann  wrote:
> The main difference is the time shown for the Gather Merge step (65 ms vs. 7 
> ms)

No Windows here, but could it be super slow at launching workers?  How
does a trivial parallel query compare, something like?

SET force_parallel_mode = on;
EXPLAIN ANALYZE SELECT 42;

Re: -1/0 virtualtransaction

2021-04-30 Thread Thomas Munro

On Wed, Apr 28, 2021 at 2:25 AM Mike Beachy  wrote:
> On Tue, Apr 27, 2021 at 2:56 AM Laurenz Albe  wrote:
>>
>> Not sure, but do you see prepared transactions in "pg_prepared_xacts"?
>
> No, the -1 in the virtualtransaction 
> (https://www.postgresql.org/docs/11/view-pg-locks.html) for pg_prepared_xacts 
> was another clue I saw! But, it seems more or less a dead end as I have 
> nothing in pg_prepared_xacts.
>
> Thanks for the idea, though.

There is another way to get a pg_lock that shows -1/0 there: if we run
out of SERIALIZABLEXACT objects, we transfer the locks of the oldest
committed transaction to a single SERIALIZABLEXACT object called
OldCommittedSxact, and it has an invalid virtual xid.  You can see
this if you recompile with TEST_SUMMARIZE_SERIAL defined to force that
path, and then run three overlapping transactions like this:

tx1: BEGIN;
tx1: SELECT * FROM t WHERE id = 42;
tx2:   BEGIN;
tx2:   SELECT * FROM t WHERE id = 999;
tx1: COMMIT;

Even though it comitted, at this point we still see tx1's locks, along
with tx2's.

tx3: BEGIN;
tx3: SELECT * FROM t WHERE id = 1234;

At this point we see tx1's locks still but their vxid has changed to -1/0.

> I still need to put more effort into Tom's idea about SIReadLock hanging out 
> after the transaction, but some evidence pointing in this direction is that 
> I've reduced the number of db connections and found that the '-1/0' locks 
> will eventually go away! I interpret this as the db needing to find time when 
> no overlapping read/write transactions are present. This doesn't seem 
> completely correct, as I don't have any long lived transactions running while 
> these locks are hanging out. Confusion still remains, for sure.

But do you have lots of short overlapping transactions so that there
is never a moment where there are zero transactions running?

As mentioned, locks (and transactions, and conflicts) hang around
after you commit.  That's because things that your finished
transaction saw can cause transactions that are still running to
become uncommittable, by forming part of a cycle.   The rules for when
the locks can be freed change once you reach the degraded "summarised"
mode, though, due to lack of bookkeeping space.  Not sure of the exact
details without rereading the source code 42 times with much coffee,
but it's something like: summarised locks can only be freed at times
when there are zero active read/write serializable transactions,
because that's when "CanPartialClearThrough" advances, while in the
normal case they can be freed as soon as there are no SSI snapshots
that were taken before it committed, because that's when
"SxactGlobalXmin" advances.

The number of SERIALIZABLEXACT objects is (max_connections +
max_prepared_transactions) * 10.  So, you could try increasing
max_connections (without increasing the actual number of connections)
to see if you can get to a point where you don't see these invalid
virtual xids, and then maybe it'll be able to clean up locks more
aggressively.

Re: Postgres crashes at memcopy() after upgrade to PG 13.

2021-03-15 Thread Thomas Munro

On Mon, Mar 15, 2021 at 1:29 PM Avinash Kumar
 wrote:
> Is this expected when replication is happening between PostgreSQL databases 
> hosted on different OS versions like Ubuntu 16 and Ubuntu 20 ? Or, do we 
> think this is some sort of corruption ?

Is this index on a text datatype, and using a collation other than "C"?

https://wiki.postgresql.org/wiki/Locale_data_changes

Not that I expect it to crash if that's the cause, I thought it'd just
get confused.

Re: SV: Log files polluted with permission denied error messages after every 10 seconds

2021-03-11 Thread Thomas Munro

Hi Andrus,

On Thu, Mar 11, 2021 at 2:21 AM Andrus  wrote:
> Windows Resource manger shows that wal files are used by large number of 
> postgres processes:
>
> postgres.exe22656FileC:\Program 
> Files\PostgreSQL\13\data\pg_wal\0001000A0075
> postgres.exe30788FileC:\Program 
> Files\PostgreSQL\13\data\pg_wal\0001000A0075
> postgres.exe14144FileC:\Program 
> Files\PostgreSQL\13\data\pg_wal\0001000A0074
...

This is normal -- postgres.exe holds open various files it's
interested in, and it's supposed to be OK for them to be renamed or
unlinked at any time by another process because they are opened with
special FILE_SHARE_XXX flags that allow that.  That list doesn't show
the open flags, but it looks like nothing *else* has the files open.
Usually when these types of errors are reported on the mailing list,
it turns out to be due to some other program opening the file without
those flags.

It's also possible that this is a real permissions problem, and not a
sharing violation.  I'd definitely look into this feedback:

On Tue, Mar 9, 2021 at 11:19 PM Thomas Kellerer  wrote:
> The data directory should not be stored in "C:\Program File"s on Windows.
>
> I wouldn't be surprised if "Program Files" has some additional security 
> settings that come into play here.
>
> %ProgramData% is a better location for the data directory.

Re: SV: Log files polluted with permission denied error messages after every 10 seconds

2021-03-09 Thread Thomas Munro

On Tue, Mar 9, 2021 at 9:43 PM Andrus  wrote:
>  > Any hints in Windows event viewer? Events occurring at the same time 
> showing up there.
>
> Looked into Administrative Events/Custom views and few others. There are no 
> messages about this. Windowsi perfomance monitor and Filemon show files 
> opened by process.
>
> How to do reverse: log processes and threads which use files in pg_wal 
> directory ?

Maybe this does the reverse?

https://docs.microsoft.com/en-us/sysinternals/downloads/handle

Re: Log files polluted with permission denied error messages after every 10 seconds

2021-03-08 Thread Thomas Munro

On Sat, Mar 6, 2021 at 2:36 PM Michael Paquier  wrote:
> On Fri, Mar 05, 2021 at 07:36:37PM +0200, Andrus wrote:
> > Then turned real-time protection off:
> >
> > Problem persists. New entry is written after every 10 seconds.
>
> On which files are those complaints?  It seems to me that you may have
> more going on in this system that interacts with your data folder than
> you think.

Suggestion received off-list from my colleague Bevan Arps, when I was
complaining about this general variety of problem: maybe we should
look into using Windows' RestartManager[1][2] API to find out which
processes (at least the pids, maybe also names) currently have a file
open?  Then, if it is indeed a sharing violation that's causing the
problem, we might at least be able to log message that says who's
blocking us once we reach that dreaded retry loop.  There are other
ways to get that information too, I believe, no idea which API would
be best, but this one looks to be the best documented.  I'm unlikely
to work on this myself as a card carrying Unix hacker, so I'm just
passing on this insight in case it's useful...

Another thought: if it's not a sharing violation, I wonder if we
should consider dumping more raw Windows error information in the
messages we log, because, if I recall correctly, we're converting many
Windows error codes into few Unix-style error numbers and thereby
throwing away valuable clues.  It makes it a bit more confusing when
trying to ask a Windows expert what might be happening.

[1] https://docs.microsoft.com/en-us/windows/win32/rstmgr/restart-manager-portal
[2] https://devblogs.microsoft.com/oldnewthing/20120217-00/?p=8283

Re: [LDAPS] Test connection user with ldaps server

2021-02-15 Thread Thomas Munro

On Tue, Feb 16, 2021 at 4:32 AM Laurenz Albe  wrote:
> What I would do is experiment with the "ldapsearch" executable from OpenLDAP 
> and see
> if you can reproduce the problem from the command line.

Also, maybe try doing this as the "postgres" user (or whatever user
PostgreSQL runs as), just in case there are some environmental
differences affecting the behaviour.

Re: How to post to this mailing list from a web based interface

2021-02-14 Thread Thomas Munro

On Fri, Jan 29, 2021 at 4:27 AM Alvaro Herrera  wrote:
> On 2021-Jan-28, Ravi Krishna wrote:
> > I recollect there use to be a website from where one can reply from web.
>
> The community does not maintain such a service.
>
> There used to be a Gmane archive of this list that you could use to
> post.  Seems it's still online at postgresql-archive.org.  They have a
> "Reply" button and it says to require your account, but given SPF and
> DMARC and other restrictions on email generation, it seems pretty
> uncertain that emails posted that way would work correctly.  I think we
> would even reject such emails if they reached our mailing list servers.

While catching up with some interesting new threads just now I was
quite confused by the opening sentence of this message (which also
arrived in my mailbox):

https://www.postgresql.org/message-id/1611355191319-0.post%40n3.nabble.com

... until I got to the last line.  I wonder if the "Resend" facility
on our own archives could be better advertised, via a "Want to join
this thread?" link in the Quick Links section that explains how to use
it and what problem it solves, or something...

Re: Unable To Drop Tablespace

2021-02-04 Thread Thomas Munro

On Fri, Feb 5, 2021 at 12:43 PM Ian Lawrence Barwick  wrote:
> 2021年2月5日(金) 3:52 Pavan Pusuluri :
>> We are trying to drop a table space on RDS Postgres . We have removed the 
>> objects etc, but it still won't drop.
>>
>> I have checked and there's no reference anywhere to this tablespace but it 
>> complains it's not empty.
>>
>> I checked if it is a default for a database, revoked all privileges on the 
>> tablespace.
>>
>> We dropped the database but underpinning tablespace remained but when I 
>> query to see if any reference i get no hits.
>>
>> "Select c.relname,t.spcname from pg_class c JOIN pg_tablespace t ON 
>> c.reltablespace=t.oid where t.spcname='mytablespace'
>>
>> I dont find any objects referencing. Kindly let me know if anything else 
>> needs to be checked?
>
>
> There's a handy function "pg_tablespace_databases()" to check which databases
> might still have objects in a database. There are a couple of useful queries 
> demonstrating
> usage here:
>
>   https://pgpedia.info/p/pg_tablespace_databases.html

It's also possible for there to be stray files in there, in some crash
scenarios where PostgreSQL doesn't currently clean up relation files
that it ideally should.  The one with the widest window AFAIK is where
you crash after creating a table but before committing[1].  You'd need
a directory listing to investigate that.

[1] 
https://www.postgresql.org/message-id/flat/CAEepm%3D0ULqYgM2aFeOnrx6YrtBg3xUdxALoyCG%2BXpssKqmezug%40mail.gmail.com

Re: Unable to compile postgres 13.1 on Slackware current x64

2020-11-17 Thread Thomas Munro

On Tue, Nov 17, 2020 at 8:02 PM Condor  wrote:
> I try to compile postgres again with (cd src/backend/commands; sed
> 's/TRUE/true/' collationcmds.c > collationcmds.c; ) and it's compiled
> but get new error on linking:

Doesn't that produce an empty file collationcmds.c? I think you want:

sed 's/TRUE/true/' collationcmds.c > collationcmds.c.tmp && mv
collationcmds.c.tmp collationcmds.c

Re: Unable to compile postgres 13.1 on Slackware current x64

2020-11-16 Thread Thomas Munro

On Mon, Nov 16, 2020 at 10:10 PM Laurenz Albe  wrote:
> On Mon, 2020-11-16 at 09:15 +0200, Condor wrote:
> > collationcmds.c: In function ‘get_icu_language_tag’:
> > collationcmds.c:467:51: error: ‘TRUE’ undeclared (first use in this
> > function); did you mean ‘IS_TRUE’?
> >467 |  uloc_toLanguageTag(localename, buf, sizeof(buf), TRUE, );
> >|   ^~~~
> >|   IS_TRUE
> >

> "UBool" and "TRUE" are defined in "umachine.h", which is a header file for the
> "libicu" library.
>
> PostgreSQL includes "unicode/ucol.h", which will include "umachine.h"
> (via "utypes.h"), so that should be fine.
>
> Are your libicu headers installed under /usr/include/unicode?
> Do you get any messages about missing include files earlier?

It looks like something happened to ICU's boolean macros .  See this
commit in FreeBSD ports:

https://github.com/freebsd/freebsd-ports/commit/81a88b4506ec06d07be10d199170ef4003eb0e30

... which references:

https://github.com/unicode-org/icu/commit/c3fe7e09d844

Re: PostgreSQL on Windows' state

2020-09-23 Thread Thomas Munro

On Wed, Sep 23, 2020 at 10:53 PM Alessandro Dentella
 wrote:
> Thanks Pavel, but psycopg2 (that I always use is just for Python). T
> hey claim/complain that from c# there's no native solution.

Maybe https://www.npgsql.org/?

Re: Check replication lag

2020-08-05 Thread Thomas Munro

On Thu, Aug 6, 2020 at 7:02 AM Sreejith P  wrote:
> IN SYNC mode of replication what would be the impact on Master DB server in 
> terms of over all performance  ?

The pg_stat_replication columns write_lag, flush_lag and replay_lag
are designed tell you how long to expect commits to take for
synchronous standbys, based on recent history, if synchronous_commit
it set to remote_write, on or remote_apply respectively.  Those times
tell you about commit latency, which limits sequential commit rate for
each session.

Re: PG 9.5.5 cores on AIX 7.1

2020-07-19 Thread Thomas Munro

On Sun, Jul 19, 2020 at 11:01 PM Abraham, Danny  wrote:
> Segmentation fault in _alloc_initial_pthread at 0x9521474
> 0x9521474 (_alloc_initial_pthread+0x1d4) e803 ld   
> r0,0x0(r3)
> (dbx) where
> _alloc_initial_pthread(??) at 0x9521474
> __pth_init(??) at 0x951f390
> uload(??, ??, ??, ??, ??, ??) at 0x9fff000ab70 load1(??, ??, ??, ??) at 
> 0x9000b74 load(??, ??, ??) at 0x9001ef0 loadAndInit(??, ??, 
> ??) at 0x905b38c dlopen(??, ??) at 0x909bfe0
> internal_load_library(??) at 0x10014c684
> RestoreLibraryState(??) at 0x10014d79c
> ParallelWorkerMain(??) at 0x1000bb2d0
> StartBackgroundWorker() at 0x10026cd94
> maybe_start_bgworkers() at 0x10003834c
> sigusr1_handler(??) at 0x10003902c
> __fd_select(??, ??, ??, ??, ??) at 0x91567fc
> ServerLoop() at 0x1004cec90
> PostmasterMain(??, ??) at 0x10003a4e8
> main(??, ??) at 0x108f8

FWIW there was a report a decade ago that looked at least superficially similar:

https://www.postgresql.org/message-id/flat/09B23E7BF70425478C1330D893A722C602FEC019BD%40MailSVR.invera.com

Re: Same query taking less time in low configuration machine

2020-07-15 Thread Thomas Munro

On Tue, Jul 14, 2020 at 9:27 PM Vishwa Kalyankar
 wrote:
>  Function Scan on kseb_geometry_trace_with_barrier_partition  
> (cost=0.25..10.25 rows=1000 width=169) (actual time=11626.548..11626.568 
> rows=254 loops=1)

>  Function Scan on kseb_geometry_trace_with_barrier_partition  
> (cost=0.25..10.25 rows=1000 width=169) (actual time=22304.425..22304.448 
> rows=254 loops=1)

I have no idea what that function does, but perhaps it runs more
queries, and you can't see the plans for those here.  If you set up
auto_explain[1], and turn on auto_explain.log_nested_statements, then
you'll be able to see the query plans for the internal stuff happening
in that function.

I'd look at EXPLAIN (BUFFERS) or auto_explain.log_buffers to get more
information on cache hits and misses.  I'd look for settings
differences with EXPLAIN (SETTINGS) to see if there's anything
accidentally set differently (maybe JIT or paralelism or something
like that).  I'd look at pg_stat_activity repeatedly while it runs to
see what the processes are doing, especially the wait_event column.
I'd also look at the CPU and I/O on the systems with operating system
tools like iostat, htop, perf to try to find the difference.

[1] https://www.postgresql.org/docs/current/auto-explain.html

Re: Definition of REPEATABLE READ

2020-06-21 Thread Thomas Munro

On Mon, Jun 22, 2020 at 12:25 AM Peter J. Holzer  wrote:
> I've read http://jepsen.io/analyses/postgresql-12.3 which reports a
> problem in PostgreSQL's SERIALIZABLE isolation leven (which has since been
> fixed) and also shows an example of a violation of what they consider to
> be the correct definition of REPEATABLE READ.

In response to that report, we added a paragraph to the manual to
state explicitly that what we offer is snapshot isolation:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=1575fbf1ac3c8464b2dade245deff537a3aa2498

https://www.postgresql.org/docs/devel/transaction-iso.html#XACT-REPEATABLE-READ

> Since those papers are now 25 and 20 years old, respectively, and there
> have been several revisions of the SQL standard in the meantime, has the
> SQL committee come around to that view (SQL/2003 seems unchanged, I
> couldn't find a later revision online)? And if it has, did it follow
> those definitions or come up with different ones (it seems to me that
> G2-item is much stronger that warranted by the wording in the standard)?

SQL:2016 is the latest, and I don't have it to hand right now, but
SQL:2011 still defined four degrees of isolation in terms of the three
phenomena (pages 133-135 of part II, "Foundations", at least in the
draft copy you can find on the net).

As for what else PostgreSQL should do about this historical confusion,
I suspect that a patch to pgsql-hackers to accept the name SNAPSHOT
ISOLATION as an alternative would be popular, and we could probably
even spin it as the "true" name of the level (while still accepting
REPEATABLE READ; there is no point in breaking that), and maybe add a
second table to the manual's Transaction Isolation section to cover
the later formalisations in a digested user friendly format (if that
is possible).

Re: troubleshooting postgresql ldap authentication

2020-06-08 Thread Thomas Munro

On Tue, Jun 9, 2020 at 9:05 AM Chris Stephens  wrote:
> hostsslall all 0.0.0.0/0  ldap 
> ldapserver="ldaps://xxx" ldapbasedn="yyy" ldaptls=1

> does anyone know what might be causing "LDAP: Bad parameter to an ldap 
> routine"

You probably want ldapurl="ldaps://xxx" (note: ldapurl, not
ldapserver).  Or you could use ldapscheme="ldaps" and
ldapserver="xxx".

Re: Shared memory error

2020-06-04 Thread Thomas Munro

On Fri, Jun 5, 2020 at 1:00 AM Sonam Sharma  wrote:
> The dynamic_shared_memory_type was set to POSIX . Because of this it was 
> using tmpfs /dev/shm. When the query was running I saw the file system was 
> filling. So I  extended the file system and luckily the query worked for that 
> time

Oh, hmm.  When you say "filling", maybe you hit the leaking shm
problem that was fixed in 11.7, that requires messages somewhere in
your log about lack of DSM slots?  I don't have an explanation for the
exact the error you're seeing though.  It's a bit tricky to speculate
on older releases with so many bug fixes missing though.  Please let
us know if you still see the problem after your next scheduled
upgrade.

Re: Shared memory error

2020-06-04 Thread Thomas Munro

>> Do you see any other errors around this one, in the PostgreSQL logs?
> No , only this is the error from db and jdbc end .. and queries are failing

If you need a workaround right now you could always set
max_parallel_workers_per_gather=0 so that it doesn't try to use
parallel query.  That could make some queries run more slowly.

When you say you extended the /dev/shm filesystem, what are the mount
options, or how did you configure it?  It's interesting that the
problem went away for a while when you did that.

Re: Shared memory error

2020-06-04 Thread Thomas Munro

On Thu, Jun 4, 2020 at 6:18 AM Sonam Sharma  wrote:
>>> 1) Postgres version : 11.2

FYI This is missing over a year's worth of bugfixes.  That said, I
don't know of anything fixed that has this symptom.

>>> 4) Is this only with one query and if so what is it doing? : No , few 
>>> queries work, few fails .. mostly all are complex select queries.

It's a long shot, because I expect you're running the usual packages
under a user set up by the package to be a "system" user, but just in
case: there is one known way for POSIX shm to be unlinked while you're
still using it: systemd.  For example, this might happen if you have
cronjobs running as the same user.  Some more info:

https://www.postgresql.org/docs/11/kernel-resources.html#SYSTEMD-REMOVEIPC

Do you see any other errors around this one, in the PostgreSQL logs?

Re: Can we get SQL Server-like cross database queries

2020-06-03 Thread Thomas Munro

On Thu, Jun 4, 2020 at 4:26 PM Ron  wrote:
> On 6/3/20 2:57 PM, Rob Sargent wrote:
> >> On Jun 3, 2020, at 1:46 PM, Tom Lane  wrote:
> >> Guyren Howe  writes:
> >>> Is it practical to provide the SQL Server-like feature in Postgres?
> >> No.
> > That got me chuckling.
> > I had just decided not to bother posting, but well, here goes.
> >
> > I call bs on the “cross db query” notion of tsql - but I admit I haven’t 
> > used it since it was a Sybase thing.
> >
> > Is db.owner.table (iirc) is really anything more than nuance on 
> > schema.table.  Does a db for automotive parts need to be 
> > up-close-and-personal with a db payroll?
>
> Those aren't the only two databases that exist.  Think of a federated system
> where you've got a "reference" database full of lookup tables, and one
> database for every 10,000 customers. For 45,000 customers you've got five
> databases, and they all need to access the reference database, plus "all
> customers" queries need to access all five databases.

There's no doubt it's useful, and it's also part of the SQL spec,
which says you can do catalog.schema.table.  I would guess that we
might get that as a byproduct of any project to make PostgreSQL
multithreaded.  That mountain moving operation will require us to get
rid of all the global state that currently ties a whole process to one
session and one database, and replace it with heap objects with names
like Session and Database that can be passed around between worker
threads.

Re: Help understanding SIReadLock growing without bound on completed transaction

2020-05-22 Thread Thomas Munro

On Fri, May 22, 2020 at 7:48 AM Mike Klaas  wrote:
> It's my understanding that these locks should be cleared when there are no 
> conflicting transactions.  These locks had existed for > 1 week and we have 
> no transactions that last more than a few seconds (the oldest transaction in 
> pg_stat_activity is always < 1minute old).
> Why would a transaction that is finished continue accumulating locks over 
> time?

Predicate locks are released by ClearOldPredicateLocks(), which
releases SERIALIZABLEXACTs once they are no longer interesting.  It
has a  conservative idea of what is no longer interesting: it waits
until the lowest xmin across active serializable snapshots is >= the
transaction's finishedBefore xid, which was the system's next xid (an
xid that hasn't been used yet*) at the time the SERIALIZABLEXACT
committed.  One implication of this scheme is that SERIALIZABLEXACTs
are cleaned up in commit order.  If you somehow got into a state where
a few of them were being kept around for a long time, but others
committed later were being cleaned up (which I suppose must be the
case or your system would be complaining about running out of
SERIALIZABLEXACTs), that might imply that there is a rare leak
somewhere in this scheme.  In the past I have wondered if there might
be a problem with wraparound in the xid tracking for finished
transactions, but I haven't worked out the details (transaction ID
wraparound is both figuratively and literally the Ground Hog Day of
PostgreSQL bug surfaces).

*Interestingly, it takes an unlocked view of that value, but that
doesn't seem relevant here; it could see a value that's too low, not
too high.

Re: Help understanding SIReadLock growing without bound on completed transaction

2020-05-21 Thread Thomas Munro

On Fri, May 22, 2020 at 7:48 AM Mike Klaas  wrote:
> locktype: page
> relation::regclass::text: _pkey
> virtualtransaction: 36/296299968
> granted:t
> pid:2263461

That's an unusually high looking pid.  Is that expected, for example
did you crank Linux's pid_max right up, or is this AIX, or something?

Re: Ubuntu 20.04: apt.postgresql.org/pub/repos/apt focal-pgdg InRelease' doesn't support architecture 'i386'

2020-05-16 Thread Thomas Munro

On Sun, May 17, 2020 at 10:45 AM Hugh  wrote:
> While this doesn't appear to be a bug that causes problems of any kind, I do 
> have a question about its cause.
>
> The "error" listed in the Subject: line is basically what I'm seeing. The 
> entire message is below, particularly the 'N:' at the end. Is there a repo 
> setting I should change to prevent the request for '386' architecture? Thank 
> you in advance for your assistance.

I'm not sure, but it seems related to this complaint and the answer
might be to tell your sources.list that that source has only amd64:

https://www.postgresql.org/message-id/flat/16402-1f2d77e819f9e1f2%40postgresql.org

Re: Is there a significant difference in Memory settings between 9.5 and 12

2020-05-11 Thread Thomas Munro

On Tue, May 12, 2020 at 2:52 PM Tory M Blue  wrote:
> It took the change but didn't help. So 10GB of shared_buffers in 12 is still 
> a no go. I'm down to 5GB and it works, but this is the same hardware, the 
> same exact 9.5 configuration. So I'm missing something. WE have not had to 
> mess with kernel memory settings since 9.4, so this is an odd one.
>
> I'll keep digging, but i'm hesitant to do my multiple TB db's with half of 
> their shared buffer configs, until I understand what 12 is doing differently 
> than 9.5

Which exact version of 9.5.x are you coming from?  What's the exact
error message on 12 (you showed the shared_memory_type=sysv error, but
with the default  value (mmap) how does it look)?  What's your
huge_pages setting?

Can you reproduce the problem with a freshly created test cluster?  As
a regular user, assuming regular RHEL packaging, something like
/usr/pgsql-12/bin/initdb -D test_pgdata, and then
/usr/pgsql-12/bin/postgres -D test_pgdata -c shared_buffers=10GB (then
^C to stop it).  If that fails to start in the same way, it'd be
interesting to see the output of the second command with strace in
front of it, in the part where it allocates shared memory.  And
perhaps it'd be interesting to see the same output with
/usr/pgsql-9.5/bin/XXX (if you still have the packages).  For example,
on my random dev laptop that looks like:

openat(AT_FDCWD, "/proc/meminfo", O_RDONLY) = 6
fstat(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
read(6, "MemTotal:   16178852 kB\nMemF"..., 1024) = 1024
read(6, ":903168 kB\nShmemHugePages:  "..., 1024) = 311
close(6)= 0
mmap(NULL, 11016339456, PROT_READ|PROT_WRITE,
MAP_SHARED|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = -1 ENOMEM (Cannot
allocate memory)
mmap(NULL, 11016003584, PROT_READ|PROT_WRITE,
MAP_SHARED|MAP_ANONYMOUS, -1, 0) = 0x7ff74e579000
shmget(0x52e2c1, 56, IPC_CREAT|IPC_EXCL|0600) = 3244038
shmat(3244038, NULL, 0) = 0x7ff9df5ad000

The output is about the same on REL9_5_STABLE and REL_12_STABLE for
me, only slightly different sizes.  If that doesn't fail in the same
way on your system with 12, perhaps there are some more settings from
your real clusters required to make it fail.  You could add them one
by one with -c foo=bar or in the throw away
test_pgdata/postgresql.conf, and perhaps that process might shed some
light?

I was going to ask if it might be a preloaded extension that is asking
for gobs of extra memory in 12, but we can see from your "Failed
system call was shmget(key=5432001, size=11026235392, 03600)" that
it's in the same ballpark as my total above for shared_buffers=10GB.

Re: 12.2: Howto check memory-leak in worker?

2020-05-04 Thread Thomas Munro

On Tue, May 5, 2020 at 10:13 AM Peter  wrote:
> BTW, I would greatly appreciate if we would reconsider the need for
> the server to read the postmaster.pid file every few seconds (probably
> needed for something, I don't know).
> That makes it necessary to set atime=off to get a spindown, and I
> usually prefer to have atime=on so I can see what my stuff is
> currently doing.

That'd be this:

/*
 * Once a minute, verify that postmaster.pid hasn't been removed or
 * overwritten.  If it has, we force a shutdown.  This avoids having
 * postmasters and child processes hanging around after their database
 * is gone, and maybe causing problems if a new database cluster is
 * created in the same place.  It also provides some protection
 * against a DBA foolishly removing postmaster.pid and manually
 * starting a new postmaster.  Data corruption is likely to ensue from
 * that anyway, but we can minimize the damage by aborting ASAP.
 */
if (now - last_lockfile_recheck_time >= 1 * SECS_PER_MINUTE)
{
if (!RecheckDataDirLockFile())

Re: Transition tables for column-specific UPDATE triggers

2020-05-03 Thread Thomas Munro

On Wed, Oct 9, 2019 at 3:59 PM Guy Burgess  wrote:
> The manual says: 
> https://www.postgresql.org/docs/current/sql-createtrigger.html
>
> A column list cannot be specified when requesting transition relations.
>
> And (I think the same point):
>
> The standard allows transition tables to be used with column-specific UPDATE 
> triggers, but then the set of rows that should be visible in the transition 
> tables depends on the trigger's column list. This is not currently 
> implemented by PostgreSQL.
>
> Are there any plans to allow transition tables to be used with 
> column-specific UPDATE triggers?  Or, is there another way for a trigger 
> function to see the rows changed by a column-specific UPDATE trigger?

Hi Guy,

Answering an ancient message that went unanswered... I'm not aware of
anyone working on that, and there isn't another way: the transition
tables feature simply won't let you create such a trigger.  The last
I've seen anyone say about that was in the following commit:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0f79440fb0b4c5a9baa9a95570c01828a9093802

Excerpt:

"Also, forbid using transition tables with column-specific UPDATE triggers.
The spec requires such transition tables to show only the tuples for which
the UPDATE trigger would have fired, which means maintaining multiple
transition tables or else somehow filtering the contents at readout.
Maybe someday we'll bother to support that option, but it looks like a
lot of trouble for a marginal feature."

The code preventing it is here:

/*
 * We currently don't allow column-specific triggers with
 * transition tables.  Per spec, that seems to require
 * accumulating separate transition tables for each combination of
 * columns, which is a lot of work for a rather marginal feature.
 */
if (stmt->columns != NIL)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("transition tables cannot be specified
for triggers with column lists")));

In theory you could do some filtering in your trigger procedure, by
comparing rows in OLD TABLE and NEW TABLE to see which row-pairs
represent changes to the columns you care about, but that's slightly
complicated: you can join OLD and NEW using whatever keys you have
defined, but that only works if there's no danger of the keys
themselves changing.  I wondered about inventing something like WITH
ORDINALITY so that you get unique ordered numbers in an extra column
that can be used to join the two transition tables without knowing
anything about the user defined keys, but among other problems I
couldn't figure out how to fit it in syntax-wise.

I suppose PostgreSQL could do this internally to make the feature you
want work: whenever you scan either table, in an UPDATE OF 
trigger, it could scan both transition tables in sync and filter out
rows that didn't change your columns of interest.  Or it could do that
filtering up front, before your trigger fires, to create two brand new
tuplestores just for your trigger invocation.  Or there could be a
third spillable data structure, that records which triggers should be
able to see each old/new-pair, or which columns changed, and is
scanned in sync with the others.  Just some first thoughts; I am not
planning to work on this any time soon.

Re: EINTR while resizing dsm segment.

2020-04-07 Thread Thomas Munro

On Tue, Apr 7, 2020 at 8:58 PM Nicola Contu  wrote:
> So that seems to be a bug, correct?
> Just to confirm, I am not using NFS, it is directly on disk.
>
> Other than that, is there a particular option we can set in the postgres.conf 
> to mitigate the issue?

Hi Nicola,

Yeah, I think it's a bug.  We're not sure exactly where yet.

Re: EINTR while resizing dsm segment.

2020-04-03 Thread Thomas Munro

On Thu, Apr 2, 2020 at 9:25 PM Kyotaro Horiguchi
 wrote:
> I provided the subject, and added -hackers.
>
> > Hello,
> > I am running postgres 11.5 and we were having issues with shared segments.
> > So I increased the max_connection as suggested by you guys and reduced my
> > work_mem to 600M.
> >
> > Right now instead, it is the second time I see this error :
> >
> > ERROR:  could not resize shared memory segment "/PostgreSQL.2137675995" to
> > 33624064 bytes: Interrupted system call
>
> The function posix_fallocate is protected against EINTR.
>
> | do
> | {
> |   rc = posix_fallocate(fd, 0, size);
> | } while (rc == EINTR && !(ProcDiePending || QueryCancelPending));
>
> But not for ftruncate and write. Don't we need to protect them from
> ENTRI as the attached?

We don't handle EINTR for write() generally because that's not
supposed to be necessary on local files (local disks are not "slow
devices", and we document that if you're using something like NFS you
should use its "hard" mount option so that it behaves that way too).
As for ftruncate(), you'd think it'd be similar, and I can't think of
a more local filesystem than tmpfs (where POSIX shmem lives on Linux),
but I can't seem to figure that out from reading man pages; maybe I'm
reading the wrong ones.  Perhaps in low memory situations, an I/O wait
path reached by ftruncate() can return EINTR here rather than entering
D state (non-interruptable sleep) or restarting due to our SA_RESTART
flag... anyone know?

Another thought: is there some way for the posix_fallocate() retry
loop to exit because (ProcDiePending || QueryCancelPending), but then
for CHECK_FOR_INTERRUPTS() to do nothing, so that we fall through to
reporting the EINTR?

Re: \COPY to accept non UTF-8 chars in CHAR columns

2020-03-27 Thread Thomas Munro

On Sat, Mar 28, 2020 at 4:46 AM Tom Lane  wrote:
> Matthias Apitz  writes:
> > In short, it there a way to let \COPY accept such broken ISO bytes, just
> > complaining about, but not stopping the insert of the row?
>
> No.  We don't particularly believe in the utility of invalid data.
>
> If you don't actually care about what encoding your data is in,
> you could use SQL_ASCII as the database "encoding" and thereby
> disable all UTF8-specific behavior.  Otherwise, maybe this conversion
> is a good time to clean up the mess?

Something like this approach might be useful for fixing the CSV file:

https://codereview.stackexchange.com/questions/185821/convert-a-mix-of-latin-1-and-utf-8-to-proper-utf-8

I haven't tested that program but it looks like the right sort of
approach; I remember writing similar logic to untangle the strange
mixtures of Latin 1, Windows 1252, and UTF-8  that late 90s browsers
used to send.  That sort of approach can't fix every theoretical
problem (some valid Latin1 sequences are also valid UTF-8 sequences)
but it's doable with text in European languages.

Re: LDAP with TLS is taking more time in Postgresql 11.5

2020-02-25 Thread Thomas Munro

On Wed, Feb 26, 2020 at 7:37 AM Adrian Klaver  wrote:
> On 2/25/20 10:23 AM, Mani Sankar wrote:
> > Hi Adrian,
> >
> > Both the machines are in same network and both are pointing towards the
> > same LDAP server
>
> I don't see any errors in the Postgres logs.
>
> You probably should take a look at the LDAP server logs to see if there
> is anything there.
>
> You could also turn up the logging detail in Postgres to see if it
> reveals anything.

A couple more ideas:

If you take PostgreSQL out of the picture and run the equivalent LDAP
queries with the ldapsearch command line tool, do you see the same
difference in response time?  If so, I'd trace that with strace etc
with timings to see where the time is spent -- for example, is it
simply waiting for a response from the LDAP (AD?) server?   If not,
I'd try tracing the PostgreSQL process and looking at the system calls
(strace -tt -T for high res times and elapsed times), perhaps using
PostgreSQL's pre_auth_delay setting to get time to attach strace.

A wild stab in the dark: if it's slow from one computer and not from
another, perhaps the problem has something to do with a variation in
reverse DNS lookup speed on the LDAP server side when it's verifying
the certificate.  Or something like that.

Re: Postgres 12.1 : UPPER() in WHERE clause restarts server

2020-02-10 Thread Thomas Munro

On Mon, Feb 10, 2020 at 4:35 AM Marc  wrote:
> We will keep the 12.1 in place so that we can run additional tests to assist 
> to pin-point the issue.
>
> Feel free to ask but allow us to recover from these hectic days ;-)

Here's how to get a stack so we can see what it was doing, assuming
you have the Apple developer tools installed:

1.  Find the PID of the backend you're connected to with SELECT
pg_backend_pid().
2.  "lldb -p PID" from a shell to attach to the process, then "cont"
to let it continue running.
3.  Run the query in that backend and wait for the SIGKILL.
4.  In the lldb session, type "bt".

It'll only make sense if your PostgreSQL build has debug symbols, but let's see.

Re: Postgres 12.1 : UPPER() in WHERE clause restarts server

2020-02-09 Thread Thomas Munro

On Sun, Feb 9, 2020 at 11:46 AM Tom Lane  wrote:
> "Nick Renders"  writes:
> > When we do the following statement:
> >   SELECT * FROM f_gsxws_schedule WHERE UPPER(gwsc_dossier) = 'TEST'
> > the Postgres service restarts.
>
> Hm.
>
> > Here is what is logged:
> > 2020-02-08 20:21:19.942 CET [83892] LOG:  server process (PID 85456) was
> > terminated by signal 9: Killed: 9
> > 2020-02-08 20:21:19.942 CET [83892] DETAIL:  Failed process was running:
> > SELECT * FROM f_gsxws_schedule WHERE UPPER(gwsc_dossier) = 'TEST'
>
> That's mighty interesting, because signal 9 is an external SIGKILL,
> not an internal-to-Postgres software fault.
>
> If you were running on Linux I would hypothesize that your process
> was getting killed by the infamous OOM killer, in which case we could
> guess that for some reason this query is consuming an unreasonable
> amount of memory and thereby attracting the wrath of the OOM killer.
> However, I'm not aware that any such mechanism exists on macOS.

macOS's thing like that appears as "kernel[0]: memorystatus_thread:
idle exiting pid XXX [some program]" in system.log, which seems like a
bit of an understatement to me but that's what they call it.  Some
details here:

http://newosxbook.com/articles/MemoryPressure.html

Nick, did you see that?

Re: Unable to startup postgres: Could not read from file "pg_clog/00EC"

2020-02-06 Thread Thomas Munro

On Fri, Feb 7, 2020 at 1:47 AM Nick Renders  wrote:
> Thank you for the feedback, Alvaro.
>
> Unfortunately, the database is no longer "dumpable". We were able to do
> a pg_dump yesterday morning (12 hours after the crash + purging the
> pg_clog) but if we try one now, we get the following error:
>
> unexpected chunk number 1 (expected 0) for toast value 8282331 in
> pg_toast_38651
>
> Looking at our data, there seem to be 6 tables that have corrupt
> records. Doing a SELECT * for one of those records, will return a
> similar error:
>
> missing chunk number 0 for toast value 8288522 in pg_toast_5572299
>
>
> What is the best way to go from here? Is tracking down these corrupt
> records and deleting them the best / only solution?
> Is there a way to determine of there are issues with new data (after the
> crash)?
>
> Any help and advice is very much appreciated.

This error indicates that the file did exist already, it was just
shorter than we expected:

2020-02-04 15:20:44 CET DETAIL:  Could not read from file
"pg_clog/00EC" at offset 106496: Undefined error: 0.

What was the length of the file before you overwrote it? Are there
00EB and 00ED files, and if so what size?  When your server rebooted,
did crash recovery run or had it shut down cleanly?  Do you know if
the machine lost power, or the kernel crashed, or if it was a normal
reboot?  What are your settings for "fsync" and "wal_sync_method"?
What is the output of pg_controldata -D pgdata?  I wonder if that part
of the clog file was supposed to be created before the checkpoint (ie
the checkpoint is somehow borked), or was supposed to be created
during recovery after that checkpoint (something else is borked, but I
don't know what), or if the xid is somehow corrupted.

Here's a dirty trick that might help rescue some data.  Assuming you
have a copy of the original file before you zeroed it, you could write
a 256kb file full of 0x55 (that's 01010101 and represents 4 commits,
so if you fill the file up with that it means 'all transactions in
this range committed', which is probably closer to the truth than all
zeroes), and then copy the original shorter file over the top of it,
so that at least the range of transactions represented by the earlier
part of the file that did make it to disk are preserved, and we have
just have bogus force-everything-to-look-committed data after that.
But as Alvaro said, this is a pretty bad situation, this is key meta
data used to interpret all other data files, so all bets are off here,
this is restore-from-backups territory.

Re: ERROR: too many dynamic shared memory segments

2020-01-31 Thread Thomas Munro

On Fri, Jan 31, 2020 at 11:05 PM Nicola Contu  wrote:
> Do you still recommend to increase max_conn?

Yes, as a workaround of last resort.  The best thing would be to
figure out why you are hitting the segment limit, and see if there is
something we could tune to fix that. If you EXPLAIN your queries, do
you see plans that have a lot of "Gather" nodes in them, perhaps
involving many partitions?  Or are you running a lot of parallel
queries at the same time?  Or are you running queries that do very,
very large parallel hash joins?  Or something else?

Re: ERROR: too many dynamic shared memory segments

2020-01-31 Thread Thomas Munro

On Thu, Jan 30, 2020 at 12:26 AM Thomas Munro  wrote:
> On Wed, Jan 29, 2020 at 11:24 PM Julian Backes  wrote:
> > we only had the "too many shared too many dynamic shared memory segments" 
> > error but no segmentation faults. The error started occurring after 
> > upgrading from postgres 10 to postgres 12 (server has 24 cores / 48 
> > threads, i.e. many parallel workers). The error itself was not that much of 
> > a problem but /dev/shm started filling up with orphaned files which 
> > probably (?) had not been cleaned up by postgres after the parallel workers 
> > died. In consequence, after some time, /dev/shm was full and everything 
> > crashed.
>
> Oh, thanks for the report.  I think see what was happening there, and
> it's a third independent problem.  The code in dsm_create() does
> DSM_OP_DESTROY (ie cleans up) in the DSM_CREATE_NULL_IF_MAXSEGMENTS
> case, but in the case where you see "ERROR: too many dynamic shared
> memory segments" it completely fails to clean up after itself.  I can
> reproduce that here.  That's a terrible bug, and has been sitting in
> the tree for 5 years.

I committed a fix for that.  It'll be in the new releases that due out
in a couple of weeks.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=93745f1e019543fe7b742d0c5e971aad8d08fd56

> > Unfortunately, the only "solution" we found so far was to increase max 
> > connections from 100 to 1000. After that (about 2 months ago I think), the 
> > error had gone.
>
> I'll take that as a vote for increasing the number of slots.

I committed something to do this for 13 (due out end of year), but I'm
not game to back-patch it to the release branches.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d061ea21fc1cc1c657bb5c742f5c4a1564e82ee2

Re: ERROR: too many dynamic shared memory segments

2020-01-30 Thread Thomas Munro

On Wed, Jan 29, 2020 at 11:53 PM Thomas Munro  wrote:
> On Wed, Jan 29, 2020 at 10:37 PM Nicola Contu  wrote:
> > This is the error on postgres log of the segmentation fault :
> >
> > 2020-01-21 14:20:29 GMT [] [4]: [108-1] db=,user= LOG:  server process 
> > (PID 2042) was terminated by signal 11: Segmentation fault
> > 2020-01-21 14:20:29 GMT [] [4]: [109-1] db=,user= DETAIL:  Failed 
> > process was running: select pid from pg_stat_activity where query ilike 
> > 'REFRESH MATERIALIZED VIEW CONCURRENTLY matview_vrs_request_stats'
> > 2020-01-21 14:20:29 GMT [] [4]: [110-1] db=,user= LOG:  terminating any 
> > other active server processes

> That gave me an idea...  I hacked my copy of PostgreSQL to flip a coin
> to decide whether to pretend there are no slots free (see below), and
> I managed to make it crash in the regression tests when doing a
> parallel index build.  It's late here now, but I'll look into that
> tomorrow.  It's possible that the parallel index code needs to learn
> to cope with that.

Hi Nicola,

Without more information I can't know if I found the same bug you
experienced, but I think it's likely.  I have committed a fix for
that, which will be available in the next release (mid-February).

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=74618e77b43cfce670b4725d5b9a300a2afd12d1

Re: ERROR: too many dynamic shared memory segments

2020-01-29 Thread Thomas Munro

On Wed, Jan 29, 2020 at 10:37 PM Nicola Contu  wrote:
> This is the error on postgres log of the segmentation fault :
>
> 2020-01-21 14:20:29 GMT [] [4]: [108-1] db=,user= LOG:  server process 
> (PID 2042) was terminated by signal 11: Segmentation fault
> 2020-01-21 14:20:29 GMT [] [4]: [109-1] db=,user= DETAIL:  Failed process 
> was running: select pid from pg_stat_activity where query ilike 'REFRESH 
> MATERIALIZED VIEW CONCURRENTLY matview_vrs_request_stats'
> 2020-01-21 14:20:29 GMT [] [4]: [110-1] db=,user= LOG:  terminating any 
> other active server processes

Ok, this is a bug.  Do you happen to have a core file?  I don't recall
where CentOS puts them.

> > If you're on Linux, you can probably see them with "ls /dev/shm".
>
> I see a lot of files there, and doing a cat they are empty. What can I do 
> with them?

Not much, but it tells you approximately how many 'slots' are in use
at a given time (ie because of currently running parallel queries), if
they were created since PostgreSQL started up (if they're older ones
they could have leaked from a crashed server, but we try to avoid that
by trying to clean them up when you restart).

> Those are two different problems I guess, but they are related because right 
> before the Segmentation Fault I see a lot of shared segment errors in the 
> postgres log.

That gave me an idea...  I hacked my copy of PostgreSQL to flip a coin
to decide whether to pretend there are no slots free (see below), and
I managed to make it crash in the regression tests when doing a
parallel index build.  It's late here now, but I'll look into that
tomorrow.  It's possible that the parallel index code needs to learn
to cope with that.

#2  0x00a096f6 in SharedFileSetInit (fileset=0x80b2fe14c,
seg=0x0) at sharedfileset.c:71
#3  0x00c72440 in tuplesort_initialize_shared
(shared=0x80b2fe140, nWorkers=2, seg=0x0) at tuplesort.c:4341
#4  0x005ab405 in _bt_begin_parallel
(buildstate=0x7fffc070, isconcurrent=false, request=1) at
nbtsort.c:1402
#5  0x005aa7c7 in _bt_spools_heapscan (heap=0x801ddd7e8,
index=0x801dddc18, buildstate=0x7fffc070, indexInfo=0x80b2b62d0)
at nbtsort.c:396
#6  0x005aa695 in btbuild (heap=0x801ddd7e8,
index=0x801dddc18, indexInfo=0x80b2b62d0) at nbtsort.c:328
#7  0x00645b5c in index_build (heapRelation=0x801ddd7e8,
indexRelation=0x801dddc18, indexInfo=0x80b2b62d0, isreindex=false,
parallel=true) at index.c:2879
#8  0x00643e5c in index_create (heapRelation=0x801ddd7e8,
indexRelationName=0x7fffc510 "pg_toast_24587_index",
indexRelationId=24603, parentIndexRelid=0,

I don't know if that's the bug that you're hitting, but it definitely
could be: REFRESH MATERIALIZED VIEW could be rebuilding an index.

===

diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index 90e0d739f8..f0b49d94ee 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -468,6 +468,13 @@ dsm_create(Size size, int flags)
nitems = dsm_control->nitems;
for (i = 0; i < nitems; ++i)
{
+   /* BEGIN HACK */
+   if (random() % 10 > 5)
+   {
+   nitems = dsm_control->maxitems;
+   break;
+   }
+   /* END HACK */
if (dsm_control->item[i].refcnt == 0)
{
dsm_control->item[i].handle = seg->handle;

Re: ERROR: too many dynamic shared memory segments

2020-01-29 Thread Thomas Munro

On Wed, Jan 22, 2020 at 4:06 AM Nicola Contu  wrote:
> after a few months, we started having this issue again.
> So we revert the work_mem parameter to 600MB instead of 2GB.
> But the issue is still there. A query went to segmentation fault, the DB went 
> to recovery mode and our app went to read only for a few minutes.

Hi Nicola,
Hmm, a segmentation fault sounds like a different problem.  Can you
please share the exact error messages from PostgreSQL and OS logs?

> I understand we can increase max_connections so we can have many more 
> segments.
>
> My question is : is there a way to understand the number of segments we 
> reached?

If you're on Linux, you can probably see them with "ls /dev/shm".

> Currently we have 220 max_conn so as your formula is 64 + 2* 220 we have 
> about 500 shared segments.
> We would like to increase that number to 300 or 400 but would be great to 
> understand if there is a way to make sure we will solve the issue as it 
> requires a restart of the service.
>
> I know you were also talking about a redesign this part in PostgreSQL. Do you 
> know if anything has changed in any of the newer versions after 11.5?

It's possible that we should increase a couple of constants used the
formula -- I'll look into that again.  But first I'd like to see if
we're even investigating the right problem here.

Re: Writing Postgres Extensions in C on Windows

2020-01-07 Thread Thomas Munro

On Wed, Jan 8, 2020 at 4:32 AM İlyas Derse  wrote:
> I want to register C code to PostgreSql on Windows. So I think, I have to 
> make a extension for PostgreSql. But I did not find to written extension on 
> windows. Do you have an idea ?

I don't do Windows myself but this blog from Craig Ringer looks like a
good starting point:

https://www.2ndquadrant.com/en/blog/compiling-postgresql-extensions-visual-studio-windows/

Re: Row locks, SKIP LOCKED, and transactions

2019-12-17 Thread Thomas Munro

On Wed, Dec 18, 2019 at 5:12 AM Steven Winfield
 wrote:
> * I observe this even if I crank up the transaction isolation level to 
> repeatable read and serializable.

Huh.  SERIALIZABLE shouldn't allow two transactions to see no result
row for a given ID and then insert a result row for that ID.  One of
those transactions should have to roll back, because otherwise it'd be
incompatible with both serial orderings of the two transactions.

> I'm wondering if row locks are not obeying the same transactional semantics 
> as row data,

They are indeed a bit weird.  They sometimes check if the condition
still apply (since the row might have changed between the scan and
LockRows node) which leads to some interesting effects, but only if
the row being locked was concurrently updated, and here that isn't the
case.  This is a source of a fair amount of confusion about FOR UPDATE
and joins/subselects.

> Perhaps this is a misuse of the locking system, since I'm locking a row "FOR 
> UPDATE" but not actually updating it, but as row locks are released at the 
> end of a transaction (according to the docs) then my expectation was for the 
> unlocking and the visibility of newly committed rows to be atomic.
> I've tried FOR NO KEY UPDATE too, without luck.
>
> If I'm doing something forbidden (and the docs say so) then I'd be grateful 
> if someone could point that out!

Conceptually, the thing you really need to lock for this to work is
the result row that isn't there yet, so that some overlapping
transaction doesn't try to lock the same absent thing.  Unfortunately,
our system for locking things that aren't there isn't there either.
Some articles on serializability talk about "materialising the
conflict", which means locking some other surrogate thing that
"covers" a gap you are interested in.  You might think the job row
would do the trick, but since we don't recheck the condition (that is,
recheck that there is no corresponding result because you don't update
the job row), no cigar.  You could also use plain old
pg_try_advisory_xact_lock(id), because it just locks integers, and
they're always there.

SERIALIZABLE deals with that type of magic internally (it locks gaps
in key ranges by predicate-locking a physical btree or hash page that
you'd need to write on to insert a row with a matching key, which is
how it discovers a conflict between one transaction that went looking
for key=42 but didn't find it and another that later writes key=42),
but, as mentioned, SERIALIZABLE doesn't really allow concurrency with
this workload, and you specified that you wanted concurrency with SKIP
LOCKED (but I think you'd have the same problem without it; SKIP
LOCKED just gets you the wrong answer faster).

There are various ways you could deal with this, but I'd probably go
for a simple scheme where you only have to consult a single row to
know if you can claim it.  You could still put the results into a
separate table, but use job.state to find work, and set it to DONE
when you insert the result.  It may also be possible to add no new
columns but do a dummy update to the job row to get the join qual
rechecked, but I'm not sure if that'd work.  Another reason to add a
state column to the job table is so that you can put a conditional
index on it so you can find jobs to be done very quickly, if you're
not planning to remove the ones that are done.

Re: When are Predicate Locks Freed

2019-12-12 Thread Thomas Munro

On Fri, Dec 13, 2019 at 5:00 AM Dave Halter  wrote:
> I have worked quite a bit with serializable transactions. I'm trying
> to understand when predicate locks are freed. I read the whole
> README-SSI [1], but I'm still not sure when a predicate lock gets
> dropped.
>
> What I learned from that README is that predicate locks are not freed
> at the moment where a transaction passes, but only at a later stage.
> This makes sense, because some concurrent transactions might be
> invalidated by the predicate locks that are not needed anymore for a
> transaction that was just committed.

Right.  Even though T1 has committed, T2 and T3 can create a
"dangerous cycle" by touching data that T1 accessed, meaning that they
can't all be allowed to commit because what they've seen isn't
consistent with any serial ordering.  So the ghost of T1 continues to
affect other transactions, and predicate.c needs to figure out the
correct poltergeist duration.

> What I don't understand is: We have only short transaction (<1s), but a
>
> select count(*) from pg_locks where mode = 'SIReadLock';
>
> would vary by a lot. It typically went up all the way to 300k. At this
> point it would drop to almost zero (always <10k) and rise again. This
> cycle pretty consistently happened every three minutes. (I think it
> looked a bit more like 2 minutes 50 seconds, but I wasn't measuring it
> exactly). Is there a "garbage collection" happening every few minutes?
> Since we scaled up our service by quite a bit, we increased
> `max_pred_locks_per_transaction`. Now we are at 1.0m to 1.3m
> predicate locks with the same ~3 minute cleanup period.

Garbage collection happens in ClearOldPredicateLocks() when a
SERIALIZABLE transactions end.  This is triggered every time the
oldest serializable transaction finishes (according to the "xmin"):

/*
 * Check whether it's time to clean up old transactions. This
can only be
 * done when the last serializable transaction with the oldest
xmin among
 * serializable transactions completes.  We then find the "new oldest"
 * xmin and purge any transactions which finished before this
transaction
 * was launched.
 */

> Does anybody have any insights in why this might be happening? I would
> also be very interested in good explanations about when predicate
> locks are actually freed. I tried reading the C code, but couldn't
> really figure out where that was happening.

Do you have long running serializable transactions?  They would
prevent cleanup of any number of short serializable transactions that
start later.  Can you tell us a bit about your workload, number of
concurrent queries, range of query durations?  Do you see a lot of
files under pg_serial (this is where finished transactions get spilled
to disk, and with the large numbers you're reporting, perhaps there
would be some of that happening too).

Re: tcp keep alive don't work when the backend is busy

2019-12-10 Thread Thomas Munro

On Wed, Dec 11, 2019 at 4:17 AM Fabio Ugo Venchiarutti
 wrote:
> On 10/12/2019 15:06, Tom Lane wrote:
> > =?utf-8?B?0J7Qu9C10LMg0KHQsNC80L7QudC70L7Qsg==?=  writes:
> >> According to the documentation
> >> https://www.postgresql.org/docs/12/runtime-config-connection.html
> >> A backend must check connection to the client by tcp_keepalive messages. 
> >> (Config option tcp_keepalives_idle).
> >
> >> But this is don't work if the backend is busy.
> >
> > You're reading something into the documentation that isn't there.
> >
> > The TCP keepalive mechanism is something that the OS does, independently
> > of backend processing.  The backend isn't going to notice loss of client
> > connection until it tries to read or write on the connection.
> >
> > If it were free to improve this, we might do so.  But it would be
> > very much not free.
>
> At what points does the backend bite the bullet to test the state of
> that file descriptor?
>
> I'd expect select() and poll() to return immediately when keepalive
> probes timeout, so idling clients are covered (and that's the main use
> case); does any other code path go out if its way to ensure that there's
> still a client without actually needing to read()/write()/send()/recv()?
> (obviously at the cost you mentioned)

It has been proposed that busy backends should (optionally)
periodically try to do a MSG_PEEK so they can learn about a client
that has gone away some time before they eventually try to write:

https://www.postgresql.org/message-id/flat/77def86b27e41f0efcba411460e929ae%40postgrespro.ru

More work is needed to move that forward, though.

1 2 >

1 - 100 of 181 matches

Mail list logo