Re: "PANIC: could not open critical system index 2662" - twice

2023-07-13 Thread Andres Freund
Hi,

On 2023-06-19 10:04:35 +, Evgeny Morozov wrote:
> There haven't been any updates posted to
> https://www.postgresql.org/message-id/20230509040203.z6mvijumv7wxcuib%40awork3.anarazel.de
> so I just wanted to check if there is any update on the status of the
> patch? Can we expect it in PostgreSQL 15.4? Thanks.

I pushed the fixes to all branches just now. Thanks for the report!

Greetings,

Andres Freund




Re: Pg 16: will pg_dump & pg_restore be faster?

2023-05-31 Thread Andres Freund
Hi,

On 2023-05-30 21:13:08 -0400, Bruce Momjian wrote:
> On Wed, May 31, 2023 at 09:14:20AM +1200, David Rowley wrote:
> > On Wed, 31 May 2023 at 08:54, Ron  wrote:
> > > https://www.postgresql.org/about/news/postgresql-16-beta-1-released-2643/
> > > says "PostgreSQL 16 can also improve the performance of concurrent bulk
> > > loading of data using COPY up to 300%."
> > >
> > > Since pg_dump & pg_restore use COPY (or something very similar), will the
> > > speed increase translate to higher speeds for those utilities?
> > 
> > I think the improvements to relation extension only help when multiple
> > backends need to extend the relation at the same time.  pg_restore can
> > have multiple workers, but the tasks that each worker performs are
> > only divided as far as an entire table, i.e. 2 workers will never be
> > working on the same table at the same time. So there is no concurrency
> > in terms of 2 or more workers working on loading data into the same
> > table at the same time.
> > 
> > It might be an interesting project now that we have TidRange scans, to
> > have pg_dump split larger tables into chunks so that they can be
> > restored in parallel.
> 
> Uh, the release notes say:
> 
>   
>   
>   
>   
>   Allow more efficient addition of heap and index pages (Andres Freund)
>   
>   
> 
> There is no mention of concurrency being a requirement.  Is it wrong?  I
> think there was a question of whether you had to add _multiple_ blocks
> ot get a benefit, not if concurrency was needed.  This email about the
> release notes didn't mention the concurrent requirement:

>   
> https://www.postgresql.org/message-id/20230521171341.jjxykfsefsek4kzj%40awork3.anarazel.de

There's multiple improvements that work together to get the overall
improvement. One part of that is filesystem interactions, another is holding
the relation extension lock for a *much* shorter time. The former helps
regardless of concurrency, the latter only with concurrency.

Regards,

Andres




Re: Earliest streamed message from replication slot has LSN less than the `consistent_point`

2023-05-17 Thread Andres Freund
Hi,

On 2023-05-17 11:36:37 +0530, Shubham Dhama wrote:
> > consistent_point (text)
> > The WAL location at which the slot became consistent. This is the earliest 
> > location from which streaming can start on this replication slot.
> 
> But it doesn't seem true in my scenario. Here is how to reproduce,
> 1. Have parallel transactions on a database
> 2. Create a replication slot while the above step is ongoing.
> 3. Peek/get the replication messages from the slot and you will see
> the first (and a few more) messages' have LSN less than the
> `consistent_point`.
> 
> I'm attaching a sample schema and insert queries. After creating the
> schema please do the following,
> 
> Step 1: for j in {1..20}; do for i in {1..100}; do psql
> $CONNECTION_STRING -f ~/insert.sql; done & done
> 
> Step 2: select pg_create_logical_replication_slot('test_slot', 'wal2json');
> 
> Step 3: psql $CONNECTION_STRING -c "select
> pg_logical_slot_peek_changes('test_slot', null, null,
> 'format-version', '2', 'include-xids', 'true', 'include-schemas'
> ,'true', 'include-transaction','true', 'include-types', 'true');"
> >~/data/slot-data.txt
> 
> I am able to reproduce this in PG 14, and here is what I got,
> 
> all_types_2=> select pg_create_logical_replication_slot('hello3', 'wal2json');
>  pg_create_logical_replication_slot
> 
>  (hello3,4B/DB04C730)
> 
> and I've attached the replication data output to confirm my issue.

I'm somewhat confused - your output actually doesn't seem to contain changes
from before that point?

The earliest LSN in sort.txt is precisely 4B/DB04C730?

I also don't exactly know what wal2json even prints as an LSN - if you have an
actual problem, could you construct a test case that just uses test_decoding?

Greetings,

Andres Freund




Re: "PANIC: could not open critical system index 2662" - twice

2023-05-16 Thread Andres Freund
Hi,

On 2023-05-16 14:20:46 +, Evgeny Morozov wrote:
> On 9/05/2023 3:32 am, Andres Freund wrote:
> > Attached is a rough prototype of that idea (only using datconnlimit ==
> > -2 for now).
> > I guess we need to move this to -hackers. Perhaps I'll post subsequent
> > versions below
> > https://www.postgresql.org/message-id/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de
> >  ?
> So now that a patch is in the works, can I drop the two corrupted
> databases?

Yes.


> Is there a workaround I can use to reduce the risk of running into this
> issue again until a patch is released? (Which I guess would be in August?)

Try to prevent the DROP DATABASE from getting cancelled :/. If you want to go
a bit further, you could rename the database to *_dropped before dropping it,
and then try to do the DROP DATABASE. That way you'd at least know that it's
corrupt because of a failed DROP database.

Greetings,

Andres Freund




Re: "PANIC: could not open critical system index 2662" - twice

2023-05-08 Thread Andres Freund
Hi,

On 2023-05-08 17:46:37 -0700, Andres Freund wrote:
> My current gut feeling is that we should use datconnlimit == -2 to prevent
> connections after reaching DropDatabaseBuffers() in dropdb(), and use a new
> column in 16, for both createdb() and dropdb().

Attached is a rough prototype of that idea (only using datconnlimit == -2 for
now).

This would need a fair bit more polish. The tests are crappy and output stuff
to stderr and don't validate enough. The error messages are bad. No docs
changes. More comments about why datconnlimit == -2 is used. Etc.

But I think it should be sufficient to discuss whether this is a viable path.


I guess we need to move this to -hackers. Perhaps I'll post subsequent
versions below
https://www.postgresql.org/message-id/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de
 ?

Greetings,

Andres Freund
>From 5784ba0b21eaf05c2989c57ce1f73c13edf0111e Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 8 May 2023 18:28:33 -0700
Subject: [PATCH v1] Handle interrupted DROP DATABASE

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20230509004637.cgvmfwrbht7xm...@awork3.anarazel.de
Backpatch:
---
 src/include/catalog/pg_database.h   |  2 +-
 src/backend/commands/dbcommands.c   | 50 ++---
 src/backend/utils/init/postinit.c   | 11 +
 src/test/recovery/meson.build   |  1 +
 src/test/recovery/t/037_invalid_database.pl | 48 
 5 files changed, 95 insertions(+), 17 deletions(-)
 create mode 100644 src/test/recovery/t/037_invalid_database.pl

diff --git a/src/include/catalog/pg_database.h b/src/include/catalog/pg_database.h
index d004f4dc8aa..5cd972da324 100644
--- a/src/include/catalog/pg_database.h
+++ b/src/include/catalog/pg_database.h
@@ -49,7 +49,7 @@ CATALOG(pg_database,1262,DatabaseRelationId) BKI_SHARED_RELATION BKI_ROWTYPE_OID
 	/* new connections allowed? */
 	bool		datallowconn;
 
-	/* max connections allowed (-1=no limit) */
+	/* max connections allowed (-1=no limit, -2=invalid database) */
 	int32		datconnlimit;
 
 	/* all Xids < this are frozen in this DB */
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 2e242eeff24..a0360d9ad80 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1569,6 +1569,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
 	bool		db_istemplate;
 	Relation	pgdbrel;
 	HeapTuple	tup;
+	Form_pg_database datform;
 	int			notherbackends;
 	int			npreparedxacts;
 	int			nslots,
@@ -1684,17 +1685,6 @@ dropdb(const char *dbname, bool missing_ok, bool force)
 		dbname),
  errdetail_busy_db(notherbackends, npreparedxacts)));
 
-	/*
-	 * Remove the database's tuple from pg_database.
-	 */
-	tup = SearchSysCache1(DATABASEOID, ObjectIdGetDatum(db_id));
-	if (!HeapTupleIsValid(tup))
-		elog(ERROR, "cache lookup failed for database %u", db_id);
-
-	CatalogTupleDelete(pgdbrel, >t_self);
-
-	ReleaseSysCache(tup);
-
 	/*
 	 * Delete any comments or security labels associated with the database.
 	 */
@@ -1711,6 +1701,31 @@ dropdb(const char *dbname, bool missing_ok, bool force)
 	 */
 	dropDatabaseDependencies(db_id);
 
+	/*
+	 * Tell the cumulative stats system to forget it immediately, too.
+	 */
+	pgstat_drop_database(db_id);
+
+	tup = SearchSysCacheCopy1(DATABASEOID, ObjectIdGetDatum(db_id));
+	if (!HeapTupleIsValid(tup))
+		elog(ERROR, "cache lookup failed for database %u", db_id);
+	datform = (Form_pg_database) GETSTRUCT(tup);
+
+	/*
+	 * Subsequent actions are not transactional (consider
+	 * DropDatabaseBuffers() discarding modified buffers). But we might crash
+	 * or get interrupted below. To prevent accessing a database with invalid
+	 * contents, mark the database as invalid using an in-place update.
+	 */
+	datform->datconnlimit = -2;
+	heap_inplace_update(pgdbrel, tup);
+
+	/*
+	 * Also delete the tuple - transactionally. If this transaction commits,
+	 * the row will be gone, but if we fail, dropdb() can be invoked again.
+	 */
+	CatalogTupleDelete(pgdbrel, >t_self);
+
 	/*
 	 * Drop db-specific replication slots.
 	 */
@@ -1723,11 +1738,6 @@ dropdb(const char *dbname, bool missing_ok, bool force)
 	 */
 	DropDatabaseBuffers(db_id);
 
-	/*
-	 * Tell the cumulative stats system to forget it immediately, too.
-	 */
-	pgstat_drop_database(db_id);
-
 	/*
 	 * Tell checkpointer to forget any pending fsync and unlink requests for
 	 * files in the database; else the fsyncs will fail at next checkpoint, or
@@ -2339,6 +2349,14 @@ AlterDatabase(ParseState *pstate, AlterDatabaseStmt *stmt, bool isTopLevel)
 	datform = (Form_pg_database) GETSTRUCT(tuple);
 	dboid = datform->oid;
 
+	if (datform->datconnlimit == -2)
+	{
+		ereport(FATAL,
+errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+errmsg("database \"%s\" is invalid", stmt->dbname),
+errdetail("Use DROP DATABASE to drop invalid dat

Re: "PANIC: could not open critical system index 2662" - twice

2023-05-08 Thread Andres Freund
Hi,

On 2023-05-08 14:04:00 -0700, Andres Freund wrote:
> But perhaps a similar approach could be the solution? My gut says that the
> rought direction might allow us to keep dropdb() a single transaction.

I started to hack on the basic approach of committing after the catalog
changes. But then I started wondering if we're not tackling this all wrong.


We don't actually want to drop the catalog contents early as that prevents the
database from being dropped again, while potentially leaving behind contents
in the case of a crash / interrupt.

Similary, the fact that we only commit the transaction at the end of
createdb() leads to interrupts / crashes orphaning the contents of that
[partial] database.  We also hve similar issues with movedb(), I think.


This is a non-transactional operation. I think we should copy the approach of
the CONCURRENTLY operations. Namely add a new column to pg_database,
indicating whether the database contents are valid. An invalid database can be
dropped, but not connected to.

Then we could have createdb() commit before starting to create the target
database directory (with invalid = true, of course). After all the filesystem
level stuff is complete, set invalid = false.

For dropping a database we'd use heap_inplace_update() to set invalid = true
just before the DropDatabaseBuffers(), preventing any connections after that
point.

Making movedb() safe is probably a bit harder - I think it'd temporarily
require two pg_database entries?


Of course we can't add a new column in the back branches. IIRC we had a
similar issue with CIC some point, and just ended up misusing some other
column for the backbranches?  We could e.g. use datconnlimit == -2 for that
purpose (but would need to make sure that ALTER DATABASE can't unset it).


My current gut feeling is that we should use datconnlimit == -2 to prevent
connections after reaching DropDatabaseBuffers() in dropdb(), and use a new
column in 16, for both createdb() and dropdb().  In some ways handling
createdb() properly is a new feature, but it's also arguably a bug that we
leak the space - and I think the code will be better if we work on both
together.

Greetings,

Andres Freund




Re: "PANIC: could not open critical system index 2662" - twice

2023-05-08 Thread Andres Freund
Hi,

On 2023-05-08 20:27:14 +, Evgeny Morozov wrote:
> On 8/05/2023 9:47 pm, Andres Freund wrote:
> > Did you have any occasions where CREATE or DROP DATABASE was interrupted?
> > Either due the connection being terminated or a crash?
>
> I've uploaded an edited version of the PG log for the time as
> https://objective.realityexists.net/temp/log-extract-2023-05-02.txt
> (test_behavior_638186279733138190 and test_behavior_638186280406544656
> are the two DBs that got corrupted).
>
> I cannot see any crash in the test logs or the PG logs, but whether it
> was interrupted is less clear. I don't know whether the the tests ran
> successfully up to the point where they tried to drop the DBs (I've
> since added logging to show that next time), but DROP DATABASE did not
> return after 30 seconds and the client library (Npgsql) then tried to
> cancel the requests. We then tried to drop the DB again, with the same
> results in both cases. After the second attempts timed out we closed the
> connections anyway - so maybe that was the interruption?

Yep, that is the interruption.  I suspect that something was not processing
interrupts, which then lead the WaitForProcSignalBarrier() in dropdb() to
block.

Are you using any extensions? Do you have any chance to figure out what
statements were running concurrently with the DROP DATABASE?


Seems we figured out what the problem is... Just need to figure out how to fix
it.

I think the minimal fix will entail moving the commit of the transaction into
dropdb(). We need to commit before executing DropDatabaseBuffers(), as the
database is inconsistent after that point.  Luckily DROP DATABASE already
can't be executed in a transaction, so that's easily possible.

That means we'd loose track of the files for the database in case of a crash,
but that's surely better than a corrupt database. And it's already the case
for CREATE DATABASE.


Additionally we probably need to prevent processing cancellations between
DropDatabaseBuffers() and remove_dbtablespaces(). But we can't just stop
accepting interrupts, because that'd break WaitForProcSignalBarrier(). Ah,
what fun.

Unfortunately QueryCancelHoldoffCount is not sufficient for this purpose - we
don't just need to prevent cancellations, we also can't accept termintions
either...

For a bit I thought we might be able to just reorder operations, moving the
WaitForProcSignalBarrier() earlier. Unfortunately I don't think that works,
because until DropDatabaseBuffers() has executed, backends might write such
buffers back, we need to do WaitForProcSignalBarrier() after that.


I've toyed with the idea of splitting DropDatabaseBuffers() into two. In a
first pass, mark all buffers as IO_IN_PROGRESS (plus perhaps some other
flag). That'd prevent new IO from being started for those buffers. Then
WaitForProcSignalBarrier(), to make sure nobody has an open FD for those
files. Then actually drop all those buffers and unlink the files. That now can
happen with interrupts held, without any chance of being blocked, afaict.  In
case of being cancelled during WaitForProcSignalBarrier(), AbortBufferIO()
would remove IO_IN_PROGRESS, and everything would be fine.

I don't like the idea of WaitForProcSignalBarrier() while having buffers
marked as in-progress. I don't *immediately* see a deadlock danger, but I'd be
unsurprised if there were some.

But perhaps a similar approach could be the solution? My gut says that the
rought direction might allow us to keep dropdb() a single transaction.

Greetings,

Andres Freund




Re: "PANIC: could not open critical system index 2662" - twice

2023-05-08 Thread Andres Freund
Hi,

On 2023-05-07 16:10:28 +, Evgeny Morozov wrote:
> Yes, kind of. We have a test suite that creates one test DB and runs a
> bunch of tests on it. Two of these tests, however, create another DB
> each (also by cloning the same template DB) in order to test copying
> data between DBs. It's only these "extra" DBs that were corrupted, at
> least on this occasion.

Did you have any occasions where CREATE or DROP DATABASE was interrupted?
Either due the connection being terminated or a crash?

As described in
https://postgr.es/m/20230314174521.74jl6ffqsee5mtug%40awork3.anarazel.de
we don't handle that correctly for DROP DATABASE.

I think that might actually fit the symptoms - the DropDatabaseBuffers() will
throw away the dirty buffer contents from the WAL strategy CREATE DATABASE,
but if you then get cancelled ata point before all the files are removed, the
on-disk fails with all-zeroes would remain.

Greetings,

Andres Freund




Re: Getting PSQL in Windows to support TAB/Autocomplete via modified readline...

2022-12-23 Thread Andres Freund
Hi,

On 2022-12-08 09:52:52 -0500, Kirk Wolak wrote:
> On Wed, Dec 7, 2022 at 2:18 PM Andres Freund  wrote:
> 
> > Hi,
> >
> > On 2022-11-28 17:22:19 -0500, Kirk Wolak wrote:
> > >   Thank you for this.  We have the build working with meson (w/o
> > readline).
> > > I was not able to get "readline" to be recognized by meson (I've attached
> > > the log file)
> > > It is "there" according to vcpkg.  (other projects have a .pc file this
> > one
> > > is missing?)
> >
> > The problem is likely indicated by this:
> > > env[PKG_CONFIG_PATH]:
> > c:/vcpkg/installed/x64-windows/tools/pkgconf/pkgconf.exe
> >
> > This needs to point to the directory containing the .pc files, not the
> > executable.

> Thanks..
>   The above, plus I ended up having to create the readline.pc files.

They should be automatically created by vcpkg.

Greetings,

Andres Freund




Re: Getting PSQL in Windows to support TAB/Autocomplete via modified readline...

2022-12-07 Thread Andres Freund
Hi,

On 2022-11-28 17:22:19 -0500, Kirk Wolak wrote:
>   Thank you for this.  We have the build working with meson (w/o readline).
> I was not able to get "readline" to be recognized by meson (I've attached
> the log file)
> It is "there" according to vcpkg.  (other projects have a .pc file this one
> is missing?)

The problem is likely indicated by this:
> env[PKG_CONFIG_PATH]: c:/vcpkg/installed/x64-windows/tools/pkgconf/pkgconf.exe

This needs to point to the directory containing the .pc files, not the
executable.

Greetings,

Andres Freund




Re: Getting PSQL in Windows to support TAB/Autocomplete via modified readline...

2022-11-23 Thread Andres Freund
Hi,

On 2022-11-23 20:55:18 -0500, Kirk Wolak wrote:
> Currently I have a lot of it working, but I need to partner up with some of
> the meson guys...
> if it is already close to working, a couple of my tweaks could accelerate
> it.

I just successfully compiled postgres with meson and readline, using msvc,
only tweaking psql_completion() contents.

The readline I used for building with msvc is from
vcpkg.exe install --triplet x64-windows readline-win32 pkgconf

I also installed icu zlib zstd libxml2 libxslt tcl that way, but shouldn't
matter here.

vcpkg is installed in c:/dev/vcpkg

Other steps:
# let meson know where to get pkgconf from, could also add it to PATH
$ENV:PKG_CONFIG="c:/dev/vcpkg/installed/x64-windows/tools/pkgconf/pkgconf.exe"

# specify location of pkgconfig files, so meson can get compilation flags
meson setup --buildtype debug 
-Dpkg_config_path=c:/dev/vcpkg/installed/x64-windows/debug/lib/pkgconfig ...

# when running tests, the readline dll needs to be somewhere in PATH
$ENV:PATH="$ENV:PATH;c:/dev/vcpkg/installed/x64-windows/debug/bin"


If you want to build an optimized postgres, you'd need to change the
pkg_config_path and PATH so it doesn't include debug/ (this is because of
needing to link to the same CRT).

Greetings,

Andres Freund




Re: Getting PSQL in Windows to support TAB/Autocomplete via modified readline...

2022-11-23 Thread Andres Freund
Hi,

On 2022-11-23 20:28:29 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2022-11-23 18:11:22 -0500, Tom Lane wrote:
> >> Huh ... do you recall the details?  Large as tab-complete is, it's
> >> far smaller than gram.y:
>
> > So It might just be that we need to split up that very long "else if" chain 
> > in
> > psql_completion().
>
> Could be, yeah.

It's not quite correct, but to test that I wrapped all the Matches("ALTER",
...) into a if (HeadMatches("ALTER")). That's enough to make the file compile
without any other changes.

The problem with that of course is that we then don't try the later
completions if there's no match for any of the Matches("ALTER", ...).

I think it'd be easier to deal with this if COMPLETE_WITH_* caused the the
containing function to return. Then the completions wouldn't need to be in one
huge if-else if. Leaving msvc aside, that also seems nice for efficiency.


> Related point: I've wondered for awhile how long that
> chain can get before we start noticing performance issues on slower
> machines.

Same.


> Even when the standard is just human reaction time, there's
> a limit to how many cycles you can throw away.  So I'd like to see it
> refactored somehow to be a bit less stupid --- but I have a nagging
> feeling that we'd end up building some special-purpose program generator,
> which is something I've not wanted to do.  OTOH, in the best of all
> possible worlds such a tool might make it easier to add tab
> completions?

> (In the past I've fantasized about auto-generating tab completion
> logic from the backend grammar, but I fear it's just fantasy.
> The backend grammar isn't factored right, and too many of its names
> don't have clear traceability to what-kind-of-object-is-that.)

I've thought about a grammar based approach as well, but as you say, it's not
obvious how to make that work well.

I wonder if we could model the completions as something roughly akin to a
DFA. We don't even need to generate the real state machine at compile time,
it'd be fine to do it at psql startup.


But even just representing the current top-level conditions in an array, and
then having a few obvious optimizations when matching the input to the array,
should make it easy to beat the current approach. And it'd result in a much
smaller amount of .text.

There's a small number of else ifs that aren't just combinations of *Matches*
conditions, e.g. stuff like !ends_with(), but those could be dealt via
optional callbacks.

I'm, very roughly, thinking of something like:

compl_match_rule match_rules[] = {
  {.match = Matches("CREATE"), .compl_func = create_command_generator},
  {.match = TailMatches("CREATE", "OR", "REPLACE"), .compl = {"FUNCTION", 
"..."}},
  ...
  {.match = MatchAnd(
HeadMatches("ALTER", "PUBLICATION", MatchAny),
TailMatches("WHERE")
   ),
   .compl = {")"}},
  ...
}

where of course Matches() etc wouldn't directly generate code, but evaluate to
a literal struct with const char* members for the different options etc.

I think this should make it easier to optimize evaluation. We e.g. could
e.g. require that the Matches() rules are sorted, allowing to find the
appropriate Matches() / HeadMatches() starting with the word we're trying to
complete.

Greetings,

Andres Freund




Re: Getting PSQL in Windows to support TAB/Autocomplete via modified readline...

2022-11-23 Thread Andres Freund
Hi,

On 2022-11-23 18:11:22 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2022-11-21 10:58:06 -0500, Tom Lane wrote:
> >> It'd certainly be nice if we could use Readline on Windows.
> 
> > 2) The last time I checked, msvc couldn't preprocess tab-complete.c with
> >USE_READLINE defined, due to running into some preprocessor limits. We 
> > can
> >probably fix that on our end somehow.
> 
> Huh ... do you recall the details?  Large as tab-complete is, it's
> far smaller than gram.y:

I think the problem was that msvc couldn't quite deal with VA_ARGS_NARGS. Or
something like that.



Ah, yes:
../src/bin/psql/tab-complete.c(4064): fatal error C1026: parser stack overflow, 
program too complex

We discussed something around VA_ARGS_NARGS not working on windows:
https://postgr.es/m/f450fc57-a147-19d0-e50c-33571c52cc13%40postgrespro.ru
But now I don't immediately see the above error as being related.


If I remove all the completions after
/* Complete INSERT INTO with table names */
else if (TailMatches("INSERT", "INTO"))
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_updatables);

it builds with msvc as well. And when using a readline sourced from vcpkg, it
even links. Of course only a few of the completions work, given that I
truncated a whole lot of them away...


So It might just be that we need to split up that very long "else if" chain in
psql_completion().

Greetings,

Andres Freund




Re: Getting PSQL in Windows to support TAB/Autocomplete via modified readline...

2022-11-23 Thread Andres Freund
hi,

On 2022-11-21 10:58:06 -0500, Tom Lane wrote:
> Kirk Wolak  writes:
> >   We have our team of Windows developers, leveraging PSQL.  But honestly,
> > it feels crippled after using PSQL in Linux for any length of time.  Losing
> > auto-complete sucks (multi-line query recall/editing is lost as well).
>
> >   In researching this problem, it appears that the decision was made like
> > 17yrs ago, when windows did not have a realistic "terminal" type
> > interface.  Assuming we target Windows 8.1 or higher, I believe this goes
> > away.
>
> It'd certainly be nice if we could use Readline on Windows.

FWIW, when building postgres with meson and mingw, you end up with a working
readline today (provided readline is installed). That ended up working mostly
accidentally.  See here for a few more details:
https://www.postgresql.org/message-id/20220928022724.erzuk5v4ai4b53do%40awork3.anarazel.de

I think there might still be something weird with ctrl-c, but that's not
really related to readline.


Readline doesn't currently work when building msvc for two reasons:

1) Readline doesn't provide an import library. Mingw implements a workaround
   for that by basically generating one at link time, but msvc doesn't.

   An import library can be fairly easily be generated. Or one could propose a
   patch to upstream readline to add support for generating readline.

2) The last time I checked, msvc couldn't preprocess tab-complete.c with
   USE_READLINE defined, due to running into some preprocessor limits. We can
   probably fix that on our end somehow.


Greetings,

Andres Freund




Re: PANIC: could not flush dirty data: Cannot allocate memory

2022-11-16 Thread Andres Freund
Hi,

On 2022-11-16 09:16:56 -0800, Andres Freund wrote:
> On 2022-11-15 13:23:56 +0100, klaus.mailingli...@pernau.at wrote:
> > Filesystem is ext4. VM technology is mixed: VMware, KVM and XEN PV. Kernel
> > is 5.15.0-52-generic.
> > 
> > We have not seen this with Ubutnu 18.04 and 20.04 (although we might not
> > have noticed it).
> 
> Did this start after upgrading to 22.04? Or after a certain kernel upgrade?
> 
> Do you use cgroups or such to limit memory usage of postgres?
> 
> I'd be helpful to see /proc/meminfo from one of the affected instances.

Another interesting thing would be to know the mount and file system options
for the FS that triggers the failures. E.g.
  tune2fs -l path/to/blockdev
and
  grep path/to/blockdev /proc/mounts

Greetings,

Andres Freund




Re: PANIC: could not flush dirty data: Cannot allocate memory

2022-11-16 Thread Andres Freund
Hi,

On 2022-11-15 13:23:56 +0100, klaus.mailingli...@pernau.at wrote:
> Filesystem is ext4. VM technology is mixed: VMware, KVM and XEN PV. Kernel
> is 5.15.0-52-generic.
> 
> We have not seen this with Ubutnu 18.04 and 20.04 (although we might not
> have noticed it).

Did this start after upgrading to 22.04? Or after a certain kernel upgrade?

Do you use cgroups or such to limit memory usage of postgres?

I'd be helpful to see /proc/meminfo from one of the affected instances.

Greetings,

Andres Freund




Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal

2022-03-07 Thread Andres Freund
Hi,

On 2022-03-07 16:11:28 +0100, hubert depesz lubaczewski wrote:
> On Sun, Mar 06, 2022 at 11:10:00AM -0500, Tom Lane wrote:
> > > I tore these boxes down, so can't check immediately, but I think
> > > I remember that you're right - single-row queries didn't use JIT.
> 
> Got focal box up. Loaded schema for Pg.
> 
> Initially select didn't break anything, but when I tuned down
> jit_above_cost so that it will kick in - got fails immediately.

Could you set jit_debugging_support=on and show a backtrace with that?


> > FWIW, I went to the trouble of installing Ubuntu Focal on my
> > raspberry pi to see if I could duplicate this, and I couldn't.
> > However, what you get from a fresh install now is
> > $ dpkg -l | grep libllvm  
> > ii  libllvm10:arm641:10.0.0-4ubuntu1 
> > arm64Modular compiler and toolchain technologies, runtime library
> > not 9.0.1.  I also found that Fedora 31/aarch64 is still downloadable, and
> > that does contain LLVM 9 ... and I could not reproduce it there either.
> 
> Soo... plot thickens.
> 
> Looks that pg 12 supplied by pgdg required libllvm9:

What are the libllvm package versions on which you can reproduce the crash?


> Newer pg12 (12.10) from pgdg also depends on llvm9. Perhaps changing the deb 
> to
> use/depend-on llvm10 would solve the problem, for now I'm not sure how to do
> it. Reached to Christoph about it.

It'd require rebuilding.


It's possible that the problem is one of the patches applied by ubuntu to llvm
and that's why others can't reproduce so far. There's quite a few...

Greetings,

Andres Freund




Re: bad JIT decision

2020-07-28 Thread Andres Freund
Hi,

On 2020-07-28 14:07:48 -0700, Andres Freund wrote:
> (I'm rebasing my tree that tries to reduce the overhead / allow caching
> / increase efficiency to current PG, but it's a fair bit of work)

FWIW, I created a demo workload for this, and repro'ed the issue with
that. Those improvements does make a very significant difference:

CREATE FUNCTION exec(text) returns text language plpgsql volatile
AS $f$
BEGIN
  EXECUTE $1;
  RETURN $1;
END;
$f$;
CREATE TABLE manypa(category text not null, data text not null) PARTITION BY 
LIST(category);
SELECT exec('CREATE TABLE manypa_'||g.i||' PARTITION OF manypa FOR VALUES 
IN('||g.i||')') FROM generate_series(1, 1000) g(i);
INSERT INTO manypa(category, data) VALUES('1', '1');

EXPLAIN ANALYZE SELECT * FROM manypa WHERE data <> '17' and data <> '15' and 
data <> '13' AND data <> '11' AND data <> '9' AND data <> '7' AND data <> '5' 
AND data <> '3' AND data <> '1';

Before:
Timing: Generation 335.345 ms, Inlining 51.025 ms, Optimization 11967.776 
ms, Emission 9201.499 ms, Total 21555.645 ms
IR size: unoptimized: 9022868 bytes, optimized: 6206368 bytes

After:
Timing: Generation 261.283 ms, Inlining 30.875 ms, Optimization 1671.969 
ms, Emission 18.557 ms, Total 1982.683 ms
IR size: unoptimized 8776100 bytes, optimized 115868 bytes

That obviously needs to be improved further, but it's already a lot
better. In particular after these changes the generated code could be
cached.


One thing that could make a huge difference here is to be able to
determine whether two expressions and/or tlists are equivalent
cheaply... I know that David has some need for that too.

Greetings,

Andres Freund




Re: bad JIT decision

2020-07-28 Thread Andres Freund
Hi,

On 2020-07-28 11:54:53 +1200, David Rowley wrote:
> Is there some reason that we can't consider jitting on a more granular
> basis?

There's a substantial "constant" overhead of doing JIT. And that it's
nontrival to determine which parts of the query should be JITed in one
part, and which not.


> To me, it seems wrong to have a jit cost per expression and
> demand that the plan cost > #nexprs * jit_expr_cost before we do jit
> on anything.  It'll make it pretty hard to predict when jit will occur
> and doing things like adding new partitions could suddenly cause jit
> to not enable for some query any more.

I think that's the right answer though:

> ISTM a more granular approach would be better. For example, for the
> expression we expect to evaluate once, there's likely little point in
> jitting it, but for the one on some other relation that has more rows,
> where we expect to evaluate it 1 billion times, there's likely good
> reason to jit that.  Wouldn't it be better to consider it at the
> RangeTblEntry level?

Because this'd still JIT if a query has 10k unconditional partition
accesses with the corresponding accesses, even if they're all just one
row?

(I'm rebasing my tree that tries to reduce the overhead / allow caching
/ increase efficiency to current PG, but it's a fair bit of work)

Greetings,

Andres Freund




Re: bad JIT decision

2020-07-28 Thread Andres Freund
Hi,

On 2020-07-27 19:02:56 -0400, Alvaro Herrera wrote:
> On 2020-Jul-27, Scott Ribe wrote:
> 
> > > On Jul 27, 2020, at 4:00 PM, Alvaro Herrera  
> > > wrote:
> > > 
> > > I don't quite understand why is it that a table with 1000 partitions
> > > means that JIT compiles the thing 1000 times.  Sure, it is possible that
> > > some partitions have a different column layout, but it seems an easy bet
> > > that most cases are going to have identical column layout, and so tuple
> > > deforming can be shared.  (I'm less sure about sharing a compile of an
> > > expression, since the varno would vary. But presumably there's a way to
> > > take the varno as an input value for the compiled expr too?)  Now I
> > > don't actually know how this works so please correct if I misunderstand
> > > it.
> > 
> > I'm guessing it's because of inlining. You could optimize a function
> > that takes parameters, no problem. But what's happening is inlining,
> > with parameters, then optimizing.

No, that's not what happens. The issue rather is that at execution time
there's simply nothing tying the partitioned parts of the query together
from the executor POV. Each table scan gets its own expressions to
evaluate quals etc. That's not a JIT specific thing, it's general.

Which then means a partitioned query with a projection and a where
clause applying on the partition level has > 2 expressions for each
partiton. And they get a separate ExprState and get emitted separately.

One issue is that we don't take that into account for costing. The other
is the overhead, of course. Even when not JITed, that's a lot of work
that we don't actually need, except we don't know which partitions look
enough like others that we could reuse another expression.

One partial way to address this is to simply add a
LLVMAddMergeFunctionsPass() at the beginning of the optimization
pipeline. In my testing that can quite drastically cut down on
optimization time. But obviously solves the problem only to some degree,
since that's not free.


> Are you saying that if you crank jit_inline_above_cost beyond this
> query's total cost, the problem goes away?

FWIW, you can set the cost to -1 and it'll never inline.

Greetings,

Andres Freund




Re: bad JIT decision

2020-07-27 Thread Andres Freund
Hi,

On 2020-07-25 10:54:18 -0400, Tom Lane wrote:
> David Rowley  writes:
> > ... nested at the bottom level join, about 6 joins deep.  The lack of
> > any row being found results in upper level joins not having to do
> > anything, and the majority of the plan is (never executed).
> 
> On re-reading this, that last point struck me forcibly.  If most of
> the plan never gets executed, could we avoid compiling it?  That is,
> maybe JIT isn't JIT enough, and we should make compilation happen
> at first use of an expression not during executor startup.

That unfortunately has its own downsides, in that there's significant
overhead of emitting code multiple times. I suspect that taking the
cost of all the JIT emissions together into account is the more
promising approach.

Greetings,

Andres Freund




Re: bad JIT decision

2020-07-24 Thread Andres Freund
Hi,

On Fri, Jul 24, 2020, at 15:32, Scott Ribe wrote:
> > On Jul 24, 2020, at 4:26 PM, David Rowley  wrote:
> > 
> > It does not really take into account the cost of jitting.
> 
> That is what I was missing.
> 
> I read about JIT when 12 was pre-release; in re-reading after my post I 
> see that it does not attempt to estimate JIT cost. And in thinking 
> about it, I realize that would be next to impossible to anticipate how 
> expensive LLVM optimizstion was going to be.

We certainly can do better than now.

> In the case where a set of functions is replicated across partitions, 
> it would be possible to do them once, then project the cost of the 
> copies. 

Probably not - JITing functions separately is more expensive than doing them 
once... The bigger benefit there is to avoid optimizing functions that are 
likely to be the same.

> Perhaps for PG 14 as better support for the combination of JIT 
> optimization and highly-partitioned data ;-)

If I posted a few patches to test / address some of these issue, could you test 
them with your schema & querries?

Regards,

Andres




Re: bad JIT decision

2020-07-24 Thread Andres Freund
Hi,

On 2020-07-24 18:37:02 -0400, Tom Lane wrote:
> David Rowley  writes:
> > However, for now, you might just want to try raising various jit
> > thresholds so that it only is enabled for more expensive plans.
> 
> Yeah.  I'm fairly convinced that the v12 defaults are far too low,
> because we are constantly seeing complaints of this sort.

I think the issue is more that we need to take into accoutn that the
overhead of JITing scales ~linearly with the number of JITed
expressions. And that's not done right now.  I've had a patch somewhere
that had a prototype implementation of changing the costing to be
#expressions * some_cost, and I think that's a lot more accurate.

Greetings,

Andres Freund




Re: Idle sessions keep deleted files open

2020-03-29 Thread Andres Freund
Hi,

On 2020-03-30 00:49:13 +0200, Peter J. Holzer wrote:
> I noticed this with PostgreSQL 11.7 on Linux:
> 
> I did a «vacuum full» on a table which had become quite bloated (size
> before the vacuum: 392 GB, after: 291 GB). I expected the disk usage to
> increase as the new table was written and then drop sharply after the
> vacuum was finished.
> 
> However, that didn't happen. While usage increased by about 300 GB
> during the vacuum, it stayed the same afterwards (as reported by «df»).
> 
> However, «du» reported about 400 GB less than «df». So apparently, files
> had been unlinked but were still in use by some process.
> 
> «lsof» confirmed this: There were some idle sessions which were keeping
> the files open. As soon as I got those sessions to execute some request,
> they closed the files, causing the disk space to be freed.
> 
> I think I noticed that before when dropping tables, but didn't draw the
> right conclusion at the time.
> 
> So, if you are reorganizing large tables, keep an eye on idle sessions -
> they may keep deleted files around for quite some time.

I think this is partially related to
https://postgr.es/m/20181003223725.elcu3t44fpd4lm56%40alap3.anarazel.de

Greetings,

Andres Freund




Re: PG12 autovac issues

2020-03-28 Thread Andres Freund
Hi,

On 2020-03-28 17:47:19 +0900, Michael Paquier wrote:
> On Fri, Mar 27, 2020 at 05:10:03PM -0500, Justin King wrote:
> > This is encouraging. As I mentioned, we have a workaround in place for
> > the moment, but don't hesitate if you need anything else from me.
> > Thanks for jumping in on the thread, it was nice validation to know
> > that I wasn't the only one seeing the issue!
> 
> Yes.  We have been able to confirm that 2aa6e33 is a direct cause of
> your problem.

I assume you're still trying to track the actual cause of the problem
further?

Greetings,

Andres Freund




Re: PG12 autovac issues

2020-03-25 Thread Andres Freund
Hi,

On 2020-03-26 10:43:36 +0900, Michael Paquier wrote:
> On Wed, Mar 25, 2020 at 10:39:17AM -0500, Justin King wrote:
> > Mar 25 14:48:26 cowtn postgres[39875]: [35298-1] 2020-03-25
> > 14:48:26.329 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
> > wraparound of table "postgres.pg_catalog.pg_tablespace"
> > Mar 25 14:48:26 cowtn postgres[39875]: [35299-1] 2020-03-25
> > 14:48:26.339 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
> > wraparound of table "postgres.pg_catalog.pg_auth_members"
> > Mar 25 14:48:26 cowtn postgres[39875]: [35300-1] 2020-03-25
> > 14:48:26.350 GMT [39875] DEBUG:  skipping redundant vacuum to prevent
> > wraparound of table "postgres.pg_catalog.pg_replication_origin"

FWIW, this kind of thing is why I think the added skipping logic is a
bad idea. Silently skipping things like this (same with the "bogus"
logic in datfrozenxid computation) is dangerous. I think we should
seriously consider backing this change out.

And if not, then we should at least include enough detail in the message
to be able to debug this.


> >  postgres=# SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM
> > pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid) DESC LIMIT
> > 1;
> > oid|age| relfrozenxid
> > ---+---+--
> >  pg_authid | 202793549 |   4284570172
> 
> Ugh.  I think that this is exactly the thing I was suspecting
> previously:
> - The database stats look sane.
> - The relation stats don't look good and visibly are put in such a
> state that only one type of jobs gets triggered (non-aggressive but
> anti-wraparound), which just keep being skipped and the relation stats
> don't get refreshed.  (Such autovacuum jobs should never happen and we
> have some underlying issues that will need separate care).

Hm. Why is this a state that is clearly invalid compared to pg_database?
Seems to precisely match

> postgres=# SELECT datname, age(datfrozenxid), datfrozenxid FROM
> pg_database ORDER BY age(datfrozenxid) DESC LIMIT 1;
>  datname  |age| datfrozenxid
> --+---+--
>  postgres | 202773709 |   4284570172

And why should this lead to anti-wraparound vacuums not happening? This
is older than the the cutoff age?

xid 4284570172 having the age of 202 million xids suggests that
ReadNewTransactionId() is approx 192376585. Which comports with the log
saying: oldest xmin: 189591147.


Or are you saying that you conclude that the relcache entry is somehow
out of date? It sure is interesting that all of the tables that hit the
"skipping redundant vacuum" condition are shared tables.

Greetings,

Andres Freund




Re: PG12 autovac issues

2020-03-24 Thread Andres Freund
Hi,

On 2020-03-24 15:12:38 +0900, Michael Paquier wrote:
> > Well, there's no logging of autovacuum launchers that don't do anything
> > due to the "skipping redundant" logic, with normal log level. If somehow
> > the horizon logic of autovacuum workers gets out of whack with what
> > vacuumlazy.c does, then you'd not get any vacuums. But a usage level
> > triggered analyze could still happen on such a table, I think.
> 
> What surprised me the most is that the same table happened to be
> analyzed again and again after the launcher began its blackout.

Well, if there's an issue with the "redundant" logic, that would be a
not too surprising outcome. It's quite plausible that one or two tables
in the database would get enough changes to occasionally need to be
analyzed. If the workload is steady, that could e.g. work out to every
~17 minutes. All tables that autovacuum things are not wraparound
threatened will be skipped, but ones that are will get both vacuum and
analyze queued. The redundant logic could then entirely skip all
vacuums - but there's no equivalent for analyze.

> > While looking at this issue I found a few problems, btw. That seems more
> > like a -hackers discussion, so I started:
> > https://postgr.es/m/20200323235036.6pje6usrjjx22zv3%40alap3.anarazel.de
> 
> Yes, let's discuss there.

Cool. Would also be good if you could expand on the thread introducing
the "redundant" logic.

Greetings,

Andres Freund




Re: PG12 autovac issues

2020-03-23 Thread Andres Freund
Hi,

On 2020-03-24 14:26:06 +0900, Michael Paquier wrote:
> > Could you share what the config of the server was?
> 
> Nothing really fancy:
> - autovacuum_vacuum_cost_delay to 2ms (default of v12, but we used it
> in v11 as well).
> - autovacuum_naptime = 15s
> - autovacuum_max_workers = 6
> - log_autovacuum_min_duration = 0

Oh, so you're also involved in this? I'm starting to get a bit confused
as to who is reporting what.


> > Did you see any errors / fatals around the time autovacuum stopped
> > working?
> 
> Before going rogue (we are not sure if autovacuum didn't launch any
> workers or if the workers were spawned and exited early as we did not
> capture any worker information in pg_stat_activity), we saw a bunch of
> aggressive wraparound jobs.  Even after that, we have traces in the
> logs of one autovacuum analyze happening at equal interval of time (17
> minutes) on one single table, which is...  Er...  uncommon to say the
> least.

Well, there's no logging of autovacuum launchers that don't do anything
due to the "skipping redundant" logic, with normal log level. If somehow
the horizon logic of autovacuum workers gets out of whack with what
vacuumlazy.c does, then you'd not get any vacuums. But a usage level
triggered analyze could still happen on such a table, I think.

While looking at this issue I found a few problems, btw. That seems more
like a -hackers discussion, so I started:
https://postgr.es/m/20200323235036.6pje6usrjjx22zv3%40alap3.anarazel.de

I think I might just have figured out another one...

Greetings,

Andres Freund




Re: PG12 autovac issues

2020-03-23 Thread Andres Freund
Hi,

On 2020-03-23 16:31:21 -0500, Justin King wrote:
> This is occurring in our environment right now (started about 30 min
> ago).  Here 's the latest logs (grepped by vacuum):
> 
> Mar 23 20:54:16 cowtn postgres[15569]: [12-1] 2020-03-23 20:54:16.542
> GMT [15569] LOG:  automatic vacuum of table "feedi.production.tita":
> index scans: 1
> Mar 23 20:54:27 cowtn postgres[15654]: [8-1] 2020-03-23 20:54:27.964
> GMT [15654] LOG:  automatic vacuum of table
> "feedi.production.distributed_virtual_schedule": index scans: 1

Hm, unfortunately you've cut off the details in the subsequent
lines. There's a few newlines in the output. Any chance you could
re-post with those included?


> > > > I wonder if what might be happening is that we're somehow missed/failed
> > > > to update relfrozenxid and/or datfrozenxid. If you manually vacuum some
> > > > table in the oldest database, but that is *NOT* the oldest table itself,
> > > > does the problem "resolve" itself?
> 
> postgres=# SELECT datname
> , age(datfrozenxid)
> , current_setting('autovacuum_freeze_max_age')
> FROM pg_database;
>   datname  |age| current_setting
> ---+---+-
>  postgres  | 202375735 | 2
>  template1 | 202345459 | 2
>  template0 | 132459914 | 2
>  feedi | 132459914 | 2
> (4 rows)

Can you show the oldest tables in 'feedi'? Or, hm, actually, could you
just post the result of all the queries from the "What is:" section in
https://postgr.es/m/20200323162303.s7ay5hjdvimtkz6u%40alap3.anarazel.de


> Since this is occurring right now, what else would be useful to
> capture?  You'd asked about a GDB -- do you want that of the main
> process or the autovac worker?

Unless you can give me gdb access directly, I don't yet have enough data
to suggest what exactly we would want to analyze with gdb in your case.


It'd be helpful if you could change log_min_messages to DEBUG1 and
reload the configuration (not restart!).

Greetings,

Andres Freund




Re: PG12 autovac issues

2020-03-23 Thread Andres Freund
rtied
> 266613-   avg read rate: 21.058 MB/s, avg write rate: 21.058 MB/s
> 266614-   system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> 266615:2020-03-14 08:49:59.931 UTC LOG:  automatic aggressive vacuum to 
> prevent wraparound of table "pg_toast.pg_toast_24845": index scans: 0
> 266616-   pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped 
> frozen
> 266617-   tuples: 0 removed, 0 remain, 0 are dead but not yet removable, 
> oldest xmin: 209635934
> 266618-   buffer usage: 25 hits, 1 misses, 2 dirtied
> 266619-   avg read rate: 8.005 MB/s, avg write rate: 16.009 MB/s
> 266620-   system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
> 266621:2020-03-14 08:49:59.945 UTC LOG:  automatic aggressive vacuum to 
> prevent wraparound of table "pg_toast.pg_toast_27712": index scans: 0
> 266622-   pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped 
> frozen
> 266623-   tuples: 0 removed, 0 remain, 0 are dead but not yet removable, 
> oldest xmin: 209635953
> [...]

Do you have any non-toast ones?


> > I wonder if what might be happening is that we're somehow missed/failed
> > to update relfrozenxid and/or datfrozenxid. If you manually vacuum some
> > table in the oldest database, but that is *NOT* the oldest table itself,
> > does the problem "resolve" itself?
> 
> I'll also add pg_class snapshot for next time we run the bench.  I'm not sure
> if we'll be able to catch the 1h45 interval when the system stays alive
> after the issue though.

Could you just script something to stop the benchmark once the disk is
90% full or so?

Did you see any errors / fatals around the time autovacuum stopped
working?

Greetings,

Andres Freund




Re: PG12 autovac issues

2020-03-23 Thread Andres Freund
Hi,

On 2020-03-23 16:22:47 +0100, Julien Rouhaud wrote:
> On Fri, Mar 20, 2020 at 12:03:17PM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2020-03-20 12:42:31 -0500, Justin King wrote:
> > > When we get into this state again, is there some other information
> > > (other than what is in pg_stat_statement or pg_stat_activity) that
> > > would be useful for folks here to help understand what is going on?
> > 
> > If it's actually stuck on a single table, and that table is not large,
> > it would be useful to get a backtrace with gdb.
> 
> FTR, we're facing a very similar issue at work (adding Michael and Kevin in 
> Cc)
> during performance tests since a recent upgrade to pg12 .
>
> What seems to be happening is that after reaching 200M transaction a first 
> pass
> of autovacuum freeze is being run, bumping pg_database.darfrozenxid by ~ 800k
> (age(datfrozenxid) still being more than autovacuum_freeze_max_age
> afterwards).

If you have older transactions around that'd not be surprising. Do you
have autovacuum logging output for this case?


> After that point, all available information seems to indicate that no
> autovacuum worker is scheduled anymore:

Do you mean that this table doesn't get autovac'ed at all anymore, that
no table is getting autovac'd, or just that there's nothing further
increasing relfrozenxid for that table?

It sounds like:

> - log_autovacuum_min_duration is set to 0 and no activity is logged (while
>   having thousands of those per hour before that)

no table at all?


> - 15 min interval snapshot of pg_database and pg_class shows that
>   datfrozenxid/relfrozenxid keeps increasing at a consistent rate and never
>   goes down

I assume you mean their age?

What is:
- datfrozenxid, age(datfrozenxid) for the oldest database
  SELECT datname, age(datfrozenxid), datfrozenxid FROM pg_database ORDER BY 
age(datfrozenxid) DESC LIMIT 1;
- relfrozenxid, age(relfrozenxid) for the oldest table in the oldest database
  SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class WHERE 
relfrozenxid <> 0 ORDER BY age(relfrozenxid) DESC LIMIT 1;
- Oldest backend xmin
  SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE 
backend_xmin <> 0 ORDER BY age(backend_xmin) DESC LIMIT 3;
- oldest replication xmin:
  SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication WHERE 
backend_xmin <> 0 ORDER BY age(backend_xmin) DESC LIMIT 3;
- oldest slot xmin:
  SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM 
pg_replication_slots WHERE xmin <> 0 OR catalog_xmin <> 0 ORDER BY 
greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3;
- oldest prepared transaction (unfortunately xmin not available)
  SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY 
age(transaction) LIMIT 3;


> The same bench was run against pg11 many times and never triggered this issue.
> So far our best guess is a side effect of 2aa6e331ead7.

In that case you'd likely see DEBUG1 output, right? Did you try running
with that?


> Michael and I have been trying to reproduce this issue locally (drastically
> reducing the various freeze_age parameters) for hours, but no luck for now.

Have you considered using gdb to investigate?


> This is using a vanilla pg 12.1, with some OLTP workload.  The only possibly
> relevant configuration changes are quite aggressive autovacuum settings on 
> some
> tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale
> factor to 0, for both heap and toast).

That, uh, is not an insignificant set of changes for an autovac
scheduling issues.

It sounds like what might be happening is that you have something
holding back the "oldest needed transaction" horizon. Before
2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated
vacuums unable to increase the horizon, but afterwards they'll all
immediately exit without any messages.

I wonder if what might be happening is that we're somehow missed/failed
to update relfrozenxid and/or datfrozenxid. If you manually vacuum some
table in the oldest database, but that is *NOT* the oldest table itself,
does the problem "resolve" itself?

Greetings,

Andres Freund




Re: PG12 autovac issues

2020-03-20 Thread Andres Freund
Hi,

On 2020-03-20 12:42:31 -0500, Justin King wrote:
> When we get into this state again, is there some other information
> (other than what is in pg_stat_statement or pg_stat_activity) that
> would be useful for folks here to help understand what is going on?

If it's actually stuck on a single table, and that table is not large,
it would be useful to get a backtrace with gdb.

Greetings,

Andres Freund




Re: PG12 autovac issues

2020-03-19 Thread Andres Freund
Hi,

On 2020-03-19 10:23:48 -0500, Justin King wrote:
> > From a single stats snapshot we can't actually understand the actual xid
> > consumption - is it actually the xid usage that triggers the vacuums?
> 
> We have looked at this and the xid consumption averages around 1250
> xid/sec -- this is when we see the "aggressive" autovac kick off in
> the logs.  What I don't understand is why these xid's are being
> consumed at this rate on the databases with no activity (postgres,
> template1).

The xid counter is global across all databases.


> > What makes you think it is a problem that you have all these vacuums? If
> > you actually update that much, and you have indexes, you're going want a
> > lot of vacuums?

> I actually don't think there's a problem with the vacuums (I was
> mostly pointing out that they are very regular and not problematic).
> The main problem I am having is that something is causing the
> autovacuums to completely stop and require manual intervention to
> resume -- and it seems to be when the "postgres" or "template1"
> database hits the autovacuum_freeze_max_age.

Did you look at pg_stat_activity for those autovacuums to see whether
they're blocked on something?


> > > What is interesting is that this happens with the 'postgres' and
> > > 'template1' databases as well and there is absolutely no activity in
> > > those databases.
> >
> > That's normal. They should be pretty darn quick in v12?
> 
> Yes, a manual VACUUM FREEZE of either database takes less than 1
> second -- which is why it's perplexing that the autovac starts and
> never seems to complete and prevents other autovacs from running.

One big difference between a manual VACUUM and autovacuum is that with
the default settings VACUUM is not throttled, but autovacuum is.

What are your vacuum_cost_delay, autovacuum_vacuum_cost_delay,
vacuum_cost_limit, autovacuum_vacuum_cost_limit, vacuum_cost_page_hit,
vacuum_cost_page_miss set to?

Greetings,

Andres Freund




Re: unexpected chunk number 2 (expected 0) for toast value ... in pg_toast_18536

2020-03-16 Thread Andres Freund
Hi, 

On March 16, 2020 1:22:18 PM PDT, Tom Lane  wrote:
>Andres Freund  writes:
>> On 2020-03-16 12:44:53 -0700, Andres Freund wrote:
>>> On 2020-03-15 20:11:18 -0400, Tom Lane wrote:
>>>> I wonder if we should change it to allow that when
>>>> allow_system_table_mods is true?  This isn't the first time we've
>>>> seen people need to be able to do surgery on a toast table.
>
>>> I'd be mildly in favor. But it's considerably more than just the
>>> executor check that'd need to change. We don't the right thing for
>toast
>>> relations in plenty places right now, because we just check for
>>> RELKIND_RELATION - which will break junkvars etc.
>
>> Hm, and I wonder if there could be problems with
>> HeapTupleSatisfiesToast() too? It doesn't really forsee much DML
>being
>> done.
>
>We've always allowed people to select from toast tables, so if there
>are planner or executor problems with that, I'd think they'd mostly be
>bugs that need fixed anyway.  Your point about HeapTupleSatisfiesToast
>is better though.

The logic to add/extract junkvars for updated/deleted tables, as well as other 
parts of the modification code paths, weren't exposed so far though.

I've tried allowing updates/deletes before (at least deletes are needed to e.g 
handle duplicate values), I'm fairly confident that the junkvar issue is real.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: unexpected chunk number 2 (expected 0) for toast value ... in pg_toast_18536

2020-03-16 Thread Andres Freund
Hi,

On 2020-03-16 12:44:53 -0700, Andres Freund wrote:
> On 2020-03-15 20:11:18 -0400, Tom Lane wrote:
> > Unfortunately, it seems like you can't do that either, short of
> > hacking up the backend or writing some custom C code, because the
> > executor won't let you open a toast table as result relation :-(.
> > I wonder if we should change it to allow that when
> > allow_system_table_mods is true?  This isn't the first time we've
> > seen people need to be able to do surgery on a toast table.
> 
> I'd be mildly in favor. But it's considerably more than just the
> executor check that'd need to change. We don't the right thing for toast
> relations in plenty places right now, because we just check for
> RELKIND_RELATION - which will break junkvars etc.

Hm, and I wonder if there could be problems with
HeapTupleSatisfiesToast() too? It doesn't really forsee much DML being
done.

Greetings,

Andres Freund




Re: unexpected chunk number 2 (expected 0) for toast value ... in pg_toast_18536

2020-03-16 Thread Andres Freund
Hi,

On 2020-03-15 20:11:18 -0400, Tom Lane wrote:
> Unfortunately, it seems like you can't do that either, short of
> hacking up the backend or writing some custom C code, because the
> executor won't let you open a toast table as result relation :-(.
> I wonder if we should change it to allow that when
> allow_system_table_mods is true?  This isn't the first time we've
> seen people need to be able to do surgery on a toast table.

I'd be mildly in favor. But it's considerably more than just the
executor check that'd need to change. We don't the right thing for toast
relations in plenty places right now, because we just check for
RELKIND_RELATION - which will break junkvars etc.

Greetings,

Andres Freund




Re: unexpected chunk number 2 (expected 0) for toast value ... in pg_toast_18536

2020-03-16 Thread Andres Freund
Hi,

On 2020-03-15 19:23:49 +0100, Karsten Hilbert wrote:
> > /home/marc# pg_dump -p 5432 --username=gm-dbo --dbname=gnumed_v22 
> > --compress=0 --no-sync --format=custom --file=/dev/null
> > pg_dump: Ausgabe des Inhalts der Tabelle »doc_obj« fehlgeschlagen: 
> > PQgetResult() fehlgeschlagen.
> > pg_dump: Fehlermeldung vom Server: ERROR:  unexpected chunk number 2 
> > (expected 0) for toast value 99027 in pg_toast_18536
> > pg_dump: Die Anweisung war: COPY blobs.doc_obj (pk, fk_doc, seq_idx, 
> > comment, fk_intended_reviewer, data, filename) TO stdout;
> 
>   (to note: column "data" is of type BYTEA)
> 
> We have been able to identify the row (there may be more)
> in blobs.doc_obj which leads to the above error.
> 
>   blobs.doc_obj.pk -> 82224
> 
> We have ruled out (?) below-PG hardware problems by a
> successful run of:
> 
>   cp -rv —preserve=all /var/lib/postgresql/9.6  /tmp/

FWIW, I don't think that rules out hardware problems at all. In plenty
cases of corruption you can just end up with corrupted on-disk data
(swapped blocks, zeroed blocks, randomly different values ...).

But obviously it is not at all guaranteed that is the case. Could you
describe the "history" of the database? Replication set up, failovers,
etc?


> Now, what else can we try to address the problem short of
> doing the
> 
>   pg_dump --exclude-table-data=blobs.doc_obj
> 
>   judicious use of COPY-FROM-with-subselect from blobs.doc_obj
> 
>   restore
> 
> dance ?

A plpgsql function that returns the rows one-by-one and catches the
exception is probably your best bet.

It could roughly look something like:

CREATE OR REPLACE FUNCTION salvage(p_tblname regclass)
RETURNS SETOF text
LANGUAGE plpgsql AS
$$
DECLARE
v_row record;
BEGIN
FOR v_row IN EXECUTE 'SELECT * FROM '||p_tblname::text LOOP
   BEGIN
   -- this forces detoasting
   RETURN NEXT v_row::text;
   EXCEPTION WHEN internal_error OR data_corrupted OR index_corrupted THEN
   -- add pkey or something else
   RAISE NOTICE 'failed to return data';
   END;
END LOOP;
END
$$

should work. You can call it like
SELECT (salvaged_rec.rec).* FROM (SELECT salvaged_text::salvage_me FROM 
salvage('salvage_me') AS salvaged_text) AS salvaged_rec(rec)

Greetings,

Andres Freund




Re: Is it safe to rename an index through pg_class update?

2020-03-09 Thread Andres Freund
Hi,

On 2020-03-09 17:47:23 +0200, Kouber Saparev wrote:
> На пт, 6.03.2020 г. в 21:00 Andres Freund  написа:
> > On 2020-02-27 10:52:36 -0500, Tom Lane wrote:
> > > FWIW, I can't immediately think of a reason this would cause a problem,
> > > at least not on 9.4 and up which use MVCC catalog scans.  If you're
> > > really still on 9.3 then it's notably more risky.  In any case, I've
> > > not had any caffeine yet today, so this doesn't count for much.
> >
> > It likely could cause some problems if somebody concurrently executed
> > DDL affecting the same table. At least some "concurrently updated"
> > errors, and perhaps some worse ones.  I'd at least add an explicit LOCK
> > TABLE on the underlying table that prevents concurrent catalog
> > modifications.
> >
> 
> I am trying to escape the Access Exclusive lock over the table indeed,
> otherwise I would use the ALTER statement instead anyway, which makes a
> lock implicitly. Thanks for the responses.

You'd not have to take an ACCESS EXCLUSIVE. A lower level would
suffice, e.g. SHARE UPDATE EXCLUSIVE, which still allows data changes.


> There is nobody else doing DDLs except me - Mr. DBA, so I guess I am
> safe on this side. ;)

If autovacuum triggered a vacuum/analyze it'd would e.g. also try to
update pg_class.

Greetings,

Andres Freund




Re: Is it safe to rename an index through pg_class update?

2020-03-06 Thread Andres Freund
Hi,

On 2020-02-27 10:52:36 -0500, Tom Lane wrote:
> FWIW, I can't immediately think of a reason this would cause a problem,
> at least not on 9.4 and up which use MVCC catalog scans.  If you're
> really still on 9.3 then it's notably more risky.  In any case, I've
> not had any caffeine yet today, so this doesn't count for much.

It likely could cause some problems if somebody concurrently executed
DDL affecting the same table. At least some "concurrently updated"
errors, and perhaps some worse ones.  I'd at least add an explicit LOCK
TABLE on the underlying table that prevents concurrent catalog
modifications.

Greetings,

Andres Freund




Re: JIT on Windows with Postgres 12.1

2020-02-12 Thread Andres Freund
Hi,

On 2020-02-13 12:25:40 +0900, Michael Paquier wrote:
> On Wed, Feb 12, 2020 at 12:05:11PM -0800, Andres Freund wrote:
> > Yes, LLVM would work on windows. I'd not even be surprised if one could
> > make it work on windows already when using a mingw based build, instead
> > of msvc.
> 
> For MSVC, assuming that you have the proper dependencies for JIT
> compilation and LLVM available, and just for the reference if anybody
> is interesting in making that happen (Tom and Andres are aware of that
> already), one would need to do roughly an update of src/tools/msvc/ to
> add a new option path to LLVM, and update install-windows.sgml to add
> the new optional dependency.  Note that including LLVM in the builds
> means adding conditionally the files part of src/backend/jit/llvm/,
> but one can take for example OpenSSL as a model of how it happens.
> That would take care of the build.

That's not really the hard part. That's integrating the generation of
LLVM bitcode files into the buildsystem. As the absolute minimum
llvmjit_types.bc needs to be generated, but to be meaningfully supported
we'd need to generate all the .bc files we're generating on make based
builds.

Greetings,

Andres Freund




Re: JIT on Windows with Postgres 12.1

2020-02-12 Thread Andres Freund
Hi,

On 2020-02-12 14:54:56 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2020-02-12 09:39:19 +0100, Josef Šimánek wrote:
> >> I'm not aware of any PG 12 Windows installer with JIT/LLVM enabled.
> 
> > It's not currently supported by the windows buildsystem. I don't
> > personally have the expertise to change that (I've not worked on windows
> > after ~2003). So somebody would have to step in to make that work,
> > first.
> 
> First you would need either a Windows-native LLVM, or a JIT provider
> based on Microsoft's compiler.  I don't know if the former exists
> (it would not be our job to create it).

Yes, LLVM would work on windows. I'd not even be surprised if one could
make it work on windows already when using a mingw based build, instead
of msvc.


> I'm not sure if the latter is possible, but for sure it'd be a large
> task if it is possible.

Yea, it'd be a lot of work to add a second provider. No iea if there's
any support for it around visual studio - but I also see no point going
there.

Greetings,

Andres Freund




Re: JIT on Windows with Postgres 12.1

2020-02-12 Thread Andres Freund
Hi,

On 2020-02-12 09:39:19 +0100, Josef Šimánek wrote:
> I'm not aware of any PG 12 Windows installer with JIT/LLVM enabled.

It's not currently supported by the windows buildsystem. I don't
personally have the expertise to change that (I've not worked on windows
after ~2003). So somebody would have to step in to make that work,
first.

Greetings,

Andres Freund




Re: Automatically parsing in-line composite types

2019-10-30 Thread Andres Freund
Hi,

On 2019-10-29 14:33:00 -0400, Tom Lane wrote:
> Mitar  writes:
> > I think RowDescription should be extended to provide full recursive
> > metadata about all data types. That would be the best way to do it.
> 
> [ shrug... ]  In a world where stability of the wire protocol were
> of zero value, maybe we would do that.  In the real world, don't
> hold your breath.

Hm. Wouldn't it be fairly easy to allow the client to specify how much
metadata they'd want? I.e. opt-in into getting more complete metadata?

Presumably a lot of clients/applications wouldn't want the server to do
the extra work / use bandwidth for the full details anyway, so making a
more expansive RowDescription be explicitly opt-in would be good, even
if there were zero compatibility concerns.

There's different ways we could do the opt-in. We could use the "_pq_."
startup option stuff to opt in, we could make it an optional parameter
to D messages (it'd be mildly hacky because unfortunately
describe_target is not a counted text), we could use an additional
describe_type etc...

Greetings,

Andres Freund




Re: EXPLAIN BUFFERS and I/O timing accounting questions

2019-10-24 Thread Andres Freund
Hi,

On 2019-10-24 16:31:39 -0700, Maciek Sakrejda wrote:
> On Thu, Oct 24, 2019 at 2:25 PM Andres Freund  wrote:
> > Note that the buffer access stats do *not* count the number of distinct
> > buffers accessed, but that they purely the number of buffer
> > accesses.
> 
> You mean, even within a single node? That is, if a node hits a block ten
> times, that counts as ten blocks hit? And if it reads a block and then
> needs it three more times, that's one read plus three hit?

Correct. It's basically the number of lookups in the buffer
pool. There's some nodes that will kind repeatedly use the same buffer,
without increasing the count. E.g. a seqscan will keep the current page
pinned until all the tuples on the page have been returned.

Consider e.g. an nested loop indexscan - how would we determine whether
we've previously looked at a buffer within the indexscan, without
drastically increasing the resources?


> > Do you have an example?
> 
> Sure, here's the "abridged" plan:
> 
> [{ "Plan": {
>   "Node Type": "Aggregate",
>   "Plan Rows": 1,
>   "Plan Width": 8,
>   "Total Cost": 26761745.14,
>   "Actual Rows": 1,
>   "I/O Read Time": 234129.299,
>   "I/O Write Time": 0,
>   "Local Hit Blocks": 0,
>   "Local Read Blocks": 0,
>   "Local Dirtied Blocks": 0,
>   "Local Written Blocks": 0,
>   "Shared Hit Blocks": 4847762,
>   "Shared Read Blocks": 1626312,
>   "Shared Dirtied Blocks": 541014,
>   "Shared Written Blocks": 0,
>   "Temp Read Blocks": 0,
>   "Temp Written Blocks": 4786,
>   "Plans": [
> {
>   "Node Type": "ModifyTable",
>   "Operation": "Delete",
>   "Parent Relationship": "InitPlan",
>   "Plan Rows": 13943446,
>   "Plan Width": 6,
>   "Total Cost": 25774594.63,
>   "Actual Rows": 2178416,
>   "I/O Read Time": 234129.299,
>   "I/O Write Time": 0,
>   "Local Hit Blocks": 0,
>   "Local Read Blocks": 0,
>   "Local Dirtied Blocks": 0,
>   "Local Written Blocks": 0,
>   "Shared Hit Blocks": 4847762,
>   "Shared Read Blocks": 1626312,
>   "Shared Dirtied Blocks": 541014,
>   "Shared Written Blocks": 0,
>   "Temp Read Blocks": 0,
>   "Temp Written Blocks": 0,
>   "Plans": ""
> },
...

I think this may be partially confusing due to the way the json output
looks. Which is so bad that it's imo fair to call it a bug.  Here's text
output to a similar-ish query:


Aggregate  (cost=112.50..112.51 rows=1 width=8) (actual time=35.893..35.894 
rows=1 loops=1)
  Output: count(*)
  Buffers: shared hit=6015 dirtied=15
  CTE foo
->  Delete on public.p  (cost=0.00..45.00 rows=3000 width=6) (actual 
time=0.235..28.239 rows=3000 loops=1)
  Output: p.data
  Delete on public.p
  Delete on public.c1
  Delete on public.c2
  Buffers: shared hit=6015 dirtied=15
  ->  Seq Scan on public.p  (cost=0.00..15.00 rows=1000 width=6) 
(actual time=0.161..1.375 rows=1000 loops=1)
Output: p.ctid
Buffers: shared hit=5 dirtied=5
  ->  Seq Scan on public.c1  (cost=0.00..15.00 rows=1000 width=6) 
(actual time=0.147..1.314 rows=1000 loops=1)
Output: c1.ctid
Buffers: shared hit=5 dirtied=5
  ->  Seq Scan on public.c2  (cost=0.00..15.00 rows=1000 width=6) 
(actual time=0.145..1.170 rows=1000 loops=1)
Output: c2.ctid
Buffers: shared hit=5 dirtied=5
  ->  CTE Scan on foo  (cost=0.00..60.00 rows=3000 width=0) (actual 
time=0.243..34.083 rows=3000 loops=1)
Output: foo.data
Buffers: shared hit=6015 dirtied=15
Planning Time: 0.508 ms
Execution Time: 36.512 ms

Note that the node below the Aggregate is actually the CTE, and that
that the DELETEs are below that. But the json, slightly abbreviated,
looks like:

[
  {
"Plan": {
  "Node Type": "Aggregate",
  "Strategy": "Plain",
  "Shared Hit Blocks": 6015,
  "Shared Read Blocks": 0,
  "Shared Dirtied Blocks": 15,
  "Shared Written Blocks": 0,
  "Plans": [
{
  "Node Type": "ModifyTable",
  "Operation": "Delete",
  "Parent Relationship": "InitPlan",
  "Subplan Name": &q

Re: EXPLAIN BUFFERS and I/O timing accounting questions

2019-10-24 Thread Andres Freund
Hi,

On 2019-10-21 23:18:32 -0700, Maciek Sakrejda wrote:
> I ran across an EXPLAIN plan and had some questions about some of its
> details. The BUFFERS docs say
> 
> >The number of blocks shown for an upper-level node includes those used by
> all its child nodes.
> 
> I initially assumed this would be cumulative, but I realized it's probably
> not because some of the blocks affected by each child will actually
> overlap.

Note that the buffer access stats do *not* count the number of distinct
buffers accessed, but that they purely the number of buffer
accesses.

It'd be really expensive to count the number of distinct buffers
accessed, although I guess one could make it only expensive by using
something like hyperloglog (although that will still be hard, due to
buffer replacement etc).


> But this particular plan has a Shared Hit Blocks at the root (an
> Aggregate) that is smaller than some of its children (three ModifyTables
> and a CTE Scan).

Do you have an example?  I assume what's going on is that the cost of
the CTE is actually attributed (in equal parts or something like that)
to all places using the CTE. Do the numbers add up if you just exclude
the CTE?


> This seems to contradict the documentation (since if
> children overlap fully in their buffers usage, the parent should still have
> a cost equal to the costliest child)--any idea what's up? I can send the
> whole plan (attached? inline? it's ~15kb) if that helps.

Or just relevant top-level excerpts.


> Also, a tangential question: why is the top-level structure of a JSON plan
> an array? I've only ever seen one root node with a Plan key there.

IIRC one can get multiple plans when there's a DO ALSO rule. There might
be other ways to get there too.

Greetings,

Andres Freund




Re: Segmentation fault with PG-12

2019-10-10 Thread Andres Freund
On 2019-10-10 09:05:06 +0200, Andreas Joseph Krogh wrote:
> (Tom: This mail is only viewable as text/html, to if you're reading the 
> text/plain version it will seem "hashed")

I'm totally not ok with that btw.




Re: Segmentation fault with PG-12

2019-10-09 Thread Andres Freund
On 2019-10-09 10:16:37 -0400, Tom Lane wrote:
> Andreas Joseph Krogh  writes:
> > Attached is output from "bt full". Is this helpful? 
> 
> Well, it shows that the failure is occurring while trying to evaluate
> a variable in a trigger's WHEN clause during
> "UPDATE origo_email_delivery SET folder_id=$1, created=$2\nWHERE entity_id IN 
> ($3)\nRETURNING entity_id"
> And I'd bet that the root cause is something to do with Andres' tuple slot
> work.  But (at least to my eye) it's not apparent exactly what's wrong.

It looks like this could "just" be another report of #16036, which was
already fixed in:

commit d986d4e87f61c68f52c68ebc274960dc664b7b4e
Author: Andres Freund 
Date:   2019-10-04 11:59:34 -0700

Fix crash caused by EPQ happening with a before update trigger present.


> This doesn't seem to correlate with your original report, btw,
> as that claimed the crash was during COMMIT.

That however, would be confusing, unless there's some deferred trigger
that causes another update, which then fires a before update trigger
causing the problem.

Greetings,

Andres Freund




Re: RowDescription message

2019-10-07 Thread Andres Freund
Hi,

On 2019-10-07 17:00:24 +0900, Tatsuo Ishii wrote:
> According to the manualof RowDescription message
> https://www.postgresql.org/docs/12/protocol-message-formats.html
> 
> Specifies the number of fields in a row (can be zero).
> 
> Does 0 fields could actually happen?

Yes, e.g.:

SELECT;




Re: Postgres 12: backend crashes when creating non-deterministic collation

2019-10-04 Thread Andres Freund
On 2019-10-04 10:52:38 -0400, Tom Lane wrote:
> Thomas Kellerer  writes:
> > It also works on Windows when I specify "correct" locale names - the above 
> > seems to be an edge case.
> > Is it worth the effort to report that through the bug reporting form?
> 
> No, this thread is a sufficient report.  What *would* be a good use
> of time is to get a stack trace from the crash, if you can.

FWIW, 
https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows
might be helpful.




Re: Why does backend send buffer size hardcoded at 8KB?

2019-07-27 Thread Andres Freund
Hi,

On 2019-07-27 19:10:22 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Additionally we perhaps ought to just not use the send buffer when
> > internal_putbytes() is called with more data than can fit in the
> > buffer. We should fill it with as much data as fits in it (so the
> > pending data like the message header, or smaller previous messages, are
> > flushed out in the largest size), and then just call secure_write()
> > directly on the rest. It's not free to memcpy all that data around, when
> > we already have a buffer.
> 
> Maybe, but how often does a single putbytes call transfer more than
> 16K?

I don't think it's that rare. COPY produces entire rows and sends them
at once, printtup also does, walsender can send pretty large chunks? I
think with several columns after text conversion it's pretty easy to
exceed 16k, not even taking large toasted columns into account.


> (If you fill the existing buffer, but don't have a full bufferload
> left to transfer, I doubt you want to shove the fractional bufferload
> directly to the kernel.)  Perhaps this added complexity will pay for
> itself, but I don't think we should just assume that.

Yea, I'm not certain either. One way to deal with the partially filled
buffer issue would be to use sendmsg() - and have two iovs (one pointing
to the filled buffer, one to the actual data). Wonder if it'd be
worthwhile to do in more scenarios, to avoid unnecessarily copying
memory around.


> > While the receive side is statically allocated, I don't think it ends up
> > in the process image as-is - as the contents aren't initialized, it ends
> > up in .bss.
> 
> Right, but then we pay for COW when a child process first touches it,
> no?  Maybe the kernel is smart about pages that started as BSS, but
> I wouldn't bet on it.

Well, they'll not exist as pages at that point, because postmaster won't
have used the send buffer to a meaningful degree? And I think that's the
same for >4k/pagesize blocks with malloc.  I think there could be a
benefit if we started the buffer pretty small with malloc, and only went
up as needed.

Greetings,

Andres Freund




Re: Why does backend send buffer size hardcoded at 8KB?

2019-07-27 Thread Andres Freund
Hi,

On 2019-07-27 18:34:50 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > It might be better to just use larger send sizes however. I think most
> > kernels are going to be better than us knowing how to chop up the send
> > size.

> Yeah.  The existing commentary about that is basically justifying 8K
> as being large enough to avoid performance issues; if somebody can
> show that that's not true, I wouldn't have any hesitation about
> kicking it up.

You think that unnecessary fragmentation, which I did show, isn't good
enough? That does have cost on the network level, even if it possibly
doesn't show up that much in timing.


I wonder if we ought to just query SO_SNDBUF/SO_RCVBUF or such, and use
those (although that's not quite perfect, because there's some added
overhead before data ends up in SNDBUF). Probably with some clamping, to
defend against a crazy sysadmin setting it extremely high.


Additionally we perhaps ought to just not use the send buffer when
internal_putbytes() is called with more data than can fit in the
buffer. We should fill it with as much data as fits in it (so the
pending data like the message header, or smaller previous messages, are
flushed out in the largest size), and then just call secure_write()
directly on the rest. It's not free to memcpy all that data around, when
we already have a buffer.


> (Might be worth malloc'ing it rather than having it as part of the
> static process image if we do so, but that's a trivial change.)

We already do for the send buffer, because we repalloc it in
socket_putmessage_noblock(). Olddly enough we never reduce it's size
after that...

While the receive side is statically allocated, I don't think it ends up
in the process image as-is - as the contents aren't initialized, it ends
up in .bss.

Greetings,

Andres Freund




Re: Why does backend send buffer size hardcoded at 8KB?

2019-07-27 Thread Andres Freund
Hi,

On 2019-07-27 11:09:06 -0400, Tom Lane wrote:
> Artemiy Ryabinkov  writes:
> > Does it make sense to make this parameter configurable?
>
> Not without some proof that it makes a performance difference on
> common setups (which you've not provided).

I think us unnecessarily fragmenting into some smaller packets everytime
we send a full 8kB buffer, unless there's already network congestion, is
kind of evidence enough? The combination of a relatively small send
buffer + TCP_NODELAY isn't great.

I'm not quite sure what the smaller buffer is supposed to achieve, at
least these days. In blocking mode (emulated in PG code, using latches,
so we can accept interrupts) we'll always just loop back to another
send() in internal_flush(). In non-blocking mode, we'll fall out of the
loop as soon as the kernel didn't send any data. Isn't the outcome of
using such a small send buffer that we end up performing a) more
syscalls, which has gotten a lot worse in last two years due to all the
cpu vulnerability mitigations making syscalls a *lot* more epensive b)
unnecessary fragmentation?

The situation for receiving data is a bit different. For one, we don't
cause unnecessary fragmentation by using a buffer of a relatively
limited size. But more importantly, copying data into the buffer takes
time, and we could actually be responding to queries earlier in the
data. In contrast to the send case we don't loop around recv() until all
the data has been received.

I suspect we could still do with a bigger buffer, just to reduce the
number of syscalls in bulk loading cases, however.

Greetings,

Andres Freund




Re: Why does backend send buffer size hardcoded at 8KB?

2019-07-27 Thread Andres Freund
On 2019-07-27 14:43:54 +0300, Artemiy Ryabinkov wrote:
> Why backend send buffer use exactly 8KB? 
> (https://github.com/postgres/postgres/blob/249d64999615802752940e017ee5166e726bc7cd/src/backend/libpq/pqcomm.c#L134)
> 
> 
> I had this question when I try to measure the speed of reading data. The
> bottleneck was a read syscall. With strace I found that in most cases read
> returns 8192 bytes (https://pastebin.com/LU10BdBJ). With tcpdump we can
> confirm, that network packets have size 8192 (https://pastebin.com/FD8abbiA)

Well, in most setups, you can't have that large frames. The most common
limit is 1500 +- some overheads. Using jumbo frames isn't that uncommon,
but it has enough problems that I don't think it's that widely used with
postgres.


> So, with well-tuned networking stack, the limit is 8KB. The reason is the
> hardcoded size of Postgres write buffer.

Well, jumbo frames are limited to 9000 bytes.



But the reason you're seeing 8192 sized packages isn't just that we have
an 8kb buffer, I think it's also that that we unconditionally set
TCP_NODELAY:

#ifdef  TCP_NODELAY
on = 1;
if (setsockopt(port->sock, IPPROTO_TCP, TCP_NODELAY,
   (char *) , sizeof(on)) < 0)
{
elog(LOG, "setsockopt(%s) failed: %m", "TCP_NODELAY");
return STATUS_ERROR;
}
#endif

With 8KB send size, we'll often unnecessarily send some smaller packets
(both for 1500 and 9000 MTUs), because 8kB doesn't neatly divide into
the MTU. Here's e.g. the ip packet sizes for a query returning maybe
18kB:

1500
1500
1500
1500
1500
1004
1500
1500
1500
1500
1500
1004
1500
414

the dips are because that's where our 8KB buffer + disabling nagle
implies a packet boundary.


I wonder if we ought to pass MSG_MORE (which overrides TCP_NODELAY by
basically having TCP_CORK behaviour for that call) in cases we know
there's more data to send. Which we pretty much know, although we'd need
to pass that knowledge from pqcomm.c to be-secure.c


It might be better to just use larger send sizes however. I think most
kernels are going to be better than us knowing how to chop up the send
size. We're using much larger limits when sending data from the client
(no limit for !win32, 65k for windows), and I don't recall seeing any
problem reports about that.


OTOH, I'm not quite convinced that you're going to see much of a
performance difference in most scenarios. As soon as the connection is
actually congested, the kernel will coalesce packages regardless of the
send() size.


> Does it make sense to make this parameter configurable?

I'd much rather not. It's goign to be too hard to tune, and I don't see
any tradeoffs actually requiring that.

Greetings,

Andres Freund




Re: LWLock optimization

2019-07-24 Thread Andres Freund
Hi,

On 2019-07-23 16:28:52 +, Alexander Pyhalov wrote:
> I was asked to bring up this topic on maling lists after asking question on 
> IRC. 
> 
> The issue is the following.
> On one of our servers (10.5), I see a lot of queries with 
> wait_event_type='LWLock', wait_event='lock_manager' 

Could you qualify this a bit more?  What does "a lot" mean, in
comparison to the total number of queries/session?


> This is a stat gathering/analyzing application with tables which have
> > 100 partitions.

Hm, what kind of partitioning scheme is this? The "new"-ish partitioning
support isn't yet always that good to only perform the minimal amount of
metadata lookups (which in turn require locks). Especially in 10.

Are you using prepared statements?


> The queries itself are mostly simple (select by pkey, update by primary key 
> and so on).
> 
>  select count(*) from pg_locks shows about 40K  records (all granted) and 
> 
>  select count(*) from (select distinct 
> locktype,database,relation,page,tuple,virtualxid,transactionid,classid,objid,objsubid
>  from pg_locks) 
> 
> is varying from about 1K to 25K (I haven't stored this value in monitoring 
> system, just observed manually).

I think to know what exactly is going on, we'd need a cpu profile, and
then perhaps a followup profile for also getting the callsites for lock
waits.  Could you take a perf profile?

https://wiki.postgresql.org/wiki/Profiling_with_perf

> Currently I was adviced to increase LOG2_NUM_LOCK_PARTITIONS at
> compile time. If it bothers us enough, we would likely do it, but I
> have a question, if this (or NUM_LOCK_PARTITIONS) value should be
> increased by default or set tunable?

I'm not quite sure what you're asking precisely here. Are you wondering
whether postgres should increase the default value for
LOG2_NUM_LOCK_PARTITIONS should be increased? And whether it should be
runtime configurable?  I'm fairly sure we don't want to make it runtime
configurable, that'd add overhead to some pretty central code.
Increasing the default would need a lot of benchmarks, to prove it
doesn't cause regressions in other common scenarios.

Greetings,

Andres Freund




Re: Corrupt index stopping autovacuum system wide

2019-07-17 Thread Andres Freund
Hi,

On 2019-07-17 13:27:23 -0400, Tom Lane wrote:
> My concern here is that if we have blinders on to the extent of only
> processing that one table or DB, we're unnecessarily allowing bloat to
> occur in other tables, and causing that missed vacuuming work to pile
> up so that there's more of it to be done once the breakage is cleared.

That a pretty common problem in the real world, with or without problems
like corrupted indexes. Autovacuum's scheduling is just too simplistic
to avoid that.

Common problems:

- If one table in one database has an xmin older than
  autovacuum_freeze_max_age, the autovacuum launcher continuously throws
  more workers at that database. Even though there's a lot of work to be
  done in other databases.

  That's because do_start_worker() ignores the "adl_next_worker"
  mechanism, and *always* starts more workers for the database with the
  oldest datfrozenxid (same with datminmxid), and because we explicitly
  don't start workers for other databases ("/* ignore not-at-risk DBs
  */").

  That's especially bad if that database has a longrunning transaction
  preventing cleanup from happening - the vacuuming will not be able to
  advance the horizon, and all others are starved.


- If there's a table "early in pg_class", and vacuum on it failes, it
  will prevent vacuuming other tables. It's pretty common for vacuum to
  fail often for a table, e.g. because it's sometimes exclusively locked,
  which then causes autovacuum to kill itself.  There's absolutely no
  mechanism for autovacuum workers to skip over that table for a while,
  leading to all other tables in the database not being vacuumed, unless
  there happens to be second worker in the database, while the first
  vacuum hasn't failed.

  This obviously also applies to the corrupted index case.

  The 'kill itself' behaviour is exascerbated by lazy_truncate_heap()'s
  exclusive lock - which will obviously trigger other backend to send
  cancellation requests. There's unfortunately a number of places where
  that leads us to just throw all the work done away, and not update
  pg_class.relfrozenxid/datfrozenxid


- Anti-wraparound vacuums are more impactful (don't cancel themselves
  upon lock conflicts, cause more IO, wait for cleanup locks), often
  emit scary messages ("oldest xmin is far in the past").  But we don't
  have *any* mechanism that avoids them for very common scenarios.

  E.g. for insert-mostly workloads, there'll never be enough dead tuples
  to cause a vacuum to happen before autovacuum_freeze_max_age is
  reached. That prevents vacuum_freeze_table_age from pre-empting the
  need to do an anti-wraparound vacuum, by increasing the xid horizon.

  We literally call anti-wraparound autovacuums "emergency" in the code,
  yet they're an almost inevitablepart of running postgres.


- There's no meaningful in-database prioritization. Which means that
  autovacuum workers might happily vacuum the table just a bit over the
  thresholds, even though there's much worse tables around.

  Especially on a busy and large databases that can lead to
  anti-wraparound started launchers effectively never getting to
  vacuuming tables above autovacuum_freeze_max_age, because tables
  earlier in pg_class are modified heavily enough that they have dead
  tuples above the thresholds by the time vacuum finishes.  To get to
  the anti-wraparound vacuum needing table, a single launcher needs to
  go through all tables preceding the table in pg_class that need
  vacuuming (only skipping over ones that are concurrently vacuumed by
  somebody else, but not ones that have *recently* been vacuumed).

  I kinda forgot how bad this one was until looking at the code again.

And there are plenty more.


My impression is that these are really hard to fix unless we develop a
new scheduling approach. And that scheduling approach probably needs to
be more stateful than the current code.

IMO these are the main concerns for how work needs to be distributed:

- work in different databases needs to be scheduled in a smarter way, in
  particular anti-wraparound cannot simply cause only the database with
  the to oldest datfrozenxid to be vacuumed until the wraparound
  "danger" is over (especially not if there's other database needing to
  be anti-wrap vacuumed)

- tables within a database need to be prioritized in a smarter way, so
  databases with a lot of bloat get vacuumed before ones with a lot less
  bloat, and similarly tables with the oldest horizon need to be
  vacuumed before ones with newer horizons, even if all of the tables
  are above the autovacuum thresholds.

- tables need to be prioritized across databases, to avoid problems like
  one currently vacuumed table causing unnecessary anti-wraparound
  workers to be launched for a database where they can't perform any
  work.

- there needs to be a separate rung between normal autovacuums and
  anti-wraparound vacuums. The inbetween level does 'aggressive'
  vacuuming (so it 

Re: perf tuning for 28 cores and 252GB RAM

2019-06-18 Thread Andres Freund
Hi,

On 2019-06-18 17:13:20 +0100, Fabio Ugo Venchiarutti wrote:
> Does the backend mmap() data files when that's possible?

No. That doesn't allow us to control when data is written back to disk,
which is crucial for durability/consistency.


> I've heard the "use the page cache" suggestion before, from users and
> hackers alike, but I never quite heard a solid argument dismissing potential
> overhead-related ill effects of the seek() & read() syscalls if they're
> needed, especially on many random page fetches.

We don't issue seek() for reads anymore in 12, instead do a pread() (but
it's not a particularly meaningful performance improvement). The read
obviously has cost, especially with syscalls getting more and more
expensive due to the mitigation for intel vulnerabilities.

I'd say that a bigger factor than the overhead of the read itself is
that for many workloads we'll e.g. incur additional writes when s_b is
smaller, that the kernel has less information about when to discard
data, that the kernel pagecaches have some scalability issues (partially
due to their generality), and double buffering.


> Given that shmem-based shared_buffers are bound to be mapped into the
> backend's address space anyway, why isn't that considered always
> preferable/cheaper?

See e.g. my point in my previous email in this thread about
drop/truncate.


> I'm aware that there are other benefits in counting on the page cache (eg:
> staying hot in the face of a backend restart), however I'm considering
> performance in steady state here.

There's also the issue that using a large shared buffers setting means
that each process' page table gets bigger, unless you configure
huge_pages. Which one definitely should - but that's an additional
configuration step that requires superuser access on most operating
systems.

Greetings,

Andres Freund




Re: perf tuning for 28 cores and 252GB RAM

2019-06-18 Thread Andres Freund
Hi,

On 2019-06-18 12:33:30 -0400, Alvaro Herrera wrote:
> On 2019-Jun-18, Andres Freund wrote:
> 
> > On 2019-06-17 19:45:41 -0400, Jeff Janes wrote:
> > > If not, I would set the value small (say, 8GB) and let the OS do the
> > > heavy lifting of deciding what to keep in cache.
> > 
> > FWIW, in my opinion this is not a good idea in most cases. E.g. linux's
> > pagecache doesn't scale particularly gracefully to large amounts of
> > data, and it's decisions when to evict data aren't really better than
> > postgres'. And there's a significant potential for additional
> > unnecessary disk writes (because the kernel will flush dirty pagecache
> > buffers, and then we'll just re-issue many of those writes again).
> 
> One additional tuning point in this area is that you need to tweak the
> Linux flusher so that it starts writing earlier than what it does by
> default (by setting dirty_background_bytes to some reasonable value --
> as I recall it normally runs on the dirty_background_ratio setting,
> which means it scales up when you add RAM, which I'm not sure is really
> sensible since you really care about the disk I/O write rate rather than
> anything about the memory).  If you don't do that, the kernel can
> accumulate lots of memory to write before starting to write it, which is
> troublesome once it does.

I think that's less needed these days - by default postgres triggers the
backend to flush data when writes occur from bgwriter
(bgwriter_flush_after) or checkpointer (checkpointer_flush_after). And
IMO one should turn on the flushing by backends in most cases too
(e.g. backend_flush_after=2MB), unless it's a really latency/jitter
insensitive application, or storage is *REALLY* slow.

There's a few things we don't flush that we maybe should (file extension
writes, SLRUs), so it can still be sensible to tune
dirty_background_bytes. But that has the disadvantage of also affecting
temp file writes etc, which is usually not wanted.

Greetings,

Andres Freund




Re: perf tuning for 28 cores and 252GB RAM

2019-06-18 Thread Andres Freund
Hi,

On 2019-06-17 19:45:41 -0400, Jeff Janes wrote:
> If not, I would set the value small (say, 8GB) and let the OS do the
> heavy lifting of deciding what to keep in cache.

FWIW, in my opinion this is not a good idea in most cases. E.g. linux's
pagecache doesn't scale particularly gracefully to large amounts of
data, and it's decisions when to evict data aren't really better than
postgres'. And there's a significant potential for additional
unnecessary disk writes (because the kernel will flush dirty pagecache
buffers, and then we'll just re-issue many of those writes again).

It's a bit hard to be specific without knowing the workload, but my
guidance would be that if the data has some expected form of locality
(e.g. index lookups etc, rather than just sequentially scanning the
whole database) then sizing s_b for at least the amount of data likely
to be repeatedly accessed can be quite beneficial.

If increasing s_b can achieve that most writes can be issued by
checkpointer rather than backends and bgwriter, the generated IO pattern
is *far* superior since 9.6 (as checkpointer writes are sorted, whereas
bgwriter/backend writes aren't to a meaningful degree).

The one big exception is if the workload frequently needs to
drop/truncate non-temporary tables. There we currently linearly need to
search shared_buffers, which, although the constants are fairly small,
obviously means that drop/truncations get noticably slower with a larger
shared_buffers.

- Andres




Re: checkpoints taking much longer than expected

2019-06-17 Thread Andres Freund
On 2019-06-16 12:25:58 -0400, Jeff Janes wrote:
> Right, but true only because they were "checkpoint starting: immediate".
> Otherwise the reported write time includes intentional sleeps added to
> honor the checkpoint_completion_target.  A bit confusing to report it that
> way, I think.

+1

It's even worse than that, actually. We also don't sleep if the *next*
requested checkpoint is an immediate one:

 * Perform the usual duties and take a nap, unless we're behind 
schedule,
 * in which case we just try to catch up as quickly as possible.
 */
if (!(flags & CHECKPOINT_IMMEDIATE) &&
!shutdown_requested &&
!ImmediateCheckpointRequested() &&
IsCheckpointOnSchedule(progress))

/*
 * Returns true if an immediate checkpoint request is pending.  (Note that
 * this does not check the *current* checkpoint's IMMEDIATE flag, but whether
 * there is one pending behind it.)
 */
static bool
ImmediateCheckpointRequested(void)

We ought to do better.

- Andres




Re: POSTGRES_FSM_RELATIONS CRITICAL: DB control fsm relations used: 79569 of 80000 (99%)

2019-05-27 Thread Andres Freund
Hi,

On 2019-05-24 06:57:52 +, Julie Nishimura wrote:
> PostgreSQL 8.2.15 (Greenplum Database 4.3.8.1 build 1)

Please note that a) 8.2 has *long* been unsupported. b) greenplum,
especially 8.2 based, is quite different from normal postgres, and not
developed by the postgresql development community.

Greetings,

Andres Freund




Re: pg_stat_user_tables.n_tup_ins empty for partitioned table

2019-05-22 Thread Andres Freund
Hi,

On 2019-05-22 10:08:44 +0200, Luca Ferrari wrote:
> I've got a table named "root", partitioned on a date field into years
> (e.g., "2018") and into months like "y2018m11" using range
> partitioning on PostgreSQL 11.2.
> Tuples are inserted into root with an INSERT...SELECT. I have
> performed an UPDATE due to a new column inserted in the parent table
> y2018.
> 
> Today, checking the statistics of the table, I found that no tuples
> were inserted:
> 
> testdb=> SELECT relname, seq_scan, idx_scan,
>  n_tup_ins, n_tup_del, n_tup_upd, n_tup_hot_upd,
>  n_live_tup, n_dead_tup,
>  last_vacuum, last_autovacuum,
>  last_analyze, last_autoanalyze
>  FROM pg_stat_user_tables;
> 
> -[ RECORD 12 ]---+--
> relname  | y2018m11
> seq_scan | 42172
> idx_scan |
> n_tup_ins| 0
> n_tup_del| 0
> n_tup_upd| 28191462
> n_tup_hot_upd| 86
> n_live_tup   | 14086279
> n_dead_tup   | 0
> last_vacuum  |
> last_autovacuum  | 2019-03-28 17:23:35.909943+01
> last_analyze |
> last_autoanalyze | 2019-03-28 17:25:12.773707+01
> 
> 
> I suspect the fact that n_tup_ins is 0 is due to the fact that I did
> insert the tuples into the parent y2018, but I would like to better
> understand: how can have n_live_tup without having n_tup_ins?
> Moreover, I don't have any stat for the parent table root, neither for
> y2018.
> Some explaination/pointer is appreciated.

That clearly seems wrong. Could you try build a small reproducer?

Greetings,

Andres Freund




Re: FATAL: SMgrRelation hashtable corrupted

2019-05-17 Thread Andres Freund
Hi,

On 2019-05-17 09:30:05 -0400, Tom Lane wrote:
> The good news is that the underlying ALTER TABLE bug is fixed in 11.3.
> The bad news is that your database is probably toast anyway --- an update
> won't undo the catalog corruption that is causing the WAL replay crash.
> I hope you have a recent backup to restore from.

Should there not be a backup, couldn't weaken the error checks during
replay a bit (locally), to allow replay to progress? The indexes will be
toast, but it ought to allow to recover the table data completely.

Greetings,

Andres Freund




Re: Hot Standby Conflict on pg_attribute

2019-05-10 Thread Andres Freund
Hi,

On 2019-05-09 13:03:50 -0700, Erik Jones wrote:
> The question then is: Why would these user queries be waiting on an
> AccessShare lock on pg_attribute?  Thus far we've been unable to recreate
> any transacitons with the above query (and others) that show any
> pg_attribute locks.  There is no ORM in play here and these queries are
> being sent as single query transactions via this Node.js postgres adapter:
> https://github.com/brianc/node-postgres which is pretty bare bones.

Queries that access a table for the *first* time after DDL happened
(including truncating the relation), need an AccessShareLock on
pg_attribute (and pg_class, pg_index, ...) for a short time.

You can reproduce that fairly easily:

S1: CREATE TABLE foo();
S2: BEGIN; LOCK pg_attribute;
S1: SELECT * FROM foo;
S2: COMMIT;

S1 could execute the select, because it has a cached view of the way the
relation looks.

S2: ALTER TABLE foo ADD COLUMN bar INT;
S2: BEGIN; LOCK pg_attribute;
S1: SELECT * FROM foo;

Here S1 is blocked, because it needs to look at pg_attribute to figure
out the "shape" of the table, but it's currently locked.

Greetings,

Andres Freund




Re: Connecting to NOTIFY with telnet

2019-05-03 Thread Andres Freund
Hi,

On 2019-05-03 11:06:09 -0700, Igal Sapir wrote:
> Is it possible to connect to Postgres for notifications via telnet?  This
> is obviously more for learning/experimenting purposes.

No. The protocol is too complicated to make that realistically doable /
useful.


> I expected a simple way to connect and consume notifications but can not
> find any example or documentation on how to do that.

If you really wanted to go down that road, you'd have to read the
protocol specs. It'd not make sense to document how-to steps for
something as low-level as this.

Greetings,

Andres Freund




Re: multiple indexes on the same column

2019-04-12 Thread Andres Freund
Hi,

On 2019-04-12 09:51:51 -0400, Tom Lane wrote:
> Tiffany Thang  writes:
> > Can someone explain the use of creating multiple indexes on the same
> > column?
> 
> There is none, unless the indexes have different properties (e.g.
> different opclasses and/or index AMs).

Well, it can be beneficial to create a new index concurrently, and then
drop the old one concurrently. Before v12 that's the only way to
recreate an index during production, if it e.g. bloated.

Greetings,

Andres Freund




Re: shared_buffers on Big RAM systems

2019-04-11 Thread Andres Freund
Hi,

On 2019-04-11 15:39:15 -0400, Jeff Janes wrote:
> But I don't think I would recommend starting at 25% of RAM larger server.
> Is that really good advice?  I would usually start out at 1GB even if the
> server has 128GB, and increase it only if there was evidence it needed to
> be increased.  Due to double buffering between shared_buffers and OS cache,
> 25% seems like a lot of wasted space.  You need shared_buffers as a cooling
> off tank where dirty data can wait for their corresponding WAL to get
> flushed in the background before they get written out themselves.  I think
> 1GB is enough for this, even if you have 128GB of RAM.

That runs very much contrary to my experience. If you actually gets
writes into your cluster, having a small shared buffers will create a
vastly larger amount of total writes. Because everytime a page is
evicted from shared buffers, it'll shortly afterwards be written out to
disk by the OS. Whereas that would not happen in shared buffers.

Due to checkpoint sorting (~9.6?) writes from checkpointer are also
vastly more efficient than either bgwriter triggered, or backend
triggered writes, because it's much more likely that the OS / IO stack
will write combine them.

I think with the exception of workloads that have a lot of trunctions
(e.g. tests that create/drop schemas) that are slow due to the implied
shared buffer scan, a lot of the problems with large shared buffers have
been fixed. Far from perfect, of course (i.e. the double buffering
youmention).

Greetings,

Andres Freund




Re: assembling PGresults from multiple simultaneous queries (libpq, singlerowmode)

2019-04-07 Thread Andres Freund
Hi,

On postgres mailing lists please don't write your reply at the top of a
fully quoted email. We like the reply to be inline and trimmed to the
necessary parts.

On 2019-04-07 13:28:46 -0700, Konstantin Izmailov wrote:
> Yes, Andres, I meant "pipelining", just couldn't choose correct word. Thank
> you for the answer(s)!
> 
> I also made changes in my own copy of libpq, and they work fine. I think
> the pipelining support is needed in libpq. Btw, how can I get the patch
> code? I want to compare your approach with mine. I couldn't figure out how
> to get the patch from the link.

Hm, odd. There's a link on the page "Latest attachment" - but for
unknown reasons that's broken. I've attached it for now, but will also
inquire with the webadmin team about what's up.

Greetings,

Andres Freund
>From ba93ae02eca024997f2ce6a9c2c2987aea4a77b8 Mon Sep 17 00:00:00 2001
From: Prabakaran 
Date: Fri, 12 Jan 2018 10:09:09 +1100
Subject: [PATCH] Pipelining-batch-support-for-libpq-code-v16

---
 doc/src/sgml/libpq.sgml| 502 +
 doc/src/sgml/lobj.sgml |   4 +
 .../libpqwalreceiver/libpqwalreceiver.c|   3 +
 src/interfaces/libpq/exports.txt   |   5 +
 src/interfaces/libpq/fe-connect.c  |  28 +
 src/interfaces/libpq/fe-exec.c | 595 +++--
 src/interfaces/libpq/fe-protocol2.c|   6 +
 src/interfaces/libpq/fe-protocol3.c|  15 +-
 src/interfaces/libpq/libpq-fe.h|  24 +-
 src/interfaces/libpq/libpq-int.h   |  47 +-
 10 files changed, 1186 insertions(+), 43 deletions(-)

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 4e46451..6aae637 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -4752,6 +4752,500 @@ int PQflush(PGconn *conn);
 
  
 
+ 
+  Batch mode and query pipelining
+
+  
+   libpq
+   batch mode
+  
+
+  
+   libpq
+   pipelining
+  
+
+  
+   libpq supports queueing up queries into
+   a pipeline to be executed as a batch on the server. Batching queries allows
+   applications to avoid a client/server round-trip after each query to get
+   the results before issuing the next query.
+  
+
+  
+   When to use batching
+
+   
+Much like asynchronous query mode, there is no performance disadvantage to
+using batching and pipelining. It increases client application complexity
+and extra caution is required to prevent client/server deadlocks but
+can sometimes offer considerable performance improvements.
+   
+
+   
+Batching is most useful when the server is distant, i.e. network latency
+(ping time) is high, and when many small operations are being performed in
+rapid sequence. There is usually less benefit in using batches when each
+query takes many multiples of the client/server round-trip time to execute.
+A 100-statement operation run on a server 300ms round-trip-time away would take
+30 seconds in network latency alone without batching; with batching it may spend
+as little as 0.3s waiting for results from the server.
+   
+
+   
+Use batches when your application does lots of small
+INSERT, UPDATE and
+DELETE operations that can't easily be transformed into
+operations on sets or into a
+COPY operation.
+   
+
+   
+Batching is not useful when information from one operation is required by the
+client before it knows enough to send the next operation. The client must
+introduce a synchronisation point and wait for a full client/server
+round-trip to get the results it needs. However, it's often possible to
+adjust the client design to exchange the required information server-side.
+Read-modify-write cycles are especially good candidates; for example:
+
+ BEGIN;
+ SELECT x FROM mytable WHERE id = 42 FOR UPDATE;
+ -- result: x=2
+ -- client adds 1 to x:
+ UPDATE mytable SET x = 3 WHERE id = 42;
+ COMMIT;
+
+could be much more efficiently done with:
+
+ UPDATE mytable SET x = x + 1 WHERE id = 42;
+
+   
+
+   
+
+ The batch API was introduced in PostgreSQL 10.0, but clients using PostgresSQL 10.0 version of libpq can
+ use batches on server versions 7.4 and newer. Batching works on any server
+ that supports the v3 extended query protocol.
+
+   
+
+  
+
+  
+   Using batch mode
+
+   
+To issue batches the application must switch
+a connection into batch mode. Enter batch mode with PQenterBatchMode(conn) or test
+whether batch mode is active with PQbatchStatus(conn). In batch mode only asynchronous operations are permitted, and
+COPY is not recommended as it most likely will trigger failure in batch processing. 
+Using any synchronous command execution functions such as PQfn,
+PQexec or one of its sibling functions are error conditions.
+F

Re: assembling PGresults from multiple simultaneous queries (libpq, singlerowmode)

2019-04-07 Thread Andres Freund
Hi,

On 2019-04-07 20:57:56 +0200, Pavel Stehule wrote:
> ne 7. 4. 2019 v 20:47 odesílatel Konstantin Izmailov 
> napsal:
> 
> > Hi,
> > I'm experimenting with Postgres 10 and protocol v3. I noticed that the
> > Postgres allows executing multiple queries simultaneously (I basically
> > commented out a check that prevents sending another query in libpq while
> > previous result(s) reading is not complete). Things appear like working,
> > but I just wanted to ask if anyone else tried the same (logically separate
> > results from multiple simultaneous queries)?
> >
> >
> Postgres cannot to execute simultaneous queries in one session. So queries
> should be executed in serial form every time.

I think what Konstantin is really talking about is pipelining
(i.e. sending multiple queries without waiting for the results
inbetween, thereby reducing latency), and that is actually supported by
postgres. Some drivers make fairly extensive use of it (e.g. the pgjdbc
driver).

As for libpq: No, that's currently not supported. There is a patch that
I hope to get into v13 however: https://commitfest.postgresql.org/20/1317

Greetings,

Andres Freund




Re: CVE-2019-9193 about COPY FROM/TO PROGRAM

2019-04-04 Thread Andres Freund
Hi

On 2019-04-04 21:50:41 +0200, Magnus Hagander wrote:
> On Thu, Apr 4, 2019 at 9:45 PM Tom Lane  wrote:
> 
> > Jeremy Schneider  writes:
> > > I'm all for having clear documentation about the security model in
> > > PostgreSQL, but I personally wouldn't be in favor of adding extra
> > > wording to the docs just to pacify concerns about a CVE which may have
> > > been erroneously granted by an assigning authority, who possibly should
> > > have done better due diligence reviewing the content. Particularly if
> > > there's any possibility that the decision to assign the number can be
> > > appealed/changed, though admittedly I know very little about the CVE
> > > process.
> >
> > Just FYI, we have filed a dispute with Mitre about the CVE, and also
> > reached out to trustwave to try to find out why they filed the CVE
> > despite the earlier private discussion.
> >
> 
> The original author has also pretty much acknowledged in comments on his
> blog and on twitter that it's not actually a vulnerability. (He doesn't
> agree with the design decision, which is apparently enough for a high
> scoring CVE registration).

Btw, the xp_cmdshell thing the author references several times?
It can be enabled via tsql if you have a privileged account.

https://docs.microsoft.com/en-us/sql/database-engine/configure-windows/xp-cmdshell-server-configuration-option?view=sql-server-2017

and it allows to execute shell code (as a specified user) even when not
a sysadmin:
https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/xp-cmdshell-transact-sql?view=sql-server-2017#xp_cmdshell-proxy-account

Greetings,

Andres Freund




Re: PostgreSQL Windows 2019 support ?

2019-04-03 Thread Andres Freund
Hi,

On 2019-04-03 12:43:25 -0400, Tom Lane wrote:
> david moloney  writes:
> > Is PostgreSQL 10+ supported on windows 2019 ?
> 
> It is not.  Considering that VS2019 was released yesterday,
> you should have been surprised to get any other answer.

I don't think VS2019 and Windows 2019 are the same thing... And the
latter has been out for longer than yesterday...  I don't know if
anybody has done rigorous testing on it however.

I'd be somewhat surprised if it didn't just work however.

Greetings,

Andres Freund




Re: New LLVM JIT Features

2019-04-02 Thread Andres Freund
On 2019-04-03 10:44:06 +0530, preejackie wrote:
> Hi Andres,
> 
> Thanks for the reply! Please see my comments inline.
> 
> On 03/04/19 3:20 AM, Andres Freund wrote:
> > Hi,
> > 
> > On 2019-04-02 00:51:51 +0530, preejackie wrote:
> > > As LLVM ORC supports compiling in multiple backend threads, it would be
> > > effective if we compile the functions speculatively before they are called
> > > by the executing function. So when we request JIT to compile a function, 
> > > JIT
> > > will immediately returns the function address for raw executable bits. 
> > > This
> > > will greatly reduce the JIT latencies in modern multi-core machines.
> > I personally think this should be approached somewhat differently -
> > putting patchpoints into code reduces the efficiency of the generated
> > code, so I don't think that's the right approach. What I think we should
>  What do you mean by patch points here? To my knowledge, LLVM symbols have
> arbitrary stub associated which resolve to function address at function
> address.

I was assuming that you'd want to improve latency by not compiling all
the functions at the start of the executor (like we currently do), but
have sub-functions compiled in the background. That'd require
patchpoints to be able to initially redirect to a function to wait for
compilation, which then can be changed to directly jump to the function.
Because we already just compile all the functions reachable at the start
of execution in one go, so it's not a one-by-one function affair.


> > do is to, if we decide it's worthwhile at plan time, generate the LLVM
> > IR time at the beginning of execution, but continue to use interpreted
> > execution initially. The generated IR would then be handed over to a
> > background [process|thread|whatnot] for optimization of code
> > generation. Then, when finished, I'd switch over from interpreted to JIT
> > compiled execution.  That approach will, in my view, yield better
> > latency behaviour because we can actually evaluate quals etc for which
> > we've not yet finished code generation.
> > 
> > 
> > > And also I'm working on designing a ORC in-place dynamic profiling 
> > > support, by
> > > this JIT will automatically able to identify the hot functions, and 
> > > compile
> > > it in higher optimization level to achieve good performance.
> > I think that's a nice concept, but at the moment the generated code is
> > so bad that it's much more likely to get big benefits by improving the
> > generated IR, compared to giving more hints to the optimizer.
> By improving the generated IR, you mean by turning pgsql queries into LLVM
> IR? If it is the case, this design doesn't handles that, it works only when
> the given program representation is in LLVM IR.

My point is that we generate IR that's hard for LLVM to optimize. And
that fixing that is going to give you way bigger wins than profile
guided optimization.

Greetings,

Andres Freund




Re: template0 is having high age of datforzenxid

2019-04-02 Thread Andres Freund
Hi,

On 2019-04-02 15:35:53 -0700, AI Rumman wrote:
> I am running Postgresql 10 where I can see the template0 database is having
> longest datfrozenxid:
> ```db_name  age_of_datfrozenxid
> --
> postgres251365
> template1234574
> template075366462
> db1253097
> db2250649
> db3250649
> db414641
> db514214```
> 
> Initially I had high age for all the databases and I ran "vacuum freeze"
> which brought down the age of other databases. But how can I do the same
> for template0?

75 million isn't that high - autovacuum will automatically look at it
once above autovacuum_max_freeze_age (defaulting to 200 million).  If
you really for some reason need to make it have a lower datfrozenxid
(why?), you'd need to update its pg_database entry to allow connections,
connect to it, vacuum it, and change datallowconn back.

Greetings,

Andres Freund




Re: New LLVM JIT Features

2019-04-02 Thread Andres Freund
Hi,

On 2019-04-02 00:51:51 +0530, preejackie wrote:
> As LLVM ORC supports compiling in multiple backend threads, it would be
> effective if we compile the functions speculatively before they are called
> by the executing function. So when we request JIT to compile a function, JIT
> will immediately returns the function address for raw executable bits. This
> will greatly reduce the JIT latencies in modern multi-core machines.

I personally think this should be approached somewhat differently -
putting patchpoints into code reduces the efficiency of the generated
code, so I don't think that's the right approach. What I think we should
do is to, if we decide it's worthwhile at plan time, generate the LLVM
IR time at the beginning of execution, but continue to use interpreted
execution initially. The generated IR would then be handed over to a
background [process|thread|whatnot] for optimization of code
generation. Then, when finished, I'd switch over from interpreted to JIT
compiled execution.  That approach will, in my view, yield better
latency behaviour because we can actually evaluate quals etc for which
we've not yet finished code generation.


> And also I'm working on designing a ORC in-place dynamic profiling support, by
> this JIT will automatically able to identify the hot functions, and compile
> it in higher optimization level to achieve good performance.

I think that's a nice concept, but at the moment the generated code is
so bad that it's much more likely to get big benefits by improving the
generated IR, compared to giving more hints to the optimizer.

Greetings,

Andres Freund




Re: CVE-2019-9193 about COPY FROM/TO PROGRAM

2019-04-02 Thread Andres Freund
Hi,

On 2019-04-02 07:35:02 -0500, Brad Nicholson wrote:
> Michael Paquier  wrote on 04/02/2019 01:05:01 AM:
> 
> > From: Michael Paquier 
> > To: "Jonathan S. Katz" 
> > Cc: Tom Lane , Magnus Hagander
> > , Daniel Verite ,
> > pgsql-general 
> > Date: 04/02/2019 01:05 AM
> > Subject: Re: CVE-2019-9193 about COPY FROM/TO PROGRAM
> >
> > On Mon, Apr 01, 2019 at 10:04:32AM -0400, Jonathan S. Katz wrote:
> > > +1, though I’d want to see if people get noisier about it before we
> rule
> > > out an official response.
> > >
> > > A blog post from a reputable author who can speak to security should
> > > be good enough and we can make noise through our various channels.
> >
> > Need a hand?  Not sure if I am reputable enough though :)
> >
> > By the way, it could be the occasion to consider an official
> > PostgreSQL blog on the main website.  News are not really a model
> > adapted for problem analysis and for entering into technical details.
> 
> A blog post would be nice, but it seems to me have something about this
> clearly in the manual would be best, assuming it's not there already.  I
> took a quick look, and couldn't find anything.

https://www.postgresql.org/docs/devel/sql-copy.html

"Note that the command is invoked by the shell, so if you need to pass
any arguments to shell command that come from an untrusted source, you
must be careful to strip or escape any special characters that might
have a special meaning for the shell. For security reasons, it is best
to use a fixed command string, or at least avoid passing any user input
in it."

"Similarly, the command specified with PROGRAM is executed directly by
the server, not by the client application, must be executable by the
PostgreSQL user. COPY naming a file or command is only allowed to
database superusers or users who are granted one of the default roles
pg_read_server_files, pg_write_server_files, or
pg_execute_server_program, since it allows reading or writing any file
or running a program that the server has privileges to access."

Those seem reasonable to me?

Greetings,

Andres Freund




Re: PostgreSQL logical replication slot LSN values

2019-03-12 Thread Andres Freund
Hi,

(please don't send HTML only emails to this list)

On 2019-03-12 11:08:56 +, Rashmi V Bharadwaj wrote:
> We have an application that uses the PostgreSQL logical replication API to 
> read
> the changes made to the PostgreSQL database and applies it to a different
> database (like Db2 etc). We are using logical replication slots for
> this.

Cool.


> Currently I am facing an issue where the replication slot is pointing to an
> older restart_lsn and confirmed_flush_lsn (like 10 days back) and the
> corresponding WAL file is already deleted from the pg_wal directory. Ideally
> this should not happen, right?

Well, did you consume the logical data, and if so how? When you use the
streaming interface - HIGHLY recommended - you need to send feedback
messages as to where you've received the data.


> since the slot is holding this LSN the wal file
> should not have been deleted. Now when I try to use query like
> select * from pg_logical_slot_get_changes(,, NULL)
> 

> or use the logical replication API with a start position as any newer LSN, I
> get the following error:
> 
> ERROR: requested WAL segment pg_wal/00010036 has already been
> removed
> SQL state: 58P01.

Hm, that should not happen. Did you by any chance externally (manually
or by script) delete WAL files?


> How do I get past this issue? I have not enabled log archiving. I would also
> like to know how I can modify the restart_lsn and confirmed_flush_lsn 
> positions
> of the slot?

You need to send feedback messages confirming up to wher eyou've
consumed the data when using the streaming protocol. When using the SQL
functions the _get_ function confirms when it returns, the _peek_
function never does so. It's recommended to limit the size of the
resultset a bit using the nchanges paramter so you can call it in
smaller increments.

Greetings,

Andres Freund



Re: 9.0 standby - could not open file global/XXXXX

2019-02-25 Thread Andres Freund
Hi,

On 2019-02-25 20:06:42 +0100, Filip Rembiałkowski wrote:
> There is a large (>5T) database on PostgreSQL 9.0.23.

As Stephen said, this is long out of support.


> 2. base backup is transferred directly to new server using
> pg_start_backup + rsync + pg_stop_backup.

Do you exclude any files? Remove any? If so which?

Greetings,

Andres Freund



Re: WSL (windows subsystem on linux) users will need to turn fsync off as of 11.2

2019-02-18 Thread Andres Freund
Hi,

On 2019-02-19 11:50:36 +1100, James Sewell wrote:
> >
> > Right, the first step would be for a WSL user to figure out what's
> > wrong with builds on the WSL and show us how to fix it; I heard
> > through the grapevine that if you try it, initdb doesn't work (it must
> > be something pretty subtle in the configure phase or something like
> > that, since the Ubuntu .deb apparently works, except for the issue
> > reported in this thread).
> 
> 
> That's correct - initdb doesn't work when you've built on WSL as
> *somehow* HAVE_FDATASYNC is set to 1 by configure - but it ends up not
> being included by #ifdef blocks. This causes the following PANIC

What do you mean by "not being included by #ifdef blocks"? The only
guard in issue_xlog_fsync() is #ifdef HAVE_FDATASYNC, which ought to be
independent of any includes?  I can see how this'd go wrong if configure
did *not* detect fdatasync, because then

#if defined(PLATFORM_DEFAULT_SYNC_METHOD)
#define DEFAULT_SYNC_METHOD PLATFORM_DEFAULT_SYNC_METHOD

would trigger, which we explicitly set for linux:

/*
 * Set the default wal_sync_method to fdatasync.  With recent Linux versions,
 * xlogdefs.h's normal rules will prefer open_datasync, which (a) doesn't
 * perform better and (b) causes outright failures on ext4 data=journal
 * filesystems, because those don't support O_DIRECT.
 */
#define PLATFORM_DEFAULT_SYNC_METHODSYNC_METHOD_FDATASYNC

Greetings,

Andres Freund



Re: WSL (windows subsystem on linux) users will need to turn fsync off as of 11.2

2019-02-18 Thread Andres Freund
On 2019-02-18 10:33:50 -0500, Ravi Krishna wrote:
> Are there any plans to support PG on WSL ?  Just curious.

I think if somebody wanted to start investing efforts to improve testing
of that setup, and then fix the resulting issues, nobody would seriously
object. But also most people working on PG are already busy.  Please
feel free to create a buildfarm test machine with postgres running on
WSL.

- Andres



Re: WSL (windows subsystem on linux) users will need to turn fsync off as of 11.2

2019-02-17 Thread Andres Freund
Hi,

On 2019-02-17 23:29:09 +1300, Thomas Munro wrote:
> Hmm.  Well, at least ENOSPC should be treated the same way as EIO.
> Here's an experiment that seems to confirm some speculations about NFS
> on Linux from the earlier threads:

I wish we'd' a good way to have test scripts in the tree for something
like that, but using postgres. Unfortunately it's not easy to write
portable setup scripts for it.


> So far I still think that we should panic if fsync() returns any error
> number at all.  For sync_file_range(), it sounds like maybe you think
> we should leave the warning-spewing in there for ENOSYS, to do exactly
> what we did before on principle since that's what back-branches are
> all about?  Something like:
> 
>   ereport(errno == ENOSYS ? WARNING : data_sync_elevel(WARNING),
> 
> Perhaps for master we could skip it completely, or somehow warn just
> once, and/or switch to one of our other implementations at runtime?  I
> don't really have a strong view on that, not being a user of that
> system.  Will they ever implement it?  Are there other systems we care
> about that don't implement it?  (Android?)

I'm not sure I see much need for leaving the warning in out of
principle. Feels like we ought to sync_file_range once at postmaster
startup and then just force-disable the flush GUCs if not available?

Greetings,

Andres Freund



Re: WSL (windows subsystem on linux) users will need to turn fsync off as of 11.2

2019-02-15 Thread Andres Freund



On February 15, 2019 9:44:50 AM PST, Tom Lane  wrote:
>Andres Freund  writes:
>> On February 15, 2019 9:13:10 AM PST, Tom Lane 
>wrote:
>>> I'm of the opinion that we shouldn't be panicking for
>sync_file_range
>>> failure, period.
>
>> With some flags it's strictly required, it does"eat"errors depending
>on the flags. So I'm not sure I understand? 
>
>Really?  The specification says that it starts I/O, not that it waits
>around for any to finish.

That depends on the flags you pass in. By memory I don't think it eats an error 
with our flags in recent kernels, but I'm not sure.

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: WSL (windows subsystem on linux) users will need to turn fsync off as of 11.2

2019-02-15 Thread Andres Freund



On February 15, 2019 9:13:10 AM PST, Tom Lane  wrote:
>Andres Freund  writes:
>> I suspect that's because WSL has an empty implementation of
>> sync_file_range(), i.e. it unconditionally returns ENOSYS. But as
>> configure detects it, we still emit calls for it.  I guess we ought
>to
>> except ENOSYS for the cases where we do panic-on-fsync-failure?
>
>I'm of the opinion that we shouldn't be panicking for sync_file_range
>failure, period.

With some flags it's strictly required, it does"eat"errors depending on the 
flags. So I'm not sure I understand? 

Access
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: WSL (windows subsystem on linux) users will need to turn fsync off as of 11.2

2019-02-15 Thread Andres Freund
Hi,

On 2019-02-14 19:48:05 -0500, Tom Lane wrote:
> Bruce Klein  writes:
> > If you are running Postgres inside Microsoft WSL (at least on Ubuntu, maybe
> > on others too), and just picked up a software update to version 11.2, you
> > will need to go into your /etc/postgresql.conf file and set fsync=off.
> 
> Hm.  Probably this is some unexpected problem with the
> panic-on-fsync-failure change; although that still leaves some things
> unexplained, because if fsync is failing for you now, why didn't it fail
> before?  Anyway, you might try experimenting with data_sync_retry,
> instead of running with scissors by turning off fsync altogether.
> See first item in the release notes:
> 
> https://www.postgresql.org/docs/11/release-11-2.html
> 
> Also, we'd quite like to hear more details; can you find any PANIC
> messages in the server log?

I suspect that's because WSL has an empty implementation of
sync_file_range(), i.e. it unconditionally returns ENOSYS. But as
configure detects it, we still emit calls for it.  I guess we ought to
except ENOSYS for the cases where we do panic-on-fsync-failure?

You temporarily can work around it, mostly, by setting
checkpoint_flush_after = 0 and bgwriter_flush_after = 0.

Greetings,

Andres Freund



Re: PostgreSQL logical replication depends on WAL segments?

2019-01-22 Thread Andres Freund
Hi,

On 2019-01-22 11:57:00 -0600, Jeremy Finzel wrote:
> >
> > That is news to me. Can you provide a citation for this?
> >
> 
> I can see the confusion in what I said.  To clarify, I really meant that in
> order to retrieve that data that you need in really old WAL segments, you
> need to keep your replication slot in a position that will hold that WAL in
> place.  And that is what will prevent old rows from being vacuumed
> away.

Note replication slots only prevent old *catalog* rows from being
removed, not old row versions in user created tables.

Greetings,

Andres Freund



Re: PostgreSQL logical replication depends on WAL segments?

2019-01-22 Thread Andres Freund
Hi,

On 2019-01-22 11:10:27 -0600, Jeremy Finzel wrote:
> P.S. do heed the advice of the others and get more familiar with the docs
> around WAL archiving.

Logical replication doesn't normally interact with WAL archiving in any
way, so that seems orthogonal.

Greetings,

Andres Freund



Re: PostgreSQL logical replication depends on WAL segments?

2019-01-22 Thread Andres Freund
Hi,

On 2019-01-22 14:18:12 +0100, Josef Machytka wrote:
> Hello, I already tried to ask on stackoverflow but so far without success.
> (
> https://stackoverflow.com/questions/54292816/postgresql-logical-replication-depends-on-wal-segments
> )
> 
> Could someone help me please?
> 
> 
> 
> I am successfully using logical replication between 2 PG 11 cloud VMs for
> latest data. But I tried to publish also some older tables to transfer data
> between databases and got strange error about missing WAL segment.

That ought not to happen, logical replication won't start in the past,
and the slot will prevent necessary WAL from being removed.  Are you
manually removing WAL from pg_wal/?


Greetings,

Andres Freund



Re: ERROR: found multixact XX from before relminmxid YY

2018-12-30 Thread Andres Freund
Hi,

On 2018-12-28 19:49:36 -0500, Tom Lane wrote:
> Mark Fletcher  writes:
> > Starting yesterday morning, auto vacuuming of one of our postgresql 9.6.10
> > (CentOS 7) table's started failing:
> > ERROR:  found multixact 370350365 from before relminmxid 765860874
> > CONTEXT:  automatic vacuum of table "userdb.public.subs"
>
> Ugh.
>
> > Reading the various discussions about this error, the only solution I found
> > was here:
> > https://www.postgresql.org/message-id/CAGewt-ukbL6WL8cc-G%2BiN9AVvmMQkhA9i2TKP4-6wJr6YOQkzA%40mail.gmail.com
> > But no other reports of this solving the problem. Can someone verify that
> > if I do the mentioned fix (and I assume upgrade to 9.6.11) that will fix
> > the problem? And that it doesn't indicate table corruption?
>
> Yeah, SELECT FOR UPDATE should overwrite the broken xmax value and thereby
> fix it, I expect.

Right.

> However, I don't see anything in the release notes
> suggesting that we've fixed any related bugs since 9.6.10, so if this
> just appeared then we've still got a problem :-(.  Did anything
> interesting happen since your last successful autovacuum on that table?
> Database crashes, WAL-related parameter changes, that sort of thing?

I think it's entirely conceivable that the damage happened with earlier 
versions,
and just became visible now as the global horizon increased.

Greetings,

Andres Freund



Re: explain analyze cost

2018-12-12 Thread Andres Freund
On 2018-12-12 17:37:47 -0500, Ravi Krishna wrote:
> I am running explain analyze cost on a SQL which reads from two large
> tables (122mil and 37 mil).  The query is an UPDATE SQL where we use
> derives table in the from clause and then join it back to the table
> being updated.
> The explain analyze cost itself is taking forever to run. It is running
> for the last 1 hr. Does that actually run the SQL to find out the
> impact of I/O (as indicated in COSTS). If not, what can cause it to run
> this slow.__

Please do not hijack other threads by replying to a message and changing
the subject.  Just send a new mail to
pgsql-general@lists.postgresql.org, or whatever list you want to send an
email to.

Thanks,

Andres



Re: amazon aroura config - seriously overcommited defaults? (May be Off Topic)

2018-12-08 Thread Andres Freund
On 2018-12-08 15:23:19 -0800, Rob Sargent wrote:
> 
> 
> > On Dec 8, 2018, at 3:12 PM, Andres Freund  wrote:
> > 
> > On 2018-12-08 12:06:23 -0800, Jeremy Schneider wrote:
> >> On RDS PostgreSQL, the default is 25% of your server memory. This seems
> >> to be pretty widely accepted as a good starting point on PostgreSQL.
> > 
> > FWIW, I think it's widely cited, but also bad advice.  25% for a OLTP
> > workload on a 1TB machine with a database size above 25% is a terrible
> > idea.
> > 
> 
> Sorry, could you please expand “database size above 25%”?  25% of what?

Memory available to postgres (i.e. 100% of the server's memory on a
server dedicated to postgres, less if it's shared duty).

Greetings,

Andres Freund



Re: backend crash on DELETE, reproducible locally

2018-11-06 Thread Andres Freund
Hi,

On 2018-11-06 17:11:40 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2018-11-06 16:47:20 -0500, Tom Lane wrote:
> >> Looks like somebody forgot to list
> >> RELOPT_KIND_GIST in RELOPT_KIND_INDEX, but is that the
> >> fault of commit c203d6cf8 or was it busted before?
> 
> > Looks new:
> > +   RELOPT_KIND_INDEX = 
> > RELOPT_KIND_BTREE|RELOPT_KIND_HASH|RELOPT_KIND_GIN|RELOPT_KIND_SPGIST,
> > there aren't any other "for all indexes" type options, so the whole
> > category didn't exist before.
> 
> > It also strikes me as a really bad idea, even if RELOPT_KIND_GIST
> > wouldn't have been omitted: It breaks index am extensibility.
> 
> Hm, well, that enum already knows about all index types, doesn't it?

Not quite sure what you mean.

Before c203d6cf8 there weren't reloptions across index types. But after
it, if one adds a new index AM via an extension, one can't set
recheck_on_update = false for indexes using it, even though the feature
affects such indexes. We support adding indexes AMs at runtime these
days, including WAL logging (even though that's a bit
voluminous). There's even a contrib index AM...

The list of AMs is supposed to be extensible at runtime, cf
add_reloption_kind().

Greetings,

Andres Freund



Re: backend crash on DELETE, reproducible locally

2018-11-06 Thread Andres Freund
Hi,

On 2018-11-06 23:11:29 +0100, Tomas Vondra wrote:
> On 11/6/18 10:54 PM, Andres Freund wrote:
> > Looks new:
> > +   RELOPT_KIND_INDEX = 
> > RELOPT_KIND_BTREE|RELOPT_KIND_HASH|RELOPT_KIND_GIN|RELOPT_KIND_SPGIST,
> > 
> > there aren't any other "for all indexes" type options, so the whole
> > category didn't exist before.
> > 
> > It also strikes me as a really bad idea, even if RELOPT_KIND_GIST
> > wouldn't have been omitted: It breaks index am extensibility.
> > 
> 
> Does it? The RELOPT_KIND_* stuff is hard-coded in reloptions.h anyway,
> so I'm not sure how this particular thing makes it less extensible?

Well, you can create new index AMs in extensions these days, but given
the relopt design above, the feature cannot be disabled for them. Yes,
there's *currently* probably no great way to have reloptions across all
potential index types, but that's not an excuse for adding something
broken.

Greetings,

Andres Freund



Re: Replication question

2018-10-22 Thread Andres Freund
Hi,

On 2018-10-22 13:53:40 +, Scot Kreienkamp wrote:
> We just moved to PG9.6 from 9.1 (yeah, not my choice to wait this long).  In 
> 9.1 I had to make the archive location (NFS in my case) available to all the 
> mirrors running PG so that they could catch up whenever they fell behind.  I 
> thought I read somewhere that in 9.6, as long as the WAL log is available on 
> disk or in the archive the replication server will provide that to the 
> replication client, and my archive NFS mount didn't have to be available to 
> all replication clients.  It doesn't seem to be operating that way though.  
> Did I completely remember that wrong or did I misunderstand something?

You can configure it that way with replication slots.  That obviously
requires enough space. It also, as the data is stored on the primary,
can't protect against loosing the entire primary (nor will an nfs served
archive if it's hosted on the primary).


> This message is intended only for the individual or entity to which it is 
> addressed.  It may contain privileged, confidential information which is 
> exempt from disclosure under applicable laws.  If you are not the intended 
> recipient, you are strictly prohibited from disseminating or distributing 
> this information (other than to the intended recipient) or copying this 
> information.  If you have received this communication in error, please notify 
> us immediately by e-mail or by telephone at the above number. Thank you.

GNGNGNG.

Greetings,

Andres Freund



Re: found xmin x from before relfrozenxid y

2018-10-21 Thread Andres Freund
Hi,

On 2018-10-21 10:24:16 -0400, Tom Lane wrote:
> =?UTF-8?Q?Johannes_Gra=c3=abn?=  writes:
> > after upgrading to version 11, I see the error pattern "found xmin x
> > from before relfrozenxid y" in different databases on different hosts.
> > From https://www.postgresql.org/docs/10/static/release-10-3.html, I
> > learned that this was an error caused by pg_upgrade, which apparently
> > had been fixed in 10.3. This page also states that refreshing the
> > affected materialized view non-concurrently would fix the problem.
> > My question is now how to infer the affected materialized view from the
> > error message. Is there a way to tell which one to refresh from the xmin
> > or relfrozenxid value?
> 
> No :-(.  I wonder why in the world we didn't make that error message
> include the relation and block number the tuple was found in.

Because it was a really complicated bugfix already, I don't think the
answer is more complicated than that.


> (Well, I see the short answer: the code layer throwing the error
> doesn't know.  But that could be fixed easily enough.)

I wonder if the better approach wouldn't be to add an errcontext for
vaccuum, where continually update the block number etc. Theres plenty of
different sources of corruption that'd potentially cause debug messages
or errors, and that should get most of them.

Greetings,

Andres Freund



Re: postgres server process crashes when using odbc_fdw

2018-10-17 Thread Andres Freund
On 2018-10-17 11:02:40 -0700, Adrian Klaver wrote:
> On 10/17/18 10:57 AM, Ravi Krishna wrote:
> > 
> > > 
> > > Please note that odbc_fdw is not maintained by the postgresql developers, 
> > > but a separate project.
> > 
> > 
> > Translation: You are on your own.  We are hoping this will make our 
> > migration out of DB2 quicker. Oh well.
> > 
> 
> No it means you need to take this up with the project maintainer(s) as they
> are the folks that can determine whether it is fixable or not:
> 
> https://github.com/ZhengYang/odbc_fdw

Looks like https://github.com/CartoDB/odbc_fdw is more recently maintained.



Re: postgres server process crashes when using odbc_fdw

2018-10-17 Thread Andres Freund
Hi,

On 2018-10-17 14:12:01 -0400, Ravi Krishna wrote:
> > 
> > Come on. We can't realistically support & debug random postgres extending 
> > projects, nor do we have control over them. And you're not necessarily on 
> > your own, you could report the issue to odbcfdw's authors/github tracker.  
> > Or pay a company for support.
> > 
> 
> On a related note is fdw for Oracle and SQLServer supported by the community ?

They're not postgresql.org projects if that's what you're asking. IOW,
they probably have their own communities, it's just not this.

Greetings,

Andres Freund



Re: postgres server process crashes when using odbc_fdw

2018-10-17 Thread Andres Freund



On October 17, 2018 10:57:37 AM PDT, Ravi Krishna  wrote:
>
>> 
>> Please note that odbc_fdw is not maintained by the postgresql
>developers, but a separate project.
>
>
>Translation: You are on your own.  We are hoping this will make our
>migration out of DB2 quicker. Oh well.

Come on. We can't realistically support & debug random postgres extending 
projects, nor do we have control over them. And you're not necessarily on your 
own, you could report the issue to odbcfdw's authors/github tracker.  Or pay a 
company for support.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: postgres server process crashes when using odbc_fdw

2018-10-17 Thread Andres Freund
Hi,

On 2018-10-17 11:17:11 -0400, Ravi Krishna wrote:
> It turned out that enabling ODBC trace was causing PG to crash.  Once 
> disabled it started working, but found another issue.
> All object names in DB2 is assumed to be upper case.  odbc_fdw sends queries 
> like this
> 
> 
> select "fld1","fld2" from "schema_name"."table_name".
> 
> So the foreign table in PG has to created in upper case within quotes.  It is 
> bit of an annoyance.

Please note that odbc_fdw is not maintained by the postgresql
developers, but a separate project.

Greetings,

Andres Freund



Re: Slot issues

2018-10-14 Thread Andres Freund
Hi,

On 2018-10-14 17:52:30 -0400, Ravi Krishna wrote:
> The best part in Db2 HADR is that when the standby is catching up with the 
> master after a long time, it will
> start from the last LSN and fetch it from the primary WAL directory (active 
> logs ). If not found, it will look for it
> in the archived logs and start applying from there until the current
> point.

Uh, it works precisely the same in postgres.


> No rep slot business.

You're not forced to use slots.  Their purpose is to allow to force the
primary to keep necessary resources around. Which also allows to get rid
of the archive in some setups.

Greetings,

Andres Freund



Re: Slot issues

2018-10-14 Thread Andres Freund
Hi,

On 2018-10-15 03:18:35 +0530, bhargav kamineni wrote:
> >  You probably afterwards want to drop those slots from the new standby.
> > See the following section from the docs:
> 
> >
> https://www.postgresql.org/docs/current/static/continuous-archiving.html#BACKUP-LOWLEVEL-BASE-BACKUP-DATA
> "It is often a good idea to also omit from the backup the files within the
> cluster's pg_replslot/ directory, so that replication slots that exist on
> the master do not become part of the backup. Otherwise, the subsequent use
> of the backup to create a standby may result in indefinite retention of WAL
> files on the standby, and possibly bloat on the master if hot standby
> feedback is enabled, because the clients that are using those replication
> slots will still be connecting to and updating the slots on the master, not
> the standby. Even if the backup is only intended for use in creating a new
> master, copying the replication slots isn't expected to be particularly
> useful, since the contents of those slots will likely be badly out of date
> by the time the new master comes on line."
> 
> Since i already synced the pg_repslot to  standby ,Is it okay if i remove
> the pg_repslot directory befor starting postgresql service  ?

You can do that, but the nicer way probably is to just remove them via
sql once started.  Something like
SELECT pg_drop_replication_slot(slot_name) FROM pg_replication_slots;

Greetings,

Andres Freund



Re: Slot issues

2018-10-14 Thread Andres Freund
Hi,

As I just wrote:
> On Mon, 15 Oct 2018 at 02:40, Andres Freund  wrote:
> > Please try to quote properly.

On 2018-10-15 02:45:51 +0530, bhargav kamineni wrote:
> Yeah i have used rsync , Got it now will increase the max_replication_slots
> to high enough , Thank you Andres Freund :-)

You probably afterwards want to drop those slots from the new standby.
See the following section from the docs:

https://www.postgresql.org/docs/current/static/continuous-archiving.html#BACKUP-LOWLEVEL-BASE-BACKUP-DATA
"It is often a good idea to also omit from the backup the files within the 
cluster's pg_replslot/ directory, so that replication slots that exist on the 
master do not become part of the backup. Otherwise, the subsequent use of the 
backup to create a standby may result in indefinite retention of WAL files on 
the standby, and possibly bloat on the master if hot standby feedback is 
enabled, because the clients that are using those replication slots will still 
be connecting to and updating the slots on the master, not the standby. Even if 
the backup is only intended for use in creating a new master, copying the 
replication slots isn't expected to be particularly useful, since the contents 
of those slots will likely be badly out of date by the time the new master 
comes on line."

Greetings,

Andres Freund



Re: Slot issues

2018-10-14 Thread Andres Freund
Hi,

Please try to quote properly.

On 2018-10-15 01:57:53 +0530, bhargav kamineni wrote:
> I got his on standby,  could you please explain in detail about
>  --*but that *on the standby* haven't set max_replication_slots high enough*
> .

What is max_replication_slots set to on the new standby?

If you created the new basebackup using rsync, and didn't exclude
pg_replication_slot, it'll have copied the slots from the primary.  And
thus needs a high enough max_replication_slots to work with them.

- Andres



  1   2   >