Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-24 Thread Tom Lane
Alvaro Herrera  writes:
> On 2020-Feb-24, Tom Lane wrote:
>> Why not just drive it off max_files_per_process?  On Unix, that
>> largely exists to override the ulimit setting anyway.  With no
>> comparable knob on a Windows system, we might as well just say
>> that's what you set.

> That makes sense to me -- but if we do that, then maybe we should be
> doing the setrlimit() dance on it too, on Linux^W^W where supported.

Yeah, arguably we could try to setrlimit if max_files_per_process is
larger than the ulimit.  We should definitely not reduce the ulimit
if max_files_per_process is smaller, though, since the DBA might
intentionally be leaving daylight for purposes such as FD-hungry PL
functions.  On the whole I'm inclined to leave well enough alone on
the Unix side --- there's nothing there that the DBA can't set if
she wishes.

regards, tom lane




Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-24 Thread Alvaro Herrera
On 2020-Feb-24, Tom Lane wrote:

> Alvaro Herrera  writes:

> > Then again, that would be akin to setrlimit() on Linux.  Maybe we can
> > consider that a separate GUC, in a separate patch, with a
> > platform-specific default value that just corresponds to the OS's
> > default, and the user can set to whatever suits them; then we call
> > either _setmaxstdio() or setrlimit().
> 
> Why not just drive it off max_files_per_process?  On Unix, that
> largely exists to override the ulimit setting anyway.  With no
> comparable knob on a Windows system, we might as well just say
> that's what you set.

That makes sense to me -- but if we do that, then maybe we should be
doing the setrlimit() dance on it too, on Linux^W^W where supported.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-24 Thread Tom Lane
Alvaro Herrera  writes:
> On 2020-Feb-24, Tom Lane wrote:
>> I don't think there's much point in telling Windows users about
>> _setmaxstdio() here.

> Yeah, telling users to _setmaxstdio() themselves is useless, because
> they can't do it; that's something *we* should do.  I think the 512
> limit is a bit low; why not increase that a little bit?  Maybe just to
> the Linux default of 1024.

> Then again, that would be akin to setrlimit() on Linux.  Maybe we can
> consider that a separate GUC, in a separate patch, with a
> platform-specific default value that just corresponds to the OS's
> default, and the user can set to whatever suits them; then we call
> either _setmaxstdio() or setrlimit().

Why not just drive it off max_files_per_process?  On Unix, that
largely exists to override the ulimit setting anyway.  With no
comparable knob on a Windows system, we might as well just say
that's what you set.

regards, tom lane




Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-24 Thread Alvaro Herrera
On 2020-Feb-24, Tom Lane wrote:

> I wrote:

> > I thought about platform-specific messages, but it's not clear to me
> > whether our translation infrastructure will find messages that are
> > inside #ifdefs ... anyone know?
> 
> Oh, but of course it does.  So let's do
> 
> errdetail("There are too many open files on the local server."),
> #ifndef WIN32
> errhint("Raise the server's max_files_per_process and/or \"ulimit 
> -n\" limits.")
> #else
> errhint("Raise the server's max_files_per_process setting.")
> #endif

WFM.

> I don't think there's much point in telling Windows users about
> _setmaxstdio() here.

Yeah, telling users to _setmaxstdio() themselves is useless, because
they can't do it; that's something *we* should do.  I think the 512
limit is a bit low; why not increase that a little bit?  Maybe just to
the Linux default of 1024.

Then again, that would be akin to setrlimit() on Linux.  Maybe we can
consider that a separate GUC, in a separate patch, with a
platform-specific default value that just corresponds to the OS's
default, and the user can set to whatever suits them; then we call
either _setmaxstdio() or setrlimit().

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-24 Thread Tom Lane
I wrote:
> Alvaro Herrera  writes:
>> As for the platform dependencies, I see two main options: make the hint
>> platform-specific (i.e have #ifdef branches per platform) or make it
>> even more generic, such as "file descriptor limits".

> I thought about platform-specific messages, but it's not clear to me
> whether our translation infrastructure will find messages that are
> inside #ifdefs ... anyone know?

Oh, but of course it does.  So let's do

errdetail("There are too many open files on the local server."),
#ifndef WIN32
errhint("Raise the server's max_files_per_process and/or \"ulimit -n\" 
limits.")
#else
errhint("Raise the server's max_files_per_process setting.")
#endif

I don't think there's much point in telling Windows users about
_setmaxstdio() here.

regards, tom lane




Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-24 Thread Tom Lane
Andres Freund  writes:
> On 2020-02-24 16:14:53 -0300, Alvaro Herrera wrote:
>> I suppose we do use the C runtime.  There's a reference to
>> _setmaxstdio() being able to raise the limit from the default of 512 to
>> up to 8192 open files.  We don't currently invoke that.
>> https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setmaxstdio?view=vs-2019

> If we ever go for that, we should also consider raising the limit on
> unix systems up to the hard limit when hitting the fd ceiling. I.e. get
> the current limit with getrlimit(RLIMIT_NOFILE) and raise rlim_cur
> [closer] to rlim_max with setrlimit.

I'm disinclined to think we should override the user's wishes in this way.
Especially given PG's proven ability to run the kernel completely out of
file descriptors ...

regards, tom lane




Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-24 Thread Andres Freund
Hi,

On 2020-02-24 16:14:53 -0300, Alvaro Herrera wrote:
> But the C runtime does:
> https://docs.microsoft.com/en-us/cpp/c-runtime-library/file-handling?view=vs-2019
> I suppose we do use the C runtime.  There's a reference to
> _setmaxstdio() being able to raise the limit from the default of 512 to
> up to 8192 open files.  We don't currently invoke that.
> https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setmaxstdio?view=vs-2019

If we ever go for that, we should also consider raising the limit on
unix systems up to the hard limit when hitting the fd ceiling. I.e. get
the current limit with getrlimit(RLIMIT_NOFILE) and raise rlim_cur
[closer] to rlim_max with setrlimit.

Perhaps it'd even be worthwhile to just always raise the limit, if
possible, in set_max_safe_fds(), by max_safe_fds +
NUM_RESERVED_FDS. That way PLs, other shared libs, would have a more
usualy amount of FDs available. Rather than a fairly small number, but
only when the backend has been running for a while in the right
workload.

Greetings,

Andres Freund




Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-24 Thread Tom Lane
Alvaro Herrera  writes:
> On 2020-Feb-24, Tom Lane wrote:
>> We could also consider a HINT, along the lines of "Raise the server's
>> max_files_per_process and/or \"ulimit -n\" limits."  This again has
>> the ambiguity about which server, and it also seems dangerously
>> platform-specific.

> Maybe talk about "the local server" to distinguish from the other one.

OK by me.

> As for the platform dependencies, I see two main options: make the hint
> platform-specific (i.e have #ifdef branches per platform) or make it
> even more generic, such as "file descriptor limits".

I thought about platform-specific messages, but it's not clear to me
whether our translation infrastructure will find messages that are
inside #ifdefs ... anyone know?  If that does work, I'd be inclined
to mention ulimit -n on non-Windows and just say nothing about that
on Windows.  "File descriptor limits" seems too unhelpful here.

regards, tom lane




Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-24 Thread Alvaro Herrera
On 2020-Feb-24, Tom Lane wrote:

> We could also consider a HINT, along the lines of "Raise the server's
> max_files_per_process and/or \"ulimit -n\" limits."  This again has
> the ambiguity about which server, and it also seems dangerously
> platform-specific.

Maybe talk about "the local server" to distinguish from the other one.

As for the platform dependencies, I see two main options: make the hint
platform-specific (i.e have #ifdef branches per platform) or make it
even more generic, such as "file descriptor limits".

A quick search suggests that current Windows itself doesn't typically
have such problems:
https://stackoverflow.com/questions/31108693/increasing-no-of-file-handles-in-windows-7-64-bit
https://docs.microsoft.com/es-es/archive/blogs/markrussinovich/pushing-the-limits-of-windows-handles

But the C runtime does:
https://docs.microsoft.com/en-us/cpp/c-runtime-library/file-handling?view=vs-2019
I suppose we do use the C runtime.  There's a reference to
_setmaxstdio() being able to raise the limit from the default of 512 to
up to 8192 open files.  We don't currently invoke that.
https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setmaxstdio?view=vs-2019

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-24 Thread Tom Lane
Mark Dilger  writes:
>> On Feb 20, 2020, at 8:30 PM, Tom Lane  wrote:
>> This idea doesn't fix every possible problem.  For instance, if you
>> have a plperlu function that wants to open a bunch of files, I don't
>> see any easy way to ensure we release VFDs to make that possible.
>> But it's sure an improvement on the status quo.

> I understand that you were using plperlu just as an example, but it got
> me thinking.  Could we ship a wrapper using perl's tie() mechanism to
> call a new spi function to report when a file handle is opened and when
> it is closed?  Most plperlu functions would not participate, since
> developers will not necessarily know to use the wrapper, but at least
> they could learn about the wrapper and use it as a work-around if this
> becomes a problem for them.  Perhaps the same spi function could be used
> by other procedural languages.

Hmm.  I had thought briefly about getting plperl to do that automatically
and had concluded that I didn't see a way (though there might be one;
I'm not much of a Perl expert).  But if we accept that changes in the
plperl function's source code might be needed, it gets a lot easier,
for sure.

Anyway, the point of the current patch is to provide the mechanism and
use it in a couple of places where we know there's an issue.  Improving
the PLs is something that could be added later.

> I can't see this solution working unless the backend can cleanup
> properly under exceptional conditions, and decrement the counter of used
> file handles appropriately.  But that's the same requirement that
> postgres_fdw would also have, right?  Would the same mechanism work for
> both?

The hard part is to tie into whatever is responsible for closing the
kernel FD.  If you can ensure that the FD gets closed, you can put
the ReleaseExternalFD() call at the same place(s).

> Is this too convoluted to be worth the bother?

So far we've not seen field reports of PL functions running out of FDs;
and there's always the ad-hoc solution of making sure the server's
ulimit -n limit is sufficiently larger than max_files_per_process.
So I wouldn't put a lot of effort into it right now.  But it's nice
to have an idea about what to do if it does become a hot issue for
somebody.

regards, tom lane




Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-24 Thread Andres Freund
Hi,

On 2020-02-24 10:29:51 -0800, Mark Dilger wrote:
> > On Feb 20, 2020, at 8:30 PM, Tom Lane  wrote:
> > 
> > This idea doesn't fix every possible problem.  For instance, if you
> > have a plperlu function that wants to open a bunch of files, I don't
> > see any easy way to ensure we release VFDs to make that possible.
> > But it's sure an improvement on the status quo.
> 
> I understand that you were using plperlu just as an example, but it
> got me thinking.  Could we ship a wrapper using perl's tie() mechanism
> to call a new spi function to report when a file handle is opened and
> when it is closed?  Most plperlu functions would not participate,
> since developers will not necessarily know to use the wrapper, but at
> least they could learn about the wrapper and use it as a work-around
> if this becomes a problem for them.  Perhaps the same spi function
> could be used by other procedural languages.

While we're thinking a bit outside of the box: We could just dup() a
bunch of fds within fd.c to protect fd.c's fd "quota". And then close
them when actually needing the fds.

Not really suggesting that we go for that, but it does have some appeal.



> I can't see this solution working unless the backend can cleanup properly 
> under exceptional conditions, and decrement the counter of used file handles 
> appropriately.  But that's the same requirement that postgres_fdw would also 
> have, right?  Would the same mechanism work for both?

We can just throw an error, and all fdw state should get cleaned up. We
can't generally rely on that for plperl, as it IIRC can global state. So
I don't think they're in the same boat.

Greetings,

Andres Freund




Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-24 Thread Mark Dilger



> On Feb 20, 2020, at 8:30 PM, Tom Lane  wrote:
> 
> This idea doesn't fix every possible problem.  For instance, if you
> have a plperlu function that wants to open a bunch of files, I don't
> see any easy way to ensure we release VFDs to make that possible.
> But it's sure an improvement on the status quo.

I understand that you were using plperlu just as an example, but it got me 
thinking.  Could we ship a wrapper using perl's tie() mechanism to call a new 
spi function to report when a file handle is opened and when it is closed?  
Most plperlu functions would not participate, since developers will not 
necessarily know to use the wrapper, but at least they could learn about the 
wrapper and use it as a work-around if this becomes a problem for them.  
Perhaps the same spi function could be used by other procedural languages.

I can't see this solution working unless the backend can cleanup properly under 
exceptional conditions, and decrement the counter of used file handles 
appropriately.  But that's the same requirement that postgres_fdw would also 
have, right?  Would the same mechanism work for both?

I imagine something like ::IO::File and ::File::Temp 
which could be thin wrappers around IO::File and File::Temp that automatically 
do the tie()ing for you. (Replace  with whichever name seems best.)

Is this too convoluted to be worth the bother?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-24 Thread Tom Lane
Thomas Munro  writes:
> I suppose there may be users who have set ulimit -n high enough to
> support an FDW workload that connects to very many hosts, who will now
> need to set max_files_per_process higher to avoid the new error now
> that we're doing this accounting.  That doesn't seem to be a problem
> in itself, but I wonder if the error message should make it clearer
> that it's our limit they hit here.

I struggled with the wording of that message, actually.  The patch
proposes

+ereport(ERROR,
+
(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+ errmsg("could not connect to server \"%s\"",
+server->servername),
+ errdetail("There are too many open files.")));

I wanted to say "The server has too many open files." but in context
it would seem to be talking about the remote server, so I'm not sure
how to fix that.

We could also consider a HINT, along the lines of "Raise the server's
max_files_per_process and/or \"ulimit -n\" limits."  This again has
the ambiguity about which server, and it also seems dangerously
platform-specific.

regards, tom lane




Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-23 Thread Thomas Munro
On Mon, Feb 24, 2020 at 7:42 PM Thomas Munro  wrote:
> On Mon, Feb 24, 2020 at 12:24 PM Tom Lane  wrote:
> > On reflection, trying to make ReserveExternalFD serve two different
> > use-cases was pretty messy.  Here's a version that splits it into two
> > functions.  I also took the trouble to fix dblink.
>
> +/*
> + * We don't want more than max_safe_fds / 3 FDs to be consumed for
> + * "external" FDs.
> + */
> +if (numExternalFDs < max_safe_fds / 3)

I suppose there may be users who have set ulimit -n high enough to
support an FDW workload that connects to very many hosts, who will now
need to set max_files_per_process higher to avoid the new error now
that we're doing this accounting.  That doesn't seem to be a problem
in itself, but I wonder if the error message should make it clearer
that it's our limit they hit here.




Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-23 Thread Thomas Munro
On Mon, Feb 24, 2020 at 12:24 PM Tom Lane  wrote:
> On reflection, trying to make ReserveExternalFD serve two different
> use-cases was pretty messy.  Here's a version that splits it into two
> functions.  I also took the trouble to fix dblink.

+/*
+ * We don't want more than max_safe_fds / 3 FDs to be consumed for
+ * "external" FDs.
+ */
+if (numExternalFDs < max_safe_fds / 3)

This looks pretty reasonable to me.

I'll have a new patch set to create a common WES at startup over on
that other thread in a day or two.




Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-23 Thread Tom Lane
I wrote:
> Here's a draft patch that does it like that.

On reflection, trying to make ReserveExternalFD serve two different
use-cases was pretty messy.  Here's a version that splits it into two
functions.  I also took the trouble to fix dblink.

regards, tom lane

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index c1155e3..a5f69fc 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -200,12 +200,27 @@ dblink_get_conn(char *conname_or_str,
 		if (connstr == NULL)
 			connstr = conname_or_str;
 		dblink_connstr_check(connstr);
+
+		/*
+		 * We must obey fd.c's limit on non-virtual file descriptors.  Assume
+		 * that a PGconn represents one long-lived FD.  (Doing this here also
+		 * ensures that VFDs are closed if needed to make room.)
+		 */
+		if (!AcquireExternalFD())
+			ereport(ERROR,
+	(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+	 errmsg("could not establish connection"),
+	 errdetail("There are too many open files.")));
+
+		/* OK to make connection */
 		conn = PQconnectdb(connstr);
+
 		if (PQstatus(conn) == CONNECTION_BAD)
 		{
 			char	   *msg = pchomp(PQerrorMessage(conn));
 
 			PQfinish(conn);
+			ReleaseExternalFD();
 			ereport(ERROR,
 	(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
 	 errmsg("could not establish connection"),
@@ -282,12 +297,26 @@ dblink_connect(PG_FUNCTION_ARGS)
 
 	/* check password in connection string if not superuser */
 	dblink_connstr_check(connstr);
+
+	/*
+	 * We must obey fd.c's limit on non-virtual file descriptors.  Assume that
+	 * a PGconn represents one long-lived FD.  (Doing this here also ensures
+	 * that VFDs are closed if needed to make room.)
+	 */
+	if (!AcquireExternalFD())
+		ereport(ERROR,
+(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+ errmsg("could not establish connection"),
+ errdetail("There are too many open files.")));
+
+	/* OK to make connection */
 	conn = PQconnectdb(connstr);
 
 	if (PQstatus(conn) == CONNECTION_BAD)
 	{
 		msg = pchomp(PQerrorMessage(conn));
 		PQfinish(conn);
+		ReleaseExternalFD();
 		if (rconn)
 			pfree(rconn);
 
@@ -312,7 +341,10 @@ dblink_connect(PG_FUNCTION_ARGS)
 	else
 	{
 		if (pconn->conn)
+		{
 			PQfinish(pconn->conn);
+			ReleaseExternalFD();
+		}
 		pconn->conn = conn;
 	}
 
@@ -346,6 +378,7 @@ dblink_disconnect(PG_FUNCTION_ARGS)
 		dblink_conn_not_avail(conname);
 
 	PQfinish(conn);
+	ReleaseExternalFD();
 	if (rconn)
 	{
 		deleteConnection(conname);
@@ -780,7 +813,10 @@ dblink_record_internal(FunctionCallInfo fcinfo, bool is_async)
 	{
 		/* if needed, close the connection to the database */
 		if (freeconn)
+		{
 			PQfinish(conn);
+			ReleaseExternalFD();
+		}
 	}
 	PG_END_TRY();
 
@@ -1458,7 +1494,10 @@ dblink_exec(PG_FUNCTION_ARGS)
 	{
 		/* if needed, close the connection to the database */
 		if (freeconn)
+		{
 			PQfinish(conn);
+			ReleaseExternalFD();
+		}
 	}
 	PG_END_TRY();
 
@@ -2563,6 +2602,7 @@ createNewConnection(const char *name, remoteConn *rconn)
 	if (found)
 	{
 		PQfinish(rconn->conn);
+		ReleaseExternalFD();
 		pfree(rconn);
 
 		ereport(ERROR,
@@ -2604,6 +2644,7 @@ dblink_security_check(PGconn *conn, remoteConn *rconn)
 		if (!PQconnectionUsedPassword(conn))
 		{
 			PQfinish(conn);
+			ReleaseExternalFD();
 			if (rconn)
 pfree(rconn);
 
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 29c811a..74576d7 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postgres_fdw.h"
+#include "storage/fd.h"
 #include "storage/latch.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
@@ -259,10 +260,27 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 
 		keywords[n] = values[n] = NULL;
 
-		/* verify connection parameters and make connection */
+		/* verify the set of connection parameters */
 		check_conn_params(keywords, values, user);
 
+		/*
+		 * We must obey fd.c's limit on non-virtual file descriptors.  Assume
+		 * that a PGconn represents one long-lived FD.  (Doing this here also
+		 * ensures that VFDs are closed if needed to make room.)
+		 */
+		if (!AcquireExternalFD())
+			ereport(ERROR,
+	(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+	 errmsg("could not connect to server \"%s\"",
+			server->servername),
+	 errdetail("There are too many open files.")));
+
+		/* OK to make connection */
 		conn = PQconnectdbParams(keywords, values, false);
+
+		if (!conn)
+			ReleaseExternalFD();	/* because the PG_CATCH block won't */
+
 		if (!conn || PQstatus(conn) != CONNECTION_OK)
 			ereport(ERROR,
 	(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
@@ -294,7 +312,10 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 	{
 		/* Release PGconn data structure if we managed to create one */
 		if (conn)
+		{
 			

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-23 Thread Tom Lane
I wrote:
>> Clearly, we gotta do something about that too.  Maybe bumping up
>> NUM_RESERVED_FDS would be a good idea, but I feel like more-honest
>> accounting for permanently-eaten FDs would be a better idea.  And
>> in any case we can't suppose that there's a fixed upper limit on
>> the number of postgres_fdw connections.  I'm liking the idea I floated
>> earlier of letting postgres_fdw forcibly close the oldest LRU entry.

> A late-night glimmer of an idea: suppose we make fd.c keep count,
> not just of the open FDs under its control, but also of open FDs
> not under its control.

Here's a draft patch that does it like that.  There are undoubtedly
more places that need to be taught to report their FD consumption;
one obvious candidate that I didn't touch is dblink.  But this is
enough to account for all the long-lived "extra" FDs that are currently
open in a default build, and it passes check-world even with ulimit -n
set to the minimum that set_max_safe_fds will allow.

One thing I noticed is that if you open enough postgres_fdw connections
to cause a failure, that tends to come out as an unintelligible-to-
the-layman "epoll_create1 failed: Too many open files" error.  That's
because after postgres_fdw has consumed the last available "external
FD" slot, it tries to use CreateWaitEventSet to wait for input from
the remote server, and now that needs to get another external FD.
I left that alone for the moment, because if you do rejigger the
WaitEventSet code to avoid dynamically creating/destroying epoll sockets,
it will stop being a problem.  If that doesn't happen, another
possibility is to reclassify CreateWaitEventSet as a caller that should
ignore "failure" from ReserveExternalFD --- but that would open us up
to problems if a lot of WaitEventSets get created, so it's not a great
answer.  It'd be okay perhaps if we added a distinction between
long-lived and short-lived WaitEventSets (ignoring "failure" only for
the latter).  But I didn't want to go there in this patch.

Thoughts?

regards, tom lane

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 29c811a..27d509c 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postgres_fdw.h"
+#include "storage/fd.h"
 #include "storage/latch.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
@@ -259,10 +260,30 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 
 		keywords[n] = values[n] = NULL;
 
-		/* verify connection parameters and make connection */
+		/* verify the set of connection parameters */
 		check_conn_params(keywords, values, user);
 
+		/*
+		 * We must obey fd.c's limit on non-virtual file descriptors.  Assume
+		 * that a PGconn represents one long-lived FD.  (Doing this here also
+		 * ensures that VFDs are closed if needed to make room.)
+		 */
+		if (!ReserveExternalFD())
+		{
+			ReleaseExternalFD();	/* yup, gotta do this here */
+			ereport(ERROR,
+	(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+	 errmsg("could not connect to server \"%s\"",
+			server->servername),
+	 errdetail("There are too many open files.")));
+		}
+
+		/* OK to make connection */
 		conn = PQconnectdbParams(keywords, values, false);
+
+		if (!conn)
+			ReleaseExternalFD();	/* because the PG_CATCH block won't */
+
 		if (!conn || PQstatus(conn) != CONNECTION_OK)
 			ereport(ERROR,
 	(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
@@ -294,7 +315,10 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 	{
 		/* Release PGconn data structure if we managed to create one */
 		if (conn)
+		{
 			PQfinish(conn);
+			ReleaseExternalFD();
+		}
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
@@ -312,6 +336,7 @@ disconnect_pg_server(ConnCacheEntry *entry)
 	{
 		PQfinish(entry->conn);
 		entry->conn = NULL;
+		ReleaseExternalFD();
 	}
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fd527f2..b60e45a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -774,6 +774,7 @@ static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "strea
  * openLogFile is -1 or a kernel FD for an open log file segment.
  * openLogSegNo identifies the segment.  These variables are only used to
  * write the XLOG, and so will normally refer to the active segment.
+ * Note: call Reserve/ReleaseExternalFD to track consumption of this FD.
  */
 static int	openLogFile = -1;
 static XLogSegNo openLogSegNo = 0;
@@ -785,6 +786,9 @@ static XLogSegNo openLogSegNo = 0;
  * will be just past that page. readLen indicates how much of the current
  * page has been read into readBuf, and readSource indicates where we got
  * the currently open file from.
+ * Note: we could use Reserve/ReleaseExternalFD to track consumption of
+ * this FD too; but it doesn't currently seem 

Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-20 Thread Tom Lane
I wrote:
> Clearly, we gotta do something about that too.  Maybe bumping up
> NUM_RESERVED_FDS would be a good idea, but I feel like more-honest
> accounting for permanently-eaten FDs would be a better idea.  And
> in any case we can't suppose that there's a fixed upper limit on
> the number of postgres_fdw connections.  I'm liking the idea I floated
> earlier of letting postgres_fdw forcibly close the oldest LRU entry.

A late-night glimmer of an idea: suppose we make fd.c keep count,
not just of the open FDs under its control, but also of open FDs
not under its control.  The latter count would include the initial
FDs (stdin/stdout/stderr), and FDs allocated by OpenTransientFile
et al, and we could provide functions for other callers to report
that they just allocated or released a FD.  So postgres_fdw could
report, when it opens or closes a PGconn, that the count of external
FDs should be increased or decreased.  fd.c would then forcibly
close VFDs as needed to keep NUM_RESERVED_FDS worth of slop.  We
still need slop, to provide some daylight for code that isn't aware
of this mechanism.  But we could certainly get all these known
long-lived FDs to be counted, and that would allow fd.c to reduce
the number of open VFDs enough to ensure that some slop remains.

This idea doesn't fix every possible problem.  For instance, if you
have a plperlu function that wants to open a bunch of files, I don't
see any easy way to ensure we release VFDs to make that possible.
But it's sure an improvement on the status quo.

regards, tom lane




Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-20 Thread Tom Lane
I wrote:
> Thomas Munro  writes:
>> ... like coypu, where NUM_RESERVED_FDS is
>> the only thing ensuring we have some spare fds.  I don't know the
>> history but it looks like NUM_RESERVED_FDS was deliberately or
>> accidentally tuned to be just enough to be able to run the regression
>> tests (the interesting ones being the ones that use sockets, like
>> postgres_fdw.sql), but with a new long lived kqueue() fd in the
>> picture, it might have to be increased to cover it, no?

> No.  NUM_RESERVED_FDS was set decades ago, long before any of those tests
> existed, and it has never been changed AFAIK.  It is a bit striking that
> we just started seeing it be insufficient with this patch.  Maybe that's
> just happenstance, but I wonder whether there is a plain old FD leak
> involved in addition to the design issue?  I'll take a closer look at
> exactly what's open when we hit the error.

Hmm ... looks like I'm wrong: we've been skating way too close to the edge
for awhile.  Here's a breakdown of the open FDs in the backend at the time
of the failure, excluding stdin/stdout/stderr (which set_max_safe_fds
accounted for) and FDs pointing to database files:

COMMANDPID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
postmaste 2657 postgres3r  FIFO0,8  0t0  20902158 pipe  
postmaster_alive_fds[0]
postmaste 2657 postgres4u   REG0,90  3877 
[eventpoll]   FeBeWaitSet's epoll_fd
postmaste 2657 postgres7u  unix 0x880878e21880  0t0  20902664 
socketsocket for a PGconn owned by postgres_fdw
postmaste 2657 postgres8u  IPv6   20902171  0t0   UDP 
localhost:40795->localhost:40795  pgStatSock
postmaste 2657 postgres9u  unix 0x880876903c00  0t0  20902605 
/tmp/.s.PGSQL.5432MyProcPort->sock
postmaste 2657 postgres   10r  FIFO0,8  0t0  20902606 pipe  
selfpipe_readfd
postmaste 2657 postgres   11w  FIFO0,8  0t0  20902606 pipe  
selfpipe_writefd
postmaste 2657 postgres  105u  unix 0x880878e21180  0t0  20902647 
socketsocket for a PGconn owned by postgres_fdw
postmaste 2657 postgres  118u  unix 0x8804772c88c0  0t0  20902650 
socketsocket for a PGconn owned by postgres_fdw

One thing to notice is that there are only nine, though NUM_RESERVED_FDS
should have allowed ten.  That's because there are 116 open FDs pointing
at database files, though max_safe_fds is 115.  I'm not sure whether
that's OK or an off-by-one error in fd.c's accounting.  One of the 116
is pointing at a WAL segment, and I think we might not be sending that
through the normal VFD path, so it might be "expected".

But anyway, what this shows is that over time we've eaten enough of
the "reserved" FDs that only three are available for random usage like
postgres_fdw's, if the process's back is against the wall FD-wise.
The postgres_fdw regression test is using all three, meaning there's
exactly no daylight in that test.

Clearly, we gotta do something about that too.  Maybe bumping up
NUM_RESERVED_FDS would be a good idea, but I feel like more-honest
accounting for permanently-eaten FDs would be a better idea.  And
in any case we can't suppose that there's a fixed upper limit on
the number of postgres_fdw connections.  I'm liking the idea I floated
earlier of letting postgres_fdw forcibly close the oldest LRU entry.

BTW, you don't need anything very exotic to provoke this error.
The results above are from a Linux box; I just did "ulimit -n 128"
before starting the postmaster.

regards, tom lane




Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-20 Thread Tom Lane
Thomas Munro  writes:
> One thing I've been planning to do for 13 is to get rid of all the
> temporary create/destroy WaitEventSets from the main backend loops.
> My goal was cutting down on stupid system calls, but this puts a new
> spin on it.  I have a patch set to do a bunch of that[1], but now I'm
> thinking that perhaps I need to be even more aggressive about it and
> set up the 'common' long lived WES up front at backend startup, rather
> than doing it on demand, so that there is no chance of failure due to
> lack of fds once you've started up.

+1

> That wouldn't mean that the postgres_fdw.sql can't fail on a ulimit -n
> = 128 system, though, it might just mean that it's postgres_fdw's
> socket() call that hits EMFILE rather than WES's kqueue() call while
> running that test.

Good point.

> I suppose there are two kinds of system: those
> where ulimit -n is higher than max_files_per_process (defaults, on
> Linux: 1024 vs 1000) so you have more allowance for sockets and the
> like, and those where it isn't, like coypu, where NUM_RESERVED_FDS is
> the only thing ensuring we have some spare fds.  I don't know the
> history but it looks like NUM_RESERVED_FDS was deliberately or
> accidentally tuned to be just enough to be able to run the regression
> tests (the interesting ones being the ones that use sockets, like
> postgres_fdw.sql), but with a new long lived kqueue() fd in the
> picture, it might have to be increased to cover it, no?

No.  NUM_RESERVED_FDS was set decades ago, long before any of those tests
existed, and it has never been changed AFAIK.  It is a bit striking that
we just started seeing it be insufficient with this patch.  Maybe that's
just happenstance, but I wonder whether there is a plain old FD leak
involved in addition to the design issue?  I'll take a closer look at
exactly what's open when we hit the error.

The point about possibly hitting EMFILE in libpq's socket() call is
an interesting one.  libpq of course can't do anything to recover
from that (and even if it could, there are lower levels such as a
possible DNS lookup that we're not going to be able to modify).
I'm speculating about having postgres_fdw ask fd.c to forcibly
free one LRU file before it attempts to open a new libpq connection.
That would prevent EMFILE (process-level exhaustion) and it would
also provide some small protection against ENFILE (system-wide
exhaustion), though of course there's no guarantee that someone
else doesn't snap up the FD you so graciously relinquished.

regards, tom lane




Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-20 Thread Thomas Munro
On Fri, Feb 21, 2020 at 8:56 AM Tom Lane  wrote:
> I wrote:
> > It seems fairly obvious now that I look at it, but: the epoll and kqueue
> > variants of CreateWaitEventSet are both *fundamentally* unsafe, because
> > they assume that they can always get a FD when they want one, which is
> > not a property that we generally want backend code to have.  The only
> > reason we've not seen this before with epoll is a lack of testing
> > under lots-of-FDs stress.
> > The fact that they'll likely leak those FDs on subsequent failures is
> > another not-very-nice property.
>
> Hmmm ... actually, there's a third problem, which is the implicit
> assumption that we can have as many concurrently-active WaitEventSets
> as we like.  We can't, if they depend on FDs --- that's a precious
> resource.  It doesn't look like we actually ever have more than about
> two per process right now, but I'm concerned about what will happen
> as the usage of the feature increases.

One thing I've been planning to do for 13 is to get rid of all the
temporary create/destroy WaitEventSets from the main backend loops.
My goal was cutting down on stupid system calls, but this puts a new
spin on it.  I have a patch set to do a bunch of that[1], but now I'm
thinking that perhaps I need to be even more aggressive about it and
set up the 'common' long lived WES up front at backend startup, rather
than doing it on demand, so that there is no chance of failure due to
lack of fds once you've started up.  I also recently figured out how
to handle some more places with the common WES.  I'll post a new patch
set over on that thread shortly.

That wouldn't mean that the postgres_fdw.sql can't fail on a ulimit -n
= 128 system, though, it might just mean that it's postgres_fdw's
socket() call that hits EMFILE rather than WES's kqueue() call while
running that test.  I suppose there are two kinds of system: those
where ulimit -n is higher than max_files_per_process (defaults, on
Linux: 1024 vs 1000) so you have more allowance for sockets and the
like, and those where it isn't, like coypu, where NUM_RESERVED_FDS is
the only thing ensuring we have some spare fds.  I don't know the
history but it looks like NUM_RESERVED_FDS was deliberately or
accidentally tuned to be just enough to be able to run the regression
tests (the interesting ones being the ones that use sockets, like
postgres_fdw.sql), but with a new long lived kqueue() fd in the
picture, it might have to be increased to cover it, no?

About the potential for leaks, Horiguchi-san realised this hazard and
posted a patch[2] to allow WaitEventSets to be cleaned up by the
resource owner machinery.  That's useful for the temporary
WaitEventSet objects that we'd genuinely need in the patch set that's
part of: that's for creating a query-lifetime WES to manage N
connections to remote shards, and it needs to be cleaned up on
failure.  For the temporary ones created by WaitLatch(), I suspect
they don't really belong in a resource owner: instead we should get
rid of it using my WaitMyLatch() patch set, and if there are any
places where we can't for some reason (I hope not), perhaps a
try/catch block should be used to fix that.

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGJAC4Oqao%3DqforhNey20J8CiG2R%3DoBPqvfR0vOJrFysGw%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/20191206.171211.1119526746053895900.horikyota.ntt%40gmail.com




Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-20 Thread Tom Lane
I wrote:
> It seems fairly obvious now that I look at it, but: the epoll and kqueue
> variants of CreateWaitEventSet are both *fundamentally* unsafe, because
> they assume that they can always get a FD when they want one, which is
> not a property that we generally want backend code to have.  The only
> reason we've not seen this before with epoll is a lack of testing
> under lots-of-FDs stress.
> The fact that they'll likely leak those FDs on subsequent failures is
> another not-very-nice property.

Hmmm ... actually, there's a third problem, which is the implicit
assumption that we can have as many concurrently-active WaitEventSets
as we like.  We can't, if they depend on FDs --- that's a precious
resource.  It doesn't look like we actually ever have more than about
two per process right now, but I'm concerned about what will happen
as the usage of the feature increases.

regards, tom lane




Re: pgsql: Add kqueue(2) support to the WaitEventSet API.

2020-02-20 Thread Tom Lane
[ redirecting to -hackers ]

I wrote:
> =?utf-8?Q?R=C3=A9mi_Zara?=  writes:
> Le 20 févr. 2020 à 12:15, Thomas Munro  a écrit :
>>> Remi, any chance you could run gmake installcheck under
>>> contrib/postgres_fdw on that host, to see if this is repeatable?  Can
>>> you tell us about the relevant limits?  Maybe ulimit -n (for the user
>>> that runs the build farm), and also sysctl -a | grep descriptors,
>>> sysctl -a | grep maxfiles?

> I have a working NetBSD 8/ppc installation, will try to reproduce there.

Yup, it reproduces fine here.  I see

$ ulimit -a
...
nofiles   (-n descriptors) 128

which squares with the sysctl values:

proc.curproc.rlimit.descriptors.soft = 128
proc.curproc.rlimit.descriptors.hard = 1772
kern.maxfiles = 1772

and also with set_max_safe_fds' results:

2020-02-20 14:29:38.610 EST [2218] DEBUG:  max_safe_fds = 115, usable_fds = 
125, already_open = 3

It seems fairly obvious now that I look at it, but: the epoll and kqueue
variants of CreateWaitEventSet are both *fundamentally* unsafe, because
they assume that they can always get a FD when they want one, which is
not a property that we generally want backend code to have.  The only
reason we've not seen this before with epoll is a lack of testing
under lots-of-FDs stress.

The fact that they'll likely leak those FDs on subsequent failures is
another not-very-nice property.

I think we ought to redesign this so that those FDs are handed out by
fd.c, which can ReleaseLruFile() and retry if it gets EMFILE or ENFILE.
fd.c could also be responsible for the resource tracking needed to
prevent leakage.

regards, tom lane