Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-24 Thread Vesa-Matti J Kari

Hello,

On Mon, 23 Sep 2013, Stephen Frost wrote:

 I've now committed a fix for this issue.

I cloned the 9.4devel branch and linked my authmilter and a test program
(based on Heikki's earlier design) against the libpq that comes with it.

After hours of pretty extensive stress testing using 2, 3, 4, 5, 6,
50, 100 and 500 parallel threads, I have observed no deadlocks. Everything
runs smoothly.

Many thanks to all who contributed to the fix.

Regards,
vmk
-- 

   Tietotekniikkakeskus / Helsingin yliopisto
 IT department / University of Helsinki



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-24 Thread Stephen Frost
* Vesa-Matti J Kari (vmk...@cc.helsinki.fi) wrote:
 Many thanks to all who contributed to the fix.

Great!  Thanks for the report and the testing.

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-23 Thread Stephen Frost
Vesa-Matti J Kari,

  I've now committed a fix for this issue.  If you have opportunity to,
  it'd be great to pull down the latest git (for whichever supported
  branch you'd like) and give it a try.  Otherwise, the fix should be
  out with our next round of point releases (which I expect will be
  'soon').

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-13 Thread Stephen Frost
Heikki, all,

* Stephen Frost (sfr...@snowman.net) wrote:
 Very curious.  Out of time right now to look into it, but will probably
 be back at it later tonight.

Alright, I was back at this a bit today and decided to go with a hunch-
and it looks like I might have been right to try.

Leaving the locking callback hooks in place appears to prevent the
deadlocks from happening, at least with this little app.

IOW, in destroy_ssl_system(), simply arrange to not have
CRYPTO_set_locking_callback(NULL); and CRYPTO_set_id_callback(NULL);
called.  I've done this with the very simple attached patch.  Per the
comment above destroy_ssl_system(), this doesn't seem to be an
acceptable solution because libpq might get unloaded from the
application, but perhaps it points the way towards what the issue is.

My thought had been that there was an issue with regard to signal
handling, but I've been unable to confirm that, nor is it clear why that
would be the case.  In any case, I'm curious what others think of these
results and if anyone can reproduce the deadlock with this patch
applied.

Thanks!

Stephen
diff --git a/src/interfaces/libpq/fe-secure.c b/src/interfaces/libpq/fe-secure.c
new file mode 100644
index 3bd0113..e025b49 100644
*** a/src/interfaces/libpq/fe-secure.c
--- b/src/interfaces/libpq/fe-secure.c
*** static SSL_CTX *SSL_context = NULL;
*** 102,107 
--- 102,108 
  
  #ifdef ENABLE_THREAD_SAFETY
  static long ssl_open_connections = 0;
+ static int initialized_hooks = 0;
  
  #ifndef WIN32
  static pthread_mutex_t ssl_config_mutex = PTHREAD_MUTEX_INITIALIZER;
*** init_ssl_system(PGconn *conn)
*** 947,952 
--- 948,955 
  
  		if (ssl_open_connections++ == 0)
  		{
+ 			initialized_hooks = 1;
+ 
  			/* These are only required for threaded libcrypto applications */
  			CRYPTO_set_id_callback(pq_threadidcallback);
  			CRYPTO_set_locking_callback(pq_lockingcallback);
*** destroy_ssl_system(void)
*** 1015,1021 
  	if (pq_init_crypto_lib  ssl_open_connections  0)
  		--ssl_open_connections;
  
! 	if (pq_init_crypto_lib  ssl_open_connections == 0)
  	{
  		/* No connections left, unregister libcrypto callbacks */
  		CRYPTO_set_locking_callback(NULL);
--- 1018,1024 
  	if (pq_init_crypto_lib  ssl_open_connections  0)
  		--ssl_open_connections;
  
! 	if (pq_init_crypto_lib  ssl_open_connections == 0  initialized_hooks == 0)
  	{
  		/* No connections left, unregister libcrypto callbacks */
  		CRYPTO_set_locking_callback(NULL);


signature.asc
Description: Digital signature


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-13 Thread Andres Freund
On 2013-09-13 12:40:11 -0400, Stephen Frost wrote:
 Heikki, all,
 
 * Stephen Frost (sfr...@snowman.net) wrote:
  Very curious.  Out of time right now to look into it, but will probably
  be back at it later tonight.
 
 Alright, I was back at this a bit today and decided to go with a hunch-
 and it looks like I might have been right to try.
 
 Leaving the locking callback hooks in place appears to prevent the
 deadlocks from happening, at least with this little app.
 
 IOW, in destroy_ssl_system(), simply arrange to not have
 CRYPTO_set_locking_callback(NULL); and CRYPTO_set_id_callback(NULL);
 called.  I've done this with the very simple attached patch.  Per the
 comment above destroy_ssl_system(), this doesn't seem to be an
 acceptable solution because libpq might get unloaded from the
 application, but perhaps it points the way towards what the issue is.

It'd be interesting to replace the origin callbacks with one immediately
doing an abort() or similar to see whether they maybe are called after
they shouldn't be and from where.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-13 Thread Stephen Frost
Andres,

On Friday, September 13, 2013, Andres Freund wrote:

 It'd be interesting to replace the origin callbacks with one immediately
 doing an abort() or similar to see whether they maybe are called after
 they shouldn't be and from where.


Good thought. Got sucked into a meeting but once I'm out I'll try having
the lock/unlock routines abort if they're called while ssl_open_connections
is zero, which should not be happening, but seems like it is.

Thanks,

Stephen


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-13 Thread Andres Freund
On 2013-09-13 13:15:34 -0400, Stephen Frost wrote:
 Andres,
 
 On Friday, September 13, 2013, Andres Freund wrote:
 
  It'd be interesting to replace the origin callbacks with one immediately
  doing an abort() or similar to see whether they maybe are called after
  they shouldn't be and from where.
 
 
 Good thought. Got sucked into a meeting but once I'm out I'll try having
 the lock/unlock routines abort if they're called while ssl_open_connections
 is zero, which should not be happening, but seems like it is.

Hm. close_SSL() first does pqsecure_destroy() which will unset the
callbacks, and the count and then goes on to do X509_free() and
ENGINE_finish(), ENGINE_free() if either is used.

It's not implausible that one of those actually needs locking. I doubt
engines play a role here, but, without having looked at the testcase,
X509_free() might be a possibility.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-13 Thread Stephen Frost
* Stephen Frost (sfr...@snowman.net) wrote:
 * Andres Freund (and...@2ndquadrant.com) wrote:
  Hm. close_SSL() first does pqsecure_destroy() which will unset the
  callbacks, and the count and then goes on to do X509_free() and
  ENGINE_finish(), ENGINE_free() if either is used.
  
  It's not implausible that one of those actually needs locking. I doubt
  engines play a role here, but, without having looked at the testcase,
  X509_free() might be a possibility.
 
 Unfortunately, while I can still easily get the deadlock to happen when
 the hooks are reset, the hooks don't appear to ever get called when
 ssl_open_connections is set to zero.  You have a good point about the
 additional SSL calls after the hooks are unloaded though, I wonder if
 holding the ssl_config_mutex lock over all of close_SSL might be more
 sensible..

I went ahead and moved the locks to be around all of close_SSL() and
haven't been able to reproduce the deadlock, so perhaps those calls are
the issue and what's happening is that another thread is dropping or
adding the hooks in a common place while the X509_free, etc, are trying
to figure out if they should be calling the locking functions or not,
but there's a race because there's no higher-level locking happening
around those.

Attached is a patch to move those and which doesn't deadlock for me.

Thoughts?

Thanks,

Stephen
diff --git a/src/interfaces/libpq/fe-secure.c b/src/interfaces/libpq/fe-secure.c
new file mode 100644
index 3bd0113..e14c301 100644
*** a/src/interfaces/libpq/fe-secure.c
--- b/src/interfaces/libpq/fe-secure.c
*** destroy_ssl_system(void)
*** 1009,1017 
  {
  #ifdef ENABLE_THREAD_SAFETY
  	/* Mutex is created in initialize_ssl_system() */
- 	if (pthread_mutex_lock(ssl_config_mutex))
- 		return;
- 
  	if (pq_init_crypto_lib  ssl_open_connections  0)
  		--ssl_open_connections;
  
--- 1009,1014 
*** destroy_ssl_system(void)
*** 1031,1037 
  		 */
  	}
  
- 	pthread_mutex_unlock(ssl_config_mutex);
  #endif
  }
  
--- 1028,1033 
*** open_client_SSL(PGconn *conn)
*** 1537,1542 
--- 1533,1543 
  static void
  close_SSL(PGconn *conn)
  {
+ #ifdef ENABLE_THREAD_SAFETY
+ 	if (pthread_mutex_lock(ssl_config_mutex))
+ 		return;
+ #endif
+ 
  	if (conn-ssl)
  	{
  		DECLARE_SIGPIPE_INFO(spinfo);
*** close_SSL(PGconn *conn)
*** 1565,1570 
--- 1566,1575 
  		conn-engine = NULL;
  	}
  #endif
+ 
+ #ifdef ENABLE_THREAD_SAFETY
+ 	pthread_mutex_unlock(ssl_config_mutex);
+ #endif
  }
  
  /*


signature.asc
Description: Digital signature


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-13 Thread Stephen Frost
* Andres Freund (and...@2ndquadrant.com) wrote:
 On 2013-09-13 13:15:34 -0400, Stephen Frost wrote:
  Good thought. Got sucked into a meeting but once I'm out I'll try having
  the lock/unlock routines abort if they're called while ssl_open_connections
  is zero, which should not be happening, but seems like it is.
 
 Hm. close_SSL() first does pqsecure_destroy() which will unset the
 callbacks, and the count and then goes on to do X509_free() and
 ENGINE_finish(), ENGINE_free() if either is used.
 
 It's not implausible that one of those actually needs locking. I doubt
 engines play a role here, but, without having looked at the testcase,
 X509_free() might be a possibility.

Unfortunately, while I can still easily get the deadlock to happen when
the hooks are reset, the hooks don't appear to ever get called when
ssl_open_connections is set to zero.  You have a good point about the
additional SSL calls after the hooks are unloaded though, I wonder if
holding the ssl_config_mutex lock over all of close_SSL might be more
sensible..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-13 Thread Andres Freund
On 2013-09-13 14:33:25 -0400, Stephen Frost wrote:
 * Stephen Frost (sfr...@snowman.net) wrote:
  * Andres Freund (and...@2ndquadrant.com) wrote:
   Hm. close_SSL() first does pqsecure_destroy() which will unset the
   callbacks, and the count and then goes on to do X509_free() and
   ENGINE_finish(), ENGINE_free() if either is used.
   
   It's not implausible that one of those actually needs locking. I doubt
   engines play a role here, but, without having looked at the testcase,
   X509_free() might be a possibility.
  
  Unfortunately, while I can still easily get the deadlock to happen when
  the hooks are reset, the hooks don't appear to ever get called when
  ssl_open_connections is set to zero.  You have a good point about the
  additional SSL calls after the hooks are unloaded though, I wonder if
  holding the ssl_config_mutex lock over all of close_SSL might be more
  sensible..
 
 I went ahead and moved the locks to be around all of close_SSL() and
 haven't been able to reproduce the deadlock, so perhaps those calls are
 the issue and what's happening is that another thread is dropping or
 adding the hooks in a common place while the X509_free, etc, are trying
 to figure out if they should be calling the locking functions or not,
 but there's a race because there's no higher-level locking happening
 around those.
 
 Attached is a patch to move those and which doesn't deadlock for me.

It seems slightly cleaner to just move the pqsecure_destroy(); to the
end of that function, based on a boolean. But if you think otherwise, I
won't protest...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-13 Thread Andres Freund
On 2013-09-13 13:59:54 -0400, Stephen Frost wrote:
 Unfortunately, while I can still easily get the deadlock to happen when
 the hooks are reset, the hooks don't appear to ever get called when
 ssl_open_connections is set to zero.  You have a good point about the
 additional SSL calls after the hooks are unloaded though, I wonder if
 holding the ssl_config_mutex lock over all of close_SSL might be more
 sensible..

Hm. You might want to do the check for ssl_open_connections after a
memory barrier or such, it might not actually be up2date (too tired to
think about the actual guarantees of x86's cache coherency guarantees).

Does it burn if you move the pqsecure_destroy(); to the end of the
function, after setting a boolean in the if (conn-ssl) branch?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-13 Thread Heikki Linnakangas

On 13.09.2013 22:26, Heikki Linnakangas wrote:

I'm afraid the move_locks.diff patch you posted earlier is also
broken; close_SSL() is called in error scenarios from
pqsecure_open_client(), while already holding the mutex. So it will
deadlock with itself if the connection cannot be established.


Actually, I think there's a pre-existing bug there in git master. If the 
SSL_set_app_data or SSL_set_fd call in pqsecure_open_client fails for 
some reason, it will call close_SSL() with conn-ssl already set, and 
the mutex held. close_SSL() will call 
pqsecure_destroy()-destroy_SSL()-destory_ssl_system(), which will try 
to acquire the mutex again.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-13 Thread Stephen Frost
* Heikki Linnakangas (hlinnakan...@vmware.com) wrote:
 Umm, with that patch, pqsecure_destroy() is never called. The if
 (conn-ssl) test that's now at the end of the close_SSL function is
 never true, because conn-ssl is set to NULL earlier.

Yeah, got ahead of myself, as Andres pointed out.

 I'm afraid the move_locks.diff patch you posted earlier is also
 broken; close_SSL() is called in error scenarios from
 pqsecure_open_client(), while already holding the mutex. So it will
 deadlock with itself if the connection cannot be established.

Sorry, I was really just tossing it up to see if it really did avoid
this particular deadlock.  I'm running some tests now with the
attached to see if I can get it to deadlock now.

Thanks,

Stephen
diff --git a/src/interfaces/libpq/fe-secure.c b/src/interfaces/libpq/fe-secure.c
new file mode 100644
index 3bd0113..c008dcf 100644
*** a/src/interfaces/libpq/fe-secure.c
--- b/src/interfaces/libpq/fe-secure.c
*** open_client_SSL(PGconn *conn)
*** 1537,1542 
--- 1537,1544 
  static void
  close_SSL(PGconn *conn)
  {
+ 	bool destroy_needed = conn-ssl ? true : false;
+ 
  	if (conn-ssl)
  	{
  		DECLARE_SIGPIPE_INFO(spinfo);
*** close_SSL(PGconn *conn)
*** 1545,1551 
  		SSL_shutdown(conn-ssl);
  		SSL_free(conn-ssl);
  		conn-ssl = NULL;
- 		pqsecure_destroy();
  		/* We have to assume we got EPIPE */
  		REMEMBER_EPIPE(spinfo, true);
  		RESTORE_SIGPIPE(conn, spinfo);
--- 1547,1552 
*** close_SSL(PGconn *conn)
*** 1565,1570 
--- 1566,1582 
  		conn-engine = NULL;
  	}
  #endif
+ 
+ 	/*
+ 	 * This will remove our SSL locking hooks, if this is the last SSL
+ 	 * connection, which means we must wait to call it until after all
+ 	 * SSL calls have been made, otherwise we can end up with a race
+ 	 * condition and possible deadlocks.
+ 	 *
+ 	 * See comments above destroy_ssl_system().
+ 	 */
+ 	if (destroy_needed)
+ 		pqsecure_destroy();
  }
  
  /*


signature.asc
Description: Digital signature


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-13 Thread Heikki Linnakangas

On 13.09.2013 22:03, Stephen Frost wrote:

* Andres Freund (and...@2ndquadrant.com) wrote:

It seems slightly cleaner to just move the pqsecure_destroy(); to the
end of that function, based on a boolean. But if you think otherwise, I
won't protest...


Hmm, agreed; I had originally been concerned that the SIGPIPE madness
needed to be around the pqsecure_destroy() call, but I can't see why
that would be.

I've run through a few times w/ the attached and haven't seen the
deadlock.  Will continue testing, of course.

Heikki, any thoughts regarding this change?  Any concerns about
back-patching it?


Umm, with that patch, pqsecure_destroy() is never called. The if 
(conn-ssl) test that's now at the end of the close_SSL function is 
never true, because conn-ssl is set to NULL earlier.


I'm afraid the move_locks.diff patch you posted earlier is also 
broken; close_SSL() is called in error scenarios from 
pqsecure_open_client(), while already holding the mutex. So it will 
deadlock with itself if the connection cannot be established.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-13 Thread Stephen Frost
* Andres Freund (and...@2ndquadrant.com) wrote:
 It seems slightly cleaner to just move the pqsecure_destroy(); to the
 end of that function, based on a boolean. But if you think otherwise, I
 won't protest...

Hmm, agreed; I had originally been concerned that the SIGPIPE madness
needed to be around the pqsecure_destroy() call, but I can't see why
that would be.

I've run through a few times w/ the attached and haven't seen the
deadlock.  Will continue testing, of course.

Heikki, any thoughts regarding this change?  Any concerns about
back-patching it?

Thanks,

Stephen
diff --git a/src/interfaces/libpq/fe-secure.c b/src/interfaces/libpq/fe-secure.c
new file mode 100644
index 3bd0113..ca2c5e4 100644
*** a/src/interfaces/libpq/fe-secure.c
--- b/src/interfaces/libpq/fe-secure.c
*** close_SSL(PGconn *conn)
*** 1545,1551 
  		SSL_shutdown(conn-ssl);
  		SSL_free(conn-ssl);
  		conn-ssl = NULL;
- 		pqsecure_destroy();
  		/* We have to assume we got EPIPE */
  		REMEMBER_EPIPE(spinfo, true);
  		RESTORE_SIGPIPE(conn, spinfo);
--- 1545,1550 
*** close_SSL(PGconn *conn)
*** 1565,1570 
--- 1564,1572 
  		conn-engine = NULL;
  	}
  #endif
+ 
+ 	if (conn-ssl)
+ 		pqsecure_destroy();
  }
  
  /*


signature.asc
Description: Digital signature


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-13 Thread Andres Freund
On 2013-09-13 15:03:31 -0400, Stephen Frost wrote:
 * Andres Freund (and...@2ndquadrant.com) wrote:
  It seems slightly cleaner to just move the pqsecure_destroy(); to the
  end of that function, based on a boolean. But if you think otherwise, I
  won't protest...
 
 Hmm, agreed; I had originally been concerned that the SIGPIPE madness
 needed to be around the pqsecure_destroy() call, but I can't see why
 that would be.
 
 I've run through a few times w/ the attached and haven't seen the
 deadlock.  Will continue testing, of course.

That patch looks wrong to me. Note that the if (conn-ssl) branch resets
conn-ssl to NULL.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-13 Thread Stephen Frost
* Heikki Linnakangas (hlinnakan...@vmware.com) wrote:
 Actually, I think there's a pre-existing bug there in git master. If
 the SSL_set_app_data or SSL_set_fd call in pqsecure_open_client
 fails for some reason, it will call close_SSL() with conn-ssl
 already set, and the mutex held. close_SSL() will call
 pqsecure_destroy()-destroy_SSL()-destory_ssl_system(), which will
 try to acquire the mutex again.

Yeah, good point, and that one looks like my fault.  Moving it after the
unlock should address that tho, no?  Looking through the other
close_SSL() calls, I don't see any other situations where it might be
called with the lock held.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-13 Thread Stephen Frost
* Andres Freund (and...@2ndquadrant.com) wrote:
 That patch looks wrong to me. Note that the if (conn-ssl) branch resets
 conn-ssl to NULL.

huh, it figures one would overlook the simplest things.  Of course it's
not locking up now- we never remove the hooks (as my original patch was
doing :).  That's what I get for trying to go to meetings and code at
the same time.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-10 Thread Stephen Frost
Heikki,

* Heikki Linnakangas (hlinnakan...@vmware.com) wrote:
 Thanks! I tested with git master.

I've run your test program against both git master and 9.2.4 on a couple
of Ubuntu 13.04 boxes and all I see are tons of these:

1: DEBUG: database connection established
1: DEBUG: about to call PQfinish()
1: DEBUG: database connection established
1: DEBUG: about to call PQfinish()
2: DEBUG: database connection established
2: DEBUG: about to call PQfinish()

But no deadlocks; even across multiple runs, they all terminate
normally.

For 9.2.4, I've got:

libpq5 9.2.4-1.pgdg12.4+1
libssl1.0.0:amd64  1.0.1c-4ubuntu8.1

With master, it's:

libpq5 as-of b34f8f409bec4921d260ac84668917f5fff04319
libssl1.0.0:amd64  1.0.1c-4ubuntu8.1

What versions of libssl are you working with..?  Is it possible this has
actually been fixed already?

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-10 Thread Heikki Linnakangas

On 10.09.2013 18:10, Stephen Frost wrote:

I've run your test program against both git master and 9.2.4 on a couple
of Ubuntu 13.04 boxes and all I see are tons of these:

1: DEBUG: database connection established
1: DEBUG: about to call PQfinish()
1: DEBUG: database connection established
1: DEBUG: about to call PQfinish()
2: DEBUG: database connection established
2: DEBUG: about to call PQfinish()

But no deadlocks; even across multiple runs, they all terminate
normally.


Hmm. Are you sure you're getting an SSL connection? Run it with 
something like this to make sure:


./threaded-connect dbname=postgres sslmode=require host=localhost


For 9.2.4, I've got:

libpq5 9.2.4-1.pgdg12.4+1
libssl1.0.0:amd64  1.0.1c-4ubuntu8.1

With master, it's:

libpq5 as-of b34f8f409bec4921d260ac84668917f5fff04319
libssl1.0.0:amd64  1.0.1c-4ubuntu8.1

What versions of libssl are you working with..?  Is it possible this has
actually been fixed already?


$ dpkg -s libssl1.0.0:amd64 | grep Vers
Version: 1.0.1e-3

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-10 Thread Stephen Frost
Heikki,

* Heikki Linnakangas (hlinnakan...@vmware.com) wrote:
 Hmm. Are you sure you're getting an SSL connection? Run it with
 something like this to make sure:

sslmode=require doesn't help on Unix domain connections. :)

Was able to get it to lock with both 9.2.4 and master, and with both
versions of the SSL library (1.0.1c-4ubuntu8.1 and 1.0.1e-3).

The lock that I got showed these stacks:

Thread #2:
  pthread_mutex_lock
  init_ssl_system ; fe-secure.c:917

Thread #3:
  pthread_mutex_lock
  pq_lockingcallback 
  ...
  SSL_new
  pqsecure_open_client ; fe-secure.c:275

Thread #2 waiting at 917 makes sense, he's waiting on the lock that the
other thread has on ssl_config_mutex before moving in to set up his own
SSL connection.  What's odd is how is thread #3 waiting on a lock in the
lock array.  Both threads agree that ssl_open_connections is only 1
(thread #3's ; thread #2 hasn't gotten to incrementing it yet).  Looking
at the lock array, only one of the locks is taken out and it's owner is
thread #3, meaning that SSL apparently caused a deadlock by trying to
take a lock which it's already taken.

Changing the lock type to be recursive instead masks the self-locking
issue, but then I got a case where, with the same stack traces as above,
the lock in the array was held by thread #2 instead, where thread #2 is
in init_ssl_system- well before it's even made any calls into SSL since
the previous PQdestroy happened.

I've also caught it where a thread is still holding a lock when it drops
into destroy_ssl_system() by simply trying to unlock all of the locks in
the array.  With the recursive lock type, all such attempts should
simply error out (either it's locked by someone else, or it's already
unlocked) and so I checked for *successful* unlocks:

2: DEBUG: database connection established
2: DEBUG: about to call PQfinish()
successfully unlocked mutex!

Having that happen can then cause the deadlock because the other thread
can end up waiting on that lock that we're still holding while in
destroy_ssl_system(), where we're waiting on the ssl_config_mutex lock
that the first thread has.  Even leaving that code in there, which
unlocks all the locks during destroy_ssl_system(), it's still deadlocked
on me with the same stack trace as above, with thread #2 holding a lock
in the pq_lockarray which thread #3 is trying to get (while thread #3
holds the ssl_config_mutex lock that thread #2 is waiting on).

Very curious.  Out of time right now to look into it, but will probably
be back at it later tonight.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-09 Thread Heikki Linnakangas

On 09.09.2013 09:34, Vesa-Matti J Kari wrote:

Basically all that the authmilter now does is to connect to PostgreSQL in
authmilt_connect() and close the connection in authmilt_close(). Based on
the authmilter debug logging it seems to me that when the hanging occurs,
the authmilter never completes PQsetdbLogin().

Based on the document

   http://www.postgresql.org/docs/9.1/interactive/libpq-threading.html

I am sure libpq should be thread safe and on startup
the authmilter verifies that the libpq is indeed thread safe.


If you think you could set up a test enviroment using:

- Sendmail (or maybe Postfix)
- authmilter
- PostgreSQL

here is an authmilter-simplied.tar.gz package for you:

   http://www.helsinki.fi/~vmkari/authmilter-simplified.tar.gz

The README file contains a rough instructions outline on how to setup
things in order to reproduce the strange hanging bug.


Please note that when running two test message sender scripts in parallel,
the bug does not occur immediately, but only after between 1 to 5 minutes
of processing. Sometimes it may take even longer.


I managed to set that up and got it running. But it works fine for me, 
does not hang.


I'd suggest poking around with gdb, to see where it hangs. Also, run 
select * from pg_stat_activity from a psql session to see what's 
happening inside the database. log_connections=on and 
log_disconnections=on would also be a good idea.


PS. You'll need to escape the strings in the queries, to avoid SQL 
injection.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-09 Thread Vesa-Matti J Kari

Hello,

On Mon, 9 Sep 2013, Heikki Linnakangas wrote:

 I managed to set that up and got it running.

Many thanks for taking the time.

 But it works fine for me, does not hang.

Okay. Have you tried increasing the iterations for the smtp sender
scripts? And could you please specify what is your test environment like
(i.e. OS and the related library versions)?

 I'd suggest poking around with gdb, to see where it hangs.

I have actually done that, but it only show the main listener thread from
the libmilter library:

(gdb) bt
#0  0x7fe64bdd0313 in poll () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7fe64c4f7b46 in mi_listener () from /usr/lib/libmilter.so.1.0.1
#2  0x7fe64c4f8707 in smfi_main () from /usr/lib/libmilter.so.1.0.1
#3  0x00402c8f in main (argc=15, argv=0x7fffa6560e68) at
authmilter.c:699

Hmmm. The man page mentioned no threads, but Google was helpful and
suggested info threads so here goes:

(I hope alpine will not wrap these long lines)

(gdb) info threads
  Id   Target Id Frame
  9Thread 0x7fe64700c700 (LWP 14362) authmilter 0x7fe64c0b69f7 in 
do_sigwait () from /lib/x86_64-linux-gnu/libpthread.so.0
  8Thread 0x7fe64680b700 (LWP 14363) authmilter 0x7fe64bdd0313 in 
poll () from /lib/x86_64-linux-gnu/libc.so.6
  7Thread 0x7fe645809700 (LWP 14365) authmilter 0x7fe64c0b589c in 
__lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
  6Thread 0x7fe645008700 (LWP 22404) authmilter 0x7fe64c0b589c in 
__lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
  5Thread 0x7fe64600a700 (LWP 27263) authmilter 0x7fe64c0b589c in 
__lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
  4Thread 0x7fe644807700 (LWP 27264) authmilter 0x7fe64c0b589c in 
__lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
  3Thread 0x7fe62700 (LWP 27283) authmilter 0x7fe64c0b589c in 
__lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
  2Thread 0x7fe62f7fe700 (LWP 27284) authmilter 0x7fe64c0b589c in 
__lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
* 1Thread 0x7fe64c8fd740 (LWP 14361) authmilter 0x7fe64bdd0313 in 
poll () from /lib/x86_64-linux-gnu/libc.so.6

It looks like a deadlock situation of some kind...

(gdb) thread 2
[Switching to thread 2 (Thread 0x7fe62f7fe700 (LWP 27284))]
#0  0x7fe64c0b589c in __lll_lock_wait () from 
/lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x7fe64c0b589c in __lll_lock_wait () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x7fe64c0b1065 in _L_lock_858 () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#2  0x7fe64c0b0eba in pthread_mutex_lock () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#3  0x7fe64c2df200 in ?? () from /usr/lib/libpq.so.5
#4  0x7fe64b78a5f5 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#5  0x7fe64b77a915 in RSA_new_method () from 
/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#6  0x7fe64b77d64d in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#7  0x7fe64b7b9bf2 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#8  0x7fe64b7bc6d1 in ASN1_item_ex_d2i () from 
/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#9  0x7fe64b7bd0c4 in ASN1_item_d2i () from 
/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#10 0x7fe64b77ea2f in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#11 0x7fe64b7b461a in X509_PUBKEY_get () from 
/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#12 0x7fe64b7d119a in X509_get_pubkey_parameters () from 
/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#13 0x7fe64b7d1398 in X509_verify_cert () from 
/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#14 0x7fe64bac52f8 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#15 0x7fe64baa2ef3 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#16 0x7fe64baa7222 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#17 0x7fe64c2dffcb in ?? () from /usr/lib/libpq.so.5
#18 0x7fe64c2d0c5e in PQconnectPoll () from /usr/lib/libpq.so.5
#19 0x7fe64c2d1e3e in ?? () from /usr/lib/libpq.so.5
#20 0x7fe64c2d26f8 in PQsetdbLogin () from /usr/lib/libpq.so.5
#21 0x00401ba5 in authmilt_connect (ctx=0xe81b60, 
hostname=0x7fe628c0 localhost, hostaddr=0x7fe62f7fdce0) at 
authmilter.c:212
#22 0x7fe64c4f69dc in ?? () from /usr/lib/libmilter.so.1.0.1
#23 0x7fe64c4f5f5f in mi_engine () from /usr/lib/libmilter.so.1.0.1
#24 0x7fe64c4fada6 in ?? () from /usr/lib/libmilter.so.1.0.1
#25 0x7fe64c0aee9a in start_thread () from
/lib/x86_64-linux-gnu/libpthread.so.0
#26 0x7fe64bddbccd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#27 0x in ?? ()

(gdb) thread 3
[Switching to thread 3 (Thread 0x7fe62700 (LWP 27283))]
#0  0x7fe64c0b589c in __lll_lock_wait () from 
/lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x7fe64c0b589c in __lll_lock_wait () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x7fe64c0b1065 in _L_lock_858 () from 

Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-09 Thread Stephen Frost
Vesa-Matti, Heikki,

* Heikki Linnakangas (hlinnakan...@vmware.com) wrote:
 On 09.09.2013 15:36, Vesa-Matti J Kari wrote:
 If I interpret this correctly, threads #2 and #3 are waiting for the same
 lock but they make no progress.
 
 A-ha, the deadlock happens while doing SSL stuff. I didn't have SSL
 enabled in my test server. As soon as I turned it on, it hung.
 
 Attached is a small stand-alone test program to reproduce it. You
 can pass a libpq connection string as argument to it.

Interesting...  Which version of libpq were you working against?  I see
that Vesa-Matti had the problem happen w/ 9.1.9, which should have been
before the changes that I made to add locking around our usage of
SSL_context, as otherwise we would end up in situations where we'd dump
core, but he also had it with 9.3rc1, which should have included it.  I
had tested the patch w/ a pretty good amount of concurrent threads fired
off from a little python script and didn't run into any deadlocks
there..

Vesa-Matti, was this working previously, and are you sure you were
testing with 9.3rc1's libpq?  Heikki, which are you testing against and
perhaps you might try before and after?  I'll be able to look into it
more in a few hours also.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-09 Thread Heikki Linnakangas

On 09.09.2013 15:36, Vesa-Matti J Kari wrote:

It looks like a deadlock situation of some kind...

(gdb) thread 2
[Switching to thread 2 (Thread 0x7fe62f7fe700 (LWP 27284))]
#0  0x7fe64c0b589c in __lll_lock_wait () from 
/lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x7fe64c0b589c in __lll_lock_wait () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x7fe64c0b1065 in _L_lock_858 () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#2  0x7fe64c0b0eba in pthread_mutex_lock () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#3  0x7fe64c2df200 in ?? () from /usr/lib/libpq.so.5
#4  0x7fe64b78a5f5 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#5  0x7fe64b77a915 in RSA_new_method () from 
/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#6  0x7fe64b77d64d in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#7  0x7fe64b7b9bf2 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#8  0x7fe64b7bc6d1 in ASN1_item_ex_d2i () from 
/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#9  0x7fe64b7bd0c4 in ASN1_item_d2i () from 
/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#10 0x7fe64b77ea2f in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#11 0x7fe64b7b461a in X509_PUBKEY_get () from 
/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#12 0x7fe64b7d119a in X509_get_pubkey_parameters () from 
/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#13 0x7fe64b7d1398 in X509_verify_cert () from 
/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#14 0x7fe64bac52f8 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#15 0x7fe64baa2ef3 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#16 0x7fe64baa7222 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#17 0x7fe64c2dffcb in ?? () from /usr/lib/libpq.so.5
#18 0x7fe64c2d0c5e in PQconnectPoll () from /usr/lib/libpq.so.5
#19 0x7fe64c2d1e3e in ?? () from /usr/lib/libpq.so.5
#20 0x7fe64c2d26f8 in PQsetdbLogin () from /usr/lib/libpq.so.5
#21 0x00401ba5 in authmilt_connect (ctx=0xe81b60, hostname=0x7fe628c0 
localhost, hostaddr=0x7fe62f7fdce0) at authmilter.c:212
#22 0x7fe64c4f69dc in ?? () from /usr/lib/libmilter.so.1.0.1
#23 0x7fe64c4f5f5f in mi_engine () from /usr/lib/libmilter.so.1.0.1
#24 0x7fe64c4fada6 in ?? () from /usr/lib/libmilter.so.1.0.1
#25 0x7fe64c0aee9a in start_thread () from
/lib/x86_64-linux-gnu/libpthread.so.0
#26 0x7fe64bddbccd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#27 0x in ?? ()

(gdb) thread 3
[Switching to thread 3 (Thread 0x7fe62700 (LWP 27283))]
#0  0x7fe64c0b589c in __lll_lock_wait () from 
/lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x7fe64c0b589c in __lll_lock_wait () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x7fe64c0b1065 in _L_lock_858 () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#2  0x7fe64c0b0eba in pthread_mutex_lock () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#3  0x7fe64c2df200 in ?? () from /usr/lib/libpq.so.5
#4  0x7fe64b78a5f5 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#5  0x7fe64b77a915 in RSA_new_method () from 
/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#6  0x7fe64b77d64d in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#7  0x7fe64b7b9bf2 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#8  0x7fe64b7bc6d1 in ASN1_item_ex_d2i () from 
/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#9  0x7fe64b7bd0c4 in ASN1_item_d2i () from 
/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#10 0x7fe64b77ea2f in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#11 0x7fe64b7b461a in X509_PUBKEY_get () from 
/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#12 0x7fe64b7d119a in X509_get_pubkey_parameters () from 
/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#13 0x7fe64b7d1398 in X509_verify_cert () from 
/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#14 0x7fe64bac52f8 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#15 0x7fe64baa2ef3 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#16 0x7fe64baa7222 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#17 0x7fe64c2dffcb in ?? () from /usr/lib/libpq.so.5
#18 0x7fe64c2d0c5e in PQconnectPoll () from /usr/lib/libpq.so.5
#19 0x7fe64c2d1e3e in ?? () from /usr/lib/libpq.so.5
#20 0x7fe64c2d26f8 in PQsetdbLogin () from /usr/lib/libpq.so.5
#21 0x00401ba5 in authmilt_connect (ctx=0xe818e0, hostname=0x7fe6280008c0 
localhost, hostaddr=0x7fe62fffece0) at authmilter.c:212
#22 0x7fe64c4f69dc in ?? () from /usr/lib/libmilter.so.1.0.1
#23 0x7fe64c4f5f5f in mi_engine () from /usr/lib/libmilter.so.1.0.1
#24 0x7fe64c4fada6 in ?? () from /usr/lib/libmilter.so.1.0.1
#25 0x7fe64c0aee9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#26 0x7fe64bddbccd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#27 0x in ?? ()

If I interpret this correctly, threads #2 and #3 are waiting for the same
lock but they make no progress.


A-ha, the deadlock happens while doing 

Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-09 Thread Heikki Linnakangas

On 09.09.2013 18:20, Stephen Frost wrote:

Vesa-Matti, Heikki,

* Heikki Linnakangas (hlinnakan...@vmware.com) wrote:

On 09.09.2013 15:36, Vesa-Matti J Kari wrote:

If I interpret this correctly, threads #2 and #3 are waiting for the same
lock but they make no progress.


A-ha, the deadlock happens while doing SSL stuff. I didn't have SSL
enabled in my test server. As soon as I turned it on, it hung.

Attached is a small stand-alone test program to reproduce it. You
can pass a libpq connection string as argument to it.


Interesting...  Which version of libpq were you working against?  I see
that Vesa-Matti had the problem happen w/ 9.1.9, which should have been
before the changes that I made to add locking around our usage of
SSL_context, as otherwise we would end up in situations where we'd dump
core, but he also had it with 9.3rc1, which should have included it.  I
had tested the patch w/ a pretty good amount of concurrent threads fired
off from a little python script and didn't run into any deadlocks
there..



Vesa-Matti, was this working previously, and are you sure you were
testing with 9.3rc1's libpq?  Heikki, which are you testing against and
perhaps you might try before and after?  I'll be able to look into it
more in a few hours also.


Thanks! I tested with git master.

I added printf()s into the pq_lockingcallback function, and got a trace 
where both threads got stuck waiting for lock 10 in the pq_lockarray. It 
looks like someone is failing to release it. The backtrace for both 
threads look like this:


#0  __lll_lock_wait () at 
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x7fad49826f3c in _L_lock_974 () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#2  0x7fad49826d8b in __GI___pthread_mutex_lock 
(mutex=0x7fad3800a260) at pthread_mutex_lock.c:64
#3  0x7fad49a53f08 in pq_lockingcallback (line=175, 
file=0x7fad48fa8bb3 x_pubkey.c, n=10, mode=optimized out) at 
fe-secure.c:872
#4  pq_lockingcallback (mode=optimized out, n=10, file=0x7fad48fa8bb3 
x_pubkey.c, line=175) at fe-secure.c:868
#5  0x7fad48f396ab in X509_PUBKEY_get () from 
/usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#6  0x7fad48f56292 in X509_get_pubkey_parameters () from 
/usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#7  0x7fad48f5649c in X509_verify_cert () from 
/usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#8  0x7fad4924f14a in ?? () from 
/usr/lib/x86_64-linux-gnu/libssl.so.1.0.0
#9  0x7fad4922ce1c in ?? () from 
/usr/lib/x86_64-linux-gnu/libssl.so.1.0.0
#10 0x7fad492310d2 in ?? () from 
/usr/lib/x86_64-linux-gnu/libssl.so.1.0.0
#11 0x7fad49a54c67 in open_client_SSL (conn=0x7fad380397d0) at 
fe-secure.c:1463

#12 pqsecure_open_client (conn=conn@entry=0x7fad380397d0) at fe-secure.c:306
#13 0x7fad49a44fb6 in PQconnectPoll (conn=conn@entry=0x7fad380397d0) 
at fe-connect.c:2123
#14 0x7fad49a4618e in connectDBComplete 
(conn=conn@entry=0x7fad380397d0) at fe-connect.c:1521
#15 0x7fad49a46b47 in PQconnectdb (conninfo=optimized out) at 
fe-connect.c:516

#16 0x004007b6 in test_connect (threadid=2) at threaded-connect.c:25
#17 0x0040086e in run_thread (arg=0x600e04 two) at 
threaded-connect.c:55
#18 0x7fad49824e0e in start_thread (arg=0x7fad48203700) at 
pthread_create.c:311
#19 0x7fad4955993d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:113


Sometimes the lockup happens differently, with one thread hung up in 
SSL_init() and another waiting for the first one on ssl_config_mutex, 
for example.


A good next step might be to create a standalone program that doesn't 
use libpq at all, but just calls X509_verify_cert() concurrently in two 
threads. Or open plain SSL connections. If the deadlock can be 
reproduced with that, then we could just report the bug to the OpenSSL 
and hope that they can figure it out.


- Heikki
diff --git a/src/interfaces/libpq/fe-secure.c b/src/interfaces/libpq/fe-secure.c
index 3bd0113..3e70306 100644
--- a/src/interfaces/libpq/fe-secure.c
+++ b/src/interfaces/libpq/fe-secure.c
@@ -871,11 +871,15 @@ pq_lockingcallback(int mode, int n, const char *file, int line)
 	{
 		if (pthread_mutex_lock(pq_lockarray[n]))
 			PGTHREAD_ERROR(failed to lock mutex);
+		printf(%ld locking callback:   lock %d %s:%d\n, pthread_self(), n, file, line);
+		fflush(stdout);
 	}
 	else
 	{
 		if (pthread_mutex_unlock(pq_lockarray[n]))
 			PGTHREAD_ERROR(failed to unlock mutex);
+		printf(%ld locking callback: unlock %d %s:%d\n, pthread_self(), n, file, line);
+		fflush(stdout);
 	}
 }
 #endif   /* ENABLE_THREAD_SAFETY */


lock-trace.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-09 Thread Alvaro Herrera
Heikki Linnakangas wrote:

 I'll dig into that, but right now it seems like an OpenSSL or
 libcrypto bug to me. Or something in the way we use them, although I
 can't see anything obviously wrong in the libpq code at a quick
 glance.

Can you please try with ssl_renegotiation_limit=0?

[ looks ]  Uh, actually you don't even send data in those connections in
your test program, do you?  Maybe there's a problem with the mutex stuff
committed recently by Stephen.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Strange hanging bug in a simple milter

2013-09-09 Thread Stephen Frost
Alvaro,

* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
 Heikki Linnakangas wrote:
  I'll dig into that, but right now it seems like an OpenSSL or
  libcrypto bug to me. Or something in the way we use them, although I
  can't see anything obviously wrong in the libpq code at a quick
  glance.
 
 Can you please try with ssl_renegotiation_limit=0?
 
 [ looks ]  Uh, actually you don't even send data in those connections in
 your test program, do you?  Maybe there's a problem with the mutex stuff
 committed recently by Stephen.

I was wondering about that also, but it was apparently an issue even
before that change (it was reported against 9.1.9).  Also, Heikki's
analysis appears to show cases where two threads end up waiting on the
same entry in the lockarray, which I don't think my changes would have
impacted at all.

In any case, I hope to find time this afternoon/evening to try this
against libpq from before and after, just to be sure and rule out that
patch.  Assuming that pans out, I tend to agree w/ Heikki that we should
test this outside of libpq entirely and see if we can reproduce it.
Even if we're able to do that, we may need to consider ways to fix it
ourselves (perhaps be holding heavier locks or something), as we have no
idea how long it'll take an OpenSSL fix to happen..

Thanks,

Stephen


signature.asc
Description: Digital signature