Re: Bug that spans tomcat and tomcat-native

2016-07-04 Thread Mark Thomas
I've opened https://bz.apache.org/bugzilla/show_bug.cgi?id=59797 to
track this.

Nate, if you let me have your public IP, I'll get you unbanned from BZ
as well.

Mark

On 01/07/2016 15:33, Nate Clark wrote:
> On Tue, Jun 28, 2016 at 12:37 PM, Rainer Jung  wrote:
>> Am 28.06.2016 um 18:06 schrieb therealnewo...@gmail.com:
>>>
>>> On Tue, Jun 28, 2016 at 11:51 AM, Rainer Jung 
>>> wrote:

 Am 28.06.2016 um 16:07 schrieb Mark Thomas:
>
>
> On 28/06/2016 12:28, Mark Thomas wrote:
>>
>>
>> On 28/06/2016 11:34, Rainer Jung wrote:
>>>
>>>
>>> Am 28.06.2016 um 11:15 schrieb Mark Thomas:
>>
>>
>>
>> 
>>
 Index: src/ssl.c
 ===
 --- src/ssl.c(revision 1750259)
 +++ src/ssl.c(working copy)
 @@ -420,6 +420,10 @@
  return psaptr->PSATOLD;
  #elif defined(WIN32)
  return (unsigned long)GetCurrentThreadId();
 +#elif defined(DARWIN)
 +uint64_t tid;
 +pthread_threadid_np(NULL, );
 +return (unsigned long)tid;
  #else
  return (unsigned long)(apr_os_thread_current());
  #endif


 I want to do some similar testing for Linux before adding what I
 suspect
 will be a very similar block using gettid().
>>>
>>>
>>>
>>> We could either add something to configure.in. Untested:
>>>
>>> Index: native/configure.in
>>> ===
>>> --- native/configure.in (revision 1750462)
>>> +++ native/configure.in (working copy)
>>> @@ -218,6 +218,9 @@
>>>  *-solaris2*)
>>>  APR_ADDTO(TCNATIVE_LIBS, -lkstat)
>>>  ;;
>>> +*linux*)
>>> +APR_ADDTO(CFLAGS, -DTCNATIVE_LINUX)
>>> +;;
>>>  *)
>>>  ;;
>>>  esac
>>>
>>>
>>> and then use a
>>>
>>> #ifdef TCNATIVE_LINUX
>>>
>>> or we copy some other more direct linux check from e.g. APR:
>>>
>>> #ifdef __linux__
>>>
>>> The latter looks simpler, but the version above is based on all the
>>> logic put into config.guess.
>>
>>
>>
>> I'd go with the __linux__ option as that is consistent with what we
>> already use in os/unix/system.c
>>
>> I'm not against the change to configure.in, I just think we should be
>> consistent with how we do this throughout the code base.
>
>
>
> I've confirmed that the same problem occurs with hash bucket selection
> on linux and that switching to gettid() fixes that problem.
>
> I'm going to go ahead with the 1.2.8 release shortly. We can continue to
> refine this as necessary and have a more complete fix in 1.2.9.



 I did a quick check on Solaris. apr_os_thread_current() uses pthread_self
 on
 Solaris like on Linux (actually on any Unix type OS), but unlike Linux
 where
 this returns a address which is either 32 or 64 bit aligned depending on
 address size, on Solaris you get an increasing number starting with 1 for
 the first thread and incremented by one for each following thread. Thread
 IDs do not get reused in the same process, even if the thread finished,
 but
 thread IDs are common between different processes, because they always
 start
 with 1. So Solaris should be fine as-is.
>>>
>>>
>>> Does the value have a cap? If not then Solaris will just continue to
>>> use more and more memory as threads are created over the lifetime of
>>> the server.
>>
>>
>> No cap. I think everyone agrees we are still looking at cleaning up at
>> thread death as a further improvement. Whether there's a problem without it
>> depends on the executor size and whether the executor actually destroys and
>> recreates threads (thus using more IDs than the max thread count).
>>
> 
> Mark,
> 
> The change which was added to 1.2.8 works great on 1.1.34. I also
> hacked in calling ERR_remove_thread_state on thread exit. It is
> similar to my original change but only does do it on a threads exit,
> but as Remy pointed out earlier this will only work if all IO
> operations occur on one of the original threads from the thread pool.
> 
> -nate
> 
> Here is the hack just for completeness.
> 
> diff --git a/java/org/apache/catalina/core/StandardThreadExecutor.java
> b/java/org/apache/catalina/core/StandardThreadExecutor.java
> index a30f29c..80f429d 100644
> --- a/java/org/apache/catalina/core/StandardThreadExecutor.java
> +++ b/java/org/apache/catalina/core/StandardThreadExecutor.java
> @@ -18,6 +18,7 @@
>  package org.apache.catalina.core;
> 
>  import java.util.concurrent.RejectedExecutionException;
> +import java.util.concurrent.ThreadFactory;
>  import java.util.concurrent.TimeUnit;
> 
>  

Re: Bug that spans tomcat and tomcat-native

2016-07-01 Thread Nate Clark
On Tue, Jun 28, 2016 at 12:37 PM, Rainer Jung  wrote:
> Am 28.06.2016 um 18:06 schrieb therealnewo...@gmail.com:
>>
>> On Tue, Jun 28, 2016 at 11:51 AM, Rainer Jung 
>> wrote:
>>>
>>> Am 28.06.2016 um 16:07 schrieb Mark Thomas:


 On 28/06/2016 12:28, Mark Thomas wrote:
>
>
> On 28/06/2016 11:34, Rainer Jung wrote:
>>
>>
>> Am 28.06.2016 um 11:15 schrieb Mark Thomas:
>
>
>
> 
>
>>> Index: src/ssl.c
>>> ===
>>> --- src/ssl.c(revision 1750259)
>>> +++ src/ssl.c(working copy)
>>> @@ -420,6 +420,10 @@
>>>  return psaptr->PSATOLD;
>>>  #elif defined(WIN32)
>>>  return (unsigned long)GetCurrentThreadId();
>>> +#elif defined(DARWIN)
>>> +uint64_t tid;
>>> +pthread_threadid_np(NULL, );
>>> +return (unsigned long)tid;
>>>  #else
>>>  return (unsigned long)(apr_os_thread_current());
>>>  #endif
>>>
>>>
>>> I want to do some similar testing for Linux before adding what I
>>> suspect
>>> will be a very similar block using gettid().
>>
>>
>>
>> We could either add something to configure.in. Untested:
>>
>> Index: native/configure.in
>> ===
>> --- native/configure.in (revision 1750462)
>> +++ native/configure.in (working copy)
>> @@ -218,6 +218,9 @@
>>  *-solaris2*)
>>  APR_ADDTO(TCNATIVE_LIBS, -lkstat)
>>  ;;
>> +*linux*)
>> +APR_ADDTO(CFLAGS, -DTCNATIVE_LINUX)
>> +;;
>>  *)
>>  ;;
>>  esac
>>
>>
>> and then use a
>>
>> #ifdef TCNATIVE_LINUX
>>
>> or we copy some other more direct linux check from e.g. APR:
>>
>> #ifdef __linux__
>>
>> The latter looks simpler, but the version above is based on all the
>> logic put into config.guess.
>
>
>
> I'd go with the __linux__ option as that is consistent with what we
> already use in os/unix/system.c
>
> I'm not against the change to configure.in, I just think we should be
> consistent with how we do this throughout the code base.



 I've confirmed that the same problem occurs with hash bucket selection
 on linux and that switching to gettid() fixes that problem.

 I'm going to go ahead with the 1.2.8 release shortly. We can continue to
 refine this as necessary and have a more complete fix in 1.2.9.
>>>
>>>
>>>
>>> I did a quick check on Solaris. apr_os_thread_current() uses pthread_self
>>> on
>>> Solaris like on Linux (actually on any Unix type OS), but unlike Linux
>>> where
>>> this returns a address which is either 32 or 64 bit aligned depending on
>>> address size, on Solaris you get an increasing number starting with 1 for
>>> the first thread and incremented by one for each following thread. Thread
>>> IDs do not get reused in the same process, even if the thread finished,
>>> but
>>> thread IDs are common between different processes, because they always
>>> start
>>> with 1. So Solaris should be fine as-is.
>>
>>
>> Does the value have a cap? If not then Solaris will just continue to
>> use more and more memory as threads are created over the lifetime of
>> the server.
>
>
> No cap. I think everyone agrees we are still looking at cleaning up at
> thread death as a further improvement. Whether there's a problem without it
> depends on the executor size and whether the executor actually destroys and
> recreates threads (thus using more IDs than the max thread count).
>

Mark,

The change which was added to 1.2.8 works great on 1.1.34. I also
hacked in calling ERR_remove_thread_state on thread exit. It is
similar to my original change but only does do it on a threads exit,
but as Remy pointed out earlier this will only work if all IO
operations occur on one of the original threads from the thread pool.

-nate

Here is the hack just for completeness.

diff --git a/java/org/apache/catalina/core/StandardThreadExecutor.java
b/java/org/apache/catalina/core/StandardThreadExecutor.java
index a30f29c..80f429d 100644
--- a/java/org/apache/catalina/core/StandardThreadExecutor.java
+++ b/java/org/apache/catalina/core/StandardThreadExecutor.java
@@ -18,6 +18,7 @@
 package org.apache.catalina.core;

 import java.util.concurrent.RejectedExecutionException;
+import java.util.concurrent.ThreadFactory;
 import java.util.concurrent.TimeUnit;

 import org.apache.catalina.Executor;
@@ -27,10 +28,11 @@ import org.apache.catalina.util.LifecycleMBeanBase;
 import org.apache.tomcat.util.threads.ResizableExecutor;
 import org.apache.tomcat.util.threads.TaskQueue;
 import org.apache.tomcat.util.threads.TaskThreadFactory;
+import org.apache.tomcat.util.threads.ThreadCreator;
 import 

Re: Bug that spans tomcat and tomcat-native

2016-06-28 Thread Rainer Jung

Am 28.06.2016 um 18:06 schrieb therealnewo...@gmail.com:

On Tue, Jun 28, 2016 at 11:51 AM, Rainer Jung  wrote:

Am 28.06.2016 um 16:07 schrieb Mark Thomas:


On 28/06/2016 12:28, Mark Thomas wrote:


On 28/06/2016 11:34, Rainer Jung wrote:


Am 28.06.2016 um 11:15 schrieb Mark Thomas:






Index: src/ssl.c
===
--- src/ssl.c(revision 1750259)
+++ src/ssl.c(working copy)
@@ -420,6 +420,10 @@
 return psaptr->PSATOLD;
 #elif defined(WIN32)
 return (unsigned long)GetCurrentThreadId();
+#elif defined(DARWIN)
+uint64_t tid;
+pthread_threadid_np(NULL, );
+return (unsigned long)tid;
 #else
 return (unsigned long)(apr_os_thread_current());
 #endif


I want to do some similar testing for Linux before adding what I
suspect
will be a very similar block using gettid().



We could either add something to configure.in. Untested:

Index: native/configure.in
===
--- native/configure.in (revision 1750462)
+++ native/configure.in (working copy)
@@ -218,6 +218,9 @@
 *-solaris2*)
 APR_ADDTO(TCNATIVE_LIBS, -lkstat)
 ;;
+*linux*)
+APR_ADDTO(CFLAGS, -DTCNATIVE_LINUX)
+;;
 *)
 ;;
 esac


and then use a

#ifdef TCNATIVE_LINUX

or we copy some other more direct linux check from e.g. APR:

#ifdef __linux__

The latter looks simpler, but the version above is based on all the
logic put into config.guess.



I'd go with the __linux__ option as that is consistent with what we
already use in os/unix/system.c

I'm not against the change to configure.in, I just think we should be
consistent with how we do this throughout the code base.



I've confirmed that the same problem occurs with hash bucket selection
on linux and that switching to gettid() fixes that problem.

I'm going to go ahead with the 1.2.8 release shortly. We can continue to
refine this as necessary and have a more complete fix in 1.2.9.



I did a quick check on Solaris. apr_os_thread_current() uses pthread_self on
Solaris like on Linux (actually on any Unix type OS), but unlike Linux where
this returns a address which is either 32 or 64 bit aligned depending on
address size, on Solaris you get an increasing number starting with 1 for
the first thread and incremented by one for each following thread. Thread
IDs do not get reused in the same process, even if the thread finished, but
thread IDs are common between different processes, because they always start
with 1. So Solaris should be fine as-is.


Does the value have a cap? If not then Solaris will just continue to
use more and more memory as threads are created over the lifetime of
the server.


No cap. I think everyone agrees we are still looking at cleaning up at 
thread death as a further improvement. Whether there's a problem without 
it depends on the executor size and whether the executor actually 
destroys and recreates threads (thus using more IDs than the max thread 
count).


Regards,

Rainer

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-28 Thread therealneworld
On Tue, Jun 28, 2016 at 11:51 AM, Rainer Jung  wrote:
> Am 28.06.2016 um 16:07 schrieb Mark Thomas:
>>
>> On 28/06/2016 12:28, Mark Thomas wrote:
>>>
>>> On 28/06/2016 11:34, Rainer Jung wrote:

 Am 28.06.2016 um 11:15 schrieb Mark Thomas:
>>>
>>>
>>> 
>>>
> Index: src/ssl.c
> ===
> --- src/ssl.c(revision 1750259)
> +++ src/ssl.c(working copy)
> @@ -420,6 +420,10 @@
>  return psaptr->PSATOLD;
>  #elif defined(WIN32)
>  return (unsigned long)GetCurrentThreadId();
> +#elif defined(DARWIN)
> +uint64_t tid;
> +pthread_threadid_np(NULL, );
> +return (unsigned long)tid;
>  #else
>  return (unsigned long)(apr_os_thread_current());
>  #endif
>
>
> I want to do some similar testing for Linux before adding what I
> suspect
> will be a very similar block using gettid().


 We could either add something to configure.in. Untested:

 Index: native/configure.in
 ===
 --- native/configure.in (revision 1750462)
 +++ native/configure.in (working copy)
 @@ -218,6 +218,9 @@
  *-solaris2*)
  APR_ADDTO(TCNATIVE_LIBS, -lkstat)
  ;;
 +*linux*)
 +APR_ADDTO(CFLAGS, -DTCNATIVE_LINUX)
 +;;
  *)
  ;;
  esac


 and then use a

 #ifdef TCNATIVE_LINUX

 or we copy some other more direct linux check from e.g. APR:

 #ifdef __linux__

 The latter looks simpler, but the version above is based on all the
 logic put into config.guess.
>>>
>>>
>>> I'd go with the __linux__ option as that is consistent with what we
>>> already use in os/unix/system.c
>>>
>>> I'm not against the change to configure.in, I just think we should be
>>> consistent with how we do this throughout the code base.
>>
>>
>> I've confirmed that the same problem occurs with hash bucket selection
>> on linux and that switching to gettid() fixes that problem.
>>
>> I'm going to go ahead with the 1.2.8 release shortly. We can continue to
>> refine this as necessary and have a more complete fix in 1.2.9.
>
>
> I did a quick check on Solaris. apr_os_thread_current() uses pthread_self on
> Solaris like on Linux (actually on any Unix type OS), but unlike Linux where
> this returns a address which is either 32 or 64 bit aligned depending on
> address size, on Solaris you get an increasing number starting with 1 for
> the first thread and incremented by one for each following thread. Thread
> IDs do not get reused in the same process, even if the thread finished, but
> thread IDs are common between different processes, because they always start
> with 1. So Solaris should be fine as-is.

Does the value have a cap? If not then Solaris will just continue to
use more and more memory as threads are created over the lifetime of
the server.

-nate

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-28 Thread Rainer Jung

Am 28.06.2016 um 16:07 schrieb Mark Thomas:

On 28/06/2016 12:28, Mark Thomas wrote:

On 28/06/2016 11:34, Rainer Jung wrote:

Am 28.06.2016 um 11:15 schrieb Mark Thomas:





Index: src/ssl.c
===
--- src/ssl.c(revision 1750259)
+++ src/ssl.c(working copy)
@@ -420,6 +420,10 @@
 return psaptr->PSATOLD;
 #elif defined(WIN32)
 return (unsigned long)GetCurrentThreadId();
+#elif defined(DARWIN)
+uint64_t tid;
+pthread_threadid_np(NULL, );
+return (unsigned long)tid;
 #else
 return (unsigned long)(apr_os_thread_current());
 #endif


I want to do some similar testing for Linux before adding what I suspect
will be a very similar block using gettid().


We could either add something to configure.in. Untested:

Index: native/configure.in
===
--- native/configure.in (revision 1750462)
+++ native/configure.in (working copy)
@@ -218,6 +218,9 @@
 *-solaris2*)
 APR_ADDTO(TCNATIVE_LIBS, -lkstat)
 ;;
+*linux*)
+APR_ADDTO(CFLAGS, -DTCNATIVE_LINUX)
+;;
 *)
 ;;
 esac


and then use a

#ifdef TCNATIVE_LINUX

or we copy some other more direct linux check from e.g. APR:

#ifdef __linux__

The latter looks simpler, but the version above is based on all the
logic put into config.guess.


I'd go with the __linux__ option as that is consistent with what we
already use in os/unix/system.c

I'm not against the change to configure.in, I just think we should be
consistent with how we do this throughout the code base.


I've confirmed that the same problem occurs with hash bucket selection
on linux and that switching to gettid() fixes that problem.

I'm going to go ahead with the 1.2.8 release shortly. We can continue to
refine this as necessary and have a more complete fix in 1.2.9.


I did a quick check on Solaris. apr_os_thread_current() uses 
pthread_self on Solaris like on Linux (actually on any Unix type OS), 
but unlike Linux where this returns a address which is either 32 or 64 
bit aligned depending on address size, on Solaris you get an increasing 
number starting with 1 for the first thread and incremented by one for 
each following thread. Thread IDs do not get reused in the same process, 
even if the thread finished, but thread IDs are common between different 
processes, because they always start with 1. So Solaris should be fine 
as-is.



I do think OS specific code to get the real thread ID is the way to go
here until we can switch to OpenSSL 1.1.0 at which point the problem
should go away.


Agreed, but cleaning up at the end of a thread would be good as well,
e.g. if the executor is configured to renew threads when a context is
reloaded we can accumulate lots of old hash entries.


Agreed.

Mark


Rainer

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-28 Thread Mark Thomas
On 28/06/2016 12:28, Mark Thomas wrote:
> On 28/06/2016 11:34, Rainer Jung wrote:
>> Am 28.06.2016 um 11:15 schrieb Mark Thomas:
> 
> 
> 
>>> Index: src/ssl.c
>>> ===
>>> --- src/ssl.c(revision 1750259)
>>> +++ src/ssl.c(working copy)
>>> @@ -420,6 +420,10 @@
>>>  return psaptr->PSATOLD;
>>>  #elif defined(WIN32)
>>>  return (unsigned long)GetCurrentThreadId();
>>> +#elif defined(DARWIN)
>>> +uint64_t tid;
>>> +pthread_threadid_np(NULL, );
>>> +return (unsigned long)tid;
>>>  #else
>>>  return (unsigned long)(apr_os_thread_current());
>>>  #endif
>>>
>>>
>>> I want to do some similar testing for Linux before adding what I suspect
>>> will be a very similar block using gettid().
>>
>> We could either add something to configure.in. Untested:
>>
>> Index: native/configure.in
>> ===
>> --- native/configure.in (revision 1750462)
>> +++ native/configure.in (working copy)
>> @@ -218,6 +218,9 @@
>>  *-solaris2*)
>>  APR_ADDTO(TCNATIVE_LIBS, -lkstat)
>>  ;;
>> +*linux*)
>> +APR_ADDTO(CFLAGS, -DTCNATIVE_LINUX)
>> +;;
>>  *)
>>  ;;
>>  esac
>>
>>
>> and then use a
>>
>> #ifdef TCNATIVE_LINUX
>>
>> or we copy some other more direct linux check from e.g. APR:
>>
>> #ifdef __linux__
>>
>> The latter looks simpler, but the version above is based on all the
>> logic put into config.guess.
> 
> I'd go with the __linux__ option as that is consistent with what we
> already use in os/unix/system.c
> 
> I'm not against the change to configure.in, I just think we should be
> consistent with how we do this throughout the code base.

I've confirmed that the same problem occurs with hash bucket selection
on linux and that switching to gettid() fixes that problem.

I'm going to go ahead with the 1.2.8 release shortly. We can continue to
refine this as necessary and have a more complete fix in 1.2.9.

Mark


> 
>>> I do think OS specific code to get the real thread ID is the way to go
>>> here until we can switch to OpenSSL 1.1.0 at which point the problem
>>> should go away.
>>
>> Agreed, but cleaning up at the end of a thread would be good as well,
>> e.g. if the executor is configured to renew threads when a context is
>> reloaded we can accumulate lots of old hash entries.
> 
> Agreed.
> 
> Mark
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: dev-h...@tomcat.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-28 Thread Mark Thomas
On 28/06/2016 11:34, Rainer Jung wrote:
> Am 28.06.2016 um 11:15 schrieb Mark Thomas:



>> Index: src/ssl.c
>> ===
>> --- src/ssl.c(revision 1750259)
>> +++ src/ssl.c(working copy)
>> @@ -420,6 +420,10 @@
>>  return psaptr->PSATOLD;
>>  #elif defined(WIN32)
>>  return (unsigned long)GetCurrentThreadId();
>> +#elif defined(DARWIN)
>> +uint64_t tid;
>> +pthread_threadid_np(NULL, );
>> +return (unsigned long)tid;
>>  #else
>>  return (unsigned long)(apr_os_thread_current());
>>  #endif
>>
>>
>> I want to do some similar testing for Linux before adding what I suspect
>> will be a very similar block using gettid().
> 
> We could either add something to configure.in. Untested:
> 
> Index: native/configure.in
> ===
> --- native/configure.in (revision 1750462)
> +++ native/configure.in (working copy)
> @@ -218,6 +218,9 @@
>  *-solaris2*)
>  APR_ADDTO(TCNATIVE_LIBS, -lkstat)
>  ;;
> +*linux*)
> +APR_ADDTO(CFLAGS, -DTCNATIVE_LINUX)
> +;;
>  *)
>  ;;
>  esac
> 
> 
> and then use a
> 
> #ifdef TCNATIVE_LINUX
> 
> or we copy some other more direct linux check from e.g. APR:
> 
> #ifdef __linux__
> 
> The latter looks simpler, but the version above is based on all the
> logic put into config.guess.

I'd go with the __linux__ option as that is consistent with what we
already use in os/unix/system.c

I'm not against the change to configure.in, I just think we should be
consistent with how we do this throughout the code base.

>> I do think OS specific code to get the real thread ID is the way to go
>> here until we can switch to OpenSSL 1.1.0 at which point the problem
>> should go away.
> 
> Agreed, but cleaning up at the end of a thread would be good as well,
> e.g. if the executor is configured to renew threads when a context is
> reloaded we can accumulate lots of old hash entries.

Agreed.

Mark


-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-28 Thread Rainer Jung

Am 28.06.2016 um 11:15 schrieb Mark Thomas:

On 27/06/2016 23:54, Nate Clark wrote:

On Mon, Jun 27, 2016 at 5:19 PM, Mark Thomas  wrote:

On 27/06/2016 19:59, Nate Clark wrote:

On Mon, Jun 27, 2016 at 2:13 PM, Rainer Jung  wrote:





I wouldn't care too much about portability right now. If you can prove
Mark's analysis by using gettid() on Linux and checking that this fixes the
performance issue, that would very useful feedback. If it doesn't help, we
can go down the path of using the hash statistics to get more info about
what's happening. If it helps, we can think further about how to reduce hash
collisions fo every platform.


I went down the hash statistics route any way. The short version is with
11 threads and 8 buckets every thread ends up in the same bucket. This
was on OSX. I'm pretty sure Linux will be the same.


I can try that, but unfortunately will not be able to run the test
right away. The machine I was using is a shared resource, which
somebody else needs for the next few days.


Next up is trying to figure out a patch to fix the bucket distribution.


Here is my proposed fix for OSX:

Index: src/ssl.c
===
--- src/ssl.c   (revision 1750259)
+++ src/ssl.c   (working copy)
@@ -420,6 +420,10 @@
 return psaptr->PSATOLD;
 #elif defined(WIN32)
 return (unsigned long)GetCurrentThreadId();
+#elif defined(DARWIN)
+uint64_t tid;
+pthread_threadid_np(NULL, );
+return (unsigned long)tid;
 #else
 return (unsigned long)(apr_os_thread_current());
 #endif


I want to do some similar testing for Linux before adding what I suspect
will be a very similar block using gettid().


We could either add something to configure.in. Untested:

Index: native/configure.in
===
--- native/configure.in (revision 1750462)
+++ native/configure.in (working copy)
@@ -218,6 +218,9 @@
 *-solaris2*)
 APR_ADDTO(TCNATIVE_LIBS, -lkstat)
 ;;
+*linux*)
+APR_ADDTO(CFLAGS, -DTCNATIVE_LINUX)
+;;
 *)
 ;;
 esac


and then use a

#ifdef TCNATIVE_LINUX

or we copy some other more direct linux check from e.g. APR:

#ifdef __linux__

The latter looks simpler, but the version above is based on all the 
logic put into config.guess.


Finally we could check against SYS_gettid, either during an autodettect 
in configure or directly in the c source file. It seems gettid() must be 
called as "syscall(SYS_gettid)" and was added in kernel 2.4.11 at the 
end of 2001.



Even if that does help the performance and makes it so the hash map
has better distribution. It won't help with long term bloat. If a
system is running long enough and threads are started and stopped the
hash map could grow to include up to almost max pid entries, 32768 by
default on Linux. With the size of the ERR_STATE struct that will be
roughly 12MB of memory, not horrible but not ideal. I still think
ERR_remove_thread_state() should probably be invoked prior to a thread
exiting. The only way I can think of doing that and keeping everything
in native and not call it for every request would be to call
pthread_cleanup_push(ERR_remove_thread_state, NULL) the first time a
thread is used to perform any ssl operation. That would then require
you to track per thread if the cleanup function had been registered.


That is definitely a secondary consideration at the moment.


Also, any solution which make the thread id random and not a bounded
set of values would cause the max memory foot print of the hash map to
be virtually unbounded and require that ERR_remove_thread_state be
invoked to clean up the error states.


The function that generated the 'random' ID from the thread ID would
have to be deterministic so the total number of values should remain
bounded. Shouldn't it?



If the error queue's are not being removed that would be a
requirement. If it is to be bound it can not use
apr_os_thread_current(), since that is a "random" memory address and
inherently unbounded. If you tried to reduce that value to a smaller
set you would then have to worry about collisions which is something
any algorithm would have to worry about. The code in err.c assumes
that the error state is only ever modified by one thread and doesn't
lock modifying the state just the map access. The easiest thing to
base the ID on, from a Linux perspective, would be the TID, which is
bounded and guaranteed to be unique for a running thread. I am not as
sure about other operating systems. It seems like on some versions of
OSx pthread_getthreadid_np(), might do what you want.


I do think OS specific code to get the real thread ID is the way to go
here until we can switch to OpenSSL 1.1.0 at which point the problem
should go away.


Agreed, but cleaning up at the end of a thread would be good as well, 
e.g. if the executor is configured to renew threads when a context is 

Re: Bug that spans tomcat and tomcat-native

2016-06-28 Thread Mark Thomas
On 27/06/2016 23:54, Nate Clark wrote:
> On Mon, Jun 27, 2016 at 5:19 PM, Mark Thomas  wrote:
>> On 27/06/2016 19:59, Nate Clark wrote:
>>> On Mon, Jun 27, 2016 at 2:13 PM, Rainer Jung  
>>> wrote:
>>
>> 
>>
 I wouldn't care too much about portability right now. If you can prove
 Mark's analysis by using gettid() on Linux and checking that this fixes the
 performance issue, that would very useful feedback. If it doesn't help, we
 can go down the path of using the hash statistics to get more info about
 what's happening. If it helps, we can think further about how to reduce 
 hash
 collisions fo every platform.
>>
>> I went down the hash statistics route any way. The short version is with
>> 11 threads and 8 buckets every thread ends up in the same bucket. This
>> was on OSX. I'm pretty sure Linux will be the same.
>>
>>> I can try that, but unfortunately will not be able to run the test
>>> right away. The machine I was using is a shared resource, which
>>> somebody else needs for the next few days.
>>
>> Next up is trying to figure out a patch to fix the bucket distribution.

Here is my proposed fix for OSX:

Index: src/ssl.c
===
--- src/ssl.c   (revision 1750259)
+++ src/ssl.c   (working copy)
@@ -420,6 +420,10 @@
 return psaptr->PSATOLD;
 #elif defined(WIN32)
 return (unsigned long)GetCurrentThreadId();
+#elif defined(DARWIN)
+uint64_t tid;
+pthread_threadid_np(NULL, );
+return (unsigned long)tid;
 #else
 return (unsigned long)(apr_os_thread_current());
 #endif


I want to do some similar testing for Linux before adding what I suspect
will be a very similar block using gettid().

>>> Even if that does help the performance and makes it so the hash map
>>> has better distribution. It won't help with long term bloat. If a
>>> system is running long enough and threads are started and stopped the
>>> hash map could grow to include up to almost max pid entries, 32768 by
>>> default on Linux. With the size of the ERR_STATE struct that will be
>>> roughly 12MB of memory, not horrible but not ideal. I still think
>>> ERR_remove_thread_state() should probably be invoked prior to a thread
>>> exiting. The only way I can think of doing that and keeping everything
>>> in native and not call it for every request would be to call
>>> pthread_cleanup_push(ERR_remove_thread_state, NULL) the first time a
>>> thread is used to perform any ssl operation. That would then require
>>> you to track per thread if the cleanup function had been registered.
>>
>> That is definitely a secondary consideration at the moment.
>>
>>> Also, any solution which make the thread id random and not a bounded
>>> set of values would cause the max memory foot print of the hash map to
>>> be virtually unbounded and require that ERR_remove_thread_state be
>>> invoked to clean up the error states.
>>
>> The function that generated the 'random' ID from the thread ID would
>> have to be deterministic so the total number of values should remain
>> bounded. Shouldn't it?
>>
> 
> If the error queue's are not being removed that would be a
> requirement. If it is to be bound it can not use
> apr_os_thread_current(), since that is a "random" memory address and
> inherently unbounded. If you tried to reduce that value to a smaller
> set you would then have to worry about collisions which is something
> any algorithm would have to worry about. The code in err.c assumes
> that the error state is only ever modified by one thread and doesn't
> lock modifying the state just the map access. The easiest thing to
> base the ID on, from a Linux perspective, would be the TID, which is
> bounded and guaranteed to be unique for a running thread. I am not as
> sure about other operating systems. It seems like on some versions of
> OSx pthread_getthreadid_np(), might do what you want.

I do think OS specific code to get the real thread ID is the way to go
here until we can switch to OpenSSL 1.1.0 at which point the problem
should go away.

Mark

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-27 Thread Nate Clark
On Mon, Jun 27, 2016 at 5:19 PM, Mark Thomas  wrote:
> On 27/06/2016 19:59, Nate Clark wrote:
>> On Mon, Jun 27, 2016 at 2:13 PM, Rainer Jung  wrote:
>
> 
>
>>> I wouldn't care too much about portability right now. If you can prove
>>> Mark's analysis by using gettid() on Linux and checking that this fixes the
>>> performance issue, that would very useful feedback. If it doesn't help, we
>>> can go down the path of using the hash statistics to get more info about
>>> what's happening. If it helps, we can think further about how to reduce hash
>>> collisions fo every platform.
>
> I went down the hash statistics route any way. The short version is with
> 11 threads and 8 buckets every thread ends up in the same bucket. This
> was on OSX. I'm pretty sure Linux will be the same.
>
>> I can try that, but unfortunately will not be able to run the test
>> right away. The machine I was using is a shared resource, which
>> somebody else needs for the next few days.
>
> Next up is trying to figure out a patch to fix the bucket distribution.
>
>> Even if that does help the performance and makes it so the hash map
>> has better distribution. It won't help with long term bloat. If a
>> system is running long enough and threads are started and stopped the
>> hash map could grow to include up to almost max pid entries, 32768 by
>> default on Linux. With the size of the ERR_STATE struct that will be
>> roughly 12MB of memory, not horrible but not ideal. I still think
>> ERR_remove_thread_state() should probably be invoked prior to a thread
>> exiting. The only way I can think of doing that and keeping everything
>> in native and not call it for every request would be to call
>> pthread_cleanup_push(ERR_remove_thread_state, NULL) the first time a
>> thread is used to perform any ssl operation. That would then require
>> you to track per thread if the cleanup function had been registered.
>
> That is definitely a secondary consideration at the moment.
>
>> Also, any solution which make the thread id random and not a bounded
>> set of values would cause the max memory foot print of the hash map to
>> be virtually unbounded and require that ERR_remove_thread_state be
>> invoked to clean up the error states.
>
> The function that generated the 'random' ID from the thread ID would
> have to be deterministic so the total number of values should remain
> bounded. Shouldn't it?
>

If the error queue's are not being removed that would be a
requirement. If it is to be bound it can not use
apr_os_thread_current(), since that is a "random" memory address and
inherently unbounded. If you tried to reduce that value to a smaller
set you would then have to worry about collisions which is something
any algorithm would have to worry about. The code in err.c assumes
that the error state is only ever modified by one thread and doesn't
lock modifying the state just the map access. The easiest thing to
base the ID on, from a Linux perspective, would be the TID, which is
bounded and guaranteed to be unique for a running thread. I am not as
sure about other operating systems. It seems like on some versions of
OSx pthread_getthreadid_np(), might do what you want.

-nate

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-27 Thread Mark Thomas
On 27/06/2016 19:59, Nate Clark wrote:
> On Mon, Jun 27, 2016 at 2:13 PM, Rainer Jung  wrote:



>> I wouldn't care too much about portability right now. If you can prove
>> Mark's analysis by using gettid() on Linux and checking that this fixes the
>> performance issue, that would very useful feedback. If it doesn't help, we
>> can go down the path of using the hash statistics to get more info about
>> what's happening. If it helps, we can think further about how to reduce hash
>> collisions fo every platform.

I went down the hash statistics route any way. The short version is with
11 threads and 8 buckets every thread ends up in the same bucket. This
was on OSX. I'm pretty sure Linux will be the same.

> I can try that, but unfortunately will not be able to run the test
> right away. The machine I was using is a shared resource, which
> somebody else needs for the next few days.

Next up is trying to figure out a patch to fix the bucket distribution.

> Even if that does help the performance and makes it so the hash map
> has better distribution. It won't help with long term bloat. If a
> system is running long enough and threads are started and stopped the
> hash map could grow to include up to almost max pid entries, 32768 by
> default on Linux. With the size of the ERR_STATE struct that will be
> roughly 12MB of memory, not horrible but not ideal. I still think
> ERR_remove_thread_state() should probably be invoked prior to a thread
> exiting. The only way I can think of doing that and keeping everything
> in native and not call it for every request would be to call
> pthread_cleanup_push(ERR_remove_thread_state, NULL) the first time a
> thread is used to perform any ssl operation. That would then require
> you to track per thread if the cleanup function had been registered.

That is definitely a secondary consideration at the moment.

> Also, any solution which make the thread id random and not a bounded
> set of values would cause the max memory foot print of the hash map to
> be virtually unbounded and require that ERR_remove_thread_state be
> invoked to clean up the error states.

The function that generated the 'random' ID from the thread ID would
have to be deterministic so the total number of values should remain
bounded. Shouldn't it?

Mark

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-27 Thread Nate Clark
On Mon, Jun 27, 2016 at 2:13 PM, Rainer Jung  wrote:
> Am 27.06.2016 um 18:47 schrieb Nate Clark:
>>
>> On Mon, Jun 27, 2016 at 12:17 PM,   wrote:
>>>
>>> On Mon, Jun 27, 2016 at 12:04 PM,   wrote:

 On Mon, Jun 27, 2016 at 11:54 AM, Rainer Jung 
 wrote:
>
> Great analysis. I was really wondering, what could make the hash map so
> huge
> and hadn't thought about the hash function as the problem.
>
> Before OpenSSL 1.1.0 there's a callback for applications to provide
> their
> own thread IDs:
>
>
> https://www.openssl.org/docs/man1.0.2/crypto/CRYPTO_THREADID_set_callback.html
>
> So we could probably work around the problem of the poor hashing
> function by
> passing in IDs that work for hashing (pre-hashed ID?). Of course then
> we
> loose the direct association of the OpenSSL thread ID with the real
> platform
> thread id.
>
> Currently our callback in tcnative is ssl_set_thread_id() which refers
> to
> ssl_thread_id(), which on Linux gets the ID from the APR function
> apr_os_thread_current(). So we could add some hashing formula in
> ssl_thread_id().


 I think just using the real thread id would work, since now it isn't
 using the real thread id instead it is using the address location of
 errno. If this was the real thread id I think the hash algorithm and
 bucket selection they have now will work much better since the thread
 ids are basically numerically increasing and aren't aligned to powers
 of 2.
>>>
>>>
>>> Sorry, I need to read code a bit closer. I misread the
>>> OPENSSL_VERSION_NUMBER < 0x1010L as version < 1.0.1 and not 1.1.0.
>>>
>>> I would still expect real thread ids to provide enough of a
>>> distribution in a map where the hash bucket ends up being is ((ID *
>>> 13) & 0X7F).
>>
>>
>> OK, this now makes more sense apr_os_thread_current() calls
>> pthread_self() which returns a pthred_t which is basically a handle
>> for pthread's internal state and not an actual ID. It might be better
>> to use gettid(), but that isn't as portable.
>
>
> I wouldn't care too much about portability right now. If you can prove
> Mark's analysis by using gettid() on Linux and checking that this fixes the
> performance issue, that would very useful feedback. If it doesn't help, we
> can go down the path of using the hash statistics to get more info about
> what's happening. If it helps, we can think further about how to reduce hash
> collisions fo every platform.

I can try that, but unfortunately will not be able to run the test
right away. The machine I was using is a shared resource, which
somebody else needs for the next few days.

Even if that does help the performance and makes it so the hash map
has better distribution. It won't help with long term bloat. If a
system is running long enough and threads are started and stopped the
hash map could grow to include up to almost max pid entries, 32768 by
default on Linux. With the size of the ERR_STATE struct that will be
roughly 12MB of memory, not horrible but not ideal. I still think
ERR_remove_thread_state() should probably be invoked prior to a thread
exiting. The only way I can think of doing that and keeping everything
in native and not call it for every request would be to call
pthread_cleanup_push(ERR_remove_thread_state, NULL) the first time a
thread is used to perform any ssl operation. That would then require
you to track per thread if the cleanup function had been registered.

Also, any solution which make the thread id random and not a bounded
set of values would cause the max memory foot print of the hash map to
be virtually unbounded and require that ERR_remove_thread_state be
invoked to clean up the error states.

-nate

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-27 Thread Rainer Jung

Am 27.06.2016 um 18:47 schrieb Nate Clark:

On Mon, Jun 27, 2016 at 12:17 PM,   wrote:

On Mon, Jun 27, 2016 at 12:04 PM,   wrote:

On Mon, Jun 27, 2016 at 11:54 AM, Rainer Jung  wrote:

Great analysis. I was really wondering, what could make the hash map so huge
and hadn't thought about the hash function as the problem.

Before OpenSSL 1.1.0 there's a callback for applications to provide their
own thread IDs:

https://www.openssl.org/docs/man1.0.2/crypto/CRYPTO_THREADID_set_callback.html

So we could probably work around the problem of the poor hashing function by
passing in IDs that work for hashing (pre-hashed ID?). Of course then we
loose the direct association of the OpenSSL thread ID with the real platform
thread id.

Currently our callback in tcnative is ssl_set_thread_id() which refers to
ssl_thread_id(), which on Linux gets the ID from the APR function
apr_os_thread_current(). So we could add some hashing formula in
ssl_thread_id().


I think just using the real thread id would work, since now it isn't
using the real thread id instead it is using the address location of
errno. If this was the real thread id I think the hash algorithm and
bucket selection they have now will work much better since the thread
ids are basically numerically increasing and aren't aligned to powers
of 2.


Sorry, I need to read code a bit closer. I misread the
OPENSSL_VERSION_NUMBER < 0x1010L as version < 1.0.1 and not 1.1.0.

I would still expect real thread ids to provide enough of a
distribution in a map where the hash bucket ends up being is ((ID *
13) & 0X7F).


OK, this now makes more sense apr_os_thread_current() calls
pthread_self() which returns a pthred_t which is basically a handle
for pthread's internal state and not an actual ID. It might be better
to use gettid(), but that isn't as portable.


I wouldn't care too much about portability right now. If you can prove 
Mark's analysis by using gettid() on Linux and checking that this fixes 
the performance issue, that would very useful feedback. If it doesn't 
help, we can go down the path of using the hash statistics to get more 
info about what's happening. If it helps, we can think further about how 
to reduce hash collisions fo every platform.


Regards,

Rainer

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-27 Thread Nate Clark
On Mon, Jun 27, 2016 at 12:17 PM,   wrote:
> On Mon, Jun 27, 2016 at 12:04 PM,   wrote:
>> On Mon, Jun 27, 2016 at 11:54 AM, Rainer Jung  
>> wrote:
>>> Great analysis. I was really wondering, what could make the hash map so huge
>>> and hadn't thought about the hash function as the problem.
>>>
>>> Before OpenSSL 1.1.0 there's a callback for applications to provide their
>>> own thread IDs:
>>>
>>> https://www.openssl.org/docs/man1.0.2/crypto/CRYPTO_THREADID_set_callback.html
>>>
>>> So we could probably work around the problem of the poor hashing function by
>>> passing in IDs that work for hashing (pre-hashed ID?). Of course then we
>>> loose the direct association of the OpenSSL thread ID with the real platform
>>> thread id.
>>>
>>> Currently our callback in tcnative is ssl_set_thread_id() which refers to
>>> ssl_thread_id(), which on Linux gets the ID from the APR function
>>> apr_os_thread_current(). So we could add some hashing formula in
>>> ssl_thread_id().
>>
>> I think just using the real thread id would work, since now it isn't
>> using the real thread id instead it is using the address location of
>> errno. If this was the real thread id I think the hash algorithm and
>> bucket selection they have now will work much better since the thread
>> ids are basically numerically increasing and aren't aligned to powers
>> of 2.
>
> Sorry, I need to read code a bit closer. I misread the
> OPENSSL_VERSION_NUMBER < 0x1010L as version < 1.0.1 and not 1.1.0.
>
> I would still expect real thread ids to provide enough of a
> distribution in a map where the hash bucket ends up being is ((ID *
> 13) & 0X7F).

OK, this now makes more sense apr_os_thread_current() calls
pthread_self() which returns a pthred_t which is basically a handle
for pthread's internal state and not an actual ID. It might be better
to use gettid(), but that isn't as portable.

-nate

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-27 Thread therealneworld
On Mon, Jun 27, 2016 at 12:04 PM,   wrote:
> On Mon, Jun 27, 2016 at 11:54 AM, Rainer Jung  wrote:
>> Great analysis. I was really wondering, what could make the hash map so huge
>> and hadn't thought about the hash function as the problem.
>>
>> Before OpenSSL 1.1.0 there's a callback for applications to provide their
>> own thread IDs:
>>
>> https://www.openssl.org/docs/man1.0.2/crypto/CRYPTO_THREADID_set_callback.html
>>
>> So we could probably work around the problem of the poor hashing function by
>> passing in IDs that work for hashing (pre-hashed ID?). Of course then we
>> loose the direct association of the OpenSSL thread ID with the real platform
>> thread id.
>>
>> Currently our callback in tcnative is ssl_set_thread_id() which refers to
>> ssl_thread_id(), which on Linux gets the ID from the APR function
>> apr_os_thread_current(). So we could add some hashing formula in
>> ssl_thread_id().
>
> I think just using the real thread id would work, since now it isn't
> using the real thread id instead it is using the address location of
> errno. If this was the real thread id I think the hash algorithm and
> bucket selection they have now will work much better since the thread
> ids are basically numerically increasing and aren't aligned to powers
> of 2.

Sorry, I need to read code a bit closer. I misread the
OPENSSL_VERSION_NUMBER < 0x1010L as version < 1.0.1 and not 1.1.0.

I would still expect real thread ids to provide enough of a
distribution in a map where the hash bucket ends up being is ((ID *
13) & 0X7F).

-nate

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-27 Thread Mark Thomas
On 27/06/2016 17:04, therealnewo...@gmail.com wrote:
> On Mon, Jun 27, 2016 at 11:54 AM, Rainer Jung  wrote:

Thanks for the feedback.



>> So we could probably work around the problem of the poor hashing function by
>> passing in IDs that work for hashing (pre-hashed ID?). Of course then we
>> loose the direct association of the OpenSSL thread ID with the real platform
>> thread id.
>>
>> Currently our callback in tcnative is ssl_set_thread_id() which refers to
>> ssl_thread_id(), which on Linux gets the ID from the APR function
>> apr_os_thread_current(). So we could add some hashing formula in
>> ssl_thread_id().
> 
> I think just using the real thread id would work, since now it isn't
> using the real thread id instead it is using the address location of
> errno. If this was the real thread id I think the hash algorithm and
> bucket selection they have now will work much better since the thread
> ids are basically numerically increasing and aren't aligned to powers
> of 2.

Just so you aren't surprised, I'm going to go ahead and start the 1.2.8
release without this fix.

Regarding this fix, I plan to add Rainer's suggested debug code to try
and confirm my previous analysis and then code up a fix based on the
real thread id for Nate to test. Nate, if you have enough info in this
thread to work on a patch along those lines yourself please do feel free
to go ahead and don't wait for me.

Thanks,

Mark

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-27 Thread therealneworld
On Mon, Jun 27, 2016 at 11:54 AM, Rainer Jung  wrote:
> Hi Mark,
>
>
> Am 27.06.2016 um 15:11 schrieb Mark Thomas:
>>
>> I believe I have an explanation for what is going on that fits both the
>> reported behaviour and the proposed fix.
>>
>> Background
>> ==
>>
>> OpenSSL tracks a list of the most recent errors for each thread in a
>> hash map keyed on the thread (int_thread_hash in err.c). Reading and
>> writing to this hash map is protected by a lock. The hash map is created
>> and populated lazily.
>>
>> tc-native calls ERR_clear_error() before every call to
>> SSL_do_handshake(), SSL_read() and SSL_write(). The call to
>> ERR_clear_error() either clears the error list for the current thread or
>> inserts a new empty list into the hash map of the thread is not already
>> present.
>>
>> The performance problem was tracked down to threads waiting in
>> ERR_clear_error() to obtain the write lock for the hash map.
>>
>> The proposed solution was to call ERR_remove_thread_state() just before
>> the current Tomcat thread processing the connection is returned to the
>> thread pool. This method removes the current thread and its associated
>> error list from the hash map.
>>
>>
>> Analysis
>> 
>>
>> The proposed solution, calling ERR_remove_thread_state(), adds a call
>> that also obtains the write lock for the hash map. This indicates that
>> the problem is not delays in obtaining the lock but contention for the
>> lock because one or more operations taking place within the lock are
>> taking a long time.
>>
>> Removing unused threads from the hash map removes the bottleneck. This
>> points towards the hash map being the source of the problem.
>>
>> Testing by the OP showed that as soon as a test had been ran that
>> required ~ 400 concurrent threads performance dropped significantly. It
>> did not get noticeably worse if the same 400 thread test was run
>> repeatedly.
>>
>> My testing indicated, on OSX at least, that the thread IDs used in the
>> hash map were stable and that uncontrolled growth of the hash map was
>> unlikely to be the cause.
>>
>> The manner in which thread IDs are generated varies by platform. On
>> Linux, where this problem was observed, the thread ID is derived from
>> (is normally equal to) the memory address of the per thread errno
>> variable. This means that thread IDs tend to be concentrated in a
>> relatively narrow range of values. For example, in a simple 10 thread
>> test on OSX thread IDs ranged from 123145344839680 to 123145354387455.
>>
>> Thread IDs therefore fall with a 10^7 range within a possible range of
>> 1.8x10^19. i.e. a very small, contiguous range.
>>
>> Hash maps use hashing functions to ensure that entries are (roughly)
>> evenly distributed between the available buckets. The hash function,
>> err_state_hash, used for the thread IDs in OpenSSL is threadID * 13.
>>
>> Supposition
>> ===
>>
>> The hash function used (multiple by 13) is insufficient to distribute
>> the resulting values across multiple buckets because they will still
>> fall in a relatively narrow band. Therefore all the threads end up in a
>> single bucket which makes the performance of the hash map poor. This in
>> turn makes calls to thread_get_item() slow because it does a hash map
>> lookup. This lookup is performed with the read lock held for the hash
>> map which in turn will slow down the calls that require the write lock.
>>
>> Proposal
>> 
>>
>> The analysis and supposition above need to be checked by someone with a
>> better understanding of C than me. Assuming my work is correct, the next
>> step is to look at possible fixes. I do not believe that patching
>> OpenSSL is a viable option.
>>
>> The OpenSSL API needs to be reviewed to see if there is a way to avoid
>> the calls that require the write lock.
>>
>> If the write lock cannot be avoided then we need to see if there is a
>> better place to call ERR_remove_thread_state(). I'd like to fix this
>> entirely in tc-native but that may mean calling
>> ERR_remove_thread_state() more frequently which could create its own
>> performance problems.
>>
>> Nate - I may have some patches for you to test in the next few days.
>>
>> Mark
>
>
> Great analysis. I was really wondering, what could make the hash map so huge
> and hadn't thought about the hash function as the problem.
>
> Before OpenSSL 1.1.0 there's a callback for applications to provide their
> own thread IDs:
>
> https://www.openssl.org/docs/man1.0.2/crypto/CRYPTO_THREADID_set_callback.html
>
> So we could probably work around the problem of the poor hashing function by
> passing in IDs that work for hashing (pre-hashed ID?). Of course then we
> loose the direct association of the OpenSSL thread ID with the real platform
> thread id.
>
> Currently our callback in tcnative is ssl_set_thread_id() which refers to
> ssl_thread_id(), which on Linux gets the ID from the APR function
> apr_os_thread_current(). So we could add some 

Re: Bug that spans tomcat and tomcat-native

2016-06-27 Thread Nate Clark
On Mon, Jun 27, 2016 at 9:11 AM, Mark Thomas  wrote:
> I believe I have an explanation for what is going on that fits both the
> reported behaviour and the proposed fix.
>
> Background
> ==
>
> OpenSSL tracks a list of the most recent errors for each thread in a
> hash map keyed on the thread (int_thread_hash in err.c). Reading and
> writing to this hash map is protected by a lock. The hash map is created
> and populated lazily.
>
> tc-native calls ERR_clear_error() before every call to
> SSL_do_handshake(), SSL_read() and SSL_write(). The call to
> ERR_clear_error() either clears the error list for the current thread or
> inserts a new empty list into the hash map of the thread is not already
> present.
>
> The performance problem was tracked down to threads waiting in
> ERR_clear_error() to obtain the write lock for the hash map.
>
> The proposed solution was to call ERR_remove_thread_state() just before
> the current Tomcat thread processing the connection is returned to the
> thread pool. This method removes the current thread and its associated
> error list from the hash map.
>
>
> Analysis
> 
>
> The proposed solution, calling ERR_remove_thread_state(), adds a call
> that also obtains the write lock for the hash map. This indicates that
> the problem is not delays in obtaining the lock but contention for the
> lock because one or more operations taking place within the lock are
> taking a long time.
>
> Removing unused threads from the hash map removes the bottleneck. This
> points towards the hash map being the source of the problem.
>
> Testing by the OP showed that as soon as a test had been ran that
> required ~ 400 concurrent threads performance dropped significantly. It
> did not get noticeably worse if the same 400 thread test was run repeatedly.
>
> My testing indicated, on OSX at least, that the thread IDs used in the
> hash map were stable and that uncontrolled growth of the hash map was
> unlikely to be the cause.
>
> The manner in which thread IDs are generated varies by platform. On
> Linux, where this problem was observed, the thread ID is derived from
> (is normally equal to) the memory address of the per thread errno
> variable. This means that thread IDs tend to be concentrated in a
> relatively narrow range of values. For example, in a simple 10 thread
> test on OSX thread IDs ranged from 123145344839680 to 123145354387455.
>
> Thread IDs therefore fall with a 10^7 range within a possible range of
> 1.8x10^19. i.e. a very small, contiguous range.
>
> Hash maps use hashing functions to ensure that entries are (roughly)
> evenly distributed between the available buckets. The hash function,
> err_state_hash, used for the thread IDs in OpenSSL is threadID * 13.
>
> Supposition
> ===
>
> The hash function used (multiple by 13) is insufficient to distribute
> the resulting values across multiple buckets because they will still
> fall in a relatively narrow band. Therefore all the threads end up in a
> single bucket which makes the performance of the hash map poor. This in
> turn makes calls to thread_get_item() slow because it does a hash map
> lookup. This lookup is performed with the read lock held for the hash
> map which in turn will slow down the calls that require the write lock.

This makes sense. Looking at openssl source code a bit more, I think
it is a combination of the hash function and bucket selection in the
hash map. The bucket which is chosen is basically done by a low bit
mask since pmax and num_alloc_nodes are always powers of 2. With 400
entries pmax=128 and num_alloc_nodes=256, making the bucket selection
either hash & 0x7F or hash & 0xFF. The errno value memory locations
are probably aligned and often that mask will result in the same
bucket being chosen.

> Proposal
> 
>
> The analysis and supposition above need to be checked by someone with a
> better understanding of C than me. Assuming my work is correct, the next
> step is to look at possible fixes. I do not believe that patching
> OpenSSL is a viable option.
>
> The OpenSSL API needs to be reviewed to see if there is a way to avoid
> the calls that require the write lock.
>
> If the write lock cannot be avoided then we need to see if there is a
> better place to call ERR_remove_thread_state(). I'd like to fix this
> entirely in tc-native but that may mean calling
> ERR_remove_thread_state() more frequently which could create its own
> performance problems.
>
> Nate - I may have some patches for you to test in the next few days.

Sounds good.

Thanks,
-nate

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-27 Thread Rainer Jung

Hi Mark,

Am 27.06.2016 um 15:11 schrieb Mark Thomas:

I believe I have an explanation for what is going on that fits both the
reported behaviour and the proposed fix.

Background
==

OpenSSL tracks a list of the most recent errors for each thread in a
hash map keyed on the thread (int_thread_hash in err.c). Reading and
writing to this hash map is protected by a lock. The hash map is created
and populated lazily.

tc-native calls ERR_clear_error() before every call to
SSL_do_handshake(), SSL_read() and SSL_write(). The call to
ERR_clear_error() either clears the error list for the current thread or
inserts a new empty list into the hash map of the thread is not already
present.

The performance problem was tracked down to threads waiting in
ERR_clear_error() to obtain the write lock for the hash map.

The proposed solution was to call ERR_remove_thread_state() just before
the current Tomcat thread processing the connection is returned to the
thread pool. This method removes the current thread and its associated
error list from the hash map.


Analysis


The proposed solution, calling ERR_remove_thread_state(), adds a call
that also obtains the write lock for the hash map. This indicates that
the problem is not delays in obtaining the lock but contention for the
lock because one or more operations taking place within the lock are
taking a long time.

Removing unused threads from the hash map removes the bottleneck. This
points towards the hash map being the source of the problem.

Testing by the OP showed that as soon as a test had been ran that
required ~ 400 concurrent threads performance dropped significantly. It
did not get noticeably worse if the same 400 thread test was run repeatedly.

My testing indicated, on OSX at least, that the thread IDs used in the
hash map were stable and that uncontrolled growth of the hash map was
unlikely to be the cause.

The manner in which thread IDs are generated varies by platform. On
Linux, where this problem was observed, the thread ID is derived from
(is normally equal to) the memory address of the per thread errno
variable. This means that thread IDs tend to be concentrated in a
relatively narrow range of values. For example, in a simple 10 thread
test on OSX thread IDs ranged from 123145344839680 to 123145354387455.

Thread IDs therefore fall with a 10^7 range within a possible range of
1.8x10^19. i.e. a very small, contiguous range.

Hash maps use hashing functions to ensure that entries are (roughly)
evenly distributed between the available buckets. The hash function,
err_state_hash, used for the thread IDs in OpenSSL is threadID * 13.

Supposition
===

The hash function used (multiple by 13) is insufficient to distribute
the resulting values across multiple buckets because they will still
fall in a relatively narrow band. Therefore all the threads end up in a
single bucket which makes the performance of the hash map poor. This in
turn makes calls to thread_get_item() slow because it does a hash map
lookup. This lookup is performed with the read lock held for the hash
map which in turn will slow down the calls that require the write lock.

Proposal


The analysis and supposition above need to be checked by someone with a
better understanding of C than me. Assuming my work is correct, the next
step is to look at possible fixes. I do not believe that patching
OpenSSL is a viable option.

The OpenSSL API needs to be reviewed to see if there is a way to avoid
the calls that require the write lock.

If the write lock cannot be avoided then we need to see if there is a
better place to call ERR_remove_thread_state(). I'd like to fix this
entirely in tc-native but that may mean calling
ERR_remove_thread_state() more frequently which could create its own
performance problems.

Nate - I may have some patches for you to test in the next few days.

Mark


Great analysis. I was really wondering, what could make the hash map so 
huge and hadn't thought about the hash function as the problem.


Before OpenSSL 1.1.0 there's a callback for applications to provide 
their own thread IDs:


https://www.openssl.org/docs/man1.0.2/crypto/CRYPTO_THREADID_set_callback.html

So we could probably work around the problem of the poor hashing 
function by passing in IDs that work for hashing (pre-hashed ID?). Of 
course then we loose the direct association of the OpenSSL thread ID 
with the real platform thread id.


Currently our callback in tcnative is ssl_set_thread_id() which refers 
to ssl_thread_id(), which on Linux gets the ID from the APR function 
apr_os_thread_current(). So we could add some hashing formula in 
ssl_thread_id().


In OpenSSL 1.1.0 this thread ID callback is gone. We'll see whether the 
lock hash problem is gone too ...


I think we could add some logging to the hash impl to verify the 
assumption, that many thread IDs result in the same hash value. The 
lhash impl already seems to maintain quite a few statistics:



Re: Bug that spans tomcat and tomcat-native

2016-06-27 Thread Mark Thomas
I believe I have an explanation for what is going on that fits both the
reported behaviour and the proposed fix.

Background
==

OpenSSL tracks a list of the most recent errors for each thread in a
hash map keyed on the thread (int_thread_hash in err.c). Reading and
writing to this hash map is protected by a lock. The hash map is created
and populated lazily.

tc-native calls ERR_clear_error() before every call to
SSL_do_handshake(), SSL_read() and SSL_write(). The call to
ERR_clear_error() either clears the error list for the current thread or
inserts a new empty list into the hash map of the thread is not already
present.

The performance problem was tracked down to threads waiting in
ERR_clear_error() to obtain the write lock for the hash map.

The proposed solution was to call ERR_remove_thread_state() just before
the current Tomcat thread processing the connection is returned to the
thread pool. This method removes the current thread and its associated
error list from the hash map.


Analysis


The proposed solution, calling ERR_remove_thread_state(), adds a call
that also obtains the write lock for the hash map. This indicates that
the problem is not delays in obtaining the lock but contention for the
lock because one or more operations taking place within the lock are
taking a long time.

Removing unused threads from the hash map removes the bottleneck. This
points towards the hash map being the source of the problem.

Testing by the OP showed that as soon as a test had been ran that
required ~ 400 concurrent threads performance dropped significantly. It
did not get noticeably worse if the same 400 thread test was run repeatedly.

My testing indicated, on OSX at least, that the thread IDs used in the
hash map were stable and that uncontrolled growth of the hash map was
unlikely to be the cause.

The manner in which thread IDs are generated varies by platform. On
Linux, where this problem was observed, the thread ID is derived from
(is normally equal to) the memory address of the per thread errno
variable. This means that thread IDs tend to be concentrated in a
relatively narrow range of values. For example, in a simple 10 thread
test on OSX thread IDs ranged from 123145344839680 to 123145354387455.

Thread IDs therefore fall with a 10^7 range within a possible range of
1.8x10^19. i.e. a very small, contiguous range.

Hash maps use hashing functions to ensure that entries are (roughly)
evenly distributed between the available buckets. The hash function,
err_state_hash, used for the thread IDs in OpenSSL is threadID * 13.

Supposition
===

The hash function used (multiple by 13) is insufficient to distribute
the resulting values across multiple buckets because they will still
fall in a relatively narrow band. Therefore all the threads end up in a
single bucket which makes the performance of the hash map poor. This in
turn makes calls to thread_get_item() slow because it does a hash map
lookup. This lookup is performed with the read lock held for the hash
map which in turn will slow down the calls that require the write lock.

Proposal


The analysis and supposition above need to be checked by someone with a
better understanding of C than me. Assuming my work is correct, the next
step is to look at possible fixes. I do not believe that patching
OpenSSL is a viable option.

The OpenSSL API needs to be reviewed to see if there is a way to avoid
the calls that require the write lock.

If the write lock cannot be avoided then we need to see if there is a
better place to call ERR_remove_thread_state(). I'd like to fix this
entirely in tc-native but that may mean calling
ERR_remove_thread_state() more frequently which could create its own
performance problems.

Nate - I may have some patches for you to test in the next few days.

Mark

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-27 Thread Mark Thomas
Thanks. This is all really helpful. I need to spend some time digesting
all of this information, reviewing the OpenSSL code and trying to figure
out what might be going on.

One of the things I am trying to do is to understand why the proposed
patch fixes the issue and if an alternative approach is possible.
Ideally, I'd like to fix this entirely in tc-native so we don't have to
cross the Java-native boundary an additional time which is relatively slow.

Thanks for all you help tracking this down so far.

Mark

On 25/06/2016 00:51, therealnewo...@gmail.com wrote:
> On Fri, Jun 24, 2016 at 5:31 PM, Mark Thomas  wrote:
>> On 24/06/2016 21:52, therealnewo...@gmail.com wrote:
>>
 I'm wondering if this is a problem that builds up over time. If I
 understood your previous posts correctly, running the big tests
 immediately gave ~700MB/s whereas running the small tests then the big
 tests resulting in ~350MB/s during the big tests. Are you able to
 experiment with this a little bit? For example, if you do big tests, 1M
 (~20%) small tests, big tests, 1M small tests, big tests etc. What is
 the data rate for the big tests after 0, 1M, 2M, 3M, 4M and 5M little 
 tests.
>>>
>>> Sure I can try that. For the in between tests do you want me to run
>>> those for a set amount of time or number of files? Like each smaller
>>> batch like 20min and then 10min of large and then next smaller size?
>>
>> I was thinking set number of files.
>>
>> I would also be useful to know how many threads the executor has created
>> at each point as well. (JMX should tell you that. You might need to
>> adjust the executor so it doesn't stop idle threads.).
> 
> I saw your message about not stopping idle threads after I already
> started things.
> 
> 1st 100M test:
> 851348MB/s
> Executor:
> largestPoolSize: 25
> poolSize: 25
> 1st 4k test
> Executor:
> largestPoolSize: 401
> poolSize: 401
> 2nd 100M test:
>  460147MB/s
> Executor:
> largestPoolSize: 401
> poolSize: 79
> 2nd 4k test
> Executor:
> largestPoolSize: 414
> poolSize: 414
> 3rd 100M test:
> 429127MB/s
> Executor:
> largestPoolSize: 414
> poolSize: 80
> 3rd 4k test:
> Executor:
> largestPoolSize: 414
> poolSize: 401
> 4th 100M test:
> 437918MB/s
> Executor:
> largestPoolSize: 414
> poolSize: 86
> 4th 4k test:
> Executor:
> largestPoolSize: 414
> poolSize: 401
> 5th 100M test:
> 464837MB/s
> Executor:
> largestPoolSize: 414
> poolSize: 87
> 
> It looks like the problem occurs right after the first set of 4k puts
> and doesn't get any worse so what ever causes the issue happens early.
> This is getting stranger and I really can not explain why calling
> ERR_remove_thread_state reliably improves performance.
> 
>> Going back to your original description, you said you saw blocking
>> during the call to ERR_clear_err(). Did you mean ERR_clear_error()?
>> Either way, could you provide the full stack trace of an example blocked
>> thread? And, ideally, the stack trace of the thread currently holding
>> the lock? I'm still trying to understand what is going on here since
>> based on my understanding of the code so far, the HashMap is bounded (to
>> the number of threads) and should reach that limit fairly quickly.
> 
> Sorry, yes I did mean ERR_clear_error().
> 
> #0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
> #1  0x7f0f61ab989d in __GI___pthread_mutex_lock (mutex=0x2b49c58)
> at ../nptl/pthread_mutex_lock.c:80
> #2  0x7f0f3205f183 in int_thread_get (create=0) at err.c:446
> #3  0x7f0f3205f68d in int_thread_get_item (d=0x7f0ca89c7ce0) at err.c:491
> #4  0x7f0f32060094 in ERR_get_state () at err.c:1014
> #5  0x7f0f320602cf in ERR_clear_error () at err.c:747
> #6  0x7f0f325f3579 in ssl_socket_recv (sock=0x7f0dcc391980,
> buf=0x7f0eec067820
> "lock->199808-Source_filename->rhino_perf_https_lt_100g_a-Loop->1-Count->11089487-11089488-11089489-11089490-11089491-11089492-11089493-11089494-11089495-11089496-11089497-11089498-11089499-11089500-11"...,
> len=0x7f0ca89c7ff0) at src/sslnetwork.c:401
> #7  0x7f0f325ece99 in Java_org_apache_tomcat_jni_Socket_recvbb
> (e=, o=, sock=,
> offset=, len=) at src/network.c:957
> 
> I tried getting more data but the jvm tends to dump core when gdb is
> attached or is going to slow to actually cause the lock contention. I
> can reliably see a thread waiting on this lock if I attach to a single
> thread and randomly interrupt it and look at the back trace. When I
> look at the mutex it has a different owner each time so different
> threads are getting the lock.
> 
> I will play with this a bit more on Monday.
> 
> -nate
> 
> -
> To unsubscribe, e-mail: 

Re: Bug that spans tomcat and tomcat-native

2016-06-24 Thread therealneworld
On Fri, Jun 24, 2016 at 5:31 PM, Mark Thomas  wrote:
> On 24/06/2016 21:52, therealnewo...@gmail.com wrote:
>
>>> I'm wondering if this is a problem that builds up over time. If I
>>> understood your previous posts correctly, running the big tests
>>> immediately gave ~700MB/s whereas running the small tests then the big
>>> tests resulting in ~350MB/s during the big tests. Are you able to
>>> experiment with this a little bit? For example, if you do big tests, 1M
>>> (~20%) small tests, big tests, 1M small tests, big tests etc. What is
>>> the data rate for the big tests after 0, 1M, 2M, 3M, 4M and 5M little tests.
>>
>> Sure I can try that. For the in between tests do you want me to run
>> those for a set amount of time or number of files? Like each smaller
>> batch like 20min and then 10min of large and then next smaller size?
>
> I was thinking set number of files.
>
> I would also be useful to know how many threads the executor has created
> at each point as well. (JMX should tell you that. You might need to
> adjust the executor so it doesn't stop idle threads.).

I saw your message about not stopping idle threads after I already
started things.

1st 100M test:
851348MB/s
Executor:
largestPoolSize: 25
poolSize: 25
1st 4k test
Executor:
largestPoolSize: 401
poolSize: 401
2nd 100M test:
 460147MB/s
Executor:
largestPoolSize: 401
poolSize: 79
2nd 4k test
Executor:
largestPoolSize: 414
poolSize: 414
3rd 100M test:
429127MB/s
Executor:
largestPoolSize: 414
poolSize: 80
3rd 4k test:
Executor:
largestPoolSize: 414
poolSize: 401
4th 100M test:
437918MB/s
Executor:
largestPoolSize: 414
poolSize: 86
4th 4k test:
Executor:
largestPoolSize: 414
poolSize: 401
5th 100M test:
464837MB/s
Executor:
largestPoolSize: 414
poolSize: 87

It looks like the problem occurs right after the first set of 4k puts
and doesn't get any worse so what ever causes the issue happens early.
This is getting stranger and I really can not explain why calling
ERR_remove_thread_state reliably improves performance.

> Going back to your original description, you said you saw blocking
> during the call to ERR_clear_err(). Did you mean ERR_clear_error()?
> Either way, could you provide the full stack trace of an example blocked
> thread? And, ideally, the stack trace of the thread currently holding
> the lock? I'm still trying to understand what is going on here since
> based on my understanding of the code so far, the HashMap is bounded (to
> the number of threads) and should reach that limit fairly quickly.

Sorry, yes I did mean ERR_clear_error().

#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x7f0f61ab989d in __GI___pthread_mutex_lock (mutex=0x2b49c58)
at ../nptl/pthread_mutex_lock.c:80
#2  0x7f0f3205f183 in int_thread_get (create=0) at err.c:446
#3  0x7f0f3205f68d in int_thread_get_item (d=0x7f0ca89c7ce0) at err.c:491
#4  0x7f0f32060094 in ERR_get_state () at err.c:1014
#5  0x7f0f320602cf in ERR_clear_error () at err.c:747
#6  0x7f0f325f3579 in ssl_socket_recv (sock=0x7f0dcc391980,
buf=0x7f0eec067820
"lock->199808-Source_filename->rhino_perf_https_lt_100g_a-Loop->1-Count->11089487-11089488-11089489-11089490-11089491-11089492-11089493-11089494-11089495-11089496-11089497-11089498-11089499-11089500-11"...,
len=0x7f0ca89c7ff0) at src/sslnetwork.c:401
#7  0x7f0f325ece99 in Java_org_apache_tomcat_jni_Socket_recvbb
(e=, o=, sock=,
offset=, len=) at src/network.c:957

I tried getting more data but the jvm tends to dump core when gdb is
attached or is going to slow to actually cause the lock contention. I
can reliably see a thread waiting on this lock if I attach to a single
thread and randomly interrupt it and look at the back trace. When I
look at the mutex it has a different owner each time so different
threads are getting the lock.

I will play with this a bit more on Monday.

-nate

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-24 Thread Christopher Schultz
Nate,

On 6/24/16 10:40 AM, therealnewo...@gmail.com wrote:
> On Fri, Jun 24, 2016 at 6:17 AM, Rémy Maucherat  wrote:
>> 2016-06-24 12:08 GMT+02:00 Mark Thomas :
>>
>>> Thanks.
>>>
>>> I'm going to start some local performance testing to confirm I see
>>> similar results and, assuming I do, I'll start looking at fixing this
>>> for 1.2.x/9.0.x and back-porting.
>>>
>>> Hum, the fix that was submitted doesn't make sense IMO since writes can be
>> async, so I don't see a way besides adding the "error clear" thing after
>> each operation [and we'll remove it once OpenSSL 1.1 is there if it
>> actually fixes it]. That's assuming this issue is real [I actually never
>> noticed anything during my many ab runs and they use a lot of threads, so I
>> have a hard time believing it is significant enough ;) ].
>>
> 
> One thing about the system on which this is running is that it has a
> 10G nic. So the slow case is about 350MB/s and the fast one is 700MB/s
> so you would need a 10G interface or use loop back to even notice the
> issue assuming the CPU on the system can push that much encrypted
> data.

I believe Jean-Frederic Clere (works with Rémy) has been using 10G
interfaces for the Tomcat performance testing, including using APR. I
don't have his slides handy... perhaps they show an unexplained
performance drop when using APR/tcnative? Or perhaps they show no
performance drop (relative to httpd).

That would be interesting to see.

-chris



signature.asc
Description: OpenPGP digital signature


Re: Bug that spans tomcat and tomcat-native

2016-06-24 Thread Mark Thomas
On 24/06/2016 21:52, therealnewo...@gmail.com wrote:

>> I'm wondering if this is a problem that builds up over time. If I
>> understood your previous posts correctly, running the big tests
>> immediately gave ~700MB/s whereas running the small tests then the big
>> tests resulting in ~350MB/s during the big tests. Are you able to
>> experiment with this a little bit? For example, if you do big tests, 1M
>> (~20%) small tests, big tests, 1M small tests, big tests etc. What is
>> the data rate for the big tests after 0, 1M, 2M, 3M, 4M and 5M little tests.
> 
> Sure I can try that. For the in between tests do you want me to run
> those for a set amount of time or number of files? Like each smaller
> batch like 20min and then 10min of large and then next smaller size?

I was thinking set number of files.

I would also be useful to know how many threads the executor has created
at each point as well. (JMX should tell you that. You might need to
adjust the executor so it doesn't stop idle threads.).

>> What I am trying to pin down is how quickly does this problem build up.
>>
>> Also, do you see any failed requests or do they all succeed?
> 
> All successes.

OK. That rules out some possibilities.

Going back to your original description, you said you saw blocking
during the call to ERR_clear_err(). Did you mean ERR_clear_error()?
Either way, could you provide the full stack trace of an example blocked
thread? And, ideally, the stack trace of the thread currently holding
the lock? I'm still trying to understand what is going on here since
based on my understanding of the code so far, the HashMap is bounded (to
the number of threads) and should reach that limit fairly quickly.

Thanks,

Mark

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-24 Thread therealneworld
On Fri, Jun 24, 2016 at 4:52 PM,   wrote:
> On Fri, Jun 24, 2016 at 3:21 PM, Mark Thomas  wrote:
>> On 24/06/2016 20:01, therealnewo...@gmail.com wrote:
>>> On Fri, Jun 24, 2016 at 2:07 PM, Mark Thomas  wrote:
 On 24/06/2016 18:41, Nate Clark wrote:
> On Fri, Jun 24, 2016 at 1:37 PM, Nate Clark  wrote:
>> On Fri, Jun 24, 2016 at 1:27 PM, Mark Thomas  wrote:
>>> On 24/06/2016 18:25, Mark Thomas wrote:

 Can you provide the settings you are using for the Executor as well 
 please?
>>
>> > maxThreads="500" minSpareThreads="4"/>
>>
>>>
>>> And how long do the initial 5,000,000 4k requests take to process?
>>>
>>
>> 40 minutes.
>>
> Not sure this matters but I just double checked and there are actually
> 400 threads in total doing the 4k PUTs. Two clients each doing 200
> threads. the 100MB test is 24 threads total 12 per client machine.
>
> Sorry for misinformation earlier.

 No problem. Thanks for the information. One last question (for now). How
 many processors / cores / threads does the server support? I'm trying to
 get a handle on what the concurrency looks like.

>>>
>>> The machine has two physical chips each with 6 cores and
>>> hyper-threading enabled, so 24 cores exposed to the OS.
>>
>> Thanks.
>>
>> 
>>
>>> If it matters the system also has 256GB of memory.
>>
>> I don't think RAM is playing a role here but it is still good to know.
>>
>> In terms of next steps, I want to see if I can come up with a theory
>> that matches what you are observing. From that we can then assess
>> whether the proposed patch can be improved.
>>
>> Apologies for the drip-feeding of questions. As I learn a bit more, a
>> few more questions come to mind.
>>
>> I'm wondering if this is a problem that builds up over time. If I
>> understood your previous posts correctly, running the big tests
>> immediately gave ~700MB/s whereas running the small tests then the big
>> tests resulting in ~350MB/s during the big tests. Are you able to
>> experiment with this a little bit? For example, if you do big tests, 1M
>> (~20%) small tests, big tests, 1M small tests, big tests etc. What is
>> the data rate for the big tests after 0, 1M, 2M, 3M, 4M and 5M little tests.
>
> Sure I can try that. For the in between tests do you want me to run
> those for a set amount of time or number of files? Like each smaller
> batch like 20min and then 10min of large and then next smaller size?

Ignore that question. I misinterpreted your 1m to be 1MB.

-nate

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-24 Thread therealneworld
On Fri, Jun 24, 2016 at 3:21 PM, Mark Thomas  wrote:
> On 24/06/2016 20:01, therealnewo...@gmail.com wrote:
>> On Fri, Jun 24, 2016 at 2:07 PM, Mark Thomas  wrote:
>>> On 24/06/2016 18:41, Nate Clark wrote:
 On Fri, Jun 24, 2016 at 1:37 PM, Nate Clark  wrote:
> On Fri, Jun 24, 2016 at 1:27 PM, Mark Thomas  wrote:
>> On 24/06/2016 18:25, Mark Thomas wrote:
>>>
>>> Can you provide the settings you are using for the Executor as well 
>>> please?
>
>  maxThreads="500" minSpareThreads="4"/>
>
>>
>> And how long do the initial 5,000,000 4k requests take to process?
>>
>
> 40 minutes.
>
 Not sure this matters but I just double checked and there are actually
 400 threads in total doing the 4k PUTs. Two clients each doing 200
 threads. the 100MB test is 24 threads total 12 per client machine.

 Sorry for misinformation earlier.
>>>
>>> No problem. Thanks for the information. One last question (for now). How
>>> many processors / cores / threads does the server support? I'm trying to
>>> get a handle on what the concurrency looks like.
>>>
>>
>> The machine has two physical chips each with 6 cores and
>> hyper-threading enabled, so 24 cores exposed to the OS.
>
> Thanks.
>
> 
>
>> If it matters the system also has 256GB of memory.
>
> I don't think RAM is playing a role here but it is still good to know.
>
> In terms of next steps, I want to see if I can come up with a theory
> that matches what you are observing. From that we can then assess
> whether the proposed patch can be improved.
>
> Apologies for the drip-feeding of questions. As I learn a bit more, a
> few more questions come to mind.
>
> I'm wondering if this is a problem that builds up over time. If I
> understood your previous posts correctly, running the big tests
> immediately gave ~700MB/s whereas running the small tests then the big
> tests resulting in ~350MB/s during the big tests. Are you able to
> experiment with this a little bit? For example, if you do big tests, 1M
> (~20%) small tests, big tests, 1M small tests, big tests etc. What is
> the data rate for the big tests after 0, 1M, 2M, 3M, 4M and 5M little tests.

Sure I can try that. For the in between tests do you want me to run
those for a set amount of time or number of files? Like each smaller
batch like 20min and then 10min of large and then next smaller size?

> What I am trying to pin down is how quickly does this problem build up.
>
> Also, do you see any failed requests or do they all succeed?

All successes.

-nate

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-24 Thread Mark Thomas
On 24/06/2016 20:01, therealnewo...@gmail.com wrote:
> On Fri, Jun 24, 2016 at 2:07 PM, Mark Thomas  wrote:
>> On 24/06/2016 18:41, Nate Clark wrote:
>>> On Fri, Jun 24, 2016 at 1:37 PM, Nate Clark  wrote:
 On Fri, Jun 24, 2016 at 1:27 PM, Mark Thomas  wrote:
> On 24/06/2016 18:25, Mark Thomas wrote:
>>
>> Can you provide the settings you are using for the Executor as well 
>> please?

 >>> maxThreads="500" minSpareThreads="4"/>

>
> And how long do the initial 5,000,000 4k requests take to process?
>

 40 minutes.

>>> Not sure this matters but I just double checked and there are actually
>>> 400 threads in total doing the 4k PUTs. Two clients each doing 200
>>> threads. the 100MB test is 24 threads total 12 per client machine.
>>>
>>> Sorry for misinformation earlier.
>>
>> No problem. Thanks for the information. One last question (for now). How
>> many processors / cores / threads does the server support? I'm trying to
>> get a handle on what the concurrency looks like.
>>
> 
> The machine has two physical chips each with 6 cores and
> hyper-threading enabled, so 24 cores exposed to the OS.

Thanks.



> If it matters the system also has 256GB of memory.

I don't think RAM is playing a role here but it is still good to know.

In terms of next steps, I want to see if I can come up with a theory
that matches what you are observing. From that we can then assess
whether the proposed patch can be improved.

Apologies for the drip-feeding of questions. As I learn a bit more, a
few more questions come to mind.

I'm wondering if this is a problem that builds up over time. If I
understood your previous posts correctly, running the big tests
immediately gave ~700MB/s whereas running the small tests then the big
tests resulting in ~350MB/s during the big tests. Are you able to
experiment with this a little bit? For example, if you do big tests, 1M
(~20%) small tests, big tests, 1M small tests, big tests etc. What is
the data rate for the big tests after 0, 1M, 2M, 3M, 4M and 5M little tests.

What I am trying to pin down is how quickly does this problem build up.

Also, do you see any failed requests or do they all succeed?

Thanks again,

Mark



> 
> -nate
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: dev-h...@tomcat.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-24 Thread therealneworld
On Fri, Jun 24, 2016 at 2:07 PM, Mark Thomas  wrote:
> On 24/06/2016 18:41, Nate Clark wrote:
>> On Fri, Jun 24, 2016 at 1:37 PM, Nate Clark  wrote:
>>> On Fri, Jun 24, 2016 at 1:27 PM, Mark Thomas  wrote:
 On 24/06/2016 18:25, Mark Thomas wrote:
>
> Can you provide the settings you are using for the Executor as well 
> please?
>>>
>>> >> maxThreads="500" minSpareThreads="4"/>
>>>

 And how long do the initial 5,000,000 4k requests take to process?

>>>
>>> 40 minutes.
>>>
>> Not sure this matters but I just double checked and there are actually
>> 400 threads in total doing the 4k PUTs. Two clients each doing 200
>> threads. the 100MB test is 24 threads total 12 per client machine.
>>
>> Sorry for misinformation earlier.
>
> No problem. Thanks for the information. One last question (for now). How
> many processors / cores / threads does the server support? I'm trying to
> get a handle on what the concurrency looks like.
>

The machine has two physical chips each with 6 cores and
hyper-threading enabled, so 24 cores exposed to the OS.

cpuinfo for first core:
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 63
model name  : Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
stepping: 2
microcode   : 0x2e
cpu MHz : 1212.656
cache size  : 15360 KB
physical id : 0
siblings: 12
core id : 0
cpu cores   : 6
apicid  : 0
initial apicid  : 0
fpu : yes
fpu_exception   : yes
cpuid level : 15
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq
dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid
dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx
f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi
flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms
invpcid cqm xsaveopt cqm_llc cqm_occup_llc
bugs:
bogomips: 4788.98
clflush size: 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:


If it matters the system also has 256GB of memory.

-nate

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-24 Thread Mark Thomas
On 24/06/2016 18:41, Nate Clark wrote:
> On Fri, Jun 24, 2016 at 1:37 PM, Nate Clark  wrote:
>> On Fri, Jun 24, 2016 at 1:27 PM, Mark Thomas  wrote:
>>> On 24/06/2016 18:25, Mark Thomas wrote:

 Can you provide the settings you are using for the Executor as well please?
>>
>> > maxThreads="500" minSpareThreads="4"/>
>>
>>>
>>> And how long do the initial 5,000,000 4k requests take to process?
>>>
>>
>> 40 minutes.
>>
> Not sure this matters but I just double checked and there are actually
> 400 threads in total doing the 4k PUTs. Two clients each doing 200
> threads. the 100MB test is 24 threads total 12 per client machine.
> 
> Sorry for misinformation earlier.

No problem. Thanks for the information. One last question (for now). How
many processors / cores / threads does the server support? I'm trying to
get a handle on what the concurrency looks like.

Thanks,

Mark


-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-24 Thread Nate Clark
On Fri, Jun 24, 2016 at 1:37 PM, Nate Clark  wrote:
> On Fri, Jun 24, 2016 at 1:27 PM, Mark Thomas  wrote:
>> On 24/06/2016 18:25, Mark Thomas wrote:
>>>
>>> Can you provide the settings you are using for the Executor as well please?
>
>  maxThreads="500" minSpareThreads="4"/>
>
>>
>> And how long do the initial 5,000,000 4k requests take to process?
>>
>
> 40 minutes.
>
Not sure this matters but I just double checked and there are actually
400 threads in total doing the 4k PUTs. Two clients each doing 200
threads. the 100MB test is 24 threads total 12 per client machine.

Sorry for misinformation earlier.

-nate

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-24 Thread Nate Clark
On Fri, Jun 24, 2016 at 1:27 PM, Mark Thomas  wrote:
> On 24/06/2016 18:25, Mark Thomas wrote:
>>
>> Can you provide the settings you are using for the Executor as well please?



>
> And how long do the initial 5,000,000 4k requests take to process?
>

40 minutes.

-nate

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-24 Thread Mark Thomas
On 24/06/2016 18:25, Mark Thomas wrote:
> On 24/06/2016 18:11, therealnewo...@gmail.com wrote:
>> On Fri, Jun 24, 2016 at 11:18 AM, Mark Thomas  wrote:
>>> On 24/06/2016 11:17, Rémy Maucherat wrote:
 2016-06-24 12:08 GMT+02:00 Mark Thomas :

> Thanks.
>
> I'm going to start some local performance testing to confirm I see
> similar results and, assuming I do, I'll start looking at fixing this
> for 1.2.x/9.0.x and back-porting.
>
 Hum, the fix that was submitted doesn't make sense IMO since writes can be
 async, so I don't see a way besides adding the "error clear" thing after
 each operation [and we'll remove it once OpenSSL 1.1 is there if it
 actually fixes it]. That's assuming this issue is real [I actually never
 noticed anything during my many ab runs and they use a lot of threads, so I
 have a hard time believing it is significant enough ;) ].
>>>
>>> I haven't been able to reproduce anything like this yet. So far I have
>>> only been testing with tc-native 1.2.x and Tomcat 9.0.x. I might need to
>>> test with 1.1.x and Tomcat 7.0.x, the versions used by the OP.
>>>
>>> I'm having trouble understanding how this is happening. I could imagine
>>> that HashMap becoming a problem if there was a high churn in Threads.
>>> I'm thinking of something like bursty traffic levels and an executor
>>> aggressively halting spare threads. I need to experiment with that as well.
>>>
>>
>> I do not understand it either. Using the thread pool there is not much
>> thread churn so I am not sure why the problem gets as bad as it does.
>> I didn't look into what the hash table actually had in it. I just
>> noticed that the majority of a read threads time was spent waiting for
>> the lock to access this hash table. Once I added the call to
>> ERR_remove_thread_state the waiting basically disappeared.
>>
>> For this test the traffic is constant. Each client thread creates one
>> connection and just keeps pushing requests for set number of requests,
>> so we aren't even creating new connections.
> 
> Can you provide the settings you are using for the Executor as well please?

And how long do the initial 5,000,000 4k requests take to process?

Thanks,

Mark


-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-24 Thread Mark Thomas
On 24/06/2016 18:11, therealnewo...@gmail.com wrote:
> On Fri, Jun 24, 2016 at 11:18 AM, Mark Thomas  wrote:
>> On 24/06/2016 11:17, Rémy Maucherat wrote:
>>> 2016-06-24 12:08 GMT+02:00 Mark Thomas :
>>>
 Thanks.

 I'm going to start some local performance testing to confirm I see
 similar results and, assuming I do, I'll start looking at fixing this
 for 1.2.x/9.0.x and back-porting.

>>> Hum, the fix that was submitted doesn't make sense IMO since writes can be
>>> async, so I don't see a way besides adding the "error clear" thing after
>>> each operation [and we'll remove it once OpenSSL 1.1 is there if it
>>> actually fixes it]. That's assuming this issue is real [I actually never
>>> noticed anything during my many ab runs and they use a lot of threads, so I
>>> have a hard time believing it is significant enough ;) ].
>>
>> I haven't been able to reproduce anything like this yet. So far I have
>> only been testing with tc-native 1.2.x and Tomcat 9.0.x. I might need to
>> test with 1.1.x and Tomcat 7.0.x, the versions used by the OP.
>>
>> I'm having trouble understanding how this is happening. I could imagine
>> that HashMap becoming a problem if there was a high churn in Threads.
>> I'm thinking of something like bursty traffic levels and an executor
>> aggressively halting spare threads. I need to experiment with that as well.
>>
> 
> I do not understand it either. Using the thread pool there is not much
> thread churn so I am not sure why the problem gets as bad as it does.
> I didn't look into what the hash table actually had in it. I just
> noticed that the majority of a read threads time was spent waiting for
> the lock to access this hash table. Once I added the call to
> ERR_remove_thread_state the waiting basically disappeared.
> 
> For this test the traffic is constant. Each client thread creates one
> connection and just keeps pushing requests for set number of requests,
> so we aren't even creating new connections.

Can you provide the settings you are using for the Executor as well please?

Thanks,

Mark


-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-24 Thread therealneworld
On Fri, Jun 24, 2016 at 11:18 AM, Mark Thomas  wrote:
> On 24/06/2016 11:17, Rémy Maucherat wrote:
>> 2016-06-24 12:08 GMT+02:00 Mark Thomas :
>>
>>> Thanks.
>>>
>>> I'm going to start some local performance testing to confirm I see
>>> similar results and, assuming I do, I'll start looking at fixing this
>>> for 1.2.x/9.0.x and back-porting.
>>>
>> Hum, the fix that was submitted doesn't make sense IMO since writes can be
>> async, so I don't see a way besides adding the "error clear" thing after
>> each operation [and we'll remove it once OpenSSL 1.1 is there if it
>> actually fixes it]. That's assuming this issue is real [I actually never
>> noticed anything during my many ab runs and they use a lot of threads, so I
>> have a hard time believing it is significant enough ;) ].
>
> I haven't been able to reproduce anything like this yet. So far I have
> only been testing with tc-native 1.2.x and Tomcat 9.0.x. I might need to
> test with 1.1.x and Tomcat 7.0.x, the versions used by the OP.
>
> I'm having trouble understanding how this is happening. I could imagine
> that HashMap becoming a problem if there was a high churn in Threads.
> I'm thinking of something like bursty traffic levels and an executor
> aggressively halting spare threads. I need to experiment with that as well.
>

I do not understand it either. Using the thread pool there is not much
thread churn so I am not sure why the problem gets as bad as it does.
I didn't look into what the hash table actually had in it. I just
noticed that the majority of a read threads time was spent waiting for
the lock to access this hash table. Once I added the call to
ERR_remove_thread_state the waiting basically disappeared.

For this test the traffic is constant. Each client thread creates one
connection and just keeps pushing requests for set number of requests,
so we aren't even creating new connections.

-nate

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-24 Thread therealneworld
On Fri, Jun 24, 2016 at 6:17 AM, Rémy Maucherat  wrote:
> 2016-06-24 12:08 GMT+02:00 Mark Thomas :
>
>> Thanks.
>>
>> I'm going to start some local performance testing to confirm I see
>> similar results and, assuming I do, I'll start looking at fixing this
>> for 1.2.x/9.0.x and back-porting.
>>
>> Hum, the fix that was submitted doesn't make sense IMO since writes can be
> async, so I don't see a way besides adding the "error clear" thing after
> each operation [and we'll remove it once OpenSSL 1.1 is there if it
> actually fixes it]. That's assuming this issue is real [I actually never
> noticed anything during my many ab runs and they use a lot of threads, so I
> have a hard time believing it is significant enough ;) ].

I was not using async IO so I did not account for that in my patch. It
was more a case of see if I can resolve this issue for my use case.

-nate

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-24 Thread therealneworld
On Fri, Jun 24, 2016 at 11:37 AM,   wrote:
> On Fri, Jun 24, 2016 at 11:18 AM, Mark Thomas  wrote:
>> On 24/06/2016 11:17, Rémy Maucherat wrote:
>>> 2016-06-24 12:08 GMT+02:00 Mark Thomas :
>>>
 Thanks.

 I'm going to start some local performance testing to confirm I see
 similar results and, assuming I do, I'll start looking at fixing this
 for 1.2.x/9.0.x and back-porting.

>>> Hum, the fix that was submitted doesn't make sense IMO since writes can be
>>> async, so I don't see a way besides adding the "error clear" thing after
>>> each operation [and we'll remove it once OpenSSL 1.1 is there if it
>>> actually fixes it]. That's assuming this issue is real [I actually never
>>> noticed anything during my many ab runs and they use a lot of threads, so I
>>> have a hard time believing it is significant enough ;) ].
>>
>> I haven't been able to reproduce anything like this yet. So far I have
>> only been testing with tc-native 1.2.x and Tomcat 9.0.x. I might need to
>> test with 1.1.x and Tomcat 7.0.x, the versions used by the OP.
>>
>> I'm having trouble understanding how this is happening. I could imagine
>> that HashMap becoming a problem if there was a high churn in Threads.
>> I'm thinking of something like bursty traffic levels and an executor
>> aggressively halting spare threads. I need to experiment with that as well.
>>
>> Nate,
>>
>> We need as much information as you can provide on how to reproduce this.
>> As a minimum we need to know:
>> - Connector configuration from server.xml
>> - Operating system
>> - How tc-native was built
>> - Exact versions for everything
>>
>> We need enough information to recreate the test and the results
>> that you obtained.
> OS: Fedora 22
> tc-native: tomcat-native-1.1.34-1.fc22.x86_64
> tomcat: tomcat-7.0.68-3.fc22.noarch
>
> This issue was seen in older versions of tomcat:
> tomcat-native-1.1.30-2.fc21 and tomcat-7.0.54-3.fc21

I forgot to give you the openssl version openssl-1.0.1k-15.fc22.x86_64

-nate

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-24 Thread therealneworld
On Fri, Jun 24, 2016 at 11:18 AM, Mark Thomas  wrote:
> On 24/06/2016 11:17, Rémy Maucherat wrote:
>> 2016-06-24 12:08 GMT+02:00 Mark Thomas :
>>
>>> Thanks.
>>>
>>> I'm going to start some local performance testing to confirm I see
>>> similar results and, assuming I do, I'll start looking at fixing this
>>> for 1.2.x/9.0.x and back-porting.
>>>
>> Hum, the fix that was submitted doesn't make sense IMO since writes can be
>> async, so I don't see a way besides adding the "error clear" thing after
>> each operation [and we'll remove it once OpenSSL 1.1 is there if it
>> actually fixes it]. That's assuming this issue is real [I actually never
>> noticed anything during my many ab runs and they use a lot of threads, so I
>> have a hard time believing it is significant enough ;) ].
>
> I haven't been able to reproduce anything like this yet. So far I have
> only been testing with tc-native 1.2.x and Tomcat 9.0.x. I might need to
> test with 1.1.x and Tomcat 7.0.x, the versions used by the OP.
>
> I'm having trouble understanding how this is happening. I could imagine
> that HashMap becoming a problem if there was a high churn in Threads.
> I'm thinking of something like bursty traffic levels and an executor
> aggressively halting spare threads. I need to experiment with that as well.
>
> Nate,
>
> We need as much information as you can provide on how to reproduce this.
> As a minimum we need to know:
> - Connector configuration from server.xml
> - Operating system
> - How tc-native was built
> - Exact versions for everything
>
> We need enough information to recreate the test and the results
> that you obtained.

Connector configuration:


Keepalive is enabled.

OS: Fedora 22
tc-native: tomcat-native-1.1.34-1.fc22.x86_64
tomcat: tomcat-7.0.68-3.fc22.noarch

This issue was seen in older versions of tomcat:
tomcat-native-1.1.30-2.fc21 and tomcat-7.0.54-3.fc21

All of the builds are the rpms released by fedora from their build machines.

The test I ran performed about 5 million 4k requests and then did the
large 100M requests and was able to see the issue immediately.

-nate

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-24 Thread Mark Thomas
On 24/06/2016 15:40, therealnewo...@gmail.com wrote:
> On Fri, Jun 24, 2016 at 6:17 AM, Rémy Maucherat  wrote:
>> 2016-06-24 12:08 GMT+02:00 Mark Thomas :
>>
>>> Thanks.
>>>
>>> I'm going to start some local performance testing to confirm I see
>>> similar results and, assuming I do, I'll start looking at fixing this
>>> for 1.2.x/9.0.x and back-porting.
>>>
>>> Hum, the fix that was submitted doesn't make sense IMO since writes can be
>> async, so I don't see a way besides adding the "error clear" thing after
>> each operation [and we'll remove it once OpenSSL 1.1 is there if it
>> actually fixes it]. That's assuming this issue is real [I actually never
>> noticed anything during my many ab runs and they use a lot of threads, so I
>> have a hard time believing it is significant enough ;) ].
>>
> 
> One thing about the system on which this is running is that it has a
> 10G nic. So the slow case is about 350MB/s and the fast one is 700MB/s
> so you would need a 10G interface or use loop back to even notice the
> issue assuming the CPU on the system can push that much encrypted
> data.

Roughly how many requests do you have to do for the problem to appear?
Is keep-alive enabled or disabled?

(plus the questions from my other mail)

Mark


-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-24 Thread Mark Thomas
On 24/06/2016 11:17, Rémy Maucherat wrote:
> 2016-06-24 12:08 GMT+02:00 Mark Thomas :
> 
>> Thanks.
>>
>> I'm going to start some local performance testing to confirm I see
>> similar results and, assuming I do, I'll start looking at fixing this
>> for 1.2.x/9.0.x and back-porting.
>>
> Hum, the fix that was submitted doesn't make sense IMO since writes can be
> async, so I don't see a way besides adding the "error clear" thing after
> each operation [and we'll remove it once OpenSSL 1.1 is there if it
> actually fixes it]. That's assuming this issue is real [I actually never
> noticed anything during my many ab runs and they use a lot of threads, so I
> have a hard time believing it is significant enough ;) ].

I haven't been able to reproduce anything like this yet. So far I have
only been testing with tc-native 1.2.x and Tomcat 9.0.x. I might need to
test with 1.1.x and Tomcat 7.0.x, the versions used by the OP.

I'm having trouble understanding how this is happening. I could imagine
that HashMap becoming a problem if there was a high churn in Threads.
I'm thinking of something like bursty traffic levels and an executor
aggressively halting spare threads. I need to experiment with that as well.

Nate,

We need as much information as you can provide on how to reproduce this.
As a minimum we need to know:
- Connector configuration from server.xml
- Operating system
- How tc-native was built
- Exact versions for everything

We need enough information to recreate the test and the results
that you obtained.

Mark


-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-24 Thread therealneworld
On Fri, Jun 24, 2016 at 6:17 AM, Rémy Maucherat  wrote:
> 2016-06-24 12:08 GMT+02:00 Mark Thomas :
>
>> Thanks.
>>
>> I'm going to start some local performance testing to confirm I see
>> similar results and, assuming I do, I'll start looking at fixing this
>> for 1.2.x/9.0.x and back-porting.
>>
>> Hum, the fix that was submitted doesn't make sense IMO since writes can be
> async, so I don't see a way besides adding the "error clear" thing after
> each operation [and we'll remove it once OpenSSL 1.1 is there if it
> actually fixes it]. That's assuming this issue is real [I actually never
> noticed anything during my many ab runs and they use a lot of threads, so I
> have a hard time believing it is significant enough ;) ].
>

One thing about the system on which this is running is that it has a
10G nic. So the slow case is about 350MB/s and the fast one is 700MB/s
so you would need a 10G interface or use loop back to even notice the
issue assuming the CPU on the system can push that much encrypted
data.

-nate

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-24 Thread Rémy Maucherat
2016-06-24 12:08 GMT+02:00 Mark Thomas :

> Thanks.
>
> I'm going to start some local performance testing to confirm I see
> similar results and, assuming I do, I'll start looking at fixing this
> for 1.2.x/9.0.x and back-porting.
>
> Hum, the fix that was submitted doesn't make sense IMO since writes can be
async, so I don't see a way besides adding the "error clear" thing after
each operation [and we'll remove it once OpenSSL 1.1 is there if it
actually fixes it]. That's assuming this issue is real [I actually never
noticed anything during my many ab runs and they use a lot of threads, so I
have a hard time believing it is significant enough ;) ].

Rémy


Re: Bug that spans tomcat and tomcat-native

2016-06-24 Thread Mark Thomas
On 23/06/2016 22:31, Nate Clark wrote:
> I tried to submit the bug but it seems that I am now unable to access
> bz.apache.org. Since you made it seem like it was important for this
> to be known about here is the info and patches.

Thanks. I'll chat with the infra folks and see if I can get whatever is
going on with BZ for you fixed.

> When performing some bench marking I noticed that the SSL performance
> of large request reads degraded heavily when performed after millions
> of small requests. Basically the setup is in a multi-threaded
> environment, about 200 threads, performing PUT requests using SSL with
> a body of about 4KB and then using 20 threads performing PUT requests
> with a body of 100MB. If the small requests are not performed the
> speed of the large requests in MB/s is about 2x.

Ouch. That is significant.

> I tracked down the issue to ERR_clear_err() blocking on an internal
> lock which protects a hash map of the error states. It seems that the
> hash map was growing unbounded because the states in it were never
> being cleared by a thread when it had completed SSL operations.
> 
> According to OpenSSL documents ERR_remove_thread_state() or
> ERR_remove_state() for versions of OpenSSL less than 1.1.0 needs to be
> invoked prior to a thread exiting. This is not done by threads in the
> native code so the hash table keeps growing and getting larger and
> larger and more expensive to maintain.
> 
> By adding a new native call which invoked ERR_remove_thread_state and
> calling it from AprEndpoint in tomcat I was able to reduce the
> contention on the lock and the performance improved.
> 
> Because of the thread pool I could not find a simple clean way to
> invoke the cleanup before the thread dies so instead I added it to the
> end of the socket processing.
> 
> Here are the patches I used against tomcat-native 1.1.34 and tomcat70:

Thanks.

I'm going to start some local performance testing to confirm I see
similar results and, assuming I do, I'll start looking at fixing this
for 1.2.x/9.0.x and back-porting.

Mark


-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Re: Bug that spans tomcat and tomcat-native

2016-06-23 Thread Nate Clark
On Thu, Jun 23, 2016 at 3:32 PM, Mark Thomas  wrote:
> On 23/06/2016 20:28, Nate Clark wrote:
>>
>> If you want details about the issue I am happy to provide them but
>> just didn't wan to duplicate explaining it in a bug and in an email
>> thread.
>
> To be honest, it doesn't really matter which product it is filed under.
> It is more important to get it filed. Even more so since a tc-native
> release is planned for tomorrow. Just file it under tc-native and
> someone will take a look.
>
Hi Mark,

I tried to submit the bug but it seems that I am now unable to access
bz.apache.org. Since you made it seem like it was important for this
to be known about here is the info and patches.

When performing some bench marking I noticed that the SSL performance
of large request reads degraded heavily when performed after millions
of small requests. Basically the setup is in a multi-threaded
environment, about 200 threads, performing PUT requests using SSL with
a body of about 4KB and then using 20 threads performing PUT requests
with a body of 100MB. If the small requests are not performed the
speed of the large requests in MB/s is about 2x.

I tracked down the issue to ERR_clear_err() blocking on an internal
lock which protects a hash map of the error states. It seems that the
hash map was growing unbounded because the states in it were never
being cleared by a thread when it had completed SSL operations.

According to OpenSSL documents ERR_remove_thread_state() or
ERR_remove_state() for versions of OpenSSL less than 1.1.0 needs to be
invoked prior to a thread exiting. This is not done by threads in the
native code so the hash table keeps growing and getting larger and
larger and more expensive to maintain.

By adding a new native call which invoked ERR_remove_thread_state and
calling it from AprEndpoint in tomcat I was able to reduce the
contention on the lock and the performance improved.

Because of the thread pool I could not find a simple clean way to
invoke the cleanup before the thread dies so instead I added it to the
end of the socket processing.

Here are the patches I used against tomcat-native 1.1.34 and tomcat70:

>From dd4a8dcdfbb863f8e8ba5ed24070e60200421082 Mon Sep 17 00:00:00 2001
From: Nate Clark 
Date: Thu, 23 Jun 2016 16:51:04 -0400
Subject: [PATCH] Add threadCleanup function to free memory footprint of a
 thread

OpenSSL allocates memory for an error queue for each thread which uses
the SSL functions. Prior to
version 1.1.0 the memory needed to released manually by either
invoking ERR_remove_state or
ERR_remove_thread_state. If this is not done the memory foot print
growes larger which slowes down
the use of the error queues. By adding the new function threadCleanup
threads can now release this
memroy prior to thread exit.
---
 native/src/ssl.c | 24 +++-
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/native/src/ssl.c b/native/src/ssl.c
index ded7b22..5875a68 100644
--- a/native/src/ssl.c
+++ b/native/src/ssl.c
@@ -272,6 +272,14 @@ TCN_IMPLEMENT_CALL(jstring, SSL,
versionString)(TCN_STDARGS)
 return AJP_TO_JSTRING(OPENSSL_VERSION_TEXT);
 }

+static void ssl_err_remove(void) {
+#if (OPENSSL_VERSION_NUMBER < 0x1000L) || defined(OPENSSL_USE_DEPRECATED)
+ERR_remove_state(0);
+#elif (OPENSSL_VERSION_NUMBER < 0x1010L)
+ERR_remove_thread_state(NULL);
+#endif
+}
+
 /*
  *  the various processing hooks
  */
@@ -310,11 +318,7 @@ static apr_status_t ssl_init_cleanup(void *data)
 #if OPENSSL_VERSION_NUMBER >= 0x00907001
 CRYPTO_cleanup_all_ex_data();
 #endif
-#if (OPENSSL_VERSION_NUMBER < 0x1010L) || defined(OPENSSL_USE_DEPRECATED)
-ERR_remove_state(0);
-#else
-ERR_remove_thread_state(NULL);
-#endif
+ssl_err_remove();

 /* Don't call ERR_free_strings here; ERR_load_*_strings only
  * actually load the error strings once per process due to static
@@ -1105,6 +1109,11 @@ TCN_IMPLEMENT_CALL(jboolean, SSL,
hasOp)(TCN_STDARGS, jint op)
 return op == (op & supported_ssl_opts) ? JNI_TRUE : JNI_FALSE;
 }

+TCN_IMPLEMENT_CALL(void, SSL, threadCleanup)(TCN_STDARGS) {
+UNREFERENCED(o);
+ssl_err_remove();
+}
+
 #else
 /* OpenSSL is not supported.
  * Create empty stubs.
@@ -1231,4 +1240,9 @@ TCN_IMPLEMENT_CALL(jboolean, SSL,
hasOp)(TCN_STDARGS, jint op)
 UNREFERENCED(op);
 return JNI_FALSE;
 }
+
+TCN_IMPLEMENT_CALL(void, SSL, threadCleanup)(TCN_STDARGS) {
+UNREFERENCED_STDARGS;
+}
+
 #endif
-- 
2.5.5

>From c5c1756153279e87f11594c0e053d5301ed121e7 Mon Sep 17 00:00:00 2001
From: Nate Clark 
Date: Thu, 23 Jun 2016 16:57:18 -0400
Subject: [PATCH] Invoke SSL.threadCleanup when done with socket processing

When a thread is done processing an ssl socket it should invoke
SSL.threadCleanup() to free memory
associated with that thread in the SSL library. If this is not done
the memory foot print bloats and
each subsequent invocation will become slightly slower as there 

Re: Bug that spans tomcat and tomcat-native

2016-06-23 Thread Mark Thomas
On 23/06/2016 20:28, Nate Clark wrote:
> Hello,
> 
> I encountered a performance issue when using the apr protocol. I
> tracked it down to basically missing a piece of the integration with
> openssl. I was able to fix the performance issue but it required a
> change to both tomcat-native and tomcat. I was going to file a bug
> with my patches but tomcat-native and tomcat are separate projects so
> I was wondering the best way to handle that. File an issue in
> tomcat-native and one in tomcat referencing the other bug? Or just use
> this mailing list to discuss the issue?
> 
> If you want details about the issue I am happy to provide them but
> just didn't wan to duplicate explaining it in a bug and in an email
> thread.

To be honest, it doesn't really matter which product it is filed under.
It is more important to get it filed. Even more so since a tc-native
release is planned for tomorrow. Just file it under tc-native and
someone will take a look.

Mark


-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



Bug that spans tomcat and tomcat-native

2016-06-23 Thread Nate Clark
Hello,

I encountered a performance issue when using the apr protocol. I
tracked it down to basically missing a piece of the integration with
openssl. I was able to fix the performance issue but it required a
change to both tomcat-native and tomcat. I was going to file a bug
with my patches but tomcat-native and tomcat are separate projects so
I was wondering the best way to handle that. File an issue in
tomcat-native and one in tomcat referencing the other bug? Or just use
this mailing list to discuss the issue?

If you want details about the issue I am happy to provide them but
just didn't wan to duplicate explaining it in a bug and in an email
thread.

Thanks,
-nate

-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org