Re: Bug that spans tomcat and tomcat-native
I've opened https://bz.apache.org/bugzilla/show_bug.cgi?id=59797 to track this. Nate, if you let me have your public IP, I'll get you unbanned from BZ as well. Mark On 01/07/2016 15:33, Nate Clark wrote: > On Tue, Jun 28, 2016 at 12:37 PM, Rainer Jungwrote: >> Am 28.06.2016 um 18:06 schrieb therealnewo...@gmail.com: >>> >>> On Tue, Jun 28, 2016 at 11:51 AM, Rainer Jung >>> wrote: Am 28.06.2016 um 16:07 schrieb Mark Thomas: > > > On 28/06/2016 12:28, Mark Thomas wrote: >> >> >> On 28/06/2016 11:34, Rainer Jung wrote: >>> >>> >>> Am 28.06.2016 um 11:15 schrieb Mark Thomas: >> >> >> >> >> Index: src/ssl.c === --- src/ssl.c(revision 1750259) +++ src/ssl.c(working copy) @@ -420,6 +420,10 @@ return psaptr->PSATOLD; #elif defined(WIN32) return (unsigned long)GetCurrentThreadId(); +#elif defined(DARWIN) +uint64_t tid; +pthread_threadid_np(NULL, ); +return (unsigned long)tid; #else return (unsigned long)(apr_os_thread_current()); #endif I want to do some similar testing for Linux before adding what I suspect will be a very similar block using gettid(). >>> >>> >>> >>> We could either add something to configure.in. Untested: >>> >>> Index: native/configure.in >>> === >>> --- native/configure.in (revision 1750462) >>> +++ native/configure.in (working copy) >>> @@ -218,6 +218,9 @@ >>> *-solaris2*) >>> APR_ADDTO(TCNATIVE_LIBS, -lkstat) >>> ;; >>> +*linux*) >>> +APR_ADDTO(CFLAGS, -DTCNATIVE_LINUX) >>> +;; >>> *) >>> ;; >>> esac >>> >>> >>> and then use a >>> >>> #ifdef TCNATIVE_LINUX >>> >>> or we copy some other more direct linux check from e.g. APR: >>> >>> #ifdef __linux__ >>> >>> The latter looks simpler, but the version above is based on all the >>> logic put into config.guess. >> >> >> >> I'd go with the __linux__ option as that is consistent with what we >> already use in os/unix/system.c >> >> I'm not against the change to configure.in, I just think we should be >> consistent with how we do this throughout the code base. > > > > I've confirmed that the same problem occurs with hash bucket selection > on linux and that switching to gettid() fixes that problem. > > I'm going to go ahead with the 1.2.8 release shortly. We can continue to > refine this as necessary and have a more complete fix in 1.2.9. I did a quick check on Solaris. apr_os_thread_current() uses pthread_self on Solaris like on Linux (actually on any Unix type OS), but unlike Linux where this returns a address which is either 32 or 64 bit aligned depending on address size, on Solaris you get an increasing number starting with 1 for the first thread and incremented by one for each following thread. Thread IDs do not get reused in the same process, even if the thread finished, but thread IDs are common between different processes, because they always start with 1. So Solaris should be fine as-is. >>> >>> >>> Does the value have a cap? If not then Solaris will just continue to >>> use more and more memory as threads are created over the lifetime of >>> the server. >> >> >> No cap. I think everyone agrees we are still looking at cleaning up at >> thread death as a further improvement. Whether there's a problem without it >> depends on the executor size and whether the executor actually destroys and >> recreates threads (thus using more IDs than the max thread count). >> > > Mark, > > The change which was added to 1.2.8 works great on 1.1.34. I also > hacked in calling ERR_remove_thread_state on thread exit. It is > similar to my original change but only does do it on a threads exit, > but as Remy pointed out earlier this will only work if all IO > operations occur on one of the original threads from the thread pool. > > -nate > > Here is the hack just for completeness. > > diff --git a/java/org/apache/catalina/core/StandardThreadExecutor.java > b/java/org/apache/catalina/core/StandardThreadExecutor.java > index a30f29c..80f429d 100644 > --- a/java/org/apache/catalina/core/StandardThreadExecutor.java > +++ b/java/org/apache/catalina/core/StandardThreadExecutor.java > @@ -18,6 +18,7 @@ > package org.apache.catalina.core; > > import java.util.concurrent.RejectedExecutionException; > +import java.util.concurrent.ThreadFactory; > import java.util.concurrent.TimeUnit; > >
Re: Bug that spans tomcat and tomcat-native
On Tue, Jun 28, 2016 at 12:37 PM, Rainer Jungwrote: > Am 28.06.2016 um 18:06 schrieb therealnewo...@gmail.com: >> >> On Tue, Jun 28, 2016 at 11:51 AM, Rainer Jung >> wrote: >>> >>> Am 28.06.2016 um 16:07 schrieb Mark Thomas: On 28/06/2016 12:28, Mark Thomas wrote: > > > On 28/06/2016 11:34, Rainer Jung wrote: >> >> >> Am 28.06.2016 um 11:15 schrieb Mark Thomas: > > > > > >>> Index: src/ssl.c >>> === >>> --- src/ssl.c(revision 1750259) >>> +++ src/ssl.c(working copy) >>> @@ -420,6 +420,10 @@ >>> return psaptr->PSATOLD; >>> #elif defined(WIN32) >>> return (unsigned long)GetCurrentThreadId(); >>> +#elif defined(DARWIN) >>> +uint64_t tid; >>> +pthread_threadid_np(NULL, ); >>> +return (unsigned long)tid; >>> #else >>> return (unsigned long)(apr_os_thread_current()); >>> #endif >>> >>> >>> I want to do some similar testing for Linux before adding what I >>> suspect >>> will be a very similar block using gettid(). >> >> >> >> We could either add something to configure.in. Untested: >> >> Index: native/configure.in >> === >> --- native/configure.in (revision 1750462) >> +++ native/configure.in (working copy) >> @@ -218,6 +218,9 @@ >> *-solaris2*) >> APR_ADDTO(TCNATIVE_LIBS, -lkstat) >> ;; >> +*linux*) >> +APR_ADDTO(CFLAGS, -DTCNATIVE_LINUX) >> +;; >> *) >> ;; >> esac >> >> >> and then use a >> >> #ifdef TCNATIVE_LINUX >> >> or we copy some other more direct linux check from e.g. APR: >> >> #ifdef __linux__ >> >> The latter looks simpler, but the version above is based on all the >> logic put into config.guess. > > > > I'd go with the __linux__ option as that is consistent with what we > already use in os/unix/system.c > > I'm not against the change to configure.in, I just think we should be > consistent with how we do this throughout the code base. I've confirmed that the same problem occurs with hash bucket selection on linux and that switching to gettid() fixes that problem. I'm going to go ahead with the 1.2.8 release shortly. We can continue to refine this as necessary and have a more complete fix in 1.2.9. >>> >>> >>> >>> I did a quick check on Solaris. apr_os_thread_current() uses pthread_self >>> on >>> Solaris like on Linux (actually on any Unix type OS), but unlike Linux >>> where >>> this returns a address which is either 32 or 64 bit aligned depending on >>> address size, on Solaris you get an increasing number starting with 1 for >>> the first thread and incremented by one for each following thread. Thread >>> IDs do not get reused in the same process, even if the thread finished, >>> but >>> thread IDs are common between different processes, because they always >>> start >>> with 1. So Solaris should be fine as-is. >> >> >> Does the value have a cap? If not then Solaris will just continue to >> use more and more memory as threads are created over the lifetime of >> the server. > > > No cap. I think everyone agrees we are still looking at cleaning up at > thread death as a further improvement. Whether there's a problem without it > depends on the executor size and whether the executor actually destroys and > recreates threads (thus using more IDs than the max thread count). > Mark, The change which was added to 1.2.8 works great on 1.1.34. I also hacked in calling ERR_remove_thread_state on thread exit. It is similar to my original change but only does do it on a threads exit, but as Remy pointed out earlier this will only work if all IO operations occur on one of the original threads from the thread pool. -nate Here is the hack just for completeness. diff --git a/java/org/apache/catalina/core/StandardThreadExecutor.java b/java/org/apache/catalina/core/StandardThreadExecutor.java index a30f29c..80f429d 100644 --- a/java/org/apache/catalina/core/StandardThreadExecutor.java +++ b/java/org/apache/catalina/core/StandardThreadExecutor.java @@ -18,6 +18,7 @@ package org.apache.catalina.core; import java.util.concurrent.RejectedExecutionException; +import java.util.concurrent.ThreadFactory; import java.util.concurrent.TimeUnit; import org.apache.catalina.Executor; @@ -27,10 +28,11 @@ import org.apache.catalina.util.LifecycleMBeanBase; import org.apache.tomcat.util.threads.ResizableExecutor; import org.apache.tomcat.util.threads.TaskQueue; import org.apache.tomcat.util.threads.TaskThreadFactory; +import org.apache.tomcat.util.threads.ThreadCreator; import
Re: Bug that spans tomcat and tomcat-native
Am 28.06.2016 um 18:06 schrieb therealnewo...@gmail.com: On Tue, Jun 28, 2016 at 11:51 AM, Rainer Jungwrote: Am 28.06.2016 um 16:07 schrieb Mark Thomas: On 28/06/2016 12:28, Mark Thomas wrote: On 28/06/2016 11:34, Rainer Jung wrote: Am 28.06.2016 um 11:15 schrieb Mark Thomas: Index: src/ssl.c === --- src/ssl.c(revision 1750259) +++ src/ssl.c(working copy) @@ -420,6 +420,10 @@ return psaptr->PSATOLD; #elif defined(WIN32) return (unsigned long)GetCurrentThreadId(); +#elif defined(DARWIN) +uint64_t tid; +pthread_threadid_np(NULL, ); +return (unsigned long)tid; #else return (unsigned long)(apr_os_thread_current()); #endif I want to do some similar testing for Linux before adding what I suspect will be a very similar block using gettid(). We could either add something to configure.in. Untested: Index: native/configure.in === --- native/configure.in (revision 1750462) +++ native/configure.in (working copy) @@ -218,6 +218,9 @@ *-solaris2*) APR_ADDTO(TCNATIVE_LIBS, -lkstat) ;; +*linux*) +APR_ADDTO(CFLAGS, -DTCNATIVE_LINUX) +;; *) ;; esac and then use a #ifdef TCNATIVE_LINUX or we copy some other more direct linux check from e.g. APR: #ifdef __linux__ The latter looks simpler, but the version above is based on all the logic put into config.guess. I'd go with the __linux__ option as that is consistent with what we already use in os/unix/system.c I'm not against the change to configure.in, I just think we should be consistent with how we do this throughout the code base. I've confirmed that the same problem occurs with hash bucket selection on linux and that switching to gettid() fixes that problem. I'm going to go ahead with the 1.2.8 release shortly. We can continue to refine this as necessary and have a more complete fix in 1.2.9. I did a quick check on Solaris. apr_os_thread_current() uses pthread_self on Solaris like on Linux (actually on any Unix type OS), but unlike Linux where this returns a address which is either 32 or 64 bit aligned depending on address size, on Solaris you get an increasing number starting with 1 for the first thread and incremented by one for each following thread. Thread IDs do not get reused in the same process, even if the thread finished, but thread IDs are common between different processes, because they always start with 1. So Solaris should be fine as-is. Does the value have a cap? If not then Solaris will just continue to use more and more memory as threads are created over the lifetime of the server. No cap. I think everyone agrees we are still looking at cleaning up at thread death as a further improvement. Whether there's a problem without it depends on the executor size and whether the executor actually destroys and recreates threads (thus using more IDs than the max thread count). Regards, Rainer - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On Tue, Jun 28, 2016 at 11:51 AM, Rainer Jungwrote: > Am 28.06.2016 um 16:07 schrieb Mark Thomas: >> >> On 28/06/2016 12:28, Mark Thomas wrote: >>> >>> On 28/06/2016 11:34, Rainer Jung wrote: Am 28.06.2016 um 11:15 schrieb Mark Thomas: >>> >>> >>> >>> > Index: src/ssl.c > === > --- src/ssl.c(revision 1750259) > +++ src/ssl.c(working copy) > @@ -420,6 +420,10 @@ > return psaptr->PSATOLD; > #elif defined(WIN32) > return (unsigned long)GetCurrentThreadId(); > +#elif defined(DARWIN) > +uint64_t tid; > +pthread_threadid_np(NULL, ); > +return (unsigned long)tid; > #else > return (unsigned long)(apr_os_thread_current()); > #endif > > > I want to do some similar testing for Linux before adding what I > suspect > will be a very similar block using gettid(). We could either add something to configure.in. Untested: Index: native/configure.in === --- native/configure.in (revision 1750462) +++ native/configure.in (working copy) @@ -218,6 +218,9 @@ *-solaris2*) APR_ADDTO(TCNATIVE_LIBS, -lkstat) ;; +*linux*) +APR_ADDTO(CFLAGS, -DTCNATIVE_LINUX) +;; *) ;; esac and then use a #ifdef TCNATIVE_LINUX or we copy some other more direct linux check from e.g. APR: #ifdef __linux__ The latter looks simpler, but the version above is based on all the logic put into config.guess. >>> >>> >>> I'd go with the __linux__ option as that is consistent with what we >>> already use in os/unix/system.c >>> >>> I'm not against the change to configure.in, I just think we should be >>> consistent with how we do this throughout the code base. >> >> >> I've confirmed that the same problem occurs with hash bucket selection >> on linux and that switching to gettid() fixes that problem. >> >> I'm going to go ahead with the 1.2.8 release shortly. We can continue to >> refine this as necessary and have a more complete fix in 1.2.9. > > > I did a quick check on Solaris. apr_os_thread_current() uses pthread_self on > Solaris like on Linux (actually on any Unix type OS), but unlike Linux where > this returns a address which is either 32 or 64 bit aligned depending on > address size, on Solaris you get an increasing number starting with 1 for > the first thread and incremented by one for each following thread. Thread > IDs do not get reused in the same process, even if the thread finished, but > thread IDs are common between different processes, because they always start > with 1. So Solaris should be fine as-is. Does the value have a cap? If not then Solaris will just continue to use more and more memory as threads are created over the lifetime of the server. -nate - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
Am 28.06.2016 um 16:07 schrieb Mark Thomas: On 28/06/2016 12:28, Mark Thomas wrote: On 28/06/2016 11:34, Rainer Jung wrote: Am 28.06.2016 um 11:15 schrieb Mark Thomas: Index: src/ssl.c === --- src/ssl.c(revision 1750259) +++ src/ssl.c(working copy) @@ -420,6 +420,10 @@ return psaptr->PSATOLD; #elif defined(WIN32) return (unsigned long)GetCurrentThreadId(); +#elif defined(DARWIN) +uint64_t tid; +pthread_threadid_np(NULL, ); +return (unsigned long)tid; #else return (unsigned long)(apr_os_thread_current()); #endif I want to do some similar testing for Linux before adding what I suspect will be a very similar block using gettid(). We could either add something to configure.in. Untested: Index: native/configure.in === --- native/configure.in (revision 1750462) +++ native/configure.in (working copy) @@ -218,6 +218,9 @@ *-solaris2*) APR_ADDTO(TCNATIVE_LIBS, -lkstat) ;; +*linux*) +APR_ADDTO(CFLAGS, -DTCNATIVE_LINUX) +;; *) ;; esac and then use a #ifdef TCNATIVE_LINUX or we copy some other more direct linux check from e.g. APR: #ifdef __linux__ The latter looks simpler, but the version above is based on all the logic put into config.guess. I'd go with the __linux__ option as that is consistent with what we already use in os/unix/system.c I'm not against the change to configure.in, I just think we should be consistent with how we do this throughout the code base. I've confirmed that the same problem occurs with hash bucket selection on linux and that switching to gettid() fixes that problem. I'm going to go ahead with the 1.2.8 release shortly. We can continue to refine this as necessary and have a more complete fix in 1.2.9. I did a quick check on Solaris. apr_os_thread_current() uses pthread_self on Solaris like on Linux (actually on any Unix type OS), but unlike Linux where this returns a address which is either 32 or 64 bit aligned depending on address size, on Solaris you get an increasing number starting with 1 for the first thread and incremented by one for each following thread. Thread IDs do not get reused in the same process, even if the thread finished, but thread IDs are common between different processes, because they always start with 1. So Solaris should be fine as-is. I do think OS specific code to get the real thread ID is the way to go here until we can switch to OpenSSL 1.1.0 at which point the problem should go away. Agreed, but cleaning up at the end of a thread would be good as well, e.g. if the executor is configured to renew threads when a context is reloaded we can accumulate lots of old hash entries. Agreed. Mark Rainer - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On 28/06/2016 12:28, Mark Thomas wrote: > On 28/06/2016 11:34, Rainer Jung wrote: >> Am 28.06.2016 um 11:15 schrieb Mark Thomas: > > > >>> Index: src/ssl.c >>> === >>> --- src/ssl.c(revision 1750259) >>> +++ src/ssl.c(working copy) >>> @@ -420,6 +420,10 @@ >>> return psaptr->PSATOLD; >>> #elif defined(WIN32) >>> return (unsigned long)GetCurrentThreadId(); >>> +#elif defined(DARWIN) >>> +uint64_t tid; >>> +pthread_threadid_np(NULL, ); >>> +return (unsigned long)tid; >>> #else >>> return (unsigned long)(apr_os_thread_current()); >>> #endif >>> >>> >>> I want to do some similar testing for Linux before adding what I suspect >>> will be a very similar block using gettid(). >> >> We could either add something to configure.in. Untested: >> >> Index: native/configure.in >> === >> --- native/configure.in (revision 1750462) >> +++ native/configure.in (working copy) >> @@ -218,6 +218,9 @@ >> *-solaris2*) >> APR_ADDTO(TCNATIVE_LIBS, -lkstat) >> ;; >> +*linux*) >> +APR_ADDTO(CFLAGS, -DTCNATIVE_LINUX) >> +;; >> *) >> ;; >> esac >> >> >> and then use a >> >> #ifdef TCNATIVE_LINUX >> >> or we copy some other more direct linux check from e.g. APR: >> >> #ifdef __linux__ >> >> The latter looks simpler, but the version above is based on all the >> logic put into config.guess. > > I'd go with the __linux__ option as that is consistent with what we > already use in os/unix/system.c > > I'm not against the change to configure.in, I just think we should be > consistent with how we do this throughout the code base. I've confirmed that the same problem occurs with hash bucket selection on linux and that switching to gettid() fixes that problem. I'm going to go ahead with the 1.2.8 release shortly. We can continue to refine this as necessary and have a more complete fix in 1.2.9. Mark > >>> I do think OS specific code to get the real thread ID is the way to go >>> here until we can switch to OpenSSL 1.1.0 at which point the problem >>> should go away. >> >> Agreed, but cleaning up at the end of a thread would be good as well, >> e.g. if the executor is configured to renew threads when a context is >> reloaded we can accumulate lots of old hash entries. > > Agreed. > > Mark > > > - > To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org > For additional commands, e-mail: dev-h...@tomcat.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On 28/06/2016 11:34, Rainer Jung wrote: > Am 28.06.2016 um 11:15 schrieb Mark Thomas: >> Index: src/ssl.c >> === >> --- src/ssl.c(revision 1750259) >> +++ src/ssl.c(working copy) >> @@ -420,6 +420,10 @@ >> return psaptr->PSATOLD; >> #elif defined(WIN32) >> return (unsigned long)GetCurrentThreadId(); >> +#elif defined(DARWIN) >> +uint64_t tid; >> +pthread_threadid_np(NULL, ); >> +return (unsigned long)tid; >> #else >> return (unsigned long)(apr_os_thread_current()); >> #endif >> >> >> I want to do some similar testing for Linux before adding what I suspect >> will be a very similar block using gettid(). > > We could either add something to configure.in. Untested: > > Index: native/configure.in > === > --- native/configure.in (revision 1750462) > +++ native/configure.in (working copy) > @@ -218,6 +218,9 @@ > *-solaris2*) > APR_ADDTO(TCNATIVE_LIBS, -lkstat) > ;; > +*linux*) > +APR_ADDTO(CFLAGS, -DTCNATIVE_LINUX) > +;; > *) > ;; > esac > > > and then use a > > #ifdef TCNATIVE_LINUX > > or we copy some other more direct linux check from e.g. APR: > > #ifdef __linux__ > > The latter looks simpler, but the version above is based on all the > logic put into config.guess. I'd go with the __linux__ option as that is consistent with what we already use in os/unix/system.c I'm not against the change to configure.in, I just think we should be consistent with how we do this throughout the code base. >> I do think OS specific code to get the real thread ID is the way to go >> here until we can switch to OpenSSL 1.1.0 at which point the problem >> should go away. > > Agreed, but cleaning up at the end of a thread would be good as well, > e.g. if the executor is configured to renew threads when a context is > reloaded we can accumulate lots of old hash entries. Agreed. Mark - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
Am 28.06.2016 um 11:15 schrieb Mark Thomas: On 27/06/2016 23:54, Nate Clark wrote: On Mon, Jun 27, 2016 at 5:19 PM, Mark Thomaswrote: On 27/06/2016 19:59, Nate Clark wrote: On Mon, Jun 27, 2016 at 2:13 PM, Rainer Jung wrote: I wouldn't care too much about portability right now. If you can prove Mark's analysis by using gettid() on Linux and checking that this fixes the performance issue, that would very useful feedback. If it doesn't help, we can go down the path of using the hash statistics to get more info about what's happening. If it helps, we can think further about how to reduce hash collisions fo every platform. I went down the hash statistics route any way. The short version is with 11 threads and 8 buckets every thread ends up in the same bucket. This was on OSX. I'm pretty sure Linux will be the same. I can try that, but unfortunately will not be able to run the test right away. The machine I was using is a shared resource, which somebody else needs for the next few days. Next up is trying to figure out a patch to fix the bucket distribution. Here is my proposed fix for OSX: Index: src/ssl.c === --- src/ssl.c (revision 1750259) +++ src/ssl.c (working copy) @@ -420,6 +420,10 @@ return psaptr->PSATOLD; #elif defined(WIN32) return (unsigned long)GetCurrentThreadId(); +#elif defined(DARWIN) +uint64_t tid; +pthread_threadid_np(NULL, ); +return (unsigned long)tid; #else return (unsigned long)(apr_os_thread_current()); #endif I want to do some similar testing for Linux before adding what I suspect will be a very similar block using gettid(). We could either add something to configure.in. Untested: Index: native/configure.in === --- native/configure.in (revision 1750462) +++ native/configure.in (working copy) @@ -218,6 +218,9 @@ *-solaris2*) APR_ADDTO(TCNATIVE_LIBS, -lkstat) ;; +*linux*) +APR_ADDTO(CFLAGS, -DTCNATIVE_LINUX) +;; *) ;; esac and then use a #ifdef TCNATIVE_LINUX or we copy some other more direct linux check from e.g. APR: #ifdef __linux__ The latter looks simpler, but the version above is based on all the logic put into config.guess. Finally we could check against SYS_gettid, either during an autodettect in configure or directly in the c source file. It seems gettid() must be called as "syscall(SYS_gettid)" and was added in kernel 2.4.11 at the end of 2001. Even if that does help the performance and makes it so the hash map has better distribution. It won't help with long term bloat. If a system is running long enough and threads are started and stopped the hash map could grow to include up to almost max pid entries, 32768 by default on Linux. With the size of the ERR_STATE struct that will be roughly 12MB of memory, not horrible but not ideal. I still think ERR_remove_thread_state() should probably be invoked prior to a thread exiting. The only way I can think of doing that and keeping everything in native and not call it for every request would be to call pthread_cleanup_push(ERR_remove_thread_state, NULL) the first time a thread is used to perform any ssl operation. That would then require you to track per thread if the cleanup function had been registered. That is definitely a secondary consideration at the moment. Also, any solution which make the thread id random and not a bounded set of values would cause the max memory foot print of the hash map to be virtually unbounded and require that ERR_remove_thread_state be invoked to clean up the error states. The function that generated the 'random' ID from the thread ID would have to be deterministic so the total number of values should remain bounded. Shouldn't it? If the error queue's are not being removed that would be a requirement. If it is to be bound it can not use apr_os_thread_current(), since that is a "random" memory address and inherently unbounded. If you tried to reduce that value to a smaller set you would then have to worry about collisions which is something any algorithm would have to worry about. The code in err.c assumes that the error state is only ever modified by one thread and doesn't lock modifying the state just the map access. The easiest thing to base the ID on, from a Linux perspective, would be the TID, which is bounded and guaranteed to be unique for a running thread. I am not as sure about other operating systems. It seems like on some versions of OSx pthread_getthreadid_np(), might do what you want. I do think OS specific code to get the real thread ID is the way to go here until we can switch to OpenSSL 1.1.0 at which point the problem should go away. Agreed, but cleaning up at the end of a thread would be good as well, e.g. if the executor is configured to renew threads when a context is
Re: Bug that spans tomcat and tomcat-native
On 27/06/2016 23:54, Nate Clark wrote: > On Mon, Jun 27, 2016 at 5:19 PM, Mark Thomaswrote: >> On 27/06/2016 19:59, Nate Clark wrote: >>> On Mon, Jun 27, 2016 at 2:13 PM, Rainer Jung >>> wrote: >> >> >> I wouldn't care too much about portability right now. If you can prove Mark's analysis by using gettid() on Linux and checking that this fixes the performance issue, that would very useful feedback. If it doesn't help, we can go down the path of using the hash statistics to get more info about what's happening. If it helps, we can think further about how to reduce hash collisions fo every platform. >> >> I went down the hash statistics route any way. The short version is with >> 11 threads and 8 buckets every thread ends up in the same bucket. This >> was on OSX. I'm pretty sure Linux will be the same. >> >>> I can try that, but unfortunately will not be able to run the test >>> right away. The machine I was using is a shared resource, which >>> somebody else needs for the next few days. >> >> Next up is trying to figure out a patch to fix the bucket distribution. Here is my proposed fix for OSX: Index: src/ssl.c === --- src/ssl.c (revision 1750259) +++ src/ssl.c (working copy) @@ -420,6 +420,10 @@ return psaptr->PSATOLD; #elif defined(WIN32) return (unsigned long)GetCurrentThreadId(); +#elif defined(DARWIN) +uint64_t tid; +pthread_threadid_np(NULL, ); +return (unsigned long)tid; #else return (unsigned long)(apr_os_thread_current()); #endif I want to do some similar testing for Linux before adding what I suspect will be a very similar block using gettid(). >>> Even if that does help the performance and makes it so the hash map >>> has better distribution. It won't help with long term bloat. If a >>> system is running long enough and threads are started and stopped the >>> hash map could grow to include up to almost max pid entries, 32768 by >>> default on Linux. With the size of the ERR_STATE struct that will be >>> roughly 12MB of memory, not horrible but not ideal. I still think >>> ERR_remove_thread_state() should probably be invoked prior to a thread >>> exiting. The only way I can think of doing that and keeping everything >>> in native and not call it for every request would be to call >>> pthread_cleanup_push(ERR_remove_thread_state, NULL) the first time a >>> thread is used to perform any ssl operation. That would then require >>> you to track per thread if the cleanup function had been registered. >> >> That is definitely a secondary consideration at the moment. >> >>> Also, any solution which make the thread id random and not a bounded >>> set of values would cause the max memory foot print of the hash map to >>> be virtually unbounded and require that ERR_remove_thread_state be >>> invoked to clean up the error states. >> >> The function that generated the 'random' ID from the thread ID would >> have to be deterministic so the total number of values should remain >> bounded. Shouldn't it? >> > > If the error queue's are not being removed that would be a > requirement. If it is to be bound it can not use > apr_os_thread_current(), since that is a "random" memory address and > inherently unbounded. If you tried to reduce that value to a smaller > set you would then have to worry about collisions which is something > any algorithm would have to worry about. The code in err.c assumes > that the error state is only ever modified by one thread and doesn't > lock modifying the state just the map access. The easiest thing to > base the ID on, from a Linux perspective, would be the TID, which is > bounded and guaranteed to be unique for a running thread. I am not as > sure about other operating systems. It seems like on some versions of > OSx pthread_getthreadid_np(), might do what you want. I do think OS specific code to get the real thread ID is the way to go here until we can switch to OpenSSL 1.1.0 at which point the problem should go away. Mark - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On Mon, Jun 27, 2016 at 5:19 PM, Mark Thomaswrote: > On 27/06/2016 19:59, Nate Clark wrote: >> On Mon, Jun 27, 2016 at 2:13 PM, Rainer Jung wrote: > > > >>> I wouldn't care too much about portability right now. If you can prove >>> Mark's analysis by using gettid() on Linux and checking that this fixes the >>> performance issue, that would very useful feedback. If it doesn't help, we >>> can go down the path of using the hash statistics to get more info about >>> what's happening. If it helps, we can think further about how to reduce hash >>> collisions fo every platform. > > I went down the hash statistics route any way. The short version is with > 11 threads and 8 buckets every thread ends up in the same bucket. This > was on OSX. I'm pretty sure Linux will be the same. > >> I can try that, but unfortunately will not be able to run the test >> right away. The machine I was using is a shared resource, which >> somebody else needs for the next few days. > > Next up is trying to figure out a patch to fix the bucket distribution. > >> Even if that does help the performance and makes it so the hash map >> has better distribution. It won't help with long term bloat. If a >> system is running long enough and threads are started and stopped the >> hash map could grow to include up to almost max pid entries, 32768 by >> default on Linux. With the size of the ERR_STATE struct that will be >> roughly 12MB of memory, not horrible but not ideal. I still think >> ERR_remove_thread_state() should probably be invoked prior to a thread >> exiting. The only way I can think of doing that and keeping everything >> in native and not call it for every request would be to call >> pthread_cleanup_push(ERR_remove_thread_state, NULL) the first time a >> thread is used to perform any ssl operation. That would then require >> you to track per thread if the cleanup function had been registered. > > That is definitely a secondary consideration at the moment. > >> Also, any solution which make the thread id random and not a bounded >> set of values would cause the max memory foot print of the hash map to >> be virtually unbounded and require that ERR_remove_thread_state be >> invoked to clean up the error states. > > The function that generated the 'random' ID from the thread ID would > have to be deterministic so the total number of values should remain > bounded. Shouldn't it? > If the error queue's are not being removed that would be a requirement. If it is to be bound it can not use apr_os_thread_current(), since that is a "random" memory address and inherently unbounded. If you tried to reduce that value to a smaller set you would then have to worry about collisions which is something any algorithm would have to worry about. The code in err.c assumes that the error state is only ever modified by one thread and doesn't lock modifying the state just the map access. The easiest thing to base the ID on, from a Linux perspective, would be the TID, which is bounded and guaranteed to be unique for a running thread. I am not as sure about other operating systems. It seems like on some versions of OSx pthread_getthreadid_np(), might do what you want. -nate - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On 27/06/2016 19:59, Nate Clark wrote: > On Mon, Jun 27, 2016 at 2:13 PM, Rainer Jungwrote: >> I wouldn't care too much about portability right now. If you can prove >> Mark's analysis by using gettid() on Linux and checking that this fixes the >> performance issue, that would very useful feedback. If it doesn't help, we >> can go down the path of using the hash statistics to get more info about >> what's happening. If it helps, we can think further about how to reduce hash >> collisions fo every platform. I went down the hash statistics route any way. The short version is with 11 threads and 8 buckets every thread ends up in the same bucket. This was on OSX. I'm pretty sure Linux will be the same. > I can try that, but unfortunately will not be able to run the test > right away. The machine I was using is a shared resource, which > somebody else needs for the next few days. Next up is trying to figure out a patch to fix the bucket distribution. > Even if that does help the performance and makes it so the hash map > has better distribution. It won't help with long term bloat. If a > system is running long enough and threads are started and stopped the > hash map could grow to include up to almost max pid entries, 32768 by > default on Linux. With the size of the ERR_STATE struct that will be > roughly 12MB of memory, not horrible but not ideal. I still think > ERR_remove_thread_state() should probably be invoked prior to a thread > exiting. The only way I can think of doing that and keeping everything > in native and not call it for every request would be to call > pthread_cleanup_push(ERR_remove_thread_state, NULL) the first time a > thread is used to perform any ssl operation. That would then require > you to track per thread if the cleanup function had been registered. That is definitely a secondary consideration at the moment. > Also, any solution which make the thread id random and not a bounded > set of values would cause the max memory foot print of the hash map to > be virtually unbounded and require that ERR_remove_thread_state be > invoked to clean up the error states. The function that generated the 'random' ID from the thread ID would have to be deterministic so the total number of values should remain bounded. Shouldn't it? Mark - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On Mon, Jun 27, 2016 at 2:13 PM, Rainer Jungwrote: > Am 27.06.2016 um 18:47 schrieb Nate Clark: >> >> On Mon, Jun 27, 2016 at 12:17 PM, wrote: >>> >>> On Mon, Jun 27, 2016 at 12:04 PM, wrote: On Mon, Jun 27, 2016 at 11:54 AM, Rainer Jung wrote: > > Great analysis. I was really wondering, what could make the hash map so > huge > and hadn't thought about the hash function as the problem. > > Before OpenSSL 1.1.0 there's a callback for applications to provide > their > own thread IDs: > > > https://www.openssl.org/docs/man1.0.2/crypto/CRYPTO_THREADID_set_callback.html > > So we could probably work around the problem of the poor hashing > function by > passing in IDs that work for hashing (pre-hashed ID?). Of course then > we > loose the direct association of the OpenSSL thread ID with the real > platform > thread id. > > Currently our callback in tcnative is ssl_set_thread_id() which refers > to > ssl_thread_id(), which on Linux gets the ID from the APR function > apr_os_thread_current(). So we could add some hashing formula in > ssl_thread_id(). I think just using the real thread id would work, since now it isn't using the real thread id instead it is using the address location of errno. If this was the real thread id I think the hash algorithm and bucket selection they have now will work much better since the thread ids are basically numerically increasing and aren't aligned to powers of 2. >>> >>> >>> Sorry, I need to read code a bit closer. I misread the >>> OPENSSL_VERSION_NUMBER < 0x1010L as version < 1.0.1 and not 1.1.0. >>> >>> I would still expect real thread ids to provide enough of a >>> distribution in a map where the hash bucket ends up being is ((ID * >>> 13) & 0X7F). >> >> >> OK, this now makes more sense apr_os_thread_current() calls >> pthread_self() which returns a pthred_t which is basically a handle >> for pthread's internal state and not an actual ID. It might be better >> to use gettid(), but that isn't as portable. > > > I wouldn't care too much about portability right now. If you can prove > Mark's analysis by using gettid() on Linux and checking that this fixes the > performance issue, that would very useful feedback. If it doesn't help, we > can go down the path of using the hash statistics to get more info about > what's happening. If it helps, we can think further about how to reduce hash > collisions fo every platform. I can try that, but unfortunately will not be able to run the test right away. The machine I was using is a shared resource, which somebody else needs for the next few days. Even if that does help the performance and makes it so the hash map has better distribution. It won't help with long term bloat. If a system is running long enough and threads are started and stopped the hash map could grow to include up to almost max pid entries, 32768 by default on Linux. With the size of the ERR_STATE struct that will be roughly 12MB of memory, not horrible but not ideal. I still think ERR_remove_thread_state() should probably be invoked prior to a thread exiting. The only way I can think of doing that and keeping everything in native and not call it for every request would be to call pthread_cleanup_push(ERR_remove_thread_state, NULL) the first time a thread is used to perform any ssl operation. That would then require you to track per thread if the cleanup function had been registered. Also, any solution which make the thread id random and not a bounded set of values would cause the max memory foot print of the hash map to be virtually unbounded and require that ERR_remove_thread_state be invoked to clean up the error states. -nate - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
Am 27.06.2016 um 18:47 schrieb Nate Clark: On Mon, Jun 27, 2016 at 12:17 PM,wrote: On Mon, Jun 27, 2016 at 12:04 PM, wrote: On Mon, Jun 27, 2016 at 11:54 AM, Rainer Jung wrote: Great analysis. I was really wondering, what could make the hash map so huge and hadn't thought about the hash function as the problem. Before OpenSSL 1.1.0 there's a callback for applications to provide their own thread IDs: https://www.openssl.org/docs/man1.0.2/crypto/CRYPTO_THREADID_set_callback.html So we could probably work around the problem of the poor hashing function by passing in IDs that work for hashing (pre-hashed ID?). Of course then we loose the direct association of the OpenSSL thread ID with the real platform thread id. Currently our callback in tcnative is ssl_set_thread_id() which refers to ssl_thread_id(), which on Linux gets the ID from the APR function apr_os_thread_current(). So we could add some hashing formula in ssl_thread_id(). I think just using the real thread id would work, since now it isn't using the real thread id instead it is using the address location of errno. If this was the real thread id I think the hash algorithm and bucket selection they have now will work much better since the thread ids are basically numerically increasing and aren't aligned to powers of 2. Sorry, I need to read code a bit closer. I misread the OPENSSL_VERSION_NUMBER < 0x1010L as version < 1.0.1 and not 1.1.0. I would still expect real thread ids to provide enough of a distribution in a map where the hash bucket ends up being is ((ID * 13) & 0X7F). OK, this now makes more sense apr_os_thread_current() calls pthread_self() which returns a pthred_t which is basically a handle for pthread's internal state and not an actual ID. It might be better to use gettid(), but that isn't as portable. I wouldn't care too much about portability right now. If you can prove Mark's analysis by using gettid() on Linux and checking that this fixes the performance issue, that would very useful feedback. If it doesn't help, we can go down the path of using the hash statistics to get more info about what's happening. If it helps, we can think further about how to reduce hash collisions fo every platform. Regards, Rainer - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On Mon, Jun 27, 2016 at 12:17 PM,wrote: > On Mon, Jun 27, 2016 at 12:04 PM, wrote: >> On Mon, Jun 27, 2016 at 11:54 AM, Rainer Jung >> wrote: >>> Great analysis. I was really wondering, what could make the hash map so huge >>> and hadn't thought about the hash function as the problem. >>> >>> Before OpenSSL 1.1.0 there's a callback for applications to provide their >>> own thread IDs: >>> >>> https://www.openssl.org/docs/man1.0.2/crypto/CRYPTO_THREADID_set_callback.html >>> >>> So we could probably work around the problem of the poor hashing function by >>> passing in IDs that work for hashing (pre-hashed ID?). Of course then we >>> loose the direct association of the OpenSSL thread ID with the real platform >>> thread id. >>> >>> Currently our callback in tcnative is ssl_set_thread_id() which refers to >>> ssl_thread_id(), which on Linux gets the ID from the APR function >>> apr_os_thread_current(). So we could add some hashing formula in >>> ssl_thread_id(). >> >> I think just using the real thread id would work, since now it isn't >> using the real thread id instead it is using the address location of >> errno. If this was the real thread id I think the hash algorithm and >> bucket selection they have now will work much better since the thread >> ids are basically numerically increasing and aren't aligned to powers >> of 2. > > Sorry, I need to read code a bit closer. I misread the > OPENSSL_VERSION_NUMBER < 0x1010L as version < 1.0.1 and not 1.1.0. > > I would still expect real thread ids to provide enough of a > distribution in a map where the hash bucket ends up being is ((ID * > 13) & 0X7F). OK, this now makes more sense apr_os_thread_current() calls pthread_self() which returns a pthred_t which is basically a handle for pthread's internal state and not an actual ID. It might be better to use gettid(), but that isn't as portable. -nate - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On Mon, Jun 27, 2016 at 12:04 PM,wrote: > On Mon, Jun 27, 2016 at 11:54 AM, Rainer Jung wrote: >> Great analysis. I was really wondering, what could make the hash map so huge >> and hadn't thought about the hash function as the problem. >> >> Before OpenSSL 1.1.0 there's a callback for applications to provide their >> own thread IDs: >> >> https://www.openssl.org/docs/man1.0.2/crypto/CRYPTO_THREADID_set_callback.html >> >> So we could probably work around the problem of the poor hashing function by >> passing in IDs that work for hashing (pre-hashed ID?). Of course then we >> loose the direct association of the OpenSSL thread ID with the real platform >> thread id. >> >> Currently our callback in tcnative is ssl_set_thread_id() which refers to >> ssl_thread_id(), which on Linux gets the ID from the APR function >> apr_os_thread_current(). So we could add some hashing formula in >> ssl_thread_id(). > > I think just using the real thread id would work, since now it isn't > using the real thread id instead it is using the address location of > errno. If this was the real thread id I think the hash algorithm and > bucket selection they have now will work much better since the thread > ids are basically numerically increasing and aren't aligned to powers > of 2. Sorry, I need to read code a bit closer. I misread the OPENSSL_VERSION_NUMBER < 0x1010L as version < 1.0.1 and not 1.1.0. I would still expect real thread ids to provide enough of a distribution in a map where the hash bucket ends up being is ((ID * 13) & 0X7F). -nate - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On 27/06/2016 17:04, therealnewo...@gmail.com wrote: > On Mon, Jun 27, 2016 at 11:54 AM, Rainer Jungwrote: Thanks for the feedback. >> So we could probably work around the problem of the poor hashing function by >> passing in IDs that work for hashing (pre-hashed ID?). Of course then we >> loose the direct association of the OpenSSL thread ID with the real platform >> thread id. >> >> Currently our callback in tcnative is ssl_set_thread_id() which refers to >> ssl_thread_id(), which on Linux gets the ID from the APR function >> apr_os_thread_current(). So we could add some hashing formula in >> ssl_thread_id(). > > I think just using the real thread id would work, since now it isn't > using the real thread id instead it is using the address location of > errno. If this was the real thread id I think the hash algorithm and > bucket selection they have now will work much better since the thread > ids are basically numerically increasing and aren't aligned to powers > of 2. Just so you aren't surprised, I'm going to go ahead and start the 1.2.8 release without this fix. Regarding this fix, I plan to add Rainer's suggested debug code to try and confirm my previous analysis and then code up a fix based on the real thread id for Nate to test. Nate, if you have enough info in this thread to work on a patch along those lines yourself please do feel free to go ahead and don't wait for me. Thanks, Mark - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On Mon, Jun 27, 2016 at 11:54 AM, Rainer Jungwrote: > Hi Mark, > > > Am 27.06.2016 um 15:11 schrieb Mark Thomas: >> >> I believe I have an explanation for what is going on that fits both the >> reported behaviour and the proposed fix. >> >> Background >> == >> >> OpenSSL tracks a list of the most recent errors for each thread in a >> hash map keyed on the thread (int_thread_hash in err.c). Reading and >> writing to this hash map is protected by a lock. The hash map is created >> and populated lazily. >> >> tc-native calls ERR_clear_error() before every call to >> SSL_do_handshake(), SSL_read() and SSL_write(). The call to >> ERR_clear_error() either clears the error list for the current thread or >> inserts a new empty list into the hash map of the thread is not already >> present. >> >> The performance problem was tracked down to threads waiting in >> ERR_clear_error() to obtain the write lock for the hash map. >> >> The proposed solution was to call ERR_remove_thread_state() just before >> the current Tomcat thread processing the connection is returned to the >> thread pool. This method removes the current thread and its associated >> error list from the hash map. >> >> >> Analysis >> >> >> The proposed solution, calling ERR_remove_thread_state(), adds a call >> that also obtains the write lock for the hash map. This indicates that >> the problem is not delays in obtaining the lock but contention for the >> lock because one or more operations taking place within the lock are >> taking a long time. >> >> Removing unused threads from the hash map removes the bottleneck. This >> points towards the hash map being the source of the problem. >> >> Testing by the OP showed that as soon as a test had been ran that >> required ~ 400 concurrent threads performance dropped significantly. It >> did not get noticeably worse if the same 400 thread test was run >> repeatedly. >> >> My testing indicated, on OSX at least, that the thread IDs used in the >> hash map were stable and that uncontrolled growth of the hash map was >> unlikely to be the cause. >> >> The manner in which thread IDs are generated varies by platform. On >> Linux, where this problem was observed, the thread ID is derived from >> (is normally equal to) the memory address of the per thread errno >> variable. This means that thread IDs tend to be concentrated in a >> relatively narrow range of values. For example, in a simple 10 thread >> test on OSX thread IDs ranged from 123145344839680 to 123145354387455. >> >> Thread IDs therefore fall with a 10^7 range within a possible range of >> 1.8x10^19. i.e. a very small, contiguous range. >> >> Hash maps use hashing functions to ensure that entries are (roughly) >> evenly distributed between the available buckets. The hash function, >> err_state_hash, used for the thread IDs in OpenSSL is threadID * 13. >> >> Supposition >> === >> >> The hash function used (multiple by 13) is insufficient to distribute >> the resulting values across multiple buckets because they will still >> fall in a relatively narrow band. Therefore all the threads end up in a >> single bucket which makes the performance of the hash map poor. This in >> turn makes calls to thread_get_item() slow because it does a hash map >> lookup. This lookup is performed with the read lock held for the hash >> map which in turn will slow down the calls that require the write lock. >> >> Proposal >> >> >> The analysis and supposition above need to be checked by someone with a >> better understanding of C than me. Assuming my work is correct, the next >> step is to look at possible fixes. I do not believe that patching >> OpenSSL is a viable option. >> >> The OpenSSL API needs to be reviewed to see if there is a way to avoid >> the calls that require the write lock. >> >> If the write lock cannot be avoided then we need to see if there is a >> better place to call ERR_remove_thread_state(). I'd like to fix this >> entirely in tc-native but that may mean calling >> ERR_remove_thread_state() more frequently which could create its own >> performance problems. >> >> Nate - I may have some patches for you to test in the next few days. >> >> Mark > > > Great analysis. I was really wondering, what could make the hash map so huge > and hadn't thought about the hash function as the problem. > > Before OpenSSL 1.1.0 there's a callback for applications to provide their > own thread IDs: > > https://www.openssl.org/docs/man1.0.2/crypto/CRYPTO_THREADID_set_callback.html > > So we could probably work around the problem of the poor hashing function by > passing in IDs that work for hashing (pre-hashed ID?). Of course then we > loose the direct association of the OpenSSL thread ID with the real platform > thread id. > > Currently our callback in tcnative is ssl_set_thread_id() which refers to > ssl_thread_id(), which on Linux gets the ID from the APR function > apr_os_thread_current(). So we could add some
Re: Bug that spans tomcat and tomcat-native
On Mon, Jun 27, 2016 at 9:11 AM, Mark Thomaswrote: > I believe I have an explanation for what is going on that fits both the > reported behaviour and the proposed fix. > > Background > == > > OpenSSL tracks a list of the most recent errors for each thread in a > hash map keyed on the thread (int_thread_hash in err.c). Reading and > writing to this hash map is protected by a lock. The hash map is created > and populated lazily. > > tc-native calls ERR_clear_error() before every call to > SSL_do_handshake(), SSL_read() and SSL_write(). The call to > ERR_clear_error() either clears the error list for the current thread or > inserts a new empty list into the hash map of the thread is not already > present. > > The performance problem was tracked down to threads waiting in > ERR_clear_error() to obtain the write lock for the hash map. > > The proposed solution was to call ERR_remove_thread_state() just before > the current Tomcat thread processing the connection is returned to the > thread pool. This method removes the current thread and its associated > error list from the hash map. > > > Analysis > > > The proposed solution, calling ERR_remove_thread_state(), adds a call > that also obtains the write lock for the hash map. This indicates that > the problem is not delays in obtaining the lock but contention for the > lock because one or more operations taking place within the lock are > taking a long time. > > Removing unused threads from the hash map removes the bottleneck. This > points towards the hash map being the source of the problem. > > Testing by the OP showed that as soon as a test had been ran that > required ~ 400 concurrent threads performance dropped significantly. It > did not get noticeably worse if the same 400 thread test was run repeatedly. > > My testing indicated, on OSX at least, that the thread IDs used in the > hash map were stable and that uncontrolled growth of the hash map was > unlikely to be the cause. > > The manner in which thread IDs are generated varies by platform. On > Linux, where this problem was observed, the thread ID is derived from > (is normally equal to) the memory address of the per thread errno > variable. This means that thread IDs tend to be concentrated in a > relatively narrow range of values. For example, in a simple 10 thread > test on OSX thread IDs ranged from 123145344839680 to 123145354387455. > > Thread IDs therefore fall with a 10^7 range within a possible range of > 1.8x10^19. i.e. a very small, contiguous range. > > Hash maps use hashing functions to ensure that entries are (roughly) > evenly distributed between the available buckets. The hash function, > err_state_hash, used for the thread IDs in OpenSSL is threadID * 13. > > Supposition > === > > The hash function used (multiple by 13) is insufficient to distribute > the resulting values across multiple buckets because they will still > fall in a relatively narrow band. Therefore all the threads end up in a > single bucket which makes the performance of the hash map poor. This in > turn makes calls to thread_get_item() slow because it does a hash map > lookup. This lookup is performed with the read lock held for the hash > map which in turn will slow down the calls that require the write lock. This makes sense. Looking at openssl source code a bit more, I think it is a combination of the hash function and bucket selection in the hash map. The bucket which is chosen is basically done by a low bit mask since pmax and num_alloc_nodes are always powers of 2. With 400 entries pmax=128 and num_alloc_nodes=256, making the bucket selection either hash & 0x7F or hash & 0xFF. The errno value memory locations are probably aligned and often that mask will result in the same bucket being chosen. > Proposal > > > The analysis and supposition above need to be checked by someone with a > better understanding of C than me. Assuming my work is correct, the next > step is to look at possible fixes. I do not believe that patching > OpenSSL is a viable option. > > The OpenSSL API needs to be reviewed to see if there is a way to avoid > the calls that require the write lock. > > If the write lock cannot be avoided then we need to see if there is a > better place to call ERR_remove_thread_state(). I'd like to fix this > entirely in tc-native but that may mean calling > ERR_remove_thread_state() more frequently which could create its own > performance problems. > > Nate - I may have some patches for you to test in the next few days. Sounds good. Thanks, -nate - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
Hi Mark, Am 27.06.2016 um 15:11 schrieb Mark Thomas: I believe I have an explanation for what is going on that fits both the reported behaviour and the proposed fix. Background == OpenSSL tracks a list of the most recent errors for each thread in a hash map keyed on the thread (int_thread_hash in err.c). Reading and writing to this hash map is protected by a lock. The hash map is created and populated lazily. tc-native calls ERR_clear_error() before every call to SSL_do_handshake(), SSL_read() and SSL_write(). The call to ERR_clear_error() either clears the error list for the current thread or inserts a new empty list into the hash map of the thread is not already present. The performance problem was tracked down to threads waiting in ERR_clear_error() to obtain the write lock for the hash map. The proposed solution was to call ERR_remove_thread_state() just before the current Tomcat thread processing the connection is returned to the thread pool. This method removes the current thread and its associated error list from the hash map. Analysis The proposed solution, calling ERR_remove_thread_state(), adds a call that also obtains the write lock for the hash map. This indicates that the problem is not delays in obtaining the lock but contention for the lock because one or more operations taking place within the lock are taking a long time. Removing unused threads from the hash map removes the bottleneck. This points towards the hash map being the source of the problem. Testing by the OP showed that as soon as a test had been ran that required ~ 400 concurrent threads performance dropped significantly. It did not get noticeably worse if the same 400 thread test was run repeatedly. My testing indicated, on OSX at least, that the thread IDs used in the hash map were stable and that uncontrolled growth of the hash map was unlikely to be the cause. The manner in which thread IDs are generated varies by platform. On Linux, where this problem was observed, the thread ID is derived from (is normally equal to) the memory address of the per thread errno variable. This means that thread IDs tend to be concentrated in a relatively narrow range of values. For example, in a simple 10 thread test on OSX thread IDs ranged from 123145344839680 to 123145354387455. Thread IDs therefore fall with a 10^7 range within a possible range of 1.8x10^19. i.e. a very small, contiguous range. Hash maps use hashing functions to ensure that entries are (roughly) evenly distributed between the available buckets. The hash function, err_state_hash, used for the thread IDs in OpenSSL is threadID * 13. Supposition === The hash function used (multiple by 13) is insufficient to distribute the resulting values across multiple buckets because they will still fall in a relatively narrow band. Therefore all the threads end up in a single bucket which makes the performance of the hash map poor. This in turn makes calls to thread_get_item() slow because it does a hash map lookup. This lookup is performed with the read lock held for the hash map which in turn will slow down the calls that require the write lock. Proposal The analysis and supposition above need to be checked by someone with a better understanding of C than me. Assuming my work is correct, the next step is to look at possible fixes. I do not believe that patching OpenSSL is a viable option. The OpenSSL API needs to be reviewed to see if there is a way to avoid the calls that require the write lock. If the write lock cannot be avoided then we need to see if there is a better place to call ERR_remove_thread_state(). I'd like to fix this entirely in tc-native but that may mean calling ERR_remove_thread_state() more frequently which could create its own performance problems. Nate - I may have some patches for you to test in the next few days. Mark Great analysis. I was really wondering, what could make the hash map so huge and hadn't thought about the hash function as the problem. Before OpenSSL 1.1.0 there's a callback for applications to provide their own thread IDs: https://www.openssl.org/docs/man1.0.2/crypto/CRYPTO_THREADID_set_callback.html So we could probably work around the problem of the poor hashing function by passing in IDs that work for hashing (pre-hashed ID?). Of course then we loose the direct association of the OpenSSL thread ID with the real platform thread id. Currently our callback in tcnative is ssl_set_thread_id() which refers to ssl_thread_id(), which on Linux gets the ID from the APR function apr_os_thread_current(). So we could add some hashing formula in ssl_thread_id(). In OpenSSL 1.1.0 this thread ID callback is gone. We'll see whether the lock hash problem is gone too ... I think we could add some logging to the hash impl to verify the assumption, that many thread IDs result in the same hash value. The lhash impl already seems to maintain quite a few statistics:
Re: Bug that spans tomcat and tomcat-native
I believe I have an explanation for what is going on that fits both the reported behaviour and the proposed fix. Background == OpenSSL tracks a list of the most recent errors for each thread in a hash map keyed on the thread (int_thread_hash in err.c). Reading and writing to this hash map is protected by a lock. The hash map is created and populated lazily. tc-native calls ERR_clear_error() before every call to SSL_do_handshake(), SSL_read() and SSL_write(). The call to ERR_clear_error() either clears the error list for the current thread or inserts a new empty list into the hash map of the thread is not already present. The performance problem was tracked down to threads waiting in ERR_clear_error() to obtain the write lock for the hash map. The proposed solution was to call ERR_remove_thread_state() just before the current Tomcat thread processing the connection is returned to the thread pool. This method removes the current thread and its associated error list from the hash map. Analysis The proposed solution, calling ERR_remove_thread_state(), adds a call that also obtains the write lock for the hash map. This indicates that the problem is not delays in obtaining the lock but contention for the lock because one or more operations taking place within the lock are taking a long time. Removing unused threads from the hash map removes the bottleneck. This points towards the hash map being the source of the problem. Testing by the OP showed that as soon as a test had been ran that required ~ 400 concurrent threads performance dropped significantly. It did not get noticeably worse if the same 400 thread test was run repeatedly. My testing indicated, on OSX at least, that the thread IDs used in the hash map were stable and that uncontrolled growth of the hash map was unlikely to be the cause. The manner in which thread IDs are generated varies by platform. On Linux, where this problem was observed, the thread ID is derived from (is normally equal to) the memory address of the per thread errno variable. This means that thread IDs tend to be concentrated in a relatively narrow range of values. For example, in a simple 10 thread test on OSX thread IDs ranged from 123145344839680 to 123145354387455. Thread IDs therefore fall with a 10^7 range within a possible range of 1.8x10^19. i.e. a very small, contiguous range. Hash maps use hashing functions to ensure that entries are (roughly) evenly distributed between the available buckets. The hash function, err_state_hash, used for the thread IDs in OpenSSL is threadID * 13. Supposition === The hash function used (multiple by 13) is insufficient to distribute the resulting values across multiple buckets because they will still fall in a relatively narrow band. Therefore all the threads end up in a single bucket which makes the performance of the hash map poor. This in turn makes calls to thread_get_item() slow because it does a hash map lookup. This lookup is performed with the read lock held for the hash map which in turn will slow down the calls that require the write lock. Proposal The analysis and supposition above need to be checked by someone with a better understanding of C than me. Assuming my work is correct, the next step is to look at possible fixes. I do not believe that patching OpenSSL is a viable option. The OpenSSL API needs to be reviewed to see if there is a way to avoid the calls that require the write lock. If the write lock cannot be avoided then we need to see if there is a better place to call ERR_remove_thread_state(). I'd like to fix this entirely in tc-native but that may mean calling ERR_remove_thread_state() more frequently which could create its own performance problems. Nate - I may have some patches for you to test in the next few days. Mark - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
Thanks. This is all really helpful. I need to spend some time digesting all of this information, reviewing the OpenSSL code and trying to figure out what might be going on. One of the things I am trying to do is to understand why the proposed patch fixes the issue and if an alternative approach is possible. Ideally, I'd like to fix this entirely in tc-native so we don't have to cross the Java-native boundary an additional time which is relatively slow. Thanks for all you help tracking this down so far. Mark On 25/06/2016 00:51, therealnewo...@gmail.com wrote: > On Fri, Jun 24, 2016 at 5:31 PM, Mark Thomaswrote: >> On 24/06/2016 21:52, therealnewo...@gmail.com wrote: >> I'm wondering if this is a problem that builds up over time. If I understood your previous posts correctly, running the big tests immediately gave ~700MB/s whereas running the small tests then the big tests resulting in ~350MB/s during the big tests. Are you able to experiment with this a little bit? For example, if you do big tests, 1M (~20%) small tests, big tests, 1M small tests, big tests etc. What is the data rate for the big tests after 0, 1M, 2M, 3M, 4M and 5M little tests. >>> >>> Sure I can try that. For the in between tests do you want me to run >>> those for a set amount of time or number of files? Like each smaller >>> batch like 20min and then 10min of large and then next smaller size? >> >> I was thinking set number of files. >> >> I would also be useful to know how many threads the executor has created >> at each point as well. (JMX should tell you that. You might need to >> adjust the executor so it doesn't stop idle threads.). > > I saw your message about not stopping idle threads after I already > started things. > > 1st 100M test: > 851348MB/s > Executor: > largestPoolSize: 25 > poolSize: 25 > 1st 4k test > Executor: > largestPoolSize: 401 > poolSize: 401 > 2nd 100M test: > 460147MB/s > Executor: > largestPoolSize: 401 > poolSize: 79 > 2nd 4k test > Executor: > largestPoolSize: 414 > poolSize: 414 > 3rd 100M test: > 429127MB/s > Executor: > largestPoolSize: 414 > poolSize: 80 > 3rd 4k test: > Executor: > largestPoolSize: 414 > poolSize: 401 > 4th 100M test: > 437918MB/s > Executor: > largestPoolSize: 414 > poolSize: 86 > 4th 4k test: > Executor: > largestPoolSize: 414 > poolSize: 401 > 5th 100M test: > 464837MB/s > Executor: > largestPoolSize: 414 > poolSize: 87 > > It looks like the problem occurs right after the first set of 4k puts > and doesn't get any worse so what ever causes the issue happens early. > This is getting stranger and I really can not explain why calling > ERR_remove_thread_state reliably improves performance. > >> Going back to your original description, you said you saw blocking >> during the call to ERR_clear_err(). Did you mean ERR_clear_error()? >> Either way, could you provide the full stack trace of an example blocked >> thread? And, ideally, the stack trace of the thread currently holding >> the lock? I'm still trying to understand what is going on here since >> based on my understanding of the code so far, the HashMap is bounded (to >> the number of threads) and should reach that limit fairly quickly. > > Sorry, yes I did mean ERR_clear_error(). > > #0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 > #1 0x7f0f61ab989d in __GI___pthread_mutex_lock (mutex=0x2b49c58) > at ../nptl/pthread_mutex_lock.c:80 > #2 0x7f0f3205f183 in int_thread_get (create=0) at err.c:446 > #3 0x7f0f3205f68d in int_thread_get_item (d=0x7f0ca89c7ce0) at err.c:491 > #4 0x7f0f32060094 in ERR_get_state () at err.c:1014 > #5 0x7f0f320602cf in ERR_clear_error () at err.c:747 > #6 0x7f0f325f3579 in ssl_socket_recv (sock=0x7f0dcc391980, > buf=0x7f0eec067820 > "lock->199808-Source_filename->rhino_perf_https_lt_100g_a-Loop->1-Count->11089487-11089488-11089489-11089490-11089491-11089492-11089493-11089494-11089495-11089496-11089497-11089498-11089499-11089500-11"..., > len=0x7f0ca89c7ff0) at src/sslnetwork.c:401 > #7 0x7f0f325ece99 in Java_org_apache_tomcat_jni_Socket_recvbb > (e=, o=, sock=, > offset=, len=) at src/network.c:957 > > I tried getting more data but the jvm tends to dump core when gdb is > attached or is going to slow to actually cause the lock contention. I > can reliably see a thread waiting on this lock if I attach to a single > thread and randomly interrupt it and look at the back trace. When I > look at the mutex it has a different owner each time so different > threads are getting the lock. > > I will play with this a bit more on Monday. > > -nate > > - > To unsubscribe, e-mail:
Re: Bug that spans tomcat and tomcat-native
On Fri, Jun 24, 2016 at 5:31 PM, Mark Thomaswrote: > On 24/06/2016 21:52, therealnewo...@gmail.com wrote: > >>> I'm wondering if this is a problem that builds up over time. If I >>> understood your previous posts correctly, running the big tests >>> immediately gave ~700MB/s whereas running the small tests then the big >>> tests resulting in ~350MB/s during the big tests. Are you able to >>> experiment with this a little bit? For example, if you do big tests, 1M >>> (~20%) small tests, big tests, 1M small tests, big tests etc. What is >>> the data rate for the big tests after 0, 1M, 2M, 3M, 4M and 5M little tests. >> >> Sure I can try that. For the in between tests do you want me to run >> those for a set amount of time or number of files? Like each smaller >> batch like 20min and then 10min of large and then next smaller size? > > I was thinking set number of files. > > I would also be useful to know how many threads the executor has created > at each point as well. (JMX should tell you that. You might need to > adjust the executor so it doesn't stop idle threads.). I saw your message about not stopping idle threads after I already started things. 1st 100M test: 851348MB/s Executor: largestPoolSize: 25 poolSize: 25 1st 4k test Executor: largestPoolSize: 401 poolSize: 401 2nd 100M test: 460147MB/s Executor: largestPoolSize: 401 poolSize: 79 2nd 4k test Executor: largestPoolSize: 414 poolSize: 414 3rd 100M test: 429127MB/s Executor: largestPoolSize: 414 poolSize: 80 3rd 4k test: Executor: largestPoolSize: 414 poolSize: 401 4th 100M test: 437918MB/s Executor: largestPoolSize: 414 poolSize: 86 4th 4k test: Executor: largestPoolSize: 414 poolSize: 401 5th 100M test: 464837MB/s Executor: largestPoolSize: 414 poolSize: 87 It looks like the problem occurs right after the first set of 4k puts and doesn't get any worse so what ever causes the issue happens early. This is getting stranger and I really can not explain why calling ERR_remove_thread_state reliably improves performance. > Going back to your original description, you said you saw blocking > during the call to ERR_clear_err(). Did you mean ERR_clear_error()? > Either way, could you provide the full stack trace of an example blocked > thread? And, ideally, the stack trace of the thread currently holding > the lock? I'm still trying to understand what is going on here since > based on my understanding of the code so far, the HashMap is bounded (to > the number of threads) and should reach that limit fairly quickly. Sorry, yes I did mean ERR_clear_error(). #0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x7f0f61ab989d in __GI___pthread_mutex_lock (mutex=0x2b49c58) at ../nptl/pthread_mutex_lock.c:80 #2 0x7f0f3205f183 in int_thread_get (create=0) at err.c:446 #3 0x7f0f3205f68d in int_thread_get_item (d=0x7f0ca89c7ce0) at err.c:491 #4 0x7f0f32060094 in ERR_get_state () at err.c:1014 #5 0x7f0f320602cf in ERR_clear_error () at err.c:747 #6 0x7f0f325f3579 in ssl_socket_recv (sock=0x7f0dcc391980, buf=0x7f0eec067820 "lock->199808-Source_filename->rhino_perf_https_lt_100g_a-Loop->1-Count->11089487-11089488-11089489-11089490-11089491-11089492-11089493-11089494-11089495-11089496-11089497-11089498-11089499-11089500-11"..., len=0x7f0ca89c7ff0) at src/sslnetwork.c:401 #7 0x7f0f325ece99 in Java_org_apache_tomcat_jni_Socket_recvbb (e=, o=, sock=, offset=, len=) at src/network.c:957 I tried getting more data but the jvm tends to dump core when gdb is attached or is going to slow to actually cause the lock contention. I can reliably see a thread waiting on this lock if I attach to a single thread and randomly interrupt it and look at the back trace. When I look at the mutex it has a different owner each time so different threads are getting the lock. I will play with this a bit more on Monday. -nate - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
Nate, On 6/24/16 10:40 AM, therealnewo...@gmail.com wrote: > On Fri, Jun 24, 2016 at 6:17 AM, Rémy Maucheratwrote: >> 2016-06-24 12:08 GMT+02:00 Mark Thomas : >> >>> Thanks. >>> >>> I'm going to start some local performance testing to confirm I see >>> similar results and, assuming I do, I'll start looking at fixing this >>> for 1.2.x/9.0.x and back-porting. >>> >>> Hum, the fix that was submitted doesn't make sense IMO since writes can be >> async, so I don't see a way besides adding the "error clear" thing after >> each operation [and we'll remove it once OpenSSL 1.1 is there if it >> actually fixes it]. That's assuming this issue is real [I actually never >> noticed anything during my many ab runs and they use a lot of threads, so I >> have a hard time believing it is significant enough ;) ]. >> > > One thing about the system on which this is running is that it has a > 10G nic. So the slow case is about 350MB/s and the fast one is 700MB/s > so you would need a 10G interface or use loop back to even notice the > issue assuming the CPU on the system can push that much encrypted > data. I believe Jean-Frederic Clere (works with Rémy) has been using 10G interfaces for the Tomcat performance testing, including using APR. I don't have his slides handy... perhaps they show an unexplained performance drop when using APR/tcnative? Or perhaps they show no performance drop (relative to httpd). That would be interesting to see. -chris signature.asc Description: OpenPGP digital signature
Re: Bug that spans tomcat and tomcat-native
On 24/06/2016 21:52, therealnewo...@gmail.com wrote: >> I'm wondering if this is a problem that builds up over time. If I >> understood your previous posts correctly, running the big tests >> immediately gave ~700MB/s whereas running the small tests then the big >> tests resulting in ~350MB/s during the big tests. Are you able to >> experiment with this a little bit? For example, if you do big tests, 1M >> (~20%) small tests, big tests, 1M small tests, big tests etc. What is >> the data rate for the big tests after 0, 1M, 2M, 3M, 4M and 5M little tests. > > Sure I can try that. For the in between tests do you want me to run > those for a set amount of time or number of files? Like each smaller > batch like 20min and then 10min of large and then next smaller size? I was thinking set number of files. I would also be useful to know how many threads the executor has created at each point as well. (JMX should tell you that. You might need to adjust the executor so it doesn't stop idle threads.). >> What I am trying to pin down is how quickly does this problem build up. >> >> Also, do you see any failed requests or do they all succeed? > > All successes. OK. That rules out some possibilities. Going back to your original description, you said you saw blocking during the call to ERR_clear_err(). Did you mean ERR_clear_error()? Either way, could you provide the full stack trace of an example blocked thread? And, ideally, the stack trace of the thread currently holding the lock? I'm still trying to understand what is going on here since based on my understanding of the code so far, the HashMap is bounded (to the number of threads) and should reach that limit fairly quickly. Thanks, Mark - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On Fri, Jun 24, 2016 at 4:52 PM,wrote: > On Fri, Jun 24, 2016 at 3:21 PM, Mark Thomas wrote: >> On 24/06/2016 20:01, therealnewo...@gmail.com wrote: >>> On Fri, Jun 24, 2016 at 2:07 PM, Mark Thomas wrote: On 24/06/2016 18:41, Nate Clark wrote: > On Fri, Jun 24, 2016 at 1:37 PM, Nate Clark wrote: >> On Fri, Jun 24, 2016 at 1:27 PM, Mark Thomas wrote: >>> On 24/06/2016 18:25, Mark Thomas wrote: Can you provide the settings you are using for the Executor as well please? >> >> > maxThreads="500" minSpareThreads="4"/> >> >>> >>> And how long do the initial 5,000,000 4k requests take to process? >>> >> >> 40 minutes. >> > Not sure this matters but I just double checked and there are actually > 400 threads in total doing the 4k PUTs. Two clients each doing 200 > threads. the 100MB test is 24 threads total 12 per client machine. > > Sorry for misinformation earlier. No problem. Thanks for the information. One last question (for now). How many processors / cores / threads does the server support? I'm trying to get a handle on what the concurrency looks like. >>> >>> The machine has two physical chips each with 6 cores and >>> hyper-threading enabled, so 24 cores exposed to the OS. >> >> Thanks. >> >> >> >>> If it matters the system also has 256GB of memory. >> >> I don't think RAM is playing a role here but it is still good to know. >> >> In terms of next steps, I want to see if I can come up with a theory >> that matches what you are observing. From that we can then assess >> whether the proposed patch can be improved. >> >> Apologies for the drip-feeding of questions. As I learn a bit more, a >> few more questions come to mind. >> >> I'm wondering if this is a problem that builds up over time. If I >> understood your previous posts correctly, running the big tests >> immediately gave ~700MB/s whereas running the small tests then the big >> tests resulting in ~350MB/s during the big tests. Are you able to >> experiment with this a little bit? For example, if you do big tests, 1M >> (~20%) small tests, big tests, 1M small tests, big tests etc. What is >> the data rate for the big tests after 0, 1M, 2M, 3M, 4M and 5M little tests. > > Sure I can try that. For the in between tests do you want me to run > those for a set amount of time or number of files? Like each smaller > batch like 20min and then 10min of large and then next smaller size? Ignore that question. I misinterpreted your 1m to be 1MB. -nate - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On Fri, Jun 24, 2016 at 3:21 PM, Mark Thomaswrote: > On 24/06/2016 20:01, therealnewo...@gmail.com wrote: >> On Fri, Jun 24, 2016 at 2:07 PM, Mark Thomas wrote: >>> On 24/06/2016 18:41, Nate Clark wrote: On Fri, Jun 24, 2016 at 1:37 PM, Nate Clark wrote: > On Fri, Jun 24, 2016 at 1:27 PM, Mark Thomas wrote: >> On 24/06/2016 18:25, Mark Thomas wrote: >>> >>> Can you provide the settings you are using for the Executor as well >>> please? > > maxThreads="500" minSpareThreads="4"/> > >> >> And how long do the initial 5,000,000 4k requests take to process? >> > > 40 minutes. > Not sure this matters but I just double checked and there are actually 400 threads in total doing the 4k PUTs. Two clients each doing 200 threads. the 100MB test is 24 threads total 12 per client machine. Sorry for misinformation earlier. >>> >>> No problem. Thanks for the information. One last question (for now). How >>> many processors / cores / threads does the server support? I'm trying to >>> get a handle on what the concurrency looks like. >>> >> >> The machine has two physical chips each with 6 cores and >> hyper-threading enabled, so 24 cores exposed to the OS. > > Thanks. > > > >> If it matters the system also has 256GB of memory. > > I don't think RAM is playing a role here but it is still good to know. > > In terms of next steps, I want to see if I can come up with a theory > that matches what you are observing. From that we can then assess > whether the proposed patch can be improved. > > Apologies for the drip-feeding of questions. As I learn a bit more, a > few more questions come to mind. > > I'm wondering if this is a problem that builds up over time. If I > understood your previous posts correctly, running the big tests > immediately gave ~700MB/s whereas running the small tests then the big > tests resulting in ~350MB/s during the big tests. Are you able to > experiment with this a little bit? For example, if you do big tests, 1M > (~20%) small tests, big tests, 1M small tests, big tests etc. What is > the data rate for the big tests after 0, 1M, 2M, 3M, 4M and 5M little tests. Sure I can try that. For the in between tests do you want me to run those for a set amount of time or number of files? Like each smaller batch like 20min and then 10min of large and then next smaller size? > What I am trying to pin down is how quickly does this problem build up. > > Also, do you see any failed requests or do they all succeed? All successes. -nate - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On 24/06/2016 20:01, therealnewo...@gmail.com wrote: > On Fri, Jun 24, 2016 at 2:07 PM, Mark Thomaswrote: >> On 24/06/2016 18:41, Nate Clark wrote: >>> On Fri, Jun 24, 2016 at 1:37 PM, Nate Clark wrote: On Fri, Jun 24, 2016 at 1:27 PM, Mark Thomas wrote: > On 24/06/2016 18:25, Mark Thomas wrote: >> >> Can you provide the settings you are using for the Executor as well >> please? >>> maxThreads="500" minSpareThreads="4"/> > > And how long do the initial 5,000,000 4k requests take to process? > 40 minutes. >>> Not sure this matters but I just double checked and there are actually >>> 400 threads in total doing the 4k PUTs. Two clients each doing 200 >>> threads. the 100MB test is 24 threads total 12 per client machine. >>> >>> Sorry for misinformation earlier. >> >> No problem. Thanks for the information. One last question (for now). How >> many processors / cores / threads does the server support? I'm trying to >> get a handle on what the concurrency looks like. >> > > The machine has two physical chips each with 6 cores and > hyper-threading enabled, so 24 cores exposed to the OS. Thanks. > If it matters the system also has 256GB of memory. I don't think RAM is playing a role here but it is still good to know. In terms of next steps, I want to see if I can come up with a theory that matches what you are observing. From that we can then assess whether the proposed patch can be improved. Apologies for the drip-feeding of questions. As I learn a bit more, a few more questions come to mind. I'm wondering if this is a problem that builds up over time. If I understood your previous posts correctly, running the big tests immediately gave ~700MB/s whereas running the small tests then the big tests resulting in ~350MB/s during the big tests. Are you able to experiment with this a little bit? For example, if you do big tests, 1M (~20%) small tests, big tests, 1M small tests, big tests etc. What is the data rate for the big tests after 0, 1M, 2M, 3M, 4M and 5M little tests. What I am trying to pin down is how quickly does this problem build up. Also, do you see any failed requests or do they all succeed? Thanks again, Mark > > -nate > > - > To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org > For additional commands, e-mail: dev-h...@tomcat.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On Fri, Jun 24, 2016 at 2:07 PM, Mark Thomaswrote: > On 24/06/2016 18:41, Nate Clark wrote: >> On Fri, Jun 24, 2016 at 1:37 PM, Nate Clark wrote: >>> On Fri, Jun 24, 2016 at 1:27 PM, Mark Thomas wrote: On 24/06/2016 18:25, Mark Thomas wrote: > > Can you provide the settings you are using for the Executor as well > please? >>> >>> >> maxThreads="500" minSpareThreads="4"/> >>> And how long do the initial 5,000,000 4k requests take to process? >>> >>> 40 minutes. >>> >> Not sure this matters but I just double checked and there are actually >> 400 threads in total doing the 4k PUTs. Two clients each doing 200 >> threads. the 100MB test is 24 threads total 12 per client machine. >> >> Sorry for misinformation earlier. > > No problem. Thanks for the information. One last question (for now). How > many processors / cores / threads does the server support? I'm trying to > get a handle on what the concurrency looks like. > The machine has two physical chips each with 6 cores and hyper-threading enabled, so 24 cores exposed to the OS. cpuinfo for first core: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 63 model name : Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz stepping: 2 microcode : 0x2e cpu MHz : 1212.656 cache size : 15360 KB physical id : 0 siblings: 12 core id : 0 cpu cores : 6 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 15 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc bugs: bogomips: 4788.98 clflush size: 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: If it matters the system also has 256GB of memory. -nate - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On 24/06/2016 18:41, Nate Clark wrote: > On Fri, Jun 24, 2016 at 1:37 PM, Nate Clarkwrote: >> On Fri, Jun 24, 2016 at 1:27 PM, Mark Thomas wrote: >>> On 24/06/2016 18:25, Mark Thomas wrote: Can you provide the settings you are using for the Executor as well please? >> >> > maxThreads="500" minSpareThreads="4"/> >> >>> >>> And how long do the initial 5,000,000 4k requests take to process? >>> >> >> 40 minutes. >> > Not sure this matters but I just double checked and there are actually > 400 threads in total doing the 4k PUTs. Two clients each doing 200 > threads. the 100MB test is 24 threads total 12 per client machine. > > Sorry for misinformation earlier. No problem. Thanks for the information. One last question (for now). How many processors / cores / threads does the server support? I'm trying to get a handle on what the concurrency looks like. Thanks, Mark - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On Fri, Jun 24, 2016 at 1:37 PM, Nate Clarkwrote: > On Fri, Jun 24, 2016 at 1:27 PM, Mark Thomas wrote: >> On 24/06/2016 18:25, Mark Thomas wrote: >>> >>> Can you provide the settings you are using for the Executor as well please? > > maxThreads="500" minSpareThreads="4"/> > >> >> And how long do the initial 5,000,000 4k requests take to process? >> > > 40 minutes. > Not sure this matters but I just double checked and there are actually 400 threads in total doing the 4k PUTs. Two clients each doing 200 threads. the 100MB test is 24 threads total 12 per client machine. Sorry for misinformation earlier. -nate - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On Fri, Jun 24, 2016 at 1:27 PM, Mark Thomaswrote: > On 24/06/2016 18:25, Mark Thomas wrote: >> >> Can you provide the settings you are using for the Executor as well please? > > And how long do the initial 5,000,000 4k requests take to process? > 40 minutes. -nate - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On 24/06/2016 18:25, Mark Thomas wrote: > On 24/06/2016 18:11, therealnewo...@gmail.com wrote: >> On Fri, Jun 24, 2016 at 11:18 AM, Mark Thomaswrote: >>> On 24/06/2016 11:17, Rémy Maucherat wrote: 2016-06-24 12:08 GMT+02:00 Mark Thomas : > Thanks. > > I'm going to start some local performance testing to confirm I see > similar results and, assuming I do, I'll start looking at fixing this > for 1.2.x/9.0.x and back-porting. > Hum, the fix that was submitted doesn't make sense IMO since writes can be async, so I don't see a way besides adding the "error clear" thing after each operation [and we'll remove it once OpenSSL 1.1 is there if it actually fixes it]. That's assuming this issue is real [I actually never noticed anything during my many ab runs and they use a lot of threads, so I have a hard time believing it is significant enough ;) ]. >>> >>> I haven't been able to reproduce anything like this yet. So far I have >>> only been testing with tc-native 1.2.x and Tomcat 9.0.x. I might need to >>> test with 1.1.x and Tomcat 7.0.x, the versions used by the OP. >>> >>> I'm having trouble understanding how this is happening. I could imagine >>> that HashMap becoming a problem if there was a high churn in Threads. >>> I'm thinking of something like bursty traffic levels and an executor >>> aggressively halting spare threads. I need to experiment with that as well. >>> >> >> I do not understand it either. Using the thread pool there is not much >> thread churn so I am not sure why the problem gets as bad as it does. >> I didn't look into what the hash table actually had in it. I just >> noticed that the majority of a read threads time was spent waiting for >> the lock to access this hash table. Once I added the call to >> ERR_remove_thread_state the waiting basically disappeared. >> >> For this test the traffic is constant. Each client thread creates one >> connection and just keeps pushing requests for set number of requests, >> so we aren't even creating new connections. > > Can you provide the settings you are using for the Executor as well please? And how long do the initial 5,000,000 4k requests take to process? Thanks, Mark - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On 24/06/2016 18:11, therealnewo...@gmail.com wrote: > On Fri, Jun 24, 2016 at 11:18 AM, Mark Thomaswrote: >> On 24/06/2016 11:17, Rémy Maucherat wrote: >>> 2016-06-24 12:08 GMT+02:00 Mark Thomas : >>> Thanks. I'm going to start some local performance testing to confirm I see similar results and, assuming I do, I'll start looking at fixing this for 1.2.x/9.0.x and back-porting. >>> Hum, the fix that was submitted doesn't make sense IMO since writes can be >>> async, so I don't see a way besides adding the "error clear" thing after >>> each operation [and we'll remove it once OpenSSL 1.1 is there if it >>> actually fixes it]. That's assuming this issue is real [I actually never >>> noticed anything during my many ab runs and they use a lot of threads, so I >>> have a hard time believing it is significant enough ;) ]. >> >> I haven't been able to reproduce anything like this yet. So far I have >> only been testing with tc-native 1.2.x and Tomcat 9.0.x. I might need to >> test with 1.1.x and Tomcat 7.0.x, the versions used by the OP. >> >> I'm having trouble understanding how this is happening. I could imagine >> that HashMap becoming a problem if there was a high churn in Threads. >> I'm thinking of something like bursty traffic levels and an executor >> aggressively halting spare threads. I need to experiment with that as well. >> > > I do not understand it either. Using the thread pool there is not much > thread churn so I am not sure why the problem gets as bad as it does. > I didn't look into what the hash table actually had in it. I just > noticed that the majority of a read threads time was spent waiting for > the lock to access this hash table. Once I added the call to > ERR_remove_thread_state the waiting basically disappeared. > > For this test the traffic is constant. Each client thread creates one > connection and just keeps pushing requests for set number of requests, > so we aren't even creating new connections. Can you provide the settings you are using for the Executor as well please? Thanks, Mark - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On Fri, Jun 24, 2016 at 11:18 AM, Mark Thomaswrote: > On 24/06/2016 11:17, Rémy Maucherat wrote: >> 2016-06-24 12:08 GMT+02:00 Mark Thomas : >> >>> Thanks. >>> >>> I'm going to start some local performance testing to confirm I see >>> similar results and, assuming I do, I'll start looking at fixing this >>> for 1.2.x/9.0.x and back-porting. >>> >> Hum, the fix that was submitted doesn't make sense IMO since writes can be >> async, so I don't see a way besides adding the "error clear" thing after >> each operation [and we'll remove it once OpenSSL 1.1 is there if it >> actually fixes it]. That's assuming this issue is real [I actually never >> noticed anything during my many ab runs and they use a lot of threads, so I >> have a hard time believing it is significant enough ;) ]. > > I haven't been able to reproduce anything like this yet. So far I have > only been testing with tc-native 1.2.x and Tomcat 9.0.x. I might need to > test with 1.1.x and Tomcat 7.0.x, the versions used by the OP. > > I'm having trouble understanding how this is happening. I could imagine > that HashMap becoming a problem if there was a high churn in Threads. > I'm thinking of something like bursty traffic levels and an executor > aggressively halting spare threads. I need to experiment with that as well. > I do not understand it either. Using the thread pool there is not much thread churn so I am not sure why the problem gets as bad as it does. I didn't look into what the hash table actually had in it. I just noticed that the majority of a read threads time was spent waiting for the lock to access this hash table. Once I added the call to ERR_remove_thread_state the waiting basically disappeared. For this test the traffic is constant. Each client thread creates one connection and just keeps pushing requests for set number of requests, so we aren't even creating new connections. -nate - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On Fri, Jun 24, 2016 at 6:17 AM, Rémy Maucheratwrote: > 2016-06-24 12:08 GMT+02:00 Mark Thomas : > >> Thanks. >> >> I'm going to start some local performance testing to confirm I see >> similar results and, assuming I do, I'll start looking at fixing this >> for 1.2.x/9.0.x and back-porting. >> >> Hum, the fix that was submitted doesn't make sense IMO since writes can be > async, so I don't see a way besides adding the "error clear" thing after > each operation [and we'll remove it once OpenSSL 1.1 is there if it > actually fixes it]. That's assuming this issue is real [I actually never > noticed anything during my many ab runs and they use a lot of threads, so I > have a hard time believing it is significant enough ;) ]. I was not using async IO so I did not account for that in my patch. It was more a case of see if I can resolve this issue for my use case. -nate - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On Fri, Jun 24, 2016 at 11:37 AM,wrote: > On Fri, Jun 24, 2016 at 11:18 AM, Mark Thomas wrote: >> On 24/06/2016 11:17, Rémy Maucherat wrote: >>> 2016-06-24 12:08 GMT+02:00 Mark Thomas : >>> Thanks. I'm going to start some local performance testing to confirm I see similar results and, assuming I do, I'll start looking at fixing this for 1.2.x/9.0.x and back-porting. >>> Hum, the fix that was submitted doesn't make sense IMO since writes can be >>> async, so I don't see a way besides adding the "error clear" thing after >>> each operation [and we'll remove it once OpenSSL 1.1 is there if it >>> actually fixes it]. That's assuming this issue is real [I actually never >>> noticed anything during my many ab runs and they use a lot of threads, so I >>> have a hard time believing it is significant enough ;) ]. >> >> I haven't been able to reproduce anything like this yet. So far I have >> only been testing with tc-native 1.2.x and Tomcat 9.0.x. I might need to >> test with 1.1.x and Tomcat 7.0.x, the versions used by the OP. >> >> I'm having trouble understanding how this is happening. I could imagine >> that HashMap becoming a problem if there was a high churn in Threads. >> I'm thinking of something like bursty traffic levels and an executor >> aggressively halting spare threads. I need to experiment with that as well. >> >> Nate, >> >> We need as much information as you can provide on how to reproduce this. >> As a minimum we need to know: >> - Connector configuration from server.xml >> - Operating system >> - How tc-native was built >> - Exact versions for everything >> >> We need enough information to recreate the test and the results >> that you obtained. > OS: Fedora 22 > tc-native: tomcat-native-1.1.34-1.fc22.x86_64 > tomcat: tomcat-7.0.68-3.fc22.noarch > > This issue was seen in older versions of tomcat: > tomcat-native-1.1.30-2.fc21 and tomcat-7.0.54-3.fc21 I forgot to give you the openssl version openssl-1.0.1k-15.fc22.x86_64 -nate - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On Fri, Jun 24, 2016 at 11:18 AM, Mark Thomaswrote: > On 24/06/2016 11:17, Rémy Maucherat wrote: >> 2016-06-24 12:08 GMT+02:00 Mark Thomas : >> >>> Thanks. >>> >>> I'm going to start some local performance testing to confirm I see >>> similar results and, assuming I do, I'll start looking at fixing this >>> for 1.2.x/9.0.x and back-porting. >>> >> Hum, the fix that was submitted doesn't make sense IMO since writes can be >> async, so I don't see a way besides adding the "error clear" thing after >> each operation [and we'll remove it once OpenSSL 1.1 is there if it >> actually fixes it]. That's assuming this issue is real [I actually never >> noticed anything during my many ab runs and they use a lot of threads, so I >> have a hard time believing it is significant enough ;) ]. > > I haven't been able to reproduce anything like this yet. So far I have > only been testing with tc-native 1.2.x and Tomcat 9.0.x. I might need to > test with 1.1.x and Tomcat 7.0.x, the versions used by the OP. > > I'm having trouble understanding how this is happening. I could imagine > that HashMap becoming a problem if there was a high churn in Threads. > I'm thinking of something like bursty traffic levels and an executor > aggressively halting spare threads. I need to experiment with that as well. > > Nate, > > We need as much information as you can provide on how to reproduce this. > As a minimum we need to know: > - Connector configuration from server.xml > - Operating system > - How tc-native was built > - Exact versions for everything > > We need enough information to recreate the test and the results > that you obtained. Connector configuration: Keepalive is enabled. OS: Fedora 22 tc-native: tomcat-native-1.1.34-1.fc22.x86_64 tomcat: tomcat-7.0.68-3.fc22.noarch This issue was seen in older versions of tomcat: tomcat-native-1.1.30-2.fc21 and tomcat-7.0.54-3.fc21 All of the builds are the rpms released by fedora from their build machines. The test I ran performed about 5 million 4k requests and then did the large 100M requests and was able to see the issue immediately. -nate - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On 24/06/2016 15:40, therealnewo...@gmail.com wrote: > On Fri, Jun 24, 2016 at 6:17 AM, Rémy Maucheratwrote: >> 2016-06-24 12:08 GMT+02:00 Mark Thomas : >> >>> Thanks. >>> >>> I'm going to start some local performance testing to confirm I see >>> similar results and, assuming I do, I'll start looking at fixing this >>> for 1.2.x/9.0.x and back-porting. >>> >>> Hum, the fix that was submitted doesn't make sense IMO since writes can be >> async, so I don't see a way besides adding the "error clear" thing after >> each operation [and we'll remove it once OpenSSL 1.1 is there if it >> actually fixes it]. That's assuming this issue is real [I actually never >> noticed anything during my many ab runs and they use a lot of threads, so I >> have a hard time believing it is significant enough ;) ]. >> > > One thing about the system on which this is running is that it has a > 10G nic. So the slow case is about 350MB/s and the fast one is 700MB/s > so you would need a 10G interface or use loop back to even notice the > issue assuming the CPU on the system can push that much encrypted > data. Roughly how many requests do you have to do for the problem to appear? Is keep-alive enabled or disabled? (plus the questions from my other mail) Mark - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On 24/06/2016 11:17, Rémy Maucherat wrote: > 2016-06-24 12:08 GMT+02:00 Mark Thomas: > >> Thanks. >> >> I'm going to start some local performance testing to confirm I see >> similar results and, assuming I do, I'll start looking at fixing this >> for 1.2.x/9.0.x and back-porting. >> > Hum, the fix that was submitted doesn't make sense IMO since writes can be > async, so I don't see a way besides adding the "error clear" thing after > each operation [and we'll remove it once OpenSSL 1.1 is there if it > actually fixes it]. That's assuming this issue is real [I actually never > noticed anything during my many ab runs and they use a lot of threads, so I > have a hard time believing it is significant enough ;) ]. I haven't been able to reproduce anything like this yet. So far I have only been testing with tc-native 1.2.x and Tomcat 9.0.x. I might need to test with 1.1.x and Tomcat 7.0.x, the versions used by the OP. I'm having trouble understanding how this is happening. I could imagine that HashMap becoming a problem if there was a high churn in Threads. I'm thinking of something like bursty traffic levels and an executor aggressively halting spare threads. I need to experiment with that as well. Nate, We need as much information as you can provide on how to reproduce this. As a minimum we need to know: - Connector configuration from server.xml - Operating system - How tc-native was built - Exact versions for everything We need enough information to recreate the test and the results that you obtained. Mark - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On Fri, Jun 24, 2016 at 6:17 AM, Rémy Maucheratwrote: > 2016-06-24 12:08 GMT+02:00 Mark Thomas : > >> Thanks. >> >> I'm going to start some local performance testing to confirm I see >> similar results and, assuming I do, I'll start looking at fixing this >> for 1.2.x/9.0.x and back-porting. >> >> Hum, the fix that was submitted doesn't make sense IMO since writes can be > async, so I don't see a way besides adding the "error clear" thing after > each operation [and we'll remove it once OpenSSL 1.1 is there if it > actually fixes it]. That's assuming this issue is real [I actually never > noticed anything during my many ab runs and they use a lot of threads, so I > have a hard time believing it is significant enough ;) ]. > One thing about the system on which this is running is that it has a 10G nic. So the slow case is about 350MB/s and the fast one is 700MB/s so you would need a 10G interface or use loop back to even notice the issue assuming the CPU on the system can push that much encrypted data. -nate - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
2016-06-24 12:08 GMT+02:00 Mark Thomas: > Thanks. > > I'm going to start some local performance testing to confirm I see > similar results and, assuming I do, I'll start looking at fixing this > for 1.2.x/9.0.x and back-porting. > > Hum, the fix that was submitted doesn't make sense IMO since writes can be async, so I don't see a way besides adding the "error clear" thing after each operation [and we'll remove it once OpenSSL 1.1 is there if it actually fixes it]. That's assuming this issue is real [I actually never noticed anything during my many ab runs and they use a lot of threads, so I have a hard time believing it is significant enough ;) ]. Rémy
Re: Bug that spans tomcat and tomcat-native
On 23/06/2016 22:31, Nate Clark wrote: > I tried to submit the bug but it seems that I am now unable to access > bz.apache.org. Since you made it seem like it was important for this > to be known about here is the info and patches. Thanks. I'll chat with the infra folks and see if I can get whatever is going on with BZ for you fixed. > When performing some bench marking I noticed that the SSL performance > of large request reads degraded heavily when performed after millions > of small requests. Basically the setup is in a multi-threaded > environment, about 200 threads, performing PUT requests using SSL with > a body of about 4KB and then using 20 threads performing PUT requests > with a body of 100MB. If the small requests are not performed the > speed of the large requests in MB/s is about 2x. Ouch. That is significant. > I tracked down the issue to ERR_clear_err() blocking on an internal > lock which protects a hash map of the error states. It seems that the > hash map was growing unbounded because the states in it were never > being cleared by a thread when it had completed SSL operations. > > According to OpenSSL documents ERR_remove_thread_state() or > ERR_remove_state() for versions of OpenSSL less than 1.1.0 needs to be > invoked prior to a thread exiting. This is not done by threads in the > native code so the hash table keeps growing and getting larger and > larger and more expensive to maintain. > > By adding a new native call which invoked ERR_remove_thread_state and > calling it from AprEndpoint in tomcat I was able to reduce the > contention on the lock and the performance improved. > > Because of the thread pool I could not find a simple clean way to > invoke the cleanup before the thread dies so instead I added it to the > end of the socket processing. > > Here are the patches I used against tomcat-native 1.1.34 and tomcat70: Thanks. I'm going to start some local performance testing to confirm I see similar results and, assuming I do, I'll start looking at fixing this for 1.2.x/9.0.x and back-porting. Mark - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Re: Bug that spans tomcat and tomcat-native
On Thu, Jun 23, 2016 at 3:32 PM, Mark Thomaswrote: > On 23/06/2016 20:28, Nate Clark wrote: >> >> If you want details about the issue I am happy to provide them but >> just didn't wan to duplicate explaining it in a bug and in an email >> thread. > > To be honest, it doesn't really matter which product it is filed under. > It is more important to get it filed. Even more so since a tc-native > release is planned for tomorrow. Just file it under tc-native and > someone will take a look. > Hi Mark, I tried to submit the bug but it seems that I am now unable to access bz.apache.org. Since you made it seem like it was important for this to be known about here is the info and patches. When performing some bench marking I noticed that the SSL performance of large request reads degraded heavily when performed after millions of small requests. Basically the setup is in a multi-threaded environment, about 200 threads, performing PUT requests using SSL with a body of about 4KB and then using 20 threads performing PUT requests with a body of 100MB. If the small requests are not performed the speed of the large requests in MB/s is about 2x. I tracked down the issue to ERR_clear_err() blocking on an internal lock which protects a hash map of the error states. It seems that the hash map was growing unbounded because the states in it were never being cleared by a thread when it had completed SSL operations. According to OpenSSL documents ERR_remove_thread_state() or ERR_remove_state() for versions of OpenSSL less than 1.1.0 needs to be invoked prior to a thread exiting. This is not done by threads in the native code so the hash table keeps growing and getting larger and larger and more expensive to maintain. By adding a new native call which invoked ERR_remove_thread_state and calling it from AprEndpoint in tomcat I was able to reduce the contention on the lock and the performance improved. Because of the thread pool I could not find a simple clean way to invoke the cleanup before the thread dies so instead I added it to the end of the socket processing. Here are the patches I used against tomcat-native 1.1.34 and tomcat70: >From dd4a8dcdfbb863f8e8ba5ed24070e60200421082 Mon Sep 17 00:00:00 2001 From: Nate Clark Date: Thu, 23 Jun 2016 16:51:04 -0400 Subject: [PATCH] Add threadCleanup function to free memory footprint of a thread OpenSSL allocates memory for an error queue for each thread which uses the SSL functions. Prior to version 1.1.0 the memory needed to released manually by either invoking ERR_remove_state or ERR_remove_thread_state. If this is not done the memory foot print growes larger which slowes down the use of the error queues. By adding the new function threadCleanup threads can now release this memroy prior to thread exit. --- native/src/ssl.c | 24 +++- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/native/src/ssl.c b/native/src/ssl.c index ded7b22..5875a68 100644 --- a/native/src/ssl.c +++ b/native/src/ssl.c @@ -272,6 +272,14 @@ TCN_IMPLEMENT_CALL(jstring, SSL, versionString)(TCN_STDARGS) return AJP_TO_JSTRING(OPENSSL_VERSION_TEXT); } +static void ssl_err_remove(void) { +#if (OPENSSL_VERSION_NUMBER < 0x1000L) || defined(OPENSSL_USE_DEPRECATED) +ERR_remove_state(0); +#elif (OPENSSL_VERSION_NUMBER < 0x1010L) +ERR_remove_thread_state(NULL); +#endif +} + /* * the various processing hooks */ @@ -310,11 +318,7 @@ static apr_status_t ssl_init_cleanup(void *data) #if OPENSSL_VERSION_NUMBER >= 0x00907001 CRYPTO_cleanup_all_ex_data(); #endif -#if (OPENSSL_VERSION_NUMBER < 0x1010L) || defined(OPENSSL_USE_DEPRECATED) -ERR_remove_state(0); -#else -ERR_remove_thread_state(NULL); -#endif +ssl_err_remove(); /* Don't call ERR_free_strings here; ERR_load_*_strings only * actually load the error strings once per process due to static @@ -1105,6 +1109,11 @@ TCN_IMPLEMENT_CALL(jboolean, SSL, hasOp)(TCN_STDARGS, jint op) return op == (op & supported_ssl_opts) ? JNI_TRUE : JNI_FALSE; } +TCN_IMPLEMENT_CALL(void, SSL, threadCleanup)(TCN_STDARGS) { +UNREFERENCED(o); +ssl_err_remove(); +} + #else /* OpenSSL is not supported. * Create empty stubs. @@ -1231,4 +1240,9 @@ TCN_IMPLEMENT_CALL(jboolean, SSL, hasOp)(TCN_STDARGS, jint op) UNREFERENCED(op); return JNI_FALSE; } + +TCN_IMPLEMENT_CALL(void, SSL, threadCleanup)(TCN_STDARGS) { +UNREFERENCED_STDARGS; +} + #endif -- 2.5.5 >From c5c1756153279e87f11594c0e053d5301ed121e7 Mon Sep 17 00:00:00 2001 From: Nate Clark Date: Thu, 23 Jun 2016 16:57:18 -0400 Subject: [PATCH] Invoke SSL.threadCleanup when done with socket processing When a thread is done processing an ssl socket it should invoke SSL.threadCleanup() to free memory associated with that thread in the SSL library. If this is not done the memory foot print bloats and each subsequent invocation will become slightly slower as there
Re: Bug that spans tomcat and tomcat-native
On 23/06/2016 20:28, Nate Clark wrote: > Hello, > > I encountered a performance issue when using the apr protocol. I > tracked it down to basically missing a piece of the integration with > openssl. I was able to fix the performance issue but it required a > change to both tomcat-native and tomcat. I was going to file a bug > with my patches but tomcat-native and tomcat are separate projects so > I was wondering the best way to handle that. File an issue in > tomcat-native and one in tomcat referencing the other bug? Or just use > this mailing list to discuss the issue? > > If you want details about the issue I am happy to provide them but > just didn't wan to duplicate explaining it in a bug and in an email > thread. To be honest, it doesn't really matter which product it is filed under. It is more important to get it filed. Even more so since a tc-native release is planned for tomorrow. Just file it under tc-native and someone will take a look. Mark - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org
Bug that spans tomcat and tomcat-native
Hello, I encountered a performance issue when using the apr protocol. I tracked it down to basically missing a piece of the integration with openssl. I was able to fix the performance issue but it required a change to both tomcat-native and tomcat. I was going to file a bug with my patches but tomcat-native and tomcat are separate projects so I was wondering the best way to handle that. File an issue in tomcat-native and one in tomcat referencing the other bug? Or just use this mailing list to discuss the issue? If you want details about the issue I am happy to provide them but just didn't wan to duplicate explaining it in a bug and in an email thread. Thanks, -nate - To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org