Re: Our webapp is running very slowly on one particular customer box

2020-10-29 Thread Christopher Schultz

James,

On 10/28/20 16:40, James H. H. Lampert wrote:

First, thanks once again, Mr. Schultz, for getting back to me.

I noticed something rather promising: it seems that maxThreads for the 
Port 443 connector were set at 150 for System "A" (problem box), but 400 
for System "J" (box that's quite happy).


I've restarted Tomcat with the maxThreads bumped up to 400, and so far, 
it seems much happier than it was. That could have been the problem all 
along.


Hmm. That doesn't sound very satisfying to me, honestly. Allowing *more* 
load uses *less* GC and/or fewer page-faults? Seems fishy.


My colleagues and I also observed that yesterday, when we did *not* shut 
down and restart, the slowdown and the nearly-full "tenured-SOA" portion 
of the heap eventually resolved itself, which suggests that it wasn't a 
memory leak in any even remotely conventional sense of the term.


That's a Good Thing, but also not very satisfying when you just want it 
to stop sucking and let your users get work done :)


The page-faulting is a virtual memory term: on an AS/400, the entire 
combined total of main storage and disk is addressable (the concept is 
called "Single-Level Store"), and virtual storage paging is built into 
the OS at a very low level; a "page fault" is when a process finds tries 
to access something that's been paged out to disk.


Yes, this is the common definition of a page-fault, not just an AS/400 
thing. Good to know for sure that AS/400 doesn't re-define that term, 
though :)


How long has the process on System J been running? How about System A 
(before you restarted the JVM)?


As to the private memory pool, it's not that the subsystem is restricted 
to its private pool; rather, everything else is kept *out* of that 
private pool. It still has full access to the "Machine" and "Base" 
shared pools.



Okay, so it's like a guaranteed-minimum memory space?

-chris

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Our webapp is running very slowly on one particular customer box

2020-10-28 Thread James H. H. Lampert

First, thanks once again, Mr. Schultz, for getting back to me.

I noticed something rather promising: it seems that maxThreads for the 
Port 443 connector were set at 150 for System "A" (problem box), but 400 
for System "J" (box that's quite happy).


I've restarted Tomcat with the maxThreads bumped up to 400, and so far, 
it seems much happier than it was. That could have been the problem all 
along.


My colleagues and I also observed that yesterday, when we did *not* shut 
down and restart, the slowdown and the nearly-full "tenured-SOA" portion 
of the heap eventually resolved itself, which suggests that it wasn't a 
memory leak in any even remotely conventional sense of the term.


The page-faulting is a virtual memory term: on an AS/400, the entire 
combined total of main storage and disk is addressable (the concept is 
called "Single-Level Store"), and virtual storage paging is built into 
the OS at a very low level; a "page fault" is when a process finds tries 
to access something that's been paged out to disk.


As to the private memory pool, it's not that the subsystem is restricted 
to its private pool; rather, everything else is kept *out* of that 
private pool. It still has full access to the "Machine" and "Base" 
shared pools.


--
JHHL

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Our webapp is running very slowly on one particular customer box

2020-10-28 Thread Christopher Schultz

James,

On 10/27/20 16:20, James H. H. Lampert wrote:

This is related to my query (thanks, Mr. Gregg) about "Tenured SOA."

It seems that on one of our customer installations, our webapp gets into 
a state of running very slowly, and the dedicated subsystem it's running 
in is showing massive levels of page-faulting.


I've compared the GC stats of the "problem" system with one that's 
actually got more users connected, but isn't experiencing performance 
issues. It seems that they're both going to GC about every 30-50 
seconds, but GC on the "problem" system appears to be somewhat less 
effective.


Also, I've looked at the threads on both. On the system that is behaving 
normally, the "GC Slave" threads (7 of them) are showing total CPU (at 
this moment) of around 150 seconds each, and Aux I/O of mostly zero, 
with one showing 1 and one showing 3. Conversely, on the "problem" 
system, I'm seeing 15(!) GC Slave threads, each with total CPU under 6 
seconds each, but Aux I/O ranging from 5800 to over 8000.


I'm not sure what to make of this. In both cases, Tomcat's JVM is 
running in a subsystem of its own, with a private memory pool of around 
7G, and they're both running with -Xms4096m -Xmx5120m.


If you expect the service to be long-running, definitely set Xms=Xmx. 
There's no reason to artificially restrict the heap "early" in the 
process's lifetime only to completely re-size and re-organize the heap 
over time. You may as well allocate the maximum right up front and leave 
it that way.


The problem system certainly appears to be thrashing its GC. Are there 
any environmental differences that you notice about the two systems? For 
a JVM with a maximum heap of ~5GiB, I think that a 7GiB private memory 
space (this is an AS/400 thing isn't it?) isn't large enough. The heap 
space is just the "Java heap" and there are other things that need 
memory, sometimes ~= to the heap size. It's sometimes surprising how 
much "native" memory a JVM needs. Is the kernel+userspace running in 
that "subsystem" as well? Or just the JVM process?


I'm guessing that your comment about page-faulting and "Aux I/O rang[es] 
from 5800 - 8000 [sec]" means that you are actually paging the heap to 
the disk. What happens if you shrink your max-heap to 2GiB and change 
nothing else? This should make sure that your heap + native memory fits 
into physical memory and that thrashing should stop.


Maybe you *do* need a 5GiB heap, though. In that case, if the 
heap-shrink works but you get OOMEs under load, then I think that simply 
increasing the memory allocated to the "subsystem" should help a lot.


How much (real) memory does the system report is being used by the JVM 
process? I think you'll find it much larger than 5GiB.


-chris

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Our webapp is running very slowly on one particular customer box

2020-10-27 Thread James H. H. Lampert

This is related to my query (thanks, Mr. Gregg) about "Tenured SOA."

It seems that on one of our customer installations, our webapp gets into 
a state of running very slowly, and the dedicated subsystem it's running 
in is showing massive levels of page-faulting.


I've compared the GC stats of the "problem" system with one that's 
actually got more users connected, but isn't experiencing performance 
issues. It seems that they're both going to GC about every 30-50 
seconds, but GC on the "problem" system appears to be somewhat less 
effective.


Also, I've looked at the threads on both. On the system that is behaving 
normally, the "GC Slave" threads (7 of them) are showing total CPU (at 
this moment) of around 150 seconds each, and Aux I/O of mostly zero, 
with one showing 1 and one showing 3. Conversely, on the "problem" 
system, I'm seeing 15(!) GC Slave threads, each with total CPU under 6 
seconds each, but Aux I/O ranging from 5800 to over 8000.


I'm not sure what to make of this. In both cases, Tomcat's JVM is 
running in a subsystem of its own, with a private memory pool of around 
7G, and they're both running with -Xms4096m -Xmx5120m.


--
JHHL

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org