Re: [naviserver-devel] A selection of messages appearing in our error logs

2023-11-08 Thread Brian Fenton
Hi Gustaf

thank you so much for such a comprehensive reply. I feel like I just increased 
my IQ by a few percentage! A couple of follow up questions, if I may:

1. "sockAccept accepted 2 connections" - this was very useful to know, I will 
start keeping an eye out for values > 10. It would be very useful to be able to 
configure the threshold at which it appears in the log, as I mentioned we see a 
huge amount of these (over 2000 in the last few hours on our test server, 
almost all of them for 2 connections).
Also regarding configuring a second driver thread - just to be clear are you 
referring to this mail you sent back in 2016 
https://sourceforge.net/p/naviserver/mailman/message/35502664/  i.e. enable 
reuseport and set driverthreads = 2? It's unclear to me if this has benefit 
when running Naviserver in a Docker container - definitely beyond my level of 
comprehension. See 
https://github.com/moby/moby/issues/7536#issuecomment-1039443599 for discussion 
of SO_REUSEPORT and Docker.

2. "ns_cache create entry collision cache util_memoize key 
'__util_memoize_installed_p" - again very helpful. We actually have been 
partitioning our caches, and try not to add anything new to util-memoize cache. 
In fact, this particular message about  '__util_memoize_installed_p' is the 
only cache collision notice we are seeing. Your comment helped me to track down 
to issue to the "util_memoize_initialized_p" and "apm_package_installed_p" 
procs, and I can see the new improved versions in later OpenACS versions, so I 
will try those out.

3. "exiting: exceeded max connections per thread" - actually we used to have 
"connsperthread" set to 1, but I reduced to 1000 when we moved to Linux, as 
I noticed that was the value in latest OpenACS config, even though the default 
is still 1 
https://bitbucket.org/naviserver/naviserver/src/main/openacs-config.tcl  Is 
1000 the recommended value for OpenACS?

thanks again
Brian

From: Gustaf Neumann 
Sent: Wednesday 8 November 2023 9:00 am
To: naviserver-devel@lists.sourceforge.net 

Subject: Re: [naviserver-devel] A selection of messages appearing in our error 
logs


Hi Brian,


In general, we try to follow with the logging severities the guideline [1]. 
"Notice" is informational, "warning" is something unexpected/unwanted (hitting 
some limits, got some data that looks like an attack, ...), "error" is 
something which should be looked at by an engineer.


[1] https://openacs.org/xowiki/logging-conventions


below, I comment inline.



On 07.11.23 18:28, Brian Fenton wrote:
Hi

with the goal of improving signal-to-noise in our error logs, and also to 
increase my understanding of Naviserver internals, I have a few questions about 
various different messages we are seeing regularly in our logs.


  1.  [06/Nov/2023:14:58:08][77320.7f7b9a7fd640][-driver:nsssl:0-] Notice: ... 
sockAccept accepted 2 connections
this is by far the most common entry in our logs - we see a LOT of these. Are 
they merely informative or indicative of some tuning action we could take? I 
don't see any way to disable them. I would love to remove these.


This indicates some stress situation of the server, where multiple requests 
come in at the same time. The value of 2 is not of a concern, if you see values 
like 10 or 20, you should consider a change in your configuration (e.g., 
configuring a second driver thread). It would be possible to define a 
configuration value for the network driver to set the threshold when the notice 
is written to the log file.


  1.

  2.
[06/Nov/2023:14:57:28][77320.7f7b63fff640][-conn:openacs:default:6:6292-] 
Notice: ns_cache create entry collision cache util_memoize key 
'__util_memoize_installed_p', no timeout
We also see a quite a few of these, always for  '__util_memoize_installed_p', 
and always with 'no timeout'. From looking at cache.c I see  that this means 
that Naviserver failed to create the cache entry. I checked the 'util_memoize' 
cache utilization in case the cache was full, but it's quite low. What other 
cause could there be for this?

This means that concurrent operations are executed to obtain the same cache 
value, and that the second request (here conn:openacs:default:6:6292) has to 
wait until the locking process has finished. The problem can become serious, 
when the locking process "hangs", i.e. taking a very long time to finish. This 
means, that more and more of caching requests for the same entry will pile up, 
potentially until a restart, since no timeout was specified. The warning was 
introduced to give the developer a hint, why suddenly the server is hanging 
(before the message was introduced, the cause was very hard to determine).


This message tells as well something else:
* It looks as if the installation is based on a very old version of 
openacs-core (including in particular acs-tcl) installed. At least since 9 
years, the code has changed to perform this test just at startup. Upgrade to 

Re: [naviserver-devel] test ns_config-7.4.1, Stack around the variable 'filter' was corrupted

2023-11-08 Thread Gustaf Neumann

On 07.11.23 22:56, Andrew Piskorski wrote:

On the current NaviServer head, that problem has gone away!  The
ns_config-7.4.1 test now runs fine for me on Windows.


You see: time heals all wounds! -g
___
naviserver-devel mailing list
naviserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/naviserver-devel


Re: [naviserver-devel] A selection of messages appearing in our error logs

2023-11-08 Thread Gustaf Neumann

Hi Brian,


In general, we try to follow with the logging severities the guideline 
[1]. "Notice" is informational, "warning" is something 
unexpected/unwanted (hitting some limits, got some data that looks like 
an attack, ...), "error" is something which should be looked at by an 
engineer.



    [1] https://openacs.org/xowiki/logging-conventions


below, I comment inline.



On 07.11.23 18:28, Brian Fenton wrote:

Hi

with the goal of improving signal-to-noise in our error logs, and also 
to increase my understanding of Naviserver internals, I have a few 
questions about various different messages we are seeing regularly in 
our logs.


 1. [06/Nov/2023:14:58:08][77320.7f7b9a7fd640][-driver:nsssl:0-]
Notice: ... sockAccept accepted 2 connections
this is by far the most common entry in our logs - we see a LOT of
these. Are they merely informative or indicative of some tuning
action we could take? I don't see any way to disable them. I would
love to remove these.



This indicates some stress situation of the server, where multiple 
requests come in at the same time. The value of 2 is not of a concern, 
if you see values like 10 or 20, you should consider a change in your 
configuration (e.g., configuring a second driver thread). It would be 
possible to define a configuration value for the network driver to set 
the threshold when the notice is written to the log file.




1.

2.
[06/Nov/2023:14:57:28][77320.7f7b63fff640][-conn:openacs:default:6:6292-]
Notice: ns_cache create entry collision cache util_memoize key
'__util_memoize_installed_p', no timeout
We also see a quite a few of these, always for
'__util_memoize_installed_p', and always with 'no timeout'. From
looking at cache.c I see  that this means that Naviserver failed
to create the cache entry. I checked the 'util_memoize' cache
utilization in case the cache was full, but it's quite low. What
other cause could there be for this?

This means that concurrent operations are executed to obtain the same 
cache value, and that the second request (here 
conn:openacs:default:6:6292) has to wait until the locking process has 
finished. The problem can become serious, when the locking process 
"hangs", i.e. taking a very long time to finish. This means, that more 
and more of caching requests for the same entry will pile up, 
potentially until a restart, since no timeout was specified. The warning 
was introduced to give the developer a hint, why suddenly the server is 
hanging (before the message was introduced, the cause was very hard to 
determine).



This message tells as well something else:
* It looks as if the installation is based on a very old version of 
openacs-core (including in particular acs-tcl) installed. At least since 
9 years, the code has changed to perform this test just at startup. 
Upgrade to recent versions will improve the performance and increase 
security.



* The message tells me as well that the application has a stress on the 
util-memoize cache. Old versions of OpenACS used this "util-memoize" 
cache as a kitchen-sink cache for everything. I've seen instances having 
300K + entries in this cache. The size itself is not the problem, but 
there were/are many operations in OpenACS that obtain all keys from the 
cache. It means that a lock is created, a list with 300K+ entries is 
created and handed to the interpreter before the cache is released. In 
the meantime, *all* caching requests for this cache have to be halted, 
... causing in high load situations degraded performance and limited 
concurrency. I would not be surprised, if you see high lock times on 
this cache.



There are caches aside of the util-memoize cache, where cache collisions 
might happen. See e.g. [2] for more background. Cache partitioning is in 
this situation the instrument for scaling.



   [2] https://openacs.org/xowiki/cache-size-tuning



1.
 2. These 5 appear usually as a group:
[07/Nov/2023:13:46:07][38.7f352cff9640][-conn:openacs:default:3:4088-]
Notice: exiting: exceeded max connections per thread
[07/Nov/2023:13:46:07][38.7f3525b84640][-driver:nsssl:0-] Notice:
NsEnsureRunningConnectionThreads wantCreate 1 waiting 1 idle 0
current 2
[07/Nov/2023:13:46:07][38.7f352d7fa640][-conn:openacs:default:5:0-]
Notice: start update interpreter openacs to epoch 1, concurrent 1
[07/Nov/2023:13:46:08][38.7f352d7fa640][-conn:openacs:default:5:0-]
Notice: update interpreter openacs to epoch 1 done, trace none,
time 0.373006 secs concurrent 1
[07/Nov/2023:13:46:08][38.7f352d7fa640][-conn:openacs:default:5:0-]
Notice: thread initialized (0.388878 secs)
We see these pretty regularly, over 50 per day on  a  local
development server.  "maxconnections" is set to the default value,
which is 100, I believe.  "maxthreads" and "minthreads" are also
set to default values. What could  be causing these so regularly
on a system with 1 logged-in