[Archivesspace_Users_Group] Diagnosing issues with ArchivesSpace

2023-05-25 Thread Michael Smith
Hi Peter & Joshua,

Thanks for getting back to me, here’s some replies to your posts.

> memory allocation for app & solr

ArchivesSpace 3.20 with 35G
Solr 8.11.1 with 512M

> The plugins that you are using probably aren't the culprit, but they can 
> add/override index functionality, so listing those out may help as well.

These are the plugins that are enabled on our Dev / Test / Prod ArchivesSpace.


  *   nla_staff_spreadsheet_importer 
 3.1 running a fork of 
https://github.com/hudmol/nla_staff_spreadsheet_importer
  *   nla_accession_events 0.2 
running a fork of https://github.com/hudmol/accession_events
  *   as_reftracker 1.0
  *   
nla_accessions_summary_reports
 3.0 running a fork of https://github.com/hudmol/accessions_summary_reports
  *   archivesspace_local nla 
custom plugin only for various AppConfig and other customisation 
(locales/en.yml)
  *   nla_accession_reports 
running a fork of https://github.com/hudmol/nla_accession_reports/
  *   
as_spreadsheet_bulk_updater
 1.5.2


> Couple of things that sprang to mind to check (if you haven't already). Have 
> you noticed this same behavior in an instance that is not in use? IE have you 
> set up a clone of your production instance, let it do its initial full index, 
> and then just let it sit? Do you see errors in the app log that have any 
> bearing on the problem or pop up around or just before the app goes 
> unresponsive or OOM?

We were seeing identical issues in our dev and test instances of AS though a 
little less frequent (less concurrent editing users).

I have found the cause for our troubles, we upgraded to MySQL 8 in July/August 
last year, at the time our DBA / Systems Administrators added 
=UTC to the connection string in our config.

I’ve confirmed in our test instance that if edits to a record are started, then 
stopped (in this case around 2023-05-26 13:27:41 AEDT). While editing, the 
table correctly incremented the timestamp by 10 seconds (INTERVAL_PERIOD in 
frontend/app/assets/javascripts/update_monitor.js). However, after the edits 
were stopped a new row with +10 hours was added to the table. As the frontend / 
backend continue to sync their copies of the active_edit table, the number of 
rows continued increasing by 1 with the 10-hr difference. After only a minute 
or so after exiting the record the table had the following rows and would 
continue to add new rows with timestamps further into the future. Eventually 
the active_edits would expire when their timestamp was less than current time – 
30 secs (EXPIRE_SECONDS in backend/app/model/active_edit.rb), but it couldn’t 
keep pace with the rows being created.

'17212459','/repositories/2/accessions/5133','mismith','2023-05-28 05:27:41'
'17212458','/repositories/2/accessions/5133','mismith','2023-05-27 19:27:41'
'17212457','/repositories/2/accessions/5133','mismith','2023-05-27 09:27:41'
'17212456','/repositories/2/accessions/5133','mismith','2023-05-26 23:27:41'
'17212455','/repositories/2/accessions/5133','mismith','2023-05-26 13:27:41'
'17212455','/repositories/2/accessions/5133','mismith','2023-05-26 03:27:41'

I’ve confirmed that our production instance has just over 15 million records in 
the active_edit table with timestamps as far in the future as 2150. We’re 
planning maintenance to stop ArchivesSpace next week and clear the table, we’ve 
also updated the time zone in our connection string to the correct time zone 
=Sydney/Australia (which matches our server timezone now). We’re 
also going to take ArchivesSpace up to v3.3.1 at the same time.

These queries helped when it came to working out system / global time zone 
settings.

SELECT @@GLOBAL.time_zone, @@SESSION.time_zone;
SELECT @@system_time_zone;

Hopefully, I’ve been able to describe that behaviour in a way that’s able to be 
understood, but feel free to ask me any further questions.


Michael Smith  |  Software Developer
02 6262 1029  |  mism...@nla.gov.au  |  National 
Library of Australia
The National Library of Australia acknowledges Australia’s First Nations 
Peoples – the First Australians – as the Traditional Owners and Custodians of 
this land and gives respect to the Elders – past and present – and through them 
to all Australian Aboriginal and Torres Strait Islander people.
___
Archivesspace_Users_Group mailing list
Archivesspace_Users_Group@lyralists.lyrasis.org
http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group


Re: [Archivesspace_Users_Group] Diagnosing issues with ArchivesSpace

2023-05-24 Thread Peter Heiner

Michael Smith wrote on 2023-05-23 23:52:45:
> Our team has been facing recurring issues with our ArchivesSpace setup since
> October last year, which we've been unable to fully resolve despite
> concerted efforts.

Could pretty much be a description of our team up until very recently, when a
team member was finally able to hook the application up with Datadog's tracing
facilities.

We're currently running 3.1.1 with a standalone Solr 7.7.3 and MariaDB 10.4.24
on Ubuntu 20.04/22.04 servers.

> The primary problem involves intermittent system slowdowns and shutdowns,
> requiring frequent reboots to regain functionality. This occurs on average
> 3-4 times weekly but can sometimes be more frequent. This issue is affecting
> multiple teams across our organization.

Using the tracing facilities mentioned above we've found that the object
resolver in Archivesspace does not deduplicate the object tree properly and as
a result a resource we had with over 1100 event links produced a 130MB+ JSON
object and was subsequently parsed into 1.3GB of Ruby data and due to a quirk
of rendering all this was done twice. We reported this on Github
(https://github.com/archivesspace/archivesspace/issues/2993) 3 weeks ago.
The events were not very important to our archivists, so we ended up deleting
them.

We've also found that search is also suboptimal for us. Searches are taking
exponentially longer with every added term and for every search thousands of
requests are made to populate the 'Found in' column of the results. We're on
an old version of Solr and are using a fairly old schema, so we want to
upgrade both before we report this issue.

We've also noticed that database queries trying to update the archivesspace
software agent's system_mtime are failing and we've found that the row has not
been updated since we switched from 2.8.1 to 3.1.1. Possibly linked to this...

> The most common symptom of our problem that we are seeing now looks to be a
> connection pool leak where what looks like indexer threads are holding
> connections in a closed wait state and preventing them from being used for
> other requests.  This leads to the main page timing out and staff seeing 504
> errors, when unresponsive in this manner we usually restart the application.

...our main problem: users are unable to save records due to the updates
timing out waiting for locks. Looking at the database processlist we've
observed 2-3 instances of identical update queries in different sessions and
on the tracing level the queries retry several times before failing on their
LIMIT 1 clause, as there are no rows to update. We don't fully understand this
problem yet, but seeing your message this might be because we don't see
indexer threads in the traces, as they're on a different host.

> Some of the things we’ve attempted so far,
> 
>   *   changed default config settings for indexer records per thread, thread
>   count and solr timeout to 10, 2 & 300


>   *   modified archivesspace.sh to increase memory available
>   (ASPACE_JAVA_XMX="-Xmx35g")

We're on 56GB of heap now. We have ~3.2 million objects in the database across
~30 repositories, I believe this to be one of the larger installations of AS
out there.

>   *   disabled both PUI and PUI indexer

We've been actually thinking of doing this, we currently have the indexer on a
separate host. Does disabling the indexers impact visibility of changes in any
way for you?

> Any advice with further diagnosis / troubleshooting would be appreciated. If
> you need additional information about our setup or the issues we're
> encountering, please let us know.

Our colleague has written a trivial plugin that enables Datadog tracing and
telemetry and it has been, excuse the phrasing, instrumental. He also made it
public, the brilliant bloke (use the log-scope branch for now):
https://gitlab.developers.cam.ac.uk/lib/dev/ams/aspace-datadog

Hope that helps,
p
___
Archivesspace_Users_Group mailing list
Archivesspace_Users_Group@lyralists.lyrasis.org
http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group


Re: [Archivesspace_Users_Group] Diagnosing issues with ArchivesSpace

2023-05-24 Thread Joshua D. Shaw
Hi Michael

These aren't answers, but I think it might help the group if we knew a bit more 
about how your instance is structured - both from a tech perspective (memory 
allocation to the app and Solr) and things like how many repos and how many 
objects (resources, AOs, etc) are in the DB. The structure of your resources 
may also be useful. IE are they wide or deep or both? Wide meaning a lot of 
siblings at each level, but not a lot of levels in the hierarchy and deep 
meaning a lot of levels in the hierarchy, but not as many siblings at each 
level.

The plugins that you are using probably aren't the culprit, but they can 
add/override index functionality, so listing those out may help as well.

It might also be good to know how many edits are made concurrently on average.

Couple of things that sprang to mind to check (if you haven't already). Have 
you noticed this same behavior in an instance that is not in use? IE have you 
set up a clone of your production instance, let it do its initial full index, 
and then just let it sit? Do you see errors in the app log that have any 
bearing on the problem or pop up around or just before the app goes 
unresponsive or OOM?

In case it helps for comparison, Dartmouth is running 3.3.1 (skipped 3.2.0) and 
allocating 4GB each to the app and Solr - everything running in containers. We 
have 5 repos, though only one is utilized much. That repo has about 15k 
resources and 670k AOs with 30k top containers and 15k agents. We have 
relatively few events or subjects. The resources tend to be wide with max 4 
levels of hierarchy. Our largest resource has 10s of thousands of AOs in the 
hierarchy. We also run a huge number of plugins. We have relatively few editors 
- less than 5 at any one time.

Full index typically takes about 24 hours. We have not seen memory issues in 
any of our instances, though I have occasionally seen indexer timeouts during a 
full index. We have stock settings for the indexer (4, 1, 25) - though I had to 
raise the solr timeout a huge amount to 7200 for 3.3.1 to avoid solr timeouts. 
We do run the PUI, so much of the full index time is the PUI index churning 
away. Staff side indexing takes about 6-8 hours.

Best,
Joshua


From: archivesspace_users_group-boun...@lyralists.lyrasis.org 
 on behalf of Michael 
Smith 
Sent: Tuesday, May 23, 2023 7:52 PM
To: archivesspace_users_group@lyralists.lyrasis.org 

Subject: [Archivesspace_Users_Group] Diagnosing issues with ArchivesSpace

You don't often get email from mism...@nla.gov.au. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>

Hello,



Our team has been facing recurring issues with our ArchivesSpace setup since 
October last year, which we've been unable to fully resolve despite concerted 
efforts.



We’re currently running v3.2 on Red Hat Enterprise Linux Server 7.9 (Maipo) and 
we do have a few custom plugins developed by Hudmol. These don’t appear to be 
causing the issues that we’re seeing but we haven’t ruled that out yet.



The primary problem involves intermittent system slowdowns and shutdowns, 
requiring frequent reboots to regain functionality. This occurs on average 3-4 
times weekly but can sometimes be more frequent. This issue is affecting 
multiple teams across our organization.



The most common symptom of our problem that we are seeing now looks to be a 
connection pool leak where what looks like indexer threads are holding 
connections in a closed wait state and preventing them from being used for 
other requests.  This leads to the main page timing out and staff seeing 504 
errors, when unresponsive in this manner we usually restart the application. If 
the application hits an OOM, it will restart itself.



Some of the things we’ve attempted so far,



  *   changed default config settings for indexer records per thread, thread 
count and solr timeout to 10, 2 & 300
  *   modified archivesspace.sh to increase memory available 
(ASPACE_JAVA_XMX="-Xmx35g")
  *   disabled both PUI and PUI indexer
  *   application logging to a circular log
  *   changed the garbage collection policies 
(ASPACE_GC_OPTS="-XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC 
-XX:NewRatio=1 -XX:+ExitOnOutOfMemoryError -XX:+UseGCOverheadLimit")
  *   checked top_containers with empty relationships (0 results)
  *   checked for duplicate event relationships (0 results)
  *   checked for empty indexer state files per record type (0 empty state 
files)
  *   nightly restarts of the system



Any advice with further diagnosis / troubleshooting would be appreciated. If 
you need additional information about our setup or the issues we're 
encountering, please let us know.



Regards,



Michael Smith  |  Software Developer
02 6262 1029  |  mism...@nla.gov.au<mailto:mism...@nla.gov.au>  |  National 
Library of Australia

The National Library of Australia acknowledges Australia’s First Nati

[Archivesspace_Users_Group] Diagnosing issues with ArchivesSpace

2023-05-23 Thread Michael Smith
Hello,

Our team has been facing recurring issues with our ArchivesSpace setup since 
October last year, which we've been unable to fully resolve despite concerted 
efforts.

We’re currently running v3.2 on Red Hat Enterprise Linux Server 7.9 (Maipo) and 
we do have a few custom plugins developed by Hudmol. These don’t appear to be 
causing the issues that we’re seeing but we haven’t ruled that out yet.

The primary problem involves intermittent system slowdowns and shutdowns, 
requiring frequent reboots to regain functionality. This occurs on average 3-4 
times weekly but can sometimes be more frequent. This issue is affecting 
multiple teams across our organization.

The most common symptom of our problem that we are seeing now looks to be a 
connection pool leak where what looks like indexer threads are holding 
connections in a closed wait state and preventing them from being used for 
other requests.  This leads to the main page timing out and staff seeing 504 
errors, when unresponsive in this manner we usually restart the application. If 
the application hits an OOM, it will restart itself.

Some of the things we’ve attempted so far,


  *   changed default config settings for indexer records per thread, thread 
count and solr timeout to 10, 2 & 300
  *   modified archivesspace.sh to increase memory available 
(ASPACE_JAVA_XMX="-Xmx35g")
  *   disabled both PUI and PUI indexer
  *   application logging to a circular log
  *   changed the garbage collection policies 
(ASPACE_GC_OPTS="-XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC 
-XX:NewRatio=1 -XX:+ExitOnOutOfMemoryError -XX:+UseGCOverheadLimit")
  *   checked top_containers with empty relationships (0 results)
  *   checked for duplicate event relationships (0 results)
  *   checked for empty indexer state files per record type (0 empty state 
files)
  *   nightly restarts of the system

Any advice with further diagnosis / troubleshooting would be appreciated. If 
you need additional information about our setup or the issues we're 
encountering, please let us know.

Regards,

Michael Smith  |  Software Developer
02 6262 1029  |  mism...@nla.gov.au  |  National 
Library of Australia
The National Library of Australia acknowledges Australia’s First Nations 
Peoples – the First Australians – as the Traditional Owners and Custodians of 
this land and gives respect to the Elders – past and present – and through them 
to all Australian Aboriginal and Torres Strait Islander people.
___
Archivesspace_Users_Group mailing list
Archivesspace_Users_Group@lyralists.lyrasis.org
http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group