Hi,

A couple of weeks ago, I ran into an unusual problem with Solr on which I could 
find previous discussion.

I have a 4 node Solr cluster with 2 collections, ‘A’ and ‘B’.  Each of the 
collections has 1 shard and 3 replicas.  Both collections are updated with a 
delta-import that pulls from a postgres database every 5 minutes.  Collection 
‘A’ is very small (~1.5k documents, ~7 MB) and there are no queries run against 
it.  Collection ‘B’ is ~90k documents and about ~500MB and has a heavy query 
load during certain parts of the day.  There is an auto hard commit every 15 
seconds.  Both collections run a nightly full import during low query load 
without issue.

There was a large delta on Collection ‘B’ that caused nearly every document to 
be updated.  This occurred while the query load was high.  Collection ‘B’ has 2 
different entity types, ‘1’, and ‘2,’ which are in a ~1:3 ratio.  There were 
both “adds” and “deletes”.

Looking at the logs, the data import process completed for entity ‘1’, but not 
entity ‘2.’  There were no errors, exceptions, or warnings in the log and the 
telemetry did not show that any of the cluster nodes ran out of heap or 
diskspace.  It is usually the case that a full import (or large delta) would 
run well within 20 minutes, but this particular import was running for at least 
an hour.

A more concerning development was that soon after the data import began to 
process entity ‘2,’ all of the nodes in the cluster began to continuously send 
a high volume of /update add requests that contained up to 200 document ids.  
This high volume of adds occurred for at least 15 minutes and appears to have 
spiked the CPU and GC on the cluster nodes and led to a high volume of query 
timeouts.  Typically, the /update adds messages would contain 1 (or rarely 2) 
documents.

The cluster was restarted in a rolling fashion (one node at a time), but this 
didn’t appear to resolve all of the issues.  Only after all of the replicas 
were deleted and then re-added (through the Admin console) did the flood of 
/updates subside.

Has anyone ever observed this kind of behavior?  Is there a known issue or a 
procedure to follow for getting a cluster out of this state?

I was able to reproduce the /update “adds” flood by starting a large delta, 
putting the cluster under heavy load, and then forcing a second delta 
immediately after the first delta finished.  However, this is obviously not 
exactly the same event, because the large deltas actually ran to completion for 
both entity ‘1’ and entity ‘2’.  In this case, forcing a commit seemed to 
reduce the volume of the large /update adds messages, but didn’t completely 
eliminate them.  Deleting and re-adding the replicas seemed to fix this issue 
as well.

Any insight into this would be very helpful.  Thanks!

Sincerely,
Perrin Bignoli

Reply via email to