Re: Time-out errors while indexing (Solr 7.7.1)

Erick Erickson Sat, 04 Jul 2020 06:37:47 -0700

You need more shards. And, I’m pretty certain, more hardware.

You say you have 13 billion documents and 6 shards. Solr/Lucene has an absolute 
upper limit of 2B (2^31) docs per shard. I don’t quite know how you’re running 
at all unless that 13B is a round number. If you keep adding documents, your 
installation will shortly, at best, stop accepting new documents for indexing. 
At worst you’ll start seeing weird errors and possibly corrupt indexes and have 
to re-index everything from scratch.


You’ve backed yourself in to a pretty tight corner here. You either have to 
re-index to a properly-sized cluster or use SPLITSHARD. This latter will double 
the index-on-disk size (it creates two child indexes per replica and keeps the 
old one for safety’s sake that you have to clean up later). I strongly 
recommend you stop ingesting more data while you do this.

You say you have 6 VMs with 2 nodes running on each. If those VMs are 
co-located with anything else, the physical hardware is going to be stressed. 
VMs themselves aren’t bad, but somewhere there’s physical hardware that runs it…

In fact, I urge you to stop ingesting data immediately and address this issue. 
You have a cluster that’s mis-configured, and you must address that before Bad 
Things Happen.

Best,
Erick

> On Jul 4, 2020, at 5:09 AM, Mad have <madhava.a.re...@gmail.com> wrote:
> 
> Hi Eric,
> 
> There are total 6 VM’s in Solr clusters and 2 nodes are running on each VM. 
> Total number of shards are 6 with 3 replicas. I can see the index size is 
> more than 220GB on each node for the collection where we are facing the 
> performance issue.
> 
> The more documents we add to the collection the indexing become slow and I 
> also have same impression that the size of the collection is creating this 
> issue. Appreciate if you can suggests any solution on this.
> 
> 
> Regards,
> Madhava 
> Sent from my iPhone
> 
>> On 3 Jul 2020, at 23:30, Erick Erickson <erickerick...@gmail.com> wrote:
>> 
>> Oops, I transposed that. If your index is a terabyte and your RAM is 128M, 
>> _that’s_ a red flag.
>> 
>>> On Jul 3, 2020, at 5:53 PM, Erick Erickson <erickerick...@gmail.com> wrote:
>>> 
>>> You haven’t said how many _shards_ are present. Nor how many replicas of 
>>> the collection you’re hosting per physical machine. Nor how large the 
>>> indexes are on disk. Those are the numbers that count. The latter is 
>>> somewhat fuzzy, but if your aggregate index size on a machine with, say, 
>>> 128G of memory is a terabyte, that’s a red flag.
>>> 
>>> Short form, though is yes. Subject to the questions above, this is what I’d 
>>> be looking at first.
>>> 
>>> And, as I said, if you’ve been steadily increasing the total number of 
>>> documents, you’ll reach a tipping point sometime.
>>> 
>>> Best,
>>> Erick
>>> 
>>>>> On Jul 3, 2020, at 5:32 PM, Mad have <madhava.a.re...@gmail.com> wrote:
>>>> 
>>>> Hi Eric,
>>>> 
>>>> The collection has almost 13billion documents with each document around 
>>>> 5kb size, all the columns around 150 are the indexed. Do you think that 
>>>> number of documents in the collection causing this issue. Appreciate your 
>>>> response.
>>>> 
>>>> Regards,
>>>> Madhava 
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>>> On 3 Jul 2020, at 12:42, Erick Erickson <erickerick...@gmail.com> wrote:
>>>>> 
>>>>> If you’re seeing low CPU utilization at the same time, you probably
>>>>> just have too much data on too little hardware. Check your
>>>>> swapping, how much of your I/O is just because Lucene can’t
>>>>> hold all the parts of the index it needs in memory at once? Lucene
>>>>> uses MMapDirectory to hold the index and you may well be
>>>>> swapping, see:
>>>>> 
>>>>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>>>> 
>>>>> But my guess is that you’ve just reached a tipping point. You say:
>>>>> 
>>>>> "From last 2-3 weeks we have been noticing either slow indexing or 
>>>>> timeout errors while indexing”
>>>>> 
>>>>> So have you been continually adding more documents to your
>>>>> collections for more than the 2-3 weeks? If so you may have just
>>>>> put so much data on the same boxes that you’ve gone over
>>>>> the capacity of your hardware. As Toke says, adding physical
>>>>> memory for the OS to use to hold relevant parts of the index may
>>>>> alleviate the problem (again, refer to Uwe’s article for why).
>>>>> 
>>>>> All that said, if you’re going to keep adding document you need to
>>>>> seriously think about adding new machines and moving some of
>>>>> your replicas to them.
>>>>> 
>>>>> Best,
>>>>> Erick
>>>>> 
>>>>>> On Jul 3, 2020, at 7:14 AM, Toke Eskildsen <t...@kb.dk> wrote:
>>>>>> 
>>>>>>> On Thu, 2020-07-02 at 11:16 +0000, Kommu, Vinodh K. wrote:
>>>>>>> We are performing QA performance testing on couple of collections
>>>>>>> which holds 2 billion and 3.5 billion docs respectively.
>>>>>> 
>>>>>> How many shards?
>>>>>> 
>>>>>>> 1.  Our performance team noticed that read operations are pretty
>>>>>>> more than write operations like 100:1 ratio, is this expected during
>>>>>>> indexing or solr nodes are doing any other operations like syncing?
>>>>>> 
>>>>>> Are you saying that there are 100 times more read operations when you
>>>>>> are indexing? That does not sound too unrealistic as the disk cache
>>>>>> might be filled with the data that the writers are flushing.
>>>>>> 
>>>>>> In that case, more RAM would help. Okay, more RAM nearly always helps,
>>>>>> but such massive difference in IO-utilization does indicate that you
>>>>>> are starved for cache.
>>>>>> 
>>>>>> I noticed you have at least 18 replicas. That's a lot. Just to sanity
>>>>>> check: How many replicas are each physical box handling? If they are
>>>>>> sharing resources, fewer replicas would probably be better.
>>>>>> 
>>>>>>> 3.  Our client timeout is set to 2mins, can they increase further
>>>>>>> more? Would that help or create any other problems?
>>>>>> 
>>>>>> It does not hurt the server to increase the client timeout as the
>>>>>> initiated query will keep running until it is finished, independent of
>>>>>> whether or not there is a client to receive the result.
>>>>>> 
>>>>>> If you want a better max time for query processing, you should look at 
>>>>>> 
>>>>>> https://lucene.apache.org/solr/guide/7_7/common-query-parameters.html#timeallowed-parameter
>>>>>> but due to its inherent limitations it might not help in your
>>>>>> situation.
>>>>>> 
>>>>>>> 4.  When we created an empty collection and loaded same data file,
>>>>>>> it loaded fine without any issues so having more documents in a
>>>>>>> collection would create such problems?
>>>>>> 
>>>>>> Solr 7 does have a problem with sparse DocValues and many documents,
>>>>>> leading to excessive IO-activity, which might be what you are seeing. I
>>>>>> can see from an earlier post that you were using streaming expressions
>>>>>> for another collection: This is one of the things that are affected by
>>>>>> the Solr 7 DocValues issue.
>>>>>> 
>>>>>> More info about DocValues and streaming:
>>>>>> https://issues.apache.org/jira/browse/SOLR-13013
>>>>>> 
>>>>>> Fairly in-depth info on the problem with Solr 7 docValues:
>>>>>> https://issues.apache.org/jira/browse/LUCENE-8374
>>>>>> 
>>>>>> If this is your problem, upgrading to Solr 8 and indexing the
>>>>>> collection from scratch should fix it. 
>>>>>> 
>>>>>> Alternatively you can port the LUCENE-8374-patch from Solr 7.3 to 7.7
>>>>>> or you can ensure that there are values defined for all DocValues-
>>>>>> fields in all your documents.
>>>>>> 
>>>>>>> java.net.SocketTimeoutException: Read timed out
>>>>>>>   at java.net.SocketInputStream.socketRead0(Native Method) 
>>>>>> ...
>>>>>>> Remote error message: java.util.concurrent.TimeoutException: Idle
>>>>>>> timeout expired: 600000/600000 ms
>>>>>> 
>>>>>> There is a default timeout of 10 minutes (distribUpdateSoTimeout?). You
>>>>>> should be able to change it in solr.xml.
>>>>>> https://lucene.apache.org/solr/guide/8_5/format-of-solr-xml.html
>>>>>> 
>>>>>> BUT if an update takes > 10 minutes to be processed, it indicates that
>>>>>> the cluster is overloaded.  Increasing the timeout is just a band-aid.
>>>>>> 
>>>>>> - Toke Eskildsen, Royal Danish Library
>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>

Re: Time-out errors while indexing (Solr 7.7.1)

Reply via email to