Re: Production Issue: TIMED_WAITING - Will net.ipv4.tcp_tw_reuse=1 help?

Doss Tue, 11 Aug 2020 04:06:22 -0700

Hi Dominique,

Our issues are similar to the one discussed here.
https://github.com/eclipse/jetty.project/issues/4105


Your views on this.

Thanks,
Mohandoss.

On Tue, Aug 11, 2020 at 7:06 AM Doss <itsmed...@gmail.com> wrote:

> Hi Dominique,
>
> Thanks for the response.
>
> I don't think I would use a JVM version 14. OpenJDK 11 in my opinion is
> the best choice for LTS version.
>
> >> We will try changing it.
>
> You change a lot of default values. Any specific raisons ? Il seems very
> aggressive !
>
> >> Our product team wants data to be reflected in Near Real Time.
>  mergePolicyFactory, mergeScheduler - This is based on our oldest SOLR
> cluster where these parameter tweaking gave good results.
>
> You have to analyze GC on all nodes !
>
> >> I checked other nodes GC, found no issues. I shared the node's GC which
> gets into trouble very frequently.
>
> Your heap is very big. According to full GC frequency, I don't think you
> really need such a big heap for only indexing. May be when you will perform
> queries.
>
> >> Heap Sizing is based on the select requests we are expecting. We expect
> it would be around 10 to 15 million per day. We have plans to increase CPU
> before routing select traffics.
>
> Did you check your network performances ?
>
> >> We do checked in sar reports, but unable to figure out an issue, we use
> 10 GBPS connection. Is there any SOLR metric API which will give network
> related information? Please suggest other ways to dig this further.
>
> Did you check Zookeeper logs ?
>
> >> We never looked at the Zookeeper logs, will check and share, is there
> any kind of information to watch out for?
>
> Regards,
> Doss
>
>
> On Monday, August 10, 2020, Dominique Bejean <dominique.bej...@eolya.fr>
> wrote:
>
>> Doss,
>>
>> See below.
>>
>> Dominique
>>
>>
>> Le lun. 10 août 2020 à 17:41, Doss <itsmed...@gmail.com> a écrit :
>>
>>> Hi Dominique,
>>>
>>> Thanks for your response. Find below the details, please do let me know
>>> if anything I missed.
>>>
>>>
>>> *- hardware architecture and sizing*
>>> >> Centos 7, VMs,4CPUs, 66GB RAM, 16GB Heap, 250GB SSD
>>>
>>>
>>> *- JVM version / settings    *
>>> >> Red Hat, Inc. OpenJDK 64-Bit Server VM, version:"14.0.1 14.0.1+7" -
>>> Default Settings including GC
>>>
>>
>> I don't think I would use a JVM version 14. OpenJDK 11 in my opinion is
>> the best choice for LTS version.
>>
>>
>>>
>>> *- Solr settings    *
>>> >> softCommit: 15000 (15 sec), autoCommit: 300000 (5 mins)
>>> <mergePolicyFactory
>>> class="org.apache.solr.index.TieredMergePolicyFactory"><int
>>> name="maxMergeAtOnce">30</int> <int name="maxMergeAtOnceExplicit">100</int>
>>> <double name="segmentsPerTier">30.0</double> </mergePolicyFactory>
>>>
>>>           <mergeScheduler
>>> class="org.apache.lucene.index.ConcurrentMergeScheduler"><int
>>> name="maxMergeCount">18</int><int
>>> name="maxThreadCount">6</int></mergeScheduler>
>>>
>>
>> You change a lot of default values. Any specific raisons ? Il seems very
>> aggressive !
>>
>>
>>>
>>>
>>> *- collections and queries information   *
>>> >> One Collection, with 4 shards , 3 replicas , 3.5 Million Records, 150
>>> columns, mostly integer fields, Average doc size is 350kb. Insert / Updates
>>> 0.5 Million Span across the whole day (peak time being 6PM to 10PM) ,
>>> selects not yet started. Daily once we do delta import of cetrain fields of
>>> type multivalued with some good amount of data.
>>>
>>> *- gc logs or gceasy results*
>>>
>>> Easy GC Report says GC health is good, one server's gc report:
>>> https://drive.google.com/file/d/1C2SqEn0iMbUOXnTNlYi46Gq9kF_CmWss/view?usp=sharing
>>> CPU Load Pattern:
>>> https://drive.google.com/file/d/1rjRMWv5ritf5QxgbFxDa0kPzVlXdbySe/view?usp=sharing
>>>
>>>
>> You have to analyze GC on all nodes !
>> Your heap is very big. According to full GC frequency, I don't think you
>> really need such a big heap for only indexing. May be when you will perform
>> queries.
>>
>> Did you check your network performances ?
>> Did you check Zookeeper logs ?
>>
>>
>>>
>>> Thanks,
>>> Doss.
>>>
>>>
>>>
>>> On Mon, Aug 10, 2020 at 7:39 PM Dominique Bejean <
>>> dominique.bej...@eolya.fr> wrote:
>>>
>>>> Hi Doss,
>>>>
>>>> See a lot of TIMED_WATING connection occurs with high tcp traffic
>>>> infrastructure as in a LAMP solution when the Apache server can't
>>>> anymore connect to the MySQL/MariaDB database.
>>>> In this case, tweak net.ipv4.tcp_tw_reuse is a possible solution (but
>>>> never net.ipv4.tcp_tw_recycle as you suggested in your previous post).
>>>> This
>>>> is well explained in this great article
>>>> https://vincent.bernat.ch/en/blog/2014-tcp-time-wait-state-linux
>>>>
>>>> However, in general and more specifically in your case, I would
>>>> investigate
>>>> the root cause of your issue and do not try to find a workaround.
>>>>
>>>> Can you provide more information about your use case (we know : 3 node
>>>> SOLR
>>>> (8.3.1 NRT) + 3 Node Zookeeper Ensemble) ?
>>>>
>>>>    - hardware architecture and sizing
>>>>    - JVM version / settings
>>>>    - Solr settings
>>>>    - collections and queries information
>>>>    - gc logs or gceasy results
>>>>
>>>> Regards
>>>>
>>>> Dominique
>>>>
>>>>
>>>>
>>>> Le lun. 10 août 2020 à 15:43, Doss <itsmed...@gmail.com> a écrit :
>>>>
>>>> > Hi,
>>>> >
>>>> > In solr 8.3.1 source, I see the following , which I assume could be
>>>> the
>>>> > reason for the issue "Max requests queued per destination 3000
>>>> exceeded for
>>>> > HttpDestination",
>>>> >
>>>> >
>>>> solr/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java:
>>>> >    private static final int MAX_OUTSTANDING_REQUESTS = 1000;
>>>> >
>>>> solr/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java:
>>>> >      available = new Semaphore(MAX_OUTSTANDING_REQUESTS, false);
>>>> >
>>>> solr/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java:
>>>> >      return MAX_OUTSTANDING_REQUESTS * 3;
>>>> >
>>>> > how can I increase this?
>>>> >
>>>> > On Mon, Aug 10, 2020 at 12:01 AM Doss <itsmed...@gmail.com> wrote:
>>>> >
>>>> > > Hi,
>>>> > >
>>>> > > We are having 3 node SOLR (8.3.1 NRT) + 3 Node Zookeeper Ensemble
>>>> now and
>>>> > > then we are facing "Max requests queued per destination 3000
>>>> exceeded for
>>>> > > HttpDestination"
>>>> > >
>>>> > > After restart evering thing starts working fine until another
>>>> problem.
>>>> > > Once a problem occurred we are seeing soo many TIMED_WAITING threads
>>>> > >
>>>> > > Server 1:
>>>> > >    *7722*  Threads are in TIMED_WATING
>>>> > >
>>>> >
>>>> ("lock":"java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@151d5f2f
>>>> > > ")
>>>> > > Server 2:
>>>> > >    *4046*   Threads are in TIMED_WATING
>>>> > >
>>>> >
>>>> ("lock":"java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@1e0205c3
>>>> > > ")
>>>> > > Server 3:
>>>> > >    *4210*   Threads are in TIMED_WATING
>>>> > >
>>>> >
>>>> ("lock":"java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@5ee792c0
>>>> > > ")
>>>> > >
>>>> > > Please suggest whether net.ipv4.tcp_tw_reuse=1 will help ? or how
>>>> can we
>>>> > > increase the 3000 limit?
>>>> > >
>>>> > > Sorry, since I haven't got any response to my previous query,  I am
>>>> > > creating this as new,
>>>> > >
>>>> > > Thanks,
>>>> > > Mohandoss.
>>>> > >
>>>> >
>>>>
>>>

Re: Production Issue: TIMED_WAITING - Will net.ipv4.tcp_tw_reuse=1 help?

Reply via email to