Re: tlog keeps growing

2021-02-16 Thread mmb1234
Erik, Looks like we're also running into this issue. https://www.mail-archive.com/solr-user@lucene.apache.org/msg153798.html Is there any think we can do to remedy this besides a node restart, which causes leader re-election on the good shards which causes them to also become un-operational?

Re: Down Replica is elected as Leader (solr v8.7.0)

2021-02-16 Thread mmb1234
> Are yours growing always, on all nodes, forever? Or is it one or two who ends up in a bad state? Randomly on some of the shards and some of the followers in the collection. Then whichever tlog was open on follower when it was the leader, that one doesn't stops growing. And that shard had

Re: Down Replica is elected as Leader (solr v8.7.0)

2021-02-16 Thread mmb1234
Looks like the problem is related to tlog rotation on the follower shard. We did the following for a specific shard. 0. start solr cloud 1. solr-0 (leader), solr-1, solr-2 2. rebalance to make solr-1 as preferred leader 3. solr-0, solr-1 (leader), solr-2 The tlog file on solr-0 kept on growing

Re: Down Replica is elected as Leader (solr v8.7.0)

2021-02-16 Thread mmb1234
Looks like the problem is related to tlog rotation on the follower shard. We did the following for a specific shard. 0. start solr cloud 1. solr-0 (leader), solr-1, solr-2 2. rebalance to make solr-1 as preferred leader 3. solr-0, solr-1 (leader), solr-2 The tlog file on solr-0 kept on growing

Re: Down Replica is elected as Leader (solr v8.7.0)

2021-02-14 Thread mmb1234
We found that for the shard that does not get a leader, the tlog replay did not complete (we don't see "log replay finished", "creating leader registration node", "I am the new leader" etc log messages) for hours. Also not sure why the TLOG are 10's of GBs (anywhere from 30 to 40GB).

Re: Down Replica is elected as Leader (solr v8.7.0)

2021-02-13 Thread mmb1234
By tracing the output in the log files we see the following sequence. Overseer role list has POD-1, POD-2, POD-3 in that order POD-3 has 2 shard leaders. POD-3 restarts. A) Logs for the shard whose leader moves successfully from POD-3 to POD-1 On POD-1: o.a.s.c.ShardLeaderElectionContext

Down Replica is elected as Leader (solr v8.7.0)

2021-02-10 Thread mmb1234
Hello, On reboot of one of the solr nodes in the cluster, we often see a collection's shards with 1. LEADER replica in DOWN state, and/or 2. shard with no LEADER Output from /solr/admin/collections?action=CLUSTERSTATUS is below. Even after 5 to 10 minutes, the collection often does not recover.

Re: Json Faceting Performance Issues on solr v8.7.0

2021-02-05 Thread mmb1234
> Does this happen on a warm searcher (are subsequent requests with no intervening updates _ever_ fast?)? Subsequent response times very fast if searcher remains open. As a control test, I faceted on the same field that I used in the q param. 1. Start solr 2. Execute q=resultId:x=0 =>

Re: Json Faceting Performance Issues on solr v8.7.0

2021-02-05 Thread mmb1234
Ok. I'll try that. Meanwhile query on resultId is subsecond response. But the immediate next query for faceting takes 40+secs. The core has 185million docs and 63GB index size. curl 'http://localhost:8983/solr/TestCollection_shard1_replica_t3/query?q=resultId:x=0' {

Json Faceting Performance Issues on solr v8.7.0

2021-02-05 Thread mmb1234
Hello, I am seeing very slow response from json faceting against a single core (though core is shard leader in a collection). Fields processId and resultId are non-multivalued, indexed and docvalues string (not text). Soft Commit = 5sec (opensearcher=true) and Hard Commit = 10sec because new

Dynamic schema failure for child docs not using "_childDocuments_" key

2020-05-05 Thread mmb1234
I am running into a exception where creating child docs fails unless the field already exists in the schema (stacktrace is at the bottom of this post). My solr is v8.5.1 running in standard/non-cloud mode. $> curl -X POST -H 'Content-Type: application/json'

Using Deep Paging with Graph Query Parser

2019-12-08 Thread mmb1234
Is there a way to use combine paging's cursor feature with graph query parser? Background: I have a hierarchical data structure that is split into N different flat json docs and updated (inserted) into solr with from/to fields. Using the from/to join syntax a graph query is needed since

Re: Solr core corrupted for version 7.4.0, please help!

2018-08-24 Thread mmb1234
Thank you for https://issues.apache.org/jira/browse/SOLR-12691. I see it's marked as minor. Can we bump up the priority please ? The example of 2 cores ingest + transientCacheSize==1 was provided for reproduction reference only, and is not running in not production. Production setup on AWS

Re: Solr core corrupted for version 7.4.0, please help!

2018-08-22 Thread mmb1234
> Having 100+ cores on a Solr node and a transient cache size of 1 The original post clarified the current state. "we have about 75 cores with "transientCacheSize" set to 32". If transientCacheSize is increased to match current cores, we'll differ the issue. It's going to hit 100's cores per

Re: Solr core corrupted for version 7.4.0, please help!

2018-08-22 Thread mmb1234
> The problem here is that you may have M requests queued up for the _same_ core, each with a new update request. With transientCacheSize ==1, as soon as the update request for Core B is received, Core B encounters data corruption not Core A. Both Core A and Core B are receiving update requets.

Re: Is commit for SOLR-11444 for SolrCloudClient#sendRequest(SolrRequest, List) ok?

2018-06-05 Thread mmb1234
In the below mentioned git commit, I see SolrCloudClient has been changed to generate solr core urls differently than before. In the previous version, solr urls were computed using "url = coreNodeProps.getCoreUrl()". This concatenated "base_url" + "core" name from the clusterstate for a tenant's

Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"

2018-02-19 Thread mmb1234
FYI. This issue went away after solrconfig.xml was tuned. "Hard commits blocked | non-solrcloud v6.6.2" thread has the details. http://lucene.472066.n3.nabble.com/Hard-commits-blocked-non-solrcloud-v6-6-2-td4374386.html -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Hard commits blocked | non-solrcloud v6.6.2

2018-02-19 Thread mmb1234
The below solrconfig.xml settings resolved the TIMED_WAIT in ConcurrentMergeScheduler.doStall(). Thanks to Shawn and Erik for their pointers. ... 30 100 30.0 18 6 300 ... ${solr.autoCommit.maxTime:3}

Re: Hard commits blocked | non-solrcloud v6.6.2

2018-02-11 Thread mmb1234
> https://github.com/mohsinbeg/datadump/tree/master/solr58f449cec94a2c75_core_256 I had uploaded the output at the above link. The OS has no swap configured. There are other processes on the host but <1GB or <5% CPU cumulatively but none inside the docker as `top` shows. Solr JVM heap is at

Re: Hard commits blocked | non-solrcloud v6.6.2

2018-02-10 Thread mmb1234
Hi Shawn, Erik > updates should slow down but not deadlock. The net effect is the same. As the CLOSE_WAITs increase, jvm ultimately stops accepting new socket requests, at which point `kill ` is the only option. This means if replication handler is invoked which sets the deletion policy, the

Re: Hard commits blocked | non-solrcloud v6.6.2

2018-02-09 Thread mmb1234
Ran /solr/58f449cec94a2c75-core-248/admin/luke at 7:05pm PST It showed "lastModified: 2018-02-10T02:25:08.231Z" indicating commit blocked for about 41 mins. Hard commit is set as 10secs in solrconfig.xml Other cores are also now blocked. https://jstack.review analysis of the thread dump says

Re: Hard commits blocked | non-solrcloud v6.6.2

2018-02-09 Thread mmb1234
Shawn, Eric, Were you able to look at the thread dump ? https://github.com/mohsinbeg/datadump/blob/master/threadDump-7pjql_1.zip Or is there additional data I may provide. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Hard commits blocked | non-solrcloud v6.6.2

2018-02-08 Thread mmb1234
> Setting openSearcher to false on autoSoftCommit makes no sense. That was my mistake in my solrconfig.xml. Thank you for identifying it. I have corrected it. I then removed my custom element from my solrconfig.xml and both hard commit and /solr/admin/core hang issues seemed to go way for a

Re: Hard commits blocked | non-solrcloud v6.6.2

2018-02-08 Thread mmb1234
> If you issue a manual commit > (http://blah/solr/core/update?commit=true) what happens? That call never returned back to client browser. So I also tried a core reload and did capture in the thread dump. That too never returned. "qtp310656974-1022" #1022 prio=5 os_prio=0

Hard commits blocked | non-solrcloud v6.6.2

2018-02-07 Thread mmb1234
I am seeing that after some time hard commits in all my solr cores stop and each one's searcher has an "opened at" date to be hours ago even though they are continuing to ingesting data successfully (index size increasing continuously).

Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"

2018-02-07 Thread mmb1234
> Maybe this is the issue: https://github.com/eclipse/jetty.project/issues/2169 Looks like it is the issue. (I've readacted IP addresses below for security reasons) solr [ /opt/solr ]$ netstat -ptan | awk '{print $6 " " $7 }' | sort | uniq -c 8425 CLOSE_WAIT - 92 ESTABLISHED - 1

Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"

2018-02-05 Thread mmb1234
Maybe this is the issue: https://github.com/eclipse/jetty.project/issues/2169 I have noticed when number of http requests / sec are increased, CLOSE_WAITS increase linearly until solr stops accepting socket connections. Netstat output is $ netstat -ptan | awk '{print $6 " " $7 }' | sort | uniq -c

Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"

2018-02-02 Thread mmb1234
> You said that you're running Solr 6.2.2, but there is no 6.2.2 version. > but the JVM argument list includes "-Xmx512m" which is a 512MB heap My typos. They're 6.6.2 and -Xmx30g respectively. > many open connections causes is a large number of open file handles, solr [ /opt/solr/server/logs

9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"

2018-02-02 Thread mmb1234
Hello, In our solr non-cloud env., we are seeing lots of CLOSE_WAIT, causing jvm to stop "working" with 3 mins of solr start. solr [ /opt/solr ]$ netstat -anp | grep 8983 | grep CLOSE_WAIT | grep 10.xxx.xxx.xxx | wc -l 9453 Only option is then`kill -9` because even `jcmd Thread.print` is