Re: tlog keeps growing

2021-02-16 Thread mmb1234
Erik,

Looks like we're also running into this issue.
https://www.mail-archive.com/solr-user@lucene.apache.org/msg153798.html

Is there any think we can do to remedy this besides a node restart, which
causes leader re-election on the good shards which causes them to also
become un-operational?





--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Down Replica is elected as Leader (solr v8.7.0)

2021-02-16 Thread mmb1234
> Are yours growing always, on all nodes, forever?  Or is it one or two who
ends up in a bad state?

Randomly on some of the shards and some of the followers in the collection.
Then whichever tlog was open on follower when it was the leader, that one
doesn't stops growing. And that shard had active ingestion at a high rate.

If we now add more shards, or do a cluster rebalance, the collection is
unsusable and causes a production query and ingest outage. Very painful
manual restoration twice a day.




--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Down Replica is elected as Leader (solr v8.7.0)

2021-02-16 Thread mmb1234
Looks like the problem is related to tlog rotation on the follower shard.

We did the following for a specific shard.

0. start solr cloud
1. solr-0 (leader), solr-1, solr-2
2. rebalance to make solr-1 as preferred leader
3. solr-0, solr-1 (leader), solr-2

The tlog file on solr-0 kept on growing infinitely (100s of GBs) until we
shut the cluster and dropped all shards (manually).

Only way to "restart" tlog rotation on solr-0 (follower) was to issue
/admin/cores/action=RELOAD=x atleast twice when the tlog size was
small (in MBs).

Also if rebalance is is issued to select solr-0 as a leader, leader election
never completes.

solr-0 output after step (3) above.

solr-0
2140856 ./data2/mydata_0_e000-/tlog
2140712 ./data2/mydata_0_e000-/tlog/tlog.021

solr-1 (leader)
35268   ./data2/mydata_0_e000-/tlog
35264   ./data2/mydata_0_e000-/tlog/tlog.055

solr-2
35256   ./data2/mydata_0_e000-/tlog
35252   ./data2/mydata_0_e000-/tlog/tlog.054



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Down Replica is elected as Leader (solr v8.7.0)

2021-02-16 Thread mmb1234
Looks like the problem is related to tlog rotation on the follower shard.

We did the following for a specific shard.

0. start solr cloud
1. solr-0 (leader), solr-1, solr-2
2. rebalance to make solr-1 as preferred leader
3. solr-0, solr-1 (leader), solr-2

The tlog file on solr-0 kept on growing infinitely (100s of GBs) until we
shut the cluster and dropped all shards (manually).

Only way to "restart" tlog rotation on solr-0 (follower) was to issue
/admin/cores/action=RELOAD=x atleast twice when the tlog size was
small (in MBs).

Also if rebalance is is issued to select solr-0 as a leader, leader election
never completes.

solr-0 output after step (3) above.

solr-0
2140856 ./data2/mydata_0_e000-/tlog
2140712 ./data2/mydata_0_e000-/tlog/tlog.021

solr-1 (leader)
35268   ./data2/mydata_0_e000-/tlog
35264   ./data2/mydata_0_e000-/tlog/tlog.055

solr-2
35256   ./data2/mydata_0_e000-/tlog
35252   ./data2/mydata_0_e000-/tlog/tlog.054




--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Down Replica is elected as Leader (solr v8.7.0)

2021-02-14 Thread mmb1234


We found that for the shard that does not get a leader, the tlog replay did
not complete (we don't see "log replay finished", "creating leader
registration node", "I am the new leader" etc log messages) for hours.

Also not sure why the TLOG are 10's of GBs (anywhere from 30 to 40GB). 

Collection's shards have 3x replicas, TLOG replication and 10sec hard
commit.

The situation is resulting in 2x per day outage. Current work around is to
stop ingestion, issue a collection rebalance and/or node restarts and repeat
until shards are online (a couple of hrs each day of manual recovery).

Any suggestions or workarounds?

Not sure if we're running into these issues:
https://issues.apache.org/jira/browse/SOLR-13486
https://issues.apache.org/jira/browse/SOLR-14679


Partial log output from both nodes (sorted by time asc):

myapp-data-solr-0
2021-02-12 19:36:05.965 INFO (zkCallback-14-thread-51) [c:mydata
s:0_8000-9fff r:core_node3 x:mydata_0_8000-9fff_replica_t1]
o.a.s.c.ShardLeaderElectionContext Replaying tlog before become new leader


myapp-data-solr-0 
2021-02-12 19:36:05.966 WARN 
(recoveryExecutor-96-thread-1-processing-n:myapp-data-solr-0.myapp-data-solr-headless:8983_solr
x:mydata_0_8000-9fff_replica_t1 c:mydata s:0_8000-9fff
r:core_node3) [c:mydata s:0_8000-9fff r:core_node3
x:mydata_0_8000-9fff_replica_t1] o.a.s.u.UpdateLog Starting log
replay
tlog{file=/opt/solr/volumes/data1/mydata_0_8000-9fff/tlog/tlog.0003525
refcount=2}  active=false starting pos=0 inSortedOrder=true


myapp-data-solr-0 
2021-02-12 22:13:03.084 INFO 
(recoveryExecutor-96-thread-1-processing-n:myapp-data-solr-0.myapp-data-solr-headless:8983_solr
x:mydata_0_8000-9fff_replica_t1 c:mydata s:0_8000-9fff
r:core_node3) [c:mydata s:0_8000-9fff r:core_node3
x:mydata_0_8000-9fff_replica_t1] o.a.s.u.UpdateLog log replay status
tlog{file=/opt/solr/volumes/data1/mydata_0_8000-9fff/tlog/tlog.0003525
refcount=3} active=false starting pos=0 current pos=27101174167 current
size=33357447222 % read=81.0


myapp-data-solr-0
2021-02-12 22:13:06.602 ERROR (indexFetcher-4092-thread-1) [ ]
o.a.s.h.ReplicationHandler Index fetch failed
:org.apache.solr.common.SolrException: No registered leader was found after
waiting for 4000ms , collection: mydata slice: 0_8000-9fff saw
state=DocCollection(mydata//collections/mydata/state.json/750)={
"pullReplicas":"0", "replicationFactor":"0", "shards":{
"0_8000-9fff":{ "range":null, "state":"active", "replicas":{
"core_node3":{ "core":"mydata_0_8000-9fff_replica_t1",
"base_url":"http://myapp-data-solr-0.myapp-data-solr-headless:8983/solr;,
"node_name":"myapp-data-solr-0.myapp-data-solr-headless:8983_solr",
"state":"active", "type":"TLOG", "force_set_state":"false"}, "core_node5":{
"core":"mydata_0_8000-9fff_replica_t2",
"base_url":"http://myapp-data-solr-1.myapp-data-solr-headless:8983/solr;,
"node_name":"myapp-data-solr-1.myapp-data-solr-headless:8983_solr",
"state":"active", "type":"TLOG", "force_set_state":"false",
"property.preferredleader":"true"}, "core_node6":{
"core":"mydata_0_8000-9fff_replica_t4",
"base_url":"http://myapp-data-solr-2.myapp-data-solr-headless:8983/solr;,
"node_name":"myapp-data-solr-2.myapp-data-solr-headless:8983_solr",
"state":"down", "type":"TLOG", "force_set_state":"false"}}},


myapp-data-solr-0
2021-02-12 22:45:51.600 ERROR (indexFetcher-4092-thread-1) [ ]
o.a.s.h.ReplicationHandler Index fetch failed
:org.apache.solr.common.SolrException: No registered leader was found after
waiting for 4000ms , collection: mydata slice: 0_8000-9fff saw
state=DocCollection(mydata//collections/mydata/state.json/754)={
"pullReplicas":"0", "replicationFactor":"0", "shards":{
"0_8000-9fff":{ "range":null, "state":"active", "replicas":{
"core_node3":{ "core":"mydata_0_8000-9fff_replica_t1",
"base_url":"http://myapp-data-solr-0.myapp-data-solr-headless:8983/solr;,
"node_name":"myapp-data-solr-0.myapp-data-solr-headless:8983_solr",
"state":"active", "type":"TLOG", "force_set_state":"false"}, "core_node5":{
"core":"mydata_0_8000-9fff_replica_t2",
"base_url":"http://myapp-data-solr-1.myapp-data-solr-headless:8983/solr;,
"node_name":"myapp-data-solr-1.myapp-data-solr-headless:8983_solr",
"state":"down", "type":"TLOG", "force_set_state":"false",
"property.preferredleader":"true"}, "core_node6":{
"core":"mydata_0_8000-9fff_replica_t4",
"base_url":"http://myapp-data-solr-2.myapp-data-solr-headless:8983/solr;,
"node_name":"myapp-data-solr-2.myapp-data-solr-headless:8983_solr",
"state":"down", "type":"TLOG", "force_set_state":"false"}}},...


myapp-data-solr-1
2021-02-12 22:45:56.600 ERROR (indexFetcher-4092-thread-1) [ ]
o.a.s.h.ReplicationHandler Index fetch failed
:org.apache.solr.common.SolrException: No registered leader was found after
waiting for 4000ms , collection: mydata slice: 0_8000-9fff saw

Re: Down Replica is elected as Leader (solr v8.7.0)

2021-02-13 Thread mmb1234


By tracing the output in the log files we see the following sequence.

Overseer role list has POD-1, POD-2, POD-3 in that order
POD-3 has 2 shard leaders.
POD-3 restarts.

A) Logs for the shard whose leader moves successfully from POD-3 to POD-1
On POD-1: o.a.s.c.ShardLeaderElectionContext Replaying tlog before become
new leader
On POD-1: o.a.s.u.UpdateLog Starting log replay
On POD-1: o.a.s.u.UpdateLog Log replay finished.
On POD-1: o.a.s.c.SolrCore  Registered new searcher autowarm time: 0 ms
On POD-1: o.a.s.c.ShardLeaderElectionContextBase Creating leader
registration node ... after winning as ...
On POD-1: o.a.s.c.ShardLeaderElectionContext I am the new leader:


B) Logs for the shard whose leader does not move from POD-3 to POD-1
On POD-1: o.a.s.c.ShardLeaderElectionContext Replaying tlog before become
new leader
On POD-1: o.a.s.u.UpdateLog Starting log replay...
On POD-1: o.a.s.h.ReplicationHandler Index fetch failed
:org.apache.solr.common.SolrException: No registered leader was found after
waiting for 4000ms

< POD-3 is back up at this time >
On POD-3: o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException:
No registered leader On was found after waiting for 4000ms

It was odd to see no INFO, WARN or ERROR log message after "Starting log
replay" on POD-1 for the shard which did not get its leader elected.



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Down Replica is elected as Leader (solr v8.7.0)

2021-02-10 Thread mmb1234
Hello,

On reboot of one of the solr nodes in the cluster, we often see a
collection's shards with
1. LEADER replica in DOWN state, and/or
2. shard with no LEADER

Output from /solr/admin/collections?action=CLUSTERSTATUS is below.

Even after 5 to 10 minutes, the collection often does not recover. Unclear
why this is happening and what we can try to prevent or remedy it.

ps: perReplicaState= true in solr v8.8.0 didn't work well because after a
rebalance all replicas somehow get a "leader:true" status even though
states.json looked ok.

{
  "responseHeader": {
"status": 0,
"QTime": 2
  },
  "cluster": {
"collections": {
  "datacore": {
"pullReplicas": "0",
"replicationFactor": "0",
"shards": {
  "__": {
"range": null,
"state": "active",
"replicas": {
  "core_node1": {
"core": "datacore____replica_t187",
"base_url": "http://solr-0.solr-headless:8983/solr;,
"node_name": "solr-0.solr-headless:8983_solr",
"state": "down",
"type": "TLOG",
"force_set_state": "false",
"property.preferredleader": "true",
"leader": "true"
  },
  "core_node2": {
"core": "datacore____replica_t188",
"base_url": "http://solr-1.solr-headless:8983/solr;,
"node_name": "solr-1.solr-headless:8983_solr",
"state": "active",
"type": "TLOG",
"force_set_state": "false"
  },
  "core_node3": {
"core": "datacore____replica_t189",
"base_url": "http://solr-2.solr-headless:8983/solr;,
"node_name": "solr-2.solr-headless:8983_solr",
"state": "active",
"type": "TLOG",
"force_set_state": "false"
  }
}
  },
  "__j": {
"range": null,
"state": "active",
"replicas": {
  "core_node19": {
"core": "datacore___j_replica_t187",
"base_url": "http://solr-0.solr-headless:8983/solr;,
"node_name": "solr-0.solr-headless:8983_solr",
"state": "down",
"type": "TLOG",
"force_set_state": "false",
"property.preferredleader": "true"
  },
  "core_node20": {
"core": "datacore___j_replica_t188",
"base_url": "http://solr-1.solr-headless:8983/solr;,
"node_name": "solr-1.solr-headless:8983_solr",
"state": "active",
"type": "TLOG",
"force_set_state": "false"
  },
  "core_node21": {
"core": "datacore___j_replica_t189",
"base_url": "http://solr-2.solr-headless:8983/solr;,
"node_name": "solr-2.solr-headless:8983_solr",
"state": "active",
"type": "TLOG",
"force_set_state": "false"
  }
}
  },
  "__": {
"range": null,
"state": "active",
"replicas": {
  "core_node4": {
"core": "datacore____replica_t91",
"base_url": "http://solr-0...



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Json Faceting Performance Issues on solr v8.7.0

2021-02-05 Thread mmb1234
> Does this happen on a warm searcher (are subsequent requests with no
intervening updates _ever_ fast?)?

Subsequent response times very fast if searcher remains open. As a control
test, I faceted on the same field that I used in the q param.

1. Start solr

2. Execute q=resultId:x=0
=>  500ms

3. Execute q=resultId:x=0
=> 40,000ms

4. Execute q=resultId:x=0
=>  150ms

5. Execute q=processId:x=0
=>   2,500ms

6. Execute q=processId:x=0
=> 200ms


curl
'http://localhost:8983/solr/TestCollection_shard1_replica_t3/query?q=processId:-xxx-xxx-xxx-x=0'
-d '
json.facet={
categories:{
  "type": "terms",
  "field" : "processId",
  "limit" : 1
}
}



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Json Faceting Performance Issues on solr v8.7.0

2021-02-05 Thread mmb1234
Ok. I'll try that. Meanwhile query on resultId is subsecond response. But the
immediate next query for faceting takes 40+secs. The core has 185million
docs and 63GB index size.

curl
'http://localhost:8983/solr/TestCollection_shard1_replica_t3/query?q=resultId:x=0'
{
  "responseHeader":{
"zkConnected":true,
"status":0,
"QTime":558,
"params":{
  "q":"resultId:x",
  "cache":"false",
  "rows":"0"}},
  "response":{"numFound":943,"start":0,"numFoundExact":true,"docs":[]
  }}


curl
'http://localhost:8983/solr/TestCollection_shard1_replica_t3/query?q=resultId:x=0'
-d '
json.facet={
categories:{
  "type": "terms",
  "field" : "resultId",
  "limit" : 1
}
}'
{
  "responseHeader":{
"zkConnected":true,
"status":0,
"QTime":43834,
"params":{
  "q":"resultId:x",
  "json.facet":"{\ncategories:{\n  \"type\": \"terms\",\n 
\"field\" : \"resultId\",\n  \"limit\" : 1\n}\n}",
  "cache":"false",
  "rows":"0"}},
  "response":{"numFound":943,"start":0,"numFoundExact":true,"docs":[]
  },
  "facets":{
"count":943,
"categories":{
  "buckets":[{
  "val":"x",
  "count":943}]}}}



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Json Faceting Performance Issues on solr v8.7.0

2021-02-05 Thread mmb1234
Hello,

I am seeing very slow response from json faceting against a single core
(though core is shard leader in a collection). 

Fields processId and resultId are non-multivalued, indexed and docvalues
string (not text).

Soft Commit = 5sec (opensearcher=true) and Hard Commit = 10sec because new
docs are constantly being indexed with 95% new and 5% overwritten
(overwrite=true; no atomic update). Caches are not considered useful due to
commit frequency.

Solr is v8.7.0 on openjdk11.

Is there any way to improve json facet QTime?

## query only
curl
'http://localhost:8983/solr/TestCollection_shard1_replica_t3/query?q=processId:-xxx-xxx-xxx-x=0'
-d '
{
  "responseHeader":{
"zkConnected":true,
"status":0,
"QTime":552,
"params":{
  "q":"processId:-xxx-xxx-xxx-x",
  "cache":"false",
  "rows":"0"}},
  "response":{"numFound":231311,"start":0,"numFoundExact":true,"docs":[]
  }}

## json facet takes 46secs
curl
'http://localhost:8983/solr/TestCollection_shard1_replica_t3/query?q=processId:-xxx-xxx-xxx-x=0'
-d '
json.facet={
categories:{
  "type": "terms",
  "field" : "resultId",
  "limit" : 1
}
}'
{
  "responseHeader":{
"zkConnected":true,
"status":0,
"QTime":46972,
"params":{
  "q":"processId:-xxx-xxx-xxx-x",
  "json.facet":"{categories:{  \"type\": \"terms\", 
\"field\" : \"resultId\",  \"limit\" : 1}}",
  "rows":"0"}},
  "response":{"numFound":231311,"start":0,"numFoundExact":true,"docs":[]
  },
  "facets":{
"count":231311,
"categories":{
  "buckets":[{
  "val":"x",
  "count":943}]}}}


## visualvm CPU sampling almost all time spent in lucene:

org.apache.lucene.util.PriorityQueue.downHeap() 23,009 ms
org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$TermsDict.next()
13,268 ms



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Dynamic schema failure for child docs not using "_childDocuments_" key

2020-05-05 Thread mmb1234
I am running into a exception where creating child docs fails unless the
field already exists in the schema (stacktrace is at the bottom of this
post). My solr is v8.5.1 running in standard/non-cloud mode.

$> curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/mycore/update' --data-binary '[{
  "id": "3dae27db6ee43e878b9d0e8e",
  "phone": "+1 (123) 456-7890",
  "myChildDocuments": [{
"id": "3baf27db6ee43387849d0e8e",
 "enabled": false
   }]
}]'

{
  "responseHeader":{
"status":400,
"QTime":285},
  "error":{
"metadata":[
  "error-class","org.apache.solr.common.SolrException",
  "root-error-class","org.apache.solr.common.SolrException"],
"msg":"ERROR: [doc=3baf27db6ee43387849d0e8e] unknown field 'enabled'",
"code":400}}


However using "_childDocuments_" key, it succeeds and child doc fields get
created in the managed-schema

$> curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/mycore/update' --data-binary '[{
  "id": "6dae27db6ee43e878b9d0e8e",
  "phone": "+1 (123) 456-7890",
  "_childDocuments_": [{
"id": "6baf27db6ee43387849d0e8e",
 "enabled": false
   }]
}]'

{
  "responseHeader":{
"status":0,
"QTime":285}}


== stacktrace ==
2020-05-06 01:01:26.762 ERROR (qtp1569435561-19) [   x:standalone]
o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR:
[doc=3baf27db6ee43387849d0e8e] unknown field 'enabled'
at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:226)
at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:100)
at
org.apache.solr.update.AddUpdateCommand.lambda$null$0(AddUpdateCommand.java:224)
at
java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at
java.base/java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1631)
at
java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.lambda$initPartialTraversalState$0(StreamSpliterators.java:294)
at
java.base/java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.fillBuffer(StreamSpliterators.java:206)
at
java.base/java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.doAdvance(StreamSpliterators.java:161)
at
java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.tryAdvance(StreamSpliterators.java:300)
at 
java.base/java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681)
at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:282)
at
org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
at
org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1284)
at
org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1277)
at
org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:975)
at
org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:345)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:292)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:239)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:76)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at
org.apache.solr.update.processor.NestedUpdateProcessorFactory$NestedUpdateProcessor.processAdd(NestedUpdateProcessorFactory.java:79)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:259)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:489)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$versionAdd$0(DistributedUpdateProcessor.java:339)
at 
org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:339)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:225)
at
org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at
org.apache.solr.update.processor.AddSchemaFieldsUpdateProcessorFactory$AddSchemaFieldsUpdateProcessor.processAdd(AddSchemaFieldsUpdateProcessorFactory.java:477)
   

Using Deep Paging with Graph Query Parser

2019-12-08 Thread mmb1234


Is there a way to use combine paging's cursor feature with graph query
parser?

Background:
I have a hierarchical data structure that is split into N different flat
json docs and updated (inserted) into solr with from/to fields. Using the
from/to join syntax a graph query is needed since different queries need
parents (given certain child filters) and different queries need children
(given certain parent filters).

Graph query parser (though not distributed) looks ideal. However I need
pagination to iterate on the results. Hints for custom code are ok, since
current solr install has lots admin, core and collection handlers already
running.

-M



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr core corrupted for version 7.4.0, please help!

2018-08-24 Thread mmb1234


Thank you for https://issues.apache.org/jira/browse/SOLR-12691.

I see it's marked as minor. Can we bump up the priority please ?

The example of 2 cores ingest + transientCacheSize==1 was provided for
reproduction reference only, and is not running in not production.

Production setup on AWS uses biggest non-baremetal EC2 instances available.
And there are quite a few of them.





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr core corrupted for version 7.4.0, please help!

2018-08-22 Thread mmb1234


> Having 100+ cores on a Solr node and a transient cache size of 1

The original post clarified the current state. "we have about 75 cores with
"transientCacheSize" set to 32". If transientCacheSize is increased to match
current cores, we'll differ the issue. It's going to hit 100's cores per
solr instance shortly. 

Majority of the cores do not get ingest request so it would be waste to
spend memory resources and that is why all solr.xml for cores is:

transient=true
loadOnStartup=false

JVM heap is 30GB on 64bit inside solr docker running in kubernetes on aws.

Any other things that can be done to avoid the core corruption or pointers
what could be causing the index segment loss ?





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr core corrupted for version 7.4.0, please help!

2018-08-22 Thread mmb1234
>  The problem here is that you may have M requests queued up for the _same_
core, each with a new update request.

With transientCacheSize ==1, as soon as the update request for Core B is
received, Core B encounters data corruption not Core A. Both Core A and Core
B are receiving update requets.

I am presuming this happens on core close after the ref count is decremented
for Core B to process the request for Core A.

In production number of cores on solr == 100+.
Transient cache is sized so jvm heap and os filecache is aligned to a
manageable number of open cores we can serve (isOpen != false).

So when a random update request comes in, corruption is seen. Very painful
since now a restore needs to be invoked. This is happening 5 to 10 times a
day.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Is commit for SOLR-11444 for SolrCloudClient#sendRequest(SolrRequest, List) ok?

2018-06-05 Thread mmb1234
In the below mentioned git commit, I see SolrCloudClient has been changed to
generate solr core urls differently than before.

In the previous version, solr urls were computed using "url =
coreNodeProps.getCoreUrl()".
This concatenated "base_url" + "core" name from the clusterstate for a
tenant's shards.
re: http://:/solr/

Now, the solr url is computed using "url =
ZkCoreNodeProps.getCoreUrl(nodeProps.getStr(ZkStateReader.BASE_URL_PROP),
joinedInputCollections)"
This concatenates "base_url" + "collection" name from the clusterstate for a
tenant's shards.
re: http://:/solr/

This change broke a custom client which used to previously work for me. Can
someone provide guidance ?

Thank you.

> commit da71587196eccefd8b58923dc162916c808762a9
> Author: David Smiley 
> Date:   Thu Oct 19 00:02:24 2017 -0400
>
>SOLR-11444: Improve consistency of collection alias handling and
> collection list references.
>Other refactorings of nearby code too.
>
>(cherry picked from commit e001f35)
>
> diff --git
> a/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java
> b/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java
> index 6a208f48c0..f63eedd742 100644
> ---
> a/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java
> +++
> b/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java
>...
>-  if (!liveNodes.contains(coreNodeProps.getNodeName())
>-  || Replica.State.getState(coreNodeProps.getState()) !=
Replica.State.ACTIVE) continue;
>-  if (nodes.put(node, nodeProps) == null) {
>-if (!sendToLeaders || coreNodeProps.isLeader()) {
>-  String url;
>-  if (reqParams.get(UpdateParams.COLLECTION) == null) {
>-url =
ZkCoreNodeProps.getCoreUrl(nodeProps.getStr(ZkStateReader.BASE_URL_PROP),
collection);
>-  } else {
>-url = coreNodeProps.getCoreUrl();
> -  }
> -  urlList2.add(url);

+  if (!liveNodes.contains(node) // Must be a live node to continue
+  || Replica.State.getState(coreNodeProps.getState()) !=
Replica.State.ACTIVE) // Must be an ACTIVE replica to continue
+continue;
+  if (seenNodes.add(node)) { // if we haven't yet collected a URL
to this node...
+String url =
ZkCoreNodeProps.getCoreUrl(nodeProps.getStr(ZkStateReader.BASE_URL_PROP),
joinedInputCollections);
+if (sendToLeaders && coreNodeProps.isLeader()) {
+  theUrlList.add(url); // put leaders here eagerly (if
sendToLeader mode)




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"

2018-02-19 Thread mmb1234

FYI. This issue went away after solrconfig.xml was tuned.

"Hard commits blocked | non-solrcloud v6.6.2" thread has the details.
http://lucene.472066.n3.nabble.com/Hard-commits-blocked-non-solrcloud-v6-6-2-td4374386.html




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Hard commits blocked | non-solrcloud v6.6.2

2018-02-19 Thread mmb1234

The below solrconfig.xml settings resolved the TIMED_WAIT in
ConcurrentMergeScheduler.doStall(). Thanks to Shawn and Erik for their
pointers.

  
...

   30
   100
   30.0


   18
   6

300
...
  

  

  ${solr.autoCommit.maxTime:3}
  false


  ${solr.autoSoftCommit.maxTime:1}
  ${solr.autoSoftCommit.maxDocs:-1}
  true

  



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Hard commits blocked | non-solrcloud v6.6.2

2018-02-11 Thread mmb1234

> https://github.com/mohsinbeg/datadump/tree/master/solr58f449cec94a2c75_core_256

I had uploaded the output at the above link.

The OS has no swap configured. There are other processes on the host but
<1GB or <5% CPU cumulatively but none inside the docker as `top` shows. Solr
JVM heap is at 30GB. JVM monitoring shows heap never went over 10GB.

Also there were no queries at that time. Few hours earlier I did run a quite
a few heavy hitting distributed paging and json facet queries.

The machine is 32vCPU and 240GB memory.

Please let me know if vmstat and/or iostat will be helpful. I didn't save
the gc logs at that time but can capture it if you like.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Hard commits blocked | non-solrcloud v6.6.2

2018-02-10 Thread mmb1234
Hi Shawn, Erik

> updates should slow down but not deadlock. 
The net effect is the same. As the CLOSE_WAITs increase, jvm ultimately
stops accepting new socket requests, at which point `kill ` is the
only option. 

This means if replication handler is invoked which sets the deletion policy,
the threads blocked rises even faster and system fails even faster.

Each solr POST is a blocking call, hence the CLOSE_WAITs. Also the POST gzip
is an json array of 100 json objects (1 json doc = 1 solr doc).

All custom AbstractSolrEventListener listeners were disabled to not process
any post commit events. Those threads are in WAITING state, which is ok.

I then ran /solr/58f449cec94a2c75-core-256/admin/luke at 10:30pm PST

It showed "lastModified: 2018-02-11T04:46:54.540Z" indicating commit blocked
for about 2 hours.
Hard commit is set as 10secs in solrconfig.xml

Other cores are also blocked for a while.

Thread dump and top output are from that condition are at
https://github.com/mohsinbeg/datadump/tree/master/solr58f449cec94a2c75_core_256

netstat CLOSE_WAIT are correlated with DirectUpdateHandler2 /
UpdateRequestProcessor.processAdd() requests.

solr [ /tmp ]$ sudo netstat -ptan | awk '{print $6 " " $7 }' | sort | uniq
-c; TZ=PST8PDT date;
   7728 CLOSE_WAIT -
 65 ESTABLISHED -
  1 FIN_WAIT2 -
  1 Foreign Address
  6 LISTEN -
 36 TIME_WAIT -
  1 established)
Sat Feb 10 22:27:07 PST 2018

http://fastthread.io shows lots 6,700 threads in TIMED_WAIT

https://jstack.review shows 6584 threads with this stack
at java.lang.Object.wait(Native Method)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doStall(ConcurrentMergeScheduler.java:616)
at
org.apache.lucene.index.ConcurrentMergeScheduler.maybeStall(ConcurrentMergeScheduler.java:602)
...
at
org.apache.solr.update.DirectUpdateHandler2.allowDuplicateUpdate(DirectUpdateHandler2.java:276)
...
at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
...


Only `top` available on Photon OS is
https://github.com/vmware/photon/blob/1.0/SPECS/procps-ng/procps-ng.spec.
Those screenshots are attached.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Hard commits blocked | non-solrcloud v6.6.2

2018-02-09 Thread mmb1234
Ran /solr/58f449cec94a2c75-core-248/admin/luke at 7:05pm PST

It showed "lastModified: 2018-02-10T02:25:08.231Z" indicating commit blocked
for about 41 mins.
Hard commit is set as 10secs in solrconfig.xml

Other cores are also now blocked.

https://jstack.review analysis of the thread dump says "Potential Deadlock"
for three 3 cores.

0x00027afca8e0 
ConcurrentMergeScheduler 
Possible deadlock
Held by commitScheduler-26-thread-1 awaiting notification , holding [
0x00027afca8e0 ]

0x00027fd19080 
ConcurrentMergeScheduler 
Possible deadlock   
Held by commitScheduler-19-thread-1 awaiting notification , holding [
0x00027fd19080 0x0002615ada20 ]

0x000280da09c8 
ConcurrentMergeScheduler 
Possible deadlock   
Held by commitScheduler-24-thread-1 awaiting notification , holding [
0x000280da09c8 0x00028e0cbf10 ]

Thread dump and top output are from that condition are available at
https://github.com/mohsinbeg/datadump/tree/master/solr58f449cec94a2c75_core_248/solr58f449cec94a2c75_core_248


848 thread(s) have this kind of stack:
---
at java.lang.Object.wait(Native Method)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doStall(ConcurrentMergeScheduler.java:616)
at
org.apache.lucene.index.ConcurrentMergeScheduler.maybeStall(ConcurrentMergeScheduler.java:602)
...
 at
org.apache.solr.update.DirectUpdateHandler2.allowDuplicateUpdate(DirectUpdateHandler2.java:276)
...



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Hard commits blocked | non-solrcloud v6.6.2

2018-02-09 Thread mmb1234

Shawn, Eric, 

Were you able to look at the thread dump ?
https://github.com/mohsinbeg/datadump/blob/master/threadDump-7pjql_1.zip

Or is there additional data I may provide.





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Hard commits blocked | non-solrcloud v6.6.2

2018-02-08 Thread mmb1234

> Setting openSearcher to false on autoSoftCommit makes no sense.

That was my mistake in my solrconfig.xml. Thank you for identifying it. I
have corrected it.

I then removed my custom  element from my solrconfig.xml and
both hard commit and /solr/admin/core hang issues seemed to go way for a
couple of hours.

Then I increased the ingestion load and the issue came back. Taking a heap
dump I now see lots of these and back to commit hanging. Full thread dump is
at https://github.com/mohsinbeg/datadump/blob/master/threadDump-7pjql_1.zip

"commitScheduler-24-thread-1" #187 prio=5 os_prio=0 tid=0x7f7c9401d000
nid=0x369 runnable [0x7f7c1a664000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00026b4317f0> (a
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.tryLock(ReentrantReadWriteLock.java:871)
at
org.apache.solr.update.DefaultSolrCoreState.lock(DefaultSolrCoreState.java:176)
at
org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:121)
at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:618)
at org.apache.solr.update.CommitTracker.run(CommitTracker.java:217)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Hard commits blocked | non-solrcloud v6.6.2

2018-02-08 Thread mmb1234
> If you issue a manual commit 
> (http://blah/solr/core/update?commit=true) what happens? 

That call never returned back to client browser.

So I also tried a core reload and did capture in the thread dump. That too
never returned.

"qtp310656974-1022" #1022 prio=5 os_prio=0 tid=0x7ef25401 nid=0x6bf
runnable [0x7f0138394000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x000271e49b00> (a
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireNanos(AbstractQueuedSynchronizer.java:934)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(AbstractQueuedSynchronizer.java:1247)
at
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.tryLock(ReentrantReadWriteLock.java:1115)
at
org.apache.solr.update.DefaultSolrCoreState.lock(DefaultSolrCoreState.java:176)
at
org.apache.solr.update.DefaultSolrCoreState.newIndexWriter(DefaultSolrCoreState.java:227)
at org.apache.solr.core.SolrCore.reload(SolrCore.java:644)
- locked <0x000271f08ac8> (a java.lang.Object)
at
org.apache.solr.core.CoreContainer.reload(CoreContainer.java:1181)
at
org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$2(CoreAdminOperation.java:111)
at
org.apache.solr.handler.admin.CoreAdminOperation$$Lambda$95/234250762.execute(Unknown
Source)
at
org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:384)
at
org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:388)
at
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:174)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
at
org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:748)
at
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:729)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:510)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:534)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.jav\
a:303)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Hard commits blocked | non-solrcloud v6.6.2

2018-02-07 Thread mmb1234

I am seeing that after some time hard commits in all my solr cores stop and
each one's searcher has an "opened at" date to be hours ago even though they
are continuing to ingesting data successfully (index size increasing
continuously).

http://localhost:8983/solr/#/x-core/plugins?type=core=searcher
"openedAt: 2018-02-08T01:52:24.950Z"

Is there something I am doing incorrectly in my config ?

I setup my solrconfig.xml without  for my "bulk indexing" use
cases.

solrConfig.xml:-
  
none
200

  1
  5

  

  

  ${solr.autoCommit.maxTime:1}
  true


  ${solr.autoSoftCommit.maxTime:-1}
  ${solr.autoSoftCommit.maxDocs:-1}
  false

  

Thread dump:-
"commitScheduler-20-thread-1" #391 prio=5 os_prio=0 tid=0x7ef194011000
nid=0x43a in Object.wait() [0x7ec99533f000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doStall(ConcurrentMergeScheduler.java:616)
- eliminated <0x00027005a0f0> (a
org.apache.lucene.index.ConcurrentMergeScheduler)
at
org.apache.lucene.index.ConcurrentMergeScheduler.maybeStall(ConcurrentMergeScheduler.java:602)
- locked <0x00027005a0f0> (a
org.apache.lucene.index.ConcurrentMergeScheduler)
at
org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeScheduler.java:524)
- locked <0x00027005a0f0> (a
org.apache.lucene.index.ConcurrentMergeScheduler)
at
org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2083)
at
org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:487)
at
org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:291)
at
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:276)
at
org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:235)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1980)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2189)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1926)
at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:675)
- locked <0x00026f20bb88> (a java.lang.Object)
at org.apache.solr.update.CommitTracker.run(CommitTracker.java:217)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"

2018-02-07 Thread mmb1234
> Maybe this is the issue:
https://github.com/eclipse/jetty.project/issues/2169

Looks like it is the issue. (I've readacted IP addresses below for security
reasons)

solr [ /opt/solr ]$ netstat -ptan | awk '{print $6 " " $7 }' | sort | uniq
-c
   8425 CLOSE_WAIT -
 92 ESTABLISHED -
  1 FIN_WAIT2 -
  1 Foreign Address
  6 LISTEN -
712 TIME_WAIT -
  1 established)

solr [ /opt/solr ]$ echo "run -b
org.eclipse.jetty.server:context=HTTP/1.1@63e2203c,id=0,type=serverconnector
dump " | java
 -jar jmxterm-1.0.0-uber.jar -l localhost:18983 -v silent -n > jettyJmx.out

solr [ /opt/solr ]$ netstat -anp | grep CLOSE_WAIT | head -n1
tcp1  0 10.xxx.x.xx:898310.xxx.x.xxx:53873 
CLOSE_WAIT  1/java

solr [ /opt/solr ]$ grep "10.xxx.x.xxx:53873" jettyJmx.out
 |   |   +-
SelectionKey@5ee4ef9f{i=0}->SelectChannelEndPoint@69feb476{/10.xxx.x.xxx:53873<->8983,Open,in,out,-,-,1712/5000,HttpConnection@c93d7fa}{io=0/0,kio=0,kro=1}

solr [ /opt/solr ]$ cat jettyJmx.out | grep 8983 | grep
SelectChannelEndPoint | grep "{io=0/0,kio=0,kro=1}" | wc -l
8220





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"

2018-02-05 Thread mmb1234
Maybe this is the issue: https://github.com/eclipse/jetty.project/issues/2169

I have noticed when number of http requests / sec are increased, CLOSE_WAITS
increase linearly until solr stops accepting socket connections. Netstat
output is
$ netstat -ptan | awk '{print $6 " " $7 }' | sort | uniq -c
   9784 CLOSE_WAIT 1/java
  1 ESTABLISHED -
  3 ESTABLISHED 1/java
  1 Foreign Address
  2 LISTEN -
  4 LISTEN 1/java
 27 TIME_WAIT -
  1 established)


The custom client code is:
RequestConfig config = RequestConfig.custom()
.setConnectionRequestTimeout(4000)
.setConnectTimeout(4000)
.setSocketTimeout(4000)
.build();

ConnectingIOReactor ioReactor = new
DefaultConnectingIOReactor();
PoolingNHttpClientConnectionManager cmAsync = new
PoolingNHttpClientConnectionManager(ioReactor);
cmAsync.setMaxTotal(1);
cmAsync.setDefaultMaxPerRoute(1000);

asynClient = HttpAsyncClients.custom()
.setDefaultRequestConfig(config)
.setConnectionManager(cmAsync).build();
asynClient.start();

executor = Executors.newScheduledThreadPool(1);
idleConnectionFuture = executor.scheduleAtFixedRate(() -> {
cmAsync.closeExpiredConnections();
cmAsync.closeIdleConnections(1, TimeUnit.SECONDS);
}, 1, 1, TimeUnit.SECONDS);


Also /solr/admin/cores takes a very long time to respond (QTime = 300secs+). 
So curl with a timeout also causes an additional CLOSE_WAIT (as expected)
curl -m 5 'http://:/solr/admin/cores' 

customhandler QTime = ~150 (ms) or lower even under max load for the active
cores.
 
Note there are 100's of solr cores on each solr jvm but only few needed to
be open at any given time in each solr jvm to avoid heap memory bloat.
solr.xml has this setting therefore because we're not running solr cloud.
  ${transientCacheSize:30}

Also  was removed from solrconfig.xml because we saw 1000's of
threads BLOCKED on VersionBucket even with.
  ${solr.ulog.numVersionBuckets:655360}

The side effect is lots of merges [ we'll tackle that when solr stops
die-ing :-) ]



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: 9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"

2018-02-02 Thread mmb1234
> You said that you're running Solr 6.2.2, but there is no 6.2.2 version.
> but the JVM argument list includes "-Xmx512m" which is a 512MB heap

My typos. They're 6.6.2 and -Xmx30g respectively.

> many open connections causes is a large number of open file handles,

solr [ /opt/solr/server/logs ]$ sysctl -a | grep vm.max_map_count
vm.max_map_count = 262144

The only thing I notice right before solr shutdown messages in solr.log the
/update QTime goes from ~500ms  to ~25.

There is an automated health check that issues a kill on the  due
to http connection timeout. 





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


9000+ CLOSE_WAIT connections in solr v6.2.2 causing it to "die"

2018-02-02 Thread mmb1234
Hello,

In our solr non-cloud env., we are seeing lots of CLOSE_WAIT, causing jvm to
stop "working" with 3 mins of solr start.

solr [ /opt/solr ]$ netstat -anp | grep 8983 | grep CLOSE_WAIT | grep
10.xxx.xxx.xxx | wc -l
9453

Only option is then`kill -9` because even `jcmd  Thread.print` is
unable to connect to the jvm. The problem can be reproduced at will.

Any suggestions what could be causing this or the fix ?

Details of system are as follows and has been setup for "bulk indexing".

-

Solr / server:
v6.2.2 non-solrcloud in a docker with kubernetes
java: 1.8.0_151 25.151-b12 HotSpot 64bit | Oracle
jvm: heap 30GB
os: Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u5 (2017-09-19) x86_64
GNU/Linux
os memory: 230GB | no swap configured
os cpu: 32vCPU
jvm:  "-XX:+UseLargePages",
"-XX:LargePageSizeInBytes=2m",
"-Xms512m",
"-Xmx512m",
"-XX:NewRatio=3",
"-XX:SurvivorRatio=4",
"-XX:TargetSurvivorRatio=90",
"-XX:MaxTenuringThreshold=8",
"-XX:+UseConcMarkSweepGC",
"-XX:+UseParNewGC",
"-XX:ConcGCThreads=4",
"-XX:ParallelGCThreads=4",
"-XX:+CMSScavengeBeforeRemark",
"-XX:PretenureSizeThreshold=64m",
"-XX:+UseCMSInitiatingOccupancyOnly",
"-XX:CMSInitiatingOccupancyFraction=50",
"-XX:CMSMaxAbortablePrecleanTime=6000",
"-XX:+CMSParallelRemarkEnabled",
"-XX:+ParallelRefProcEnabled",

non-cloud solr.xml:
transientCacheSize = 30
shareSchema = true
Also only 4 cores are POSTed to.

Client / java8 app:
An AsyncHTTPClient POST-ing gzip payloads.
PoolingNHttpClientConnectionManager maxtotal=10,000 and maxperroute=1000)
ConnectionRequestTimeout = ConnectTimeout = SocketTimeout = 4000 (4 secs)

Gzip payloads:
About 800 json messages like this.
[
  {id:"abcdefx", datetimestamp:"xx", key1:"xx", key2:"z",
},
   
]

POST rate:
Each of 4 solr core receives ~32 payloads per second from the custom java
app (plugin handler metrics in solr reports the same).
Approx ~102,000 docs per sec in total (32 payload x 800 docs x 4 solr cores)

Document uniqueness:
No doc or id is ever repeated or concurrently sent.
No atomic updates needed (overwrite=false in AddUpdateCommand was set in
solr handler)

Solrconfig.xml
For bulk indexing requirement, updatelog and softcommit were minimized /
removed.
  
none
200

  1
  6

  
  

  ${solr.autoCommit.maxTime:1}
  true


  ${solr.autoSoftCommit.maxTime:-1}
  ${solr.autoSoftCommit.maxDocs:-1}
  false

  

-M



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html