RE: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-28 Thread Will Martin
The easiest, and coarsest measure of response time [not service time in a 
distributed system] can be picked up in your localhost_access.log file.
You're using tomcat write?  Lookup AccessLogValve in the docs and server.xml. 
You can add configuration to report the payload and time to service the request 
without touching any code.

Queueing theory is what Otis was talking about when he said you've saturated 
your environment. In AWS people just auto-scale up and don't worry about where 
the load comes from; its dumb if it happens more than 2 times. Capacity 
planning is tough, let's hope it doesn't disappear altogether.

G'luck


-Original Message-
From: S.L [mailto:simpleliving...@gmail.com] 
Sent: Monday, October 27, 2014 9:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of 
synch.

Good point about ZK logs , I do see the following exceptions intermittently in 
the ZK log.

2014-10-27 06:54:14,621 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client 
/xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029
2014-10-27 07:00:06,697 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection 
from /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,725 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish new 
session at /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,746 [myid:1] - INFO
[CommitProcessor:1:ZooKeeperServer@617] - Established session
0x14949db9da40037 with negotiated timeout 1 for client
/xxx.xxx.xxx.xxx:37336
2014-10-27 07:01:06,520 [myid:1] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 
0x14949db9da40037, likely client has closed socket
at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:744)

For queuing theory , I dont know of any way to see how fasts the requests are 
being served by SolrCloud , and if a queue is being maintained if the service 
rate is slower than the rate of requests from the incoming multiple threads.

On Mon, Oct 27, 2014 at 7:09 PM, Will Martin wmartin...@gmail.com wrote:

 2 naïve comments, of course.



 -  Queuing theory

 -  Zookeeper logs.



 From: S.L [mailto:simpleliving...@gmail.com]
 Sent: Monday, October 27, 2014 1:42 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 
 replicas out of synch.



 Please find the clusterstate.json attached.

 Also in this case atleast the Shard1 replicas are out of sync , as can 
 be seen below.

 Shard 1 replica 1 *does not* return a result with distrib=false.

 Query 
 :http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*  
 http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%
 28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebu
 g=trackshards.info=true 
 fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=false
 debug=track
 shards.info=true



 Result :

 responselst name=responseHeaderint name=status0/intint 
 name=QTime1/intlst name=paramsstr name=q*:*/strstr name=
 shards.infotrue/strstr name=distribfalse/strstr 
 name=debugtrack/strstr name=wtxml/strstr 
 name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lst
 result name=response numFound=0 start=0/lst 
 name=debug//response



 Shard1 replica 2 *does* return the result with distrib=false.

 Query: 
 http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:*  
 http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%
 28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebu
 g=trackshards.info=true 
 fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=false
 debug=track
 shards.info=true

 Result:

 responselst name=responseHeaderint name=status0/intint 
 name=QTime1/intlst name=paramsstr name=q*:*/strstr name=
 shards.infotrue/strstr name=distribfalse/strstr 
 name=debugtrack/strstr name=wtxml/strstr 
 name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lst
 result name=response numFound=1 start=0docstr 
 name=thingURL http://www.xyz.com/strstr 
 name=id9f4748c0-fe16-4632-b74e-4fee6b80cbf5/strlong
 name=_version_1483135330558148608/long/doc/resultlst
 name=debug//response



 On Mon, Oct 27, 2014 at 12:19 PM, Shalin Shekhar Mangar  
 shalinman...@gmail.com wrote:

 On Mon, Oct 27, 2014 at 9:40 PM, S.L simpleliving...@gmail.com wrote:

  One is not smaller than the other, because the numDocs is same for 
  both replicas and essentially they seem to be disjoint sets.
 

 That is strange. Can we see your clusterstate.json? With that, please 
 also

Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-28 Thread S.L
Will,

I think in one of your other emails(which I am not able to find) you has
asked if I was indexing directly from MapReduce jobs, yes I am indexing
directly from the map task and that is done using SolrJ with a
SolrCloudServer initialized with the ZK ensemble URLs.Do I need to use
something like MapReducerIndexerTool , which I suupose writes to HDFS and
that is in a subsequent step moved to Solr index ? If so why ?

I dont use any softCommits and do autocommit every 15 seconds , the snippet
in the configuration can be seen below.

 autoSoftCommit
   maxTime${solr.
autoSoftCommit.maxTime:-1}/maxTime
 /autoSoftCommit

 autoCommit
   maxTime${solr.autoCommit.maxTime:15000}/maxTime

   openSearchertrue/openSearcher
 /autoCommit

I looked at the localhost_access.log file ,  all the GET and POST requests
have a sub-second response time.




On Tue, Oct 28, 2014 at 2:06 AM, Will Martin wmartin...@gmail.com wrote:

 The easiest, and coarsest measure of response time [not service time in a
 distributed system] can be picked up in your localhost_access.log file.
 You're using tomcat write?  Lookup AccessLogValve in the docs and
 server.xml. You can add configuration to report the payload and time to
 service the request without touching any code.

 Queueing theory is what Otis was talking about when he said you've
 saturated your environment. In AWS people just auto-scale up and don't
 worry about where the load comes from; its dumb if it happens more than 2
 times. Capacity planning is tough, let's hope it doesn't disappear
 altogether.

 G'luck


 -Original Message-
 From: S.L [mailto:simpleliving...@gmail.com]
 Sent: Monday, October 27, 2014 9:25 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas
 out of synch.

 Good point about ZK logs , I do see the following exceptions
 intermittently in the ZK log.

 2014-10-27 06:54:14,621 [myid:1] - INFO  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for
 client /xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029
 2014-10-27 07:00:06,697 [myid:1] - INFO  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
 connection from /xxx.xxx.xxx.xxx:37336
 2014-10-27 07:00:06,725 [myid:1] - INFO  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to
 establish new session at /xxx.xxx.xxx.xxx:37336
 2014-10-27 07:00:06,746 [myid:1] - INFO
 [CommitProcessor:1:ZooKeeperServer@617] - Established session
 0x14949db9da40037 with negotiated timeout 1 for client
 /xxx.xxx.xxx.xxx:37336
 2014-10-27 07:01:06,520 [myid:1] - WARN  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
 EndOfStreamException: Unable to read additional data from client sessionid
 0x14949db9da40037, likely client has closed socket
 at
 org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
 at

 org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
 at java.lang.Thread.run(Thread.java:744)

 For queuing theory , I dont know of any way to see how fasts the requests
 are being served by SolrCloud , and if a queue is being maintained if the
 service rate is slower than the rate of requests from the incoming multiple
 threads.

 On Mon, Oct 27, 2014 at 7:09 PM, Will Martin wmartin...@gmail.com wrote:

  2 naïve comments, of course.
 
 
 
  -  Queuing theory
 
  -  Zookeeper logs.
 
 
 
  From: S.L [mailto:simpleliving...@gmail.com]
  Sent: Monday, October 27, 2014 1:42 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1
  replicas out of synch.
 
 
 
  Please find the clusterstate.json attached.
 
  Also in this case atleast the Shard1 replicas are out of sync , as can
  be seen below.
 
  Shard 1 replica 1 *does not* return a result with distrib=false.
 
  Query
  :http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:* 
  http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%
  28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebu
  g=trackshards.info=true
  fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=false
  debug=track
  shards.info=true
 
 
 
  Result :
 
  responselst name=responseHeaderint name=status0/intint
  name=QTime1/intlst name=paramsstr name=q*:*/strstr name=
  shards.infotrue/strstr name=distribfalse/strstr
  name=debugtrack/strstr name=wtxml/strstr
  name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lst
  result name=response numFound=0 start=0/lst
  name=debug//response
 
 
 
  Shard1 replica 2 *does* return the result with distrib=false.
 
  Query:
  http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:* 
  http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%
  28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt

Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-28 Thread Michael Della Bitta
We index directly from mappers using SolrJ. It does work, but you pay 
the price of having to instantiate all those sockets vs. the way 
MapReduceIndexerTool works, where you're writing to an 
EmbeddedSolrServer directly in the Reduce task.


You don't *need* to use MapReduceIndexerTool, but it's more efficient, 
and if you don't, you then have to make sure to appropriately tune your 
Hadoop implementation to match what your Solr installation is capable of.


On 10/28/14 12:39, S.L wrote:

Will,

I think in one of your other emails(which I am not able to find) you has
asked if I was indexing directly from MapReduce jobs, yes I am indexing
directly from the map task and that is done using SolrJ with a
SolrCloudServer initialized with the ZK ensemble URLs.Do I need to use
something like MapReducerIndexerTool , which I suupose writes to HDFS and
that is in a subsequent step moved to Solr index ? If so why ?

I dont use any softCommits and do autocommit every 15 seconds , the snippet
in the configuration can be seen below.

  autoSoftCommit
maxTime${solr.
autoSoftCommit.maxTime:-1}/maxTime
  /autoSoftCommit

  autoCommit
maxTime${solr.autoCommit.maxTime:15000}/maxTime

openSearchertrue/openSearcher
  /autoCommit

I looked at the localhost_access.log file ,  all the GET and POST requests
have a sub-second response time.




On Tue, Oct 28, 2014 at 2:06 AM, Will Martin wmartin...@gmail.com wrote:


The easiest, and coarsest measure of response time [not service time in a
distributed system] can be picked up in your localhost_access.log file.
You're using tomcat write?  Lookup AccessLogValve in the docs and
server.xml. You can add configuration to report the payload and time to
service the request without touching any code.

Queueing theory is what Otis was talking about when he said you've
saturated your environment. In AWS people just auto-scale up and don't
worry about where the load comes from; its dumb if it happens more than 2
times. Capacity planning is tough, let's hope it doesn't disappear
altogether.

G'luck


-Original Message-
From: S.L [mailto:simpleliving...@gmail.com]
Sent: Monday, October 27, 2014 9:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas
out of synch.

Good point about ZK logs , I do see the following exceptions
intermittently in the ZK log.

2014-10-27 06:54:14,621 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for
client /xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029
2014-10-27 07:00:06,697 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
connection from /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,725 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to
establish new session at /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,746 [myid:1] - INFO
[CommitProcessor:1:ZooKeeperServer@617] - Established session
0x14949db9da40037 with negotiated timeout 1 for client
/xxx.xxx.xxx.xxx:37336
2014-10-27 07:01:06,520 [myid:1] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x14949db9da40037, likely client has closed socket
 at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
 at

org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
 at java.lang.Thread.run(Thread.java:744)

For queuing theory , I dont know of any way to see how fasts the requests
are being served by SolrCloud , and if a queue is being maintained if the
service rate is slower than the rate of requests from the incoming multiple
threads.

On Mon, Oct 27, 2014 at 7:09 PM, Will Martin wmartin...@gmail.com wrote:


2 naïve comments, of course.



-  Queuing theory

-  Zookeeper logs.



From: S.L [mailto:simpleliving...@gmail.com]
Sent: Monday, October 27, 2014 1:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1
replicas out of synch.



Please find the clusterstate.json attached.

Also in this case atleast the Shard1 replicas are out of sync , as can
be seen below.

Shard 1 replica 1 *does not* return a result with distrib=false.

Query
:http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:* 
http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%
28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebu
g=trackshards.info=true
fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=false
debug=track
shards.info=true



Result :

responselst name=responseHeaderint name=status0/intint
name=QTime1/intlst name=paramsstr name=q*:*/strstr name=
shards.infotrue/strstr name=distribfalse/strstr
name=debugtrack/strstr name=wtxml/strstr
name=fq(id

Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-28 Thread S.L
I m using Apache Hadoop and Solr , do I nee dto switch to Cloudera

On Tue, Oct 28, 2014 at 1:27 PM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 We index directly from mappers using SolrJ. It does work, but you pay the
 price of having to instantiate all those sockets vs. the way
 MapReduceIndexerTool works, where you're writing to an EmbeddedSolrServer
 directly in the Reduce task.

 You don't *need* to use MapReduceIndexerTool, but it's more efficient, and
 if you don't, you then have to make sure to appropriately tune your Hadoop
 implementation to match what your Solr installation is capable of.

 On 10/28/14 12:39, S.L wrote:

 Will,

 I think in one of your other emails(which I am not able to find) you has
 asked if I was indexing directly from MapReduce jobs, yes I am indexing
 directly from the map task and that is done using SolrJ with a
 SolrCloudServer initialized with the ZK ensemble URLs.Do I need to use
 something like MapReducerIndexerTool , which I suupose writes to HDFS and
 that is in a subsequent step moved to Solr index ? If so why ?

 I dont use any softCommits and do autocommit every 15 seconds , the
 snippet
 in the configuration can be seen below.

   autoSoftCommit
 maxTime${solr.
 autoSoftCommit.maxTime:-1}/maxTime
   /autoSoftCommit

   autoCommit
 maxTime${solr.autoCommit.maxTime:15000}/maxTime

 openSearchertrue/openSearcher
   /autoCommit

 I looked at the localhost_access.log file ,  all the GET and POST requests
 have a sub-second response time.




 On Tue, Oct 28, 2014 at 2:06 AM, Will Martin wmartin...@gmail.com
 wrote:

  The easiest, and coarsest measure of response time [not service time in a
 distributed system] can be picked up in your localhost_access.log file.
 You're using tomcat write?  Lookup AccessLogValve in the docs and
 server.xml. You can add configuration to report the payload and time to
 service the request without touching any code.

 Queueing theory is what Otis was talking about when he said you've
 saturated your environment. In AWS people just auto-scale up and don't
 worry about where the load comes from; its dumb if it happens more than 2
 times. Capacity planning is tough, let's hope it doesn't disappear
 altogether.

 G'luck


 -Original Message-
 From: S.L [mailto:simpleliving...@gmail.com]
 Sent: Monday, October 27, 2014 9:25 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas
 out of synch.

 Good point about ZK logs , I do see the following exceptions
 intermittently in the ZK log.

 2014-10-27 06:54:14,621 [myid:1] - INFO  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for
 client /xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029
 2014-10-27 07:00:06,697 [myid:1] - INFO  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
 connection from /xxx.xxx.xxx.xxx:37336
 2014-10-27 07:00:06,725 [myid:1] - INFO  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to
 establish new session at /xxx.xxx.xxx.xxx:37336
 2014-10-27 07:00:06,746 [myid:1] - INFO
 [CommitProcessor:1:ZooKeeperServer@617] - Established session
 0x14949db9da40037 with negotiated timeout 1 for client
 /xxx.xxx.xxx.xxx:37336
 2014-10-27 07:01:06,520 [myid:1] - WARN  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
 EndOfStreamException: Unable to read additional data from client
 sessionid
 0x14949db9da40037, likely client has closed socket
  at
 org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
  at

 org.apache.zookeeper.server.NIOServerCnxnFactory.run(
 NIOServerCnxnFactory.java:208)
  at java.lang.Thread.run(Thread.java:744)

 For queuing theory , I dont know of any way to see how fasts the requests
 are being served by SolrCloud , and if a queue is being maintained if the
 service rate is slower than the rate of requests from the incoming
 multiple
 threads.

 On Mon, Oct 27, 2014 at 7:09 PM, Will Martin wmartin...@gmail.com
 wrote:

  2 naïve comments, of course.



 -  Queuing theory

 -  Zookeeper logs.



 From: S.L [mailto:simpleliving...@gmail.com]
 Sent: Monday, October 27, 2014 1:42 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1
 replicas out of synch.



 Please find the clusterstate.json attached.

 Also in this case atleast the Shard1 replicas are out of sync , as can
 be seen below.

 Shard 1 replica 1 *does not* return a result with distrib=false.

 Query
 :http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:* 
 http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%
 28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebu
 g=trackshards.info=true
 fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib

Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-28 Thread S.L
Yeah , I get that not using a MarReduceIndexerTool could be more resource
intensive , but the way this issue  is manifesting which is resulting in
disjoint SolrCloud replicas perplexes me .

While you were tuning your SolrCloud environment to cater to the Hadoop
indexing requirements , did you ever face the issue of disjoint replicas?

Is MapReduceIndexer tool Cloudera distro specific? I am using Apache Solr
and Hadoop.

Thanks



On Tue, Oct 28, 2014 at 1:27 PM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 We index directly from mappers using SolrJ. It does work, but you pay the
 price of having to instantiate all those sockets vs. the way
 MapReduceIndexerTool works, where you're writing to an EmbeddedSolrServer
 directly in the Reduce task.

 You don't *need* to use MapReduceIndexerTool, but it's more efficient, and
 if you don't, you then have to make sure to appropriately tune your Hadoop
 implementation to match what your Solr installation is capable of.

 On 10/28/14 12:39, S.L wrote:

 Will,

 I think in one of your other emails(which I am not able to find) you has
 asked if I was indexing directly from MapReduce jobs, yes I am indexing
 directly from the map task and that is done using SolrJ with a
 SolrCloudServer initialized with the ZK ensemble URLs.Do I need to use
 something like MapReducerIndexerTool , which I suupose writes to HDFS and
 that is in a subsequent step moved to Solr index ? If so why ?

 I dont use any softCommits and do autocommit every 15 seconds , the
 snippet
 in the configuration can be seen below.

   autoSoftCommit
 maxTime${solr.
 autoSoftCommit.maxTime:-1}/maxTime
   /autoSoftCommit

   autoCommit
 maxTime${solr.autoCommit.maxTime:15000}/maxTime

 openSearchertrue/openSearcher
   /autoCommit

 I looked at the localhost_access.log file ,  all the GET and POST requests
 have a sub-second response time.




 On Tue, Oct 28, 2014 at 2:06 AM, Will Martin wmartin...@gmail.com
 wrote:

  The easiest, and coarsest measure of response time [not service time in a
 distributed system] can be picked up in your localhost_access.log file.
 You're using tomcat write?  Lookup AccessLogValve in the docs and
 server.xml. You can add configuration to report the payload and time to
 service the request without touching any code.

 Queueing theory is what Otis was talking about when he said you've
 saturated your environment. In AWS people just auto-scale up and don't
 worry about where the load comes from; its dumb if it happens more than 2
 times. Capacity planning is tough, let's hope it doesn't disappear
 altogether.

 G'luck


 -Original Message-
 From: S.L [mailto:simpleliving...@gmail.com]
 Sent: Monday, October 27, 2014 9:25 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas
 out of synch.

 Good point about ZK logs , I do see the following exceptions
 intermittently in the ZK log.

 2014-10-27 06:54:14,621 [myid:1] - INFO  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for
 client /xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029
 2014-10-27 07:00:06,697 [myid:1] - INFO  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
 connection from /xxx.xxx.xxx.xxx:37336
 2014-10-27 07:00:06,725 [myid:1] - INFO  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to
 establish new session at /xxx.xxx.xxx.xxx:37336
 2014-10-27 07:00:06,746 [myid:1] - INFO
 [CommitProcessor:1:ZooKeeperServer@617] - Established session
 0x14949db9da40037 with negotiated timeout 1 for client
 /xxx.xxx.xxx.xxx:37336
 2014-10-27 07:01:06,520 [myid:1] - WARN  [NIOServerCxn.Factory:
 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
 EndOfStreamException: Unable to read additional data from client
 sessionid
 0x14949db9da40037, likely client has closed socket
  at
 org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
  at

 org.apache.zookeeper.server.NIOServerCnxnFactory.run(
 NIOServerCnxnFactory.java:208)
  at java.lang.Thread.run(Thread.java:744)

 For queuing theory , I dont know of any way to see how fasts the requests
 are being served by SolrCloud , and if a queue is being maintained if the
 service rate is slower than the rate of requests from the incoming
 multiple
 threads.

 On Mon, Oct 27, 2014 at 7:09 PM, Will Martin wmartin...@gmail.com
 wrote:

  2 naïve comments, of course.



 -  Queuing theory

 -  Zookeeper logs.



 From: S.L [mailto:simpleliving...@gmail.com]
 Sent: Monday, October 27, 2014 1:42 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1
 replicas out of synch.



 Please find the clusterstate.json attached.

 Also in this case atleast the Shard1 replicas are out of sync , as can
 be seen below.

 Shard

Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-28 Thread Michael Della Bitta
No you do not, although you may consider it, because you'd be getting a 
sort of integrated stack.


But really, the decision to switch to running Solr in HDFS should not be 
taken lightly. Unless you are on a team familiar with running a Hadoop 
stack, or you're willing to devote a lot of effort toward becoming 
proficient with one, I would recommend against it.


On 10/28/14 15:27, S.L wrote:

I m using Apache Hadoop and Solr , do I nee dto switch to Cloudera

On Tue, Oct 28, 2014 at 1:27 PM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:


We index directly from mappers using SolrJ. It does work, but you pay the
price of having to instantiate all those sockets vs. the way
MapReduceIndexerTool works, where you're writing to an EmbeddedSolrServer
directly in the Reduce task.

You don't *need* to use MapReduceIndexerTool, but it's more efficient, and
if you don't, you then have to make sure to appropriately tune your Hadoop
implementation to match what your Solr installation is capable of.

On 10/28/14 12:39, S.L wrote:


Will,

I think in one of your other emails(which I am not able to find) you has
asked if I was indexing directly from MapReduce jobs, yes I am indexing
directly from the map task and that is done using SolrJ with a
SolrCloudServer initialized with the ZK ensemble URLs.Do I need to use
something like MapReducerIndexerTool , which I suupose writes to HDFS and
that is in a subsequent step moved to Solr index ? If so why ?

I dont use any softCommits and do autocommit every 15 seconds , the
snippet
in the configuration can be seen below.

   autoSoftCommit
 maxTime${solr.
autoSoftCommit.maxTime:-1}/maxTime
   /autoSoftCommit

   autoCommit
 maxTime${solr.autoCommit.maxTime:15000}/maxTime

 openSearchertrue/openSearcher
   /autoCommit

I looked at the localhost_access.log file ,  all the GET and POST requests
have a sub-second response time.




On Tue, Oct 28, 2014 at 2:06 AM, Will Martin wmartin...@gmail.com
wrote:

  The easiest, and coarsest measure of response time [not service time in a

distributed system] can be picked up in your localhost_access.log file.
You're using tomcat write?  Lookup AccessLogValve in the docs and
server.xml. You can add configuration to report the payload and time to
service the request without touching any code.

Queueing theory is what Otis was talking about when he said you've
saturated your environment. In AWS people just auto-scale up and don't
worry about where the load comes from; its dumb if it happens more than 2
times. Capacity planning is tough, let's hope it doesn't disappear
altogether.

G'luck


-Original Message-
From: S.L [mailto:simpleliving...@gmail.com]
Sent: Monday, October 27, 2014 9:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas
out of synch.

Good point about ZK logs , I do see the following exceptions
intermittently in the ZK log.

2014-10-27 06:54:14,621 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for
client /xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029
2014-10-27 07:00:06,697 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
connection from /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,725 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to
establish new session at /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,746 [myid:1] - INFO
[CommitProcessor:1:ZooKeeperServer@617] - Established session
0x14949db9da40037 with negotiated timeout 1 for client
/xxx.xxx.xxx.xxx:37336
2014-10-27 07:01:06,520 [myid:1] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client
sessionid
0x14949db9da40037, likely client has closed socket
  at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
  at

org.apache.zookeeper.server.NIOServerCnxnFactory.run(
NIOServerCnxnFactory.java:208)
  at java.lang.Thread.run(Thread.java:744)

For queuing theory , I dont know of any way to see how fasts the requests
are being served by SolrCloud , and if a queue is being maintained if the
service rate is slower than the rate of requests from the incoming
multiple
threads.

On Mon, Oct 27, 2014 at 7:09 PM, Will Martin wmartin...@gmail.com
wrote:

  2 naïve comments, of course.



-  Queuing theory

-  Zookeeper logs.



From: S.L [mailto:simpleliving...@gmail.com]
Sent: Monday, October 27, 2014 1:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1
replicas out of synch.



Please find the clusterstate.json attached.

Also in this case atleast the Shard1 replicas are out of sync , as can
be seen below.

Shard 1 replica 1 *does not* return a result

Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread Erick Erickson
OK, clarify a bit more what you're doing with Hadoop. Are you using
the MapReduceIndexerTool? Or are your Hadoop jobs writing directly to
SolrCloud?

How are you measuring out of sync? Are you sure that you've
committed? Does out of synch mean reporting different result counts?
Different order? Different numbers of deleted docs? Completely
different search results? How do you know? Do you measure with
distrib=false to each one?

Details matter a lot here ;)
Erick

On Sun, Oct 26, 2014 at 9:59 PM, S.L simpleliving...@gmail.com wrote:
 Folks,

 I have posted previously about this , I am using SolrCloud 4.10.1 and have
 a sharded collection with  6 nodes , 3 shards and a replication factor of 2.

 I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks , that
 can each have upto 5 threds each , so the load on the indexing side can get
 to as high as 75 concurrent threads.

 I am facing an issue where the replicas of a particular shard(s) are
 consistently getting out of synch , initially I thought this was beccause I
 was using a custom component , but I did a fresh install and removed the
 custom component and reindexed using the Hadoop job , I still see the same
 behavior.

 I do not see any exceptions in my catalina.out , like OOM , or any other
 excepitions, I suspecting thi scould be because of the multi-threaded
 indexing nature of the Hadoop job . I use CloudSolrServer from my java code
 to index and initialize the CloudSolrServer using a 3 node ZK ensemble.

 Does any one know of any known issues with a highly multi-threaded indexing
 and SolrCloud ?

 Can someone help ? This issue has been slowing things down on my end for a
 while now.

 Thanks and much appreciated!


Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread Otis Gospodnetic
Hi,

You may simply be overwhelming your cluster-nodes. Have you checked
various metrics to see if that is the case?

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/



 On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com wrote:

 Folks,

 I have posted previously about this , I am using SolrCloud 4.10.1 and have
 a sharded collection with  6 nodes , 3 shards and a replication factor of 2.

 I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks , that
 can each have upto 5 threds each , so the load on the indexing side can get
 to as high as 75 concurrent threads.

 I am facing an issue where the replicas of a particular shard(s) are
 consistently getting out of synch , initially I thought this was beccause I
 was using a custom component , but I did a fresh install and removed the
 custom component and reindexed using the Hadoop job , I still see the same
 behavior.

 I do not see any exceptions in my catalina.out , like OOM , or any other
 excepitions, I suspecting thi scould be because of the multi-threaded
 indexing nature of the Hadoop job . I use CloudSolrServer from my java code
 to index and initialize the CloudSolrServer using a 3 node ZK ensemble.

 Does any one know of any known issues with a highly multi-threaded indexing
 and SolrCloud ?

 Can someone help ? This issue has been slowing things down on my end for a
 while now.

 Thanks and much appreciated!


Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread S.L
Thank Otis,

I have checked the logs , in my case the default catalina.out and I dont
see any OOMs or , any other exceptions.

What others metrics do you suggest ?

On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Hi,

 You may simply be overwhelming your cluster-nodes. Have you checked
 various metrics to see if that is the case?

 Otis
 --
 Monitoring * Alerting * Anomaly Detection * Centralized Log Management
 Solr  Elasticsearch Support * http://sematext.com/



  On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com wrote:
 
  Folks,
 
  I have posted previously about this , I am using SolrCloud 4.10.1 and
 have
  a sharded collection with  6 nodes , 3 shards and a replication factor
 of 2.
 
  I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks , that
  can each have upto 5 threds each , so the load on the indexing side can
 get
  to as high as 75 concurrent threads.
 
  I am facing an issue where the replicas of a particular shard(s) are
  consistently getting out of synch , initially I thought this was
 beccause I
  was using a custom component , but I did a fresh install and removed the
  custom component and reindexed using the Hadoop job , I still see the
 same
  behavior.
 
  I do not see any exceptions in my catalina.out , like OOM , or any other
  excepitions, I suspecting thi scould be because of the multi-threaded
  indexing nature of the Hadoop job . I use CloudSolrServer from my java
 code
  to index and initialize the CloudSolrServer using a 3 node ZK ensemble.
 
  Does any one know of any known issues with a highly multi-threaded
 indexing
  and SolrCloud ?
 
  Can someone help ? This issue has been slowing things down on my end for
 a
  while now.
 
  Thanks and much appreciated!



RE: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread Markus Jelsma
It is an ancient issue. One of the major contributors to the issue was resolved 
some versions ago but we are still seeing it sometimes too, there is nothing to 
see in the logs. We ignore it and just reindex.

-Original message-
 From:S.L simpleliving...@gmail.com
 Sent: Monday 27th October 2014 16:25
 To: solr-user@lucene.apache.org
 Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out 
 of synch.
 
 Thank Otis,
 
 I have checked the logs , in my case the default catalina.out and I dont
 see any OOMs or , any other exceptions.
 
 What others metrics do you suggest ?
 
 On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic 
 otis.gospodne...@gmail.com wrote:
 
  Hi,
 
  You may simply be overwhelming your cluster-nodes. Have you checked
  various metrics to see if that is the case?
 
  Otis
  --
  Monitoring * Alerting * Anomaly Detection * Centralized Log Management
  Solr  Elasticsearch Support * http://sematext.com/
 
 
 
   On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com wrote:
  
   Folks,
  
   I have posted previously about this , I am using SolrCloud 4.10.1 and
  have
   a sharded collection with  6 nodes , 3 shards and a replication factor
  of 2.
  
   I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks , that
   can each have upto 5 threds each , so the load on the indexing side can
  get
   to as high as 75 concurrent threads.
  
   I am facing an issue where the replicas of a particular shard(s) are
   consistently getting out of synch , initially I thought this was
  beccause I
   was using a custom component , but I did a fresh install and removed the
   custom component and reindexed using the Hadoop job , I still see the
  same
   behavior.
  
   I do not see any exceptions in my catalina.out , like OOM , or any other
   excepitions, I suspecting thi scould be because of the multi-threaded
   indexing nature of the Hadoop job . I use CloudSolrServer from my java
  code
   to index and initialize the CloudSolrServer using a 3 node ZK ensemble.
  
   Does any one know of any known issues with a highly multi-threaded
  indexing
   and SolrCloud ?
  
   Can someone help ? This issue has been slowing things down on my end for
  a
   while now.
  
   Thanks and much appreciated!
 
 


Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread Michael Della Bitta

I'm curious, could you elaborate on the issue and the partial fix?

Thanks!

On 10/27/14 11:31, Markus Jelsma wrote:

It is an ancient issue. One of the major contributors to the issue was resolved 
some versions ago but we are still seeing it sometimes too, there is nothing to 
see in the logs. We ignore it and just reindex.

-Original message-

From:S.L simpleliving...@gmail.com
Sent: Monday 27th October 2014 16:25
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of 
synch.

Thank Otis,

I have checked the logs , in my case the default catalina.out and I dont
see any OOMs or , any other exceptions.

What others metrics do you suggest ?

On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:


Hi,

You may simply be overwhelming your cluster-nodes. Have you checked
various metrics to see if that is the case?

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/




On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com wrote:

Folks,

I have posted previously about this , I am using SolrCloud 4.10.1 and

have

a sharded collection with  6 nodes , 3 shards and a replication factor

of 2.

I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks , that
can each have upto 5 threds each , so the load on the indexing side can

get

to as high as 75 concurrent threads.

I am facing an issue where the replicas of a particular shard(s) are
consistently getting out of synch , initially I thought this was

beccause I

was using a custom component , but I did a fresh install and removed the
custom component and reindexed using the Hadoop job , I still see the

same

behavior.

I do not see any exceptions in my catalina.out , like OOM , or any other
excepitions, I suspecting thi scould be because of the multi-threaded
indexing nature of the Hadoop job . I use CloudSolrServer from my java

code

to index and initialize the CloudSolrServer using a 3 node ZK ensemble.

Does any one know of any known issues with a highly multi-threaded

indexing

and SolrCloud ?

Can someone help ? This issue has been slowing things down on my end for

a

while now.

Thanks and much appreciated!




Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread S.L
Markus,

I would like to ignore it too, but whats happening is that the there is a
lot of discrepancy between the replicas , queries like
q=*:*fq=(id:220a8dce-3b31-4d46-8386-da8405595c47) fail depending on which
replica the request goes to, because of huge amount of discrepancy between
the replicas.

Thank you for confirming that it is a know issue , I was thinking I was the
only one facing this due to my set up.



On Mon, Oct 27, 2014 at 11:31 AM, Markus Jelsma markus.jel...@openindex.io
wrote:

 It is an ancient issue. One of the major contributors to the issue was
 resolved some versions ago but we are still seeing it sometimes too, there
 is nothing to see in the logs. We ignore it and just reindex.

 -Original message-
  From:S.L simpleliving...@gmail.com
  Sent: Monday 27th October 2014 16:25
  To: solr-user@lucene.apache.org
  Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas
 out of synch.
 
  Thank Otis,
 
  I have checked the logs , in my case the default catalina.out and I dont
  see any OOMs or , any other exceptions.
 
  What others metrics do you suggest ?
 
  On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic 
  otis.gospodne...@gmail.com wrote:
 
   Hi,
  
   You may simply be overwhelming your cluster-nodes. Have you checked
   various metrics to see if that is the case?
  
   Otis
   --
   Monitoring * Alerting * Anomaly Detection * Centralized Log Management
   Solr  Elasticsearch Support * http://sematext.com/
  
  
  
On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com wrote:
   
Folks,
   
I have posted previously about this , I am using SolrCloud 4.10.1 and
   have
a sharded collection with  6 nodes , 3 shards and a replication
 factor
   of 2.
   
I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks ,
 that
can each have upto 5 threds each , so the load on the indexing side
 can
   get
to as high as 75 concurrent threads.
   
I am facing an issue where the replicas of a particular shard(s) are
consistently getting out of synch , initially I thought this was
   beccause I
was using a custom component , but I did a fresh install and removed
 the
custom component and reindexed using the Hadoop job , I still see the
   same
behavior.
   
I do not see any exceptions in my catalina.out , like OOM , or any
 other
excepitions, I suspecting thi scould be because of the multi-threaded
indexing nature of the Hadoop job . I use CloudSolrServer from my
 java
   code
to index and initialize the CloudSolrServer using a 3 node ZK
 ensemble.
   
Does any one know of any known issues with a highly multi-threaded
   indexing
and SolrCloud ?
   
Can someone help ? This issue has been slowing things down on my end
 for
   a
while now.
   
Thanks and much appreciated!
  
 



RE: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread Markus Jelsma
https://issues.apache.org/jira/browse/SOLR-4260 resolved
https://issues.apache.org/jira/browse/SOLR-4924 open

 
 
-Original message-
 From:Michael Della Bitta michael.della.bi...@appinions.com
 Sent: Monday 27th October 2014 16:40
 To: solr-user@lucene.apache.org
 Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out 
 of synch.
 
 I'm curious, could you elaborate on the issue and the partial fix?
 
 Thanks!
 
 On 10/27/14 11:31, Markus Jelsma wrote:
  It is an ancient issue. One of the major contributors to the issue was 
  resolved some versions ago but we are still seeing it sometimes too, there 
  is nothing to see in the logs. We ignore it and just reindex.
 
  -Original message-
  From:S.L simpleliving...@gmail.com
  Sent: Monday 27th October 2014 16:25
  To: solr-user@lucene.apache.org
  Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas 
  out of synch.
 
  Thank Otis,
 
  I have checked the logs , in my case the default catalina.out and I dont
  see any OOMs or , any other exceptions.
 
  What others metrics do you suggest ?
 
  On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic 
  otis.gospodne...@gmail.com wrote:
 
  Hi,
 
  You may simply be overwhelming your cluster-nodes. Have you checked
  various metrics to see if that is the case?
 
  Otis
  --
  Monitoring * Alerting * Anomaly Detection * Centralized Log Management
  Solr  Elasticsearch Support * http://sematext.com/
 
 
 
  On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com wrote:
 
  Folks,
 
  I have posted previously about this , I am using SolrCloud 4.10.1 and
  have
  a sharded collection with  6 nodes , 3 shards and a replication factor
  of 2.
  I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks , that
  can each have upto 5 threds each , so the load on the indexing side can
  get
  to as high as 75 concurrent threads.
 
  I am facing an issue where the replicas of a particular shard(s) are
  consistently getting out of synch , initially I thought this was
  beccause I
  was using a custom component , but I did a fresh install and removed the
  custom component and reindexed using the Hadoop job , I still see the
  same
  behavior.
 
  I do not see any exceptions in my catalina.out , like OOM , or any other
  excepitions, I suspecting thi scould be because of the multi-threaded
  indexing nature of the Hadoop job . I use CloudSolrServer from my java
  code
  to index and initialize the CloudSolrServer using a 3 node ZK ensemble.
 
  Does any one know of any known issues with a highly multi-threaded
  indexing
  and SolrCloud ?
 
  Can someone help ? This issue has been slowing things down on my end for
  a
  while now.
 
  Thanks and much appreciated!
 
 


RE: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread Markus Jelsma
Hi - if there is a very large discrepancy, you could consider to purge the 
smallest replica, it will then resync from the leader.
 
 
-Original message-
 From:S.L simpleliving...@gmail.com
 Sent: Monday 27th October 2014 16:41
 To: solr-user@lucene.apache.org
 Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out 
 of synch.
 
 Markus,
 
 I would like to ignore it too, but whats happening is that the there is a
 lot of discrepancy between the replicas , queries like
 q=*:*fq=(id:220a8dce-3b31-4d46-8386-da8405595c47) fail depending on which
 replica the request goes to, because of huge amount of discrepancy between
 the replicas.
 
 Thank you for confirming that it is a know issue , I was thinking I was the
 only one facing this due to my set up.
 
 On Mon, Oct 27, 2014 at 11:31 AM, Markus Jelsma markus.jel...@openindex.io
 wrote:
 
  It is an ancient issue. One of the major contributors to the issue was
  resolved some versions ago but we are still seeing it sometimes too, there
  is nothing to see in the logs. We ignore it and just reindex.
 
  -Original message-
   From:S.L simpleliving...@gmail.com
   Sent: Monday 27th October 2014 16:25
   To: solr-user@lucene.apache.org
   Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas
  out of synch.
  
   Thank Otis,
  
   I have checked the logs , in my case the default catalina.out and I dont
   see any OOMs or , any other exceptions.
  
   What others metrics do you suggest ?
  
   On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic 
   otis.gospodne...@gmail.com wrote:
  
Hi,
   
You may simply be overwhelming your cluster-nodes. Have you checked
various metrics to see if that is the case?
   
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/
   
   
   
 On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com wrote:

 Folks,

 I have posted previously about this , I am using SolrCloud 4.10.1 and
have
 a sharded collection with  6 nodes , 3 shards and a replication
  factor
of 2.

 I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks ,
  that
 can each have upto 5 threds each , so the load on the indexing side
  can
get
 to as high as 75 concurrent threads.

 I am facing an issue where the replicas of a particular shard(s) are
 consistently getting out of synch , initially I thought this was
beccause I
 was using a custom component , but I did a fresh install and removed
  the
 custom component and reindexed using the Hadoop job , I still see the
same
 behavior.

 I do not see any exceptions in my catalina.out , like OOM , or any
  other
 excepitions, I suspecting thi scould be because of the multi-threaded
 indexing nature of the Hadoop job . I use CloudSolrServer from my
  java
code
 to index and initialize the CloudSolrServer using a 3 node ZK
  ensemble.

 Does any one know of any known issues with a highly multi-threaded
indexing
 and SolrCloud ?

 Can someone help ? This issue has been slowing things down on my end
  for
a
 while now.

 Thanks and much appreciated!
   
  
 


Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread S.L
One is not smaller than the other, because the numDocs is same for both
replicas and essentially they seem to be disjoint sets.

Also manually purging the replicas is not option , because this is
frequently indexed index and we need everything to be automated.

What other options do I have now.

1. Turn of the replication completely in SolrCloud
2. Use traditional Master Slave replication model.
3. Introduce a replica aware field in the index , to figure out which
replica the request should go to from the client.
4. Try a distribution like Helios to see if it has any different behavior.

Just think out loud here ..

On Mon, Oct 27, 2014 at 11:56 AM, Markus Jelsma markus.jel...@openindex.io
wrote:

 Hi - if there is a very large discrepancy, you could consider to purge the
 smallest replica, it will then resync from the leader.


 -Original message-
  From:S.L simpleliving...@gmail.com
  Sent: Monday 27th October 2014 16:41
  To: solr-user@lucene.apache.org
  Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas
 out of synch.
 
  Markus,
 
  I would like to ignore it too, but whats happening is that the there is a
  lot of discrepancy between the replicas , queries like
  q=*:*fq=(id:220a8dce-3b31-4d46-8386-da8405595c47) fail depending on
 which
  replica the request goes to, because of huge amount of discrepancy
 between
  the replicas.
 
  Thank you for confirming that it is a know issue , I was thinking I was
 the
  only one facing this due to my set up.
 
  On Mon, Oct 27, 2014 at 11:31 AM, Markus Jelsma 
 markus.jel...@openindex.io
  wrote:
 
   It is an ancient issue. One of the major contributors to the issue was
   resolved some versions ago but we are still seeing it sometimes too,
 there
   is nothing to see in the logs. We ignore it and just reindex.
  
   -Original message-
From:S.L simpleliving...@gmail.com
Sent: Monday 27th October 2014 16:25
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1
 replicas
   out of synch.
   
Thank Otis,
   
I have checked the logs , in my case the default catalina.out and I
 dont
see any OOMs or , any other exceptions.
   
What others metrics do you suggest ?
   
On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:
   
 Hi,

 You may simply be overwhelming your cluster-nodes. Have you checked
 various metrics to see if that is the case?

 Otis
 --
 Monitoring * Alerting * Anomaly Detection * Centralized Log
 Management
 Solr  Elasticsearch Support * http://sematext.com/



  On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com
 wrote:
 
  Folks,
 
  I have posted previously about this , I am using SolrCloud
 4.10.1 and
 have
  a sharded collection with  6 nodes , 3 shards and a replication
   factor
 of 2.
 
  I am indexing Solr using a Hadoop job , I have 15 Map fetch
 tasks ,
   that
  can each have upto 5 threds each , so the load on the indexing
 side
   can
 get
  to as high as 75 concurrent threads.
 
  I am facing an issue where the replicas of a particular shard(s)
 are
  consistently getting out of synch , initially I thought this was
 beccause I
  was using a custom component , but I did a fresh install and
 removed
   the
  custom component and reindexed using the Hadoop job , I still
 see the
 same
  behavior.
 
  I do not see any exceptions in my catalina.out , like OOM , or
 any
   other
  excepitions, I suspecting thi scould be because of the
 multi-threaded
  indexing nature of the Hadoop job . I use CloudSolrServer from my
   java
 code
  to index and initialize the CloudSolrServer using a 3 node ZK
   ensemble.
 
  Does any one know of any known issues with a highly
 multi-threaded
 indexing
  and SolrCloud ?
 
  Can someone help ? This issue has been slowing things down on my
 end
   for
 a
  while now.
 
  Thanks and much appreciated!

   
  



Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread S.L
Please find the clusterstate.json attached.

Also in this case *atleast *the Shard1 replicas are out of sync , as can be
seen below.


*Shard 1 replica 1 *does not* return a result with distrib=false.*
*Query :*
http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebug=trackshards.info=true

*Result :*

responselst name=responseHeaderint name=status0/intint
name=QTime1/intlst name=paramsstr name=q*:*/strstr name=
shards.infotrue/strstr name=distribfalse/strstr
name=debugtrack/strstr name=wtxml/strstr
name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lstresult
name=response numFound=0 start=0/lst name=debug//response


*Shard1 replica 2 *does* return the result with distrib=false.*
*Query:*
http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebug=trackshards.info=true

*Result:*

responselst name=responseHeaderint name=status0/intint
name=QTime1/intlst name=paramsstr name=q*:*/strstr name=
shards.infotrue/strstr name=distribfalse/strstr
name=debugtrack/strstr name=wtxml/strstr
name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lstresult
name=response numFound=1 start=0docstr name=thingURL
http://www.xyz.com/strstr
name=id9f4748c0-fe16-4632-b74e-4fee6b80cbf5/strlong
name=_version_1483135330558148608/long/doc/resultlst
name=debug//response

On Mon, Oct 27, 2014 at 12:19 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 On Mon, Oct 27, 2014 at 9:40 PM, S.L simpleliving...@gmail.com wrote:

  One is not smaller than the other, because the numDocs is same for both
  replicas and essentially they seem to be disjoint sets.
 

 That is strange. Can we see your clusterstate.json? With that, please also
 specify the two replicas which are out of sync.

 
  Also manually purging the replicas is not option , because this is
  frequently indexed index and we need everything to be automated.
 
  What other options do I have now.
 
  1. Turn of the replication completely in SolrCloud
  2. Use traditional Master Slave replication model.
  3. Introduce a replica aware field in the index , to figure out which
  replica the request should go to from the client.
  4. Try a distribution like Helios to see if it has any different
 behavior.
 
  Just think out loud here ..
 
  On Mon, Oct 27, 2014 at 11:56 AM, Markus Jelsma 
  markus.jel...@openindex.io
  wrote:
 
   Hi - if there is a very large discrepancy, you could consider to purge
  the
   smallest replica, it will then resync from the leader.
  
  
   -Original message-
From:S.L simpleliving...@gmail.com
Sent: Monday 27th October 2014 16:41
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1
  replicas
   out of synch.
   
Markus,
   
I would like to ignore it too, but whats happening is that the there
  is a
lot of discrepancy between the replicas , queries like
q=*:*fq=(id:220a8dce-3b31-4d46-8386-da8405595c47) fail depending on
   which
replica the request goes to, because of huge amount of discrepancy
   between
the replicas.
   
Thank you for confirming that it is a know issue , I was thinking I
 was
   the
only one facing this due to my set up.
   
On Mon, Oct 27, 2014 at 11:31 AM, Markus Jelsma 
   markus.jel...@openindex.io
wrote:
   
 It is an ancient issue. One of the major contributors to the issue
  was
 resolved some versions ago but we are still seeing it sometimes
 too,
   there
 is nothing to see in the logs. We ignore it and just reindex.

 -Original message-
  From:S.L simpleliving...@gmail.com
  Sent: Monday 27th October 2014 16:25
  To: solr-user@lucene.apache.org
  Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1
   replicas
 out of synch.
 
  Thank Otis,
 
  I have checked the logs , in my case the default catalina.out
 and I
   dont
  see any OOMs or , any other exceptions.
 
  What others metrics do you suggest ?
 
  On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic 
  otis.gospodne...@gmail.com wrote:
 
   Hi,
  
   You may simply be overwhelming your cluster-nodes. Have you
  checked
   various metrics to see if that is the case?
  
   Otis
   --
   Monitoring * Alerting * Anomaly Detection * Centralized Log
   Management
   Solr  Elasticsearch Support * http://sematext.com/
  
  
  
On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com
   wrote:
   
Folks,
   
I have posted previously about this , I am using SolrCloud
   4.10.1 and
   have
a sharded collection with  6 nodes , 3 shards and a
 replication
 factor
   of 2.
   
I am indexing Solr using a Hadoop job , I have 15 Map fetch
   tasks ,
 that
can each have upto 5

RE: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread Will Martin
2 naïve comments, of course.

 

-  Queuing theory

-  Zookeeper logs.

 

From: S.L [mailto:simpleliving...@gmail.com] 
Sent: Monday, October 27, 2014 1:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of 
synch.

 

Please find the clusterstate.json attached.

Also in this case atleast the Shard1 replicas are out of sync , as can be seen 
below.

Shard 1 replica 1 *does not* return a result with distrib=false.

Query :http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:* 
http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebug=trackshards.info=true
 
fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebug=trackshards.info=true

 

Result :

responselst name=responseHeaderint name=status0/intint 
name=QTime1/intlst name=paramsstr name=q*:*/strstr 
name=shards.infotrue/strstr name=distribfalse/strstr 
name=debugtrack/strstr name=wtxml/strstr 
name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lstresult 
name=response numFound=0 start=0/lst name=debug//response

 

Shard1 replica 2 *does* return the result with distrib=false.

Query: http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:* 
http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebug=trackshards.info=true
 
fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebug=trackshards.info=true

Result:

responselst name=responseHeaderint name=status0/intint 
name=QTime1/intlst name=paramsstr name=q*:*/strstr 
name=shards.infotrue/strstr name=distribfalse/strstr 
name=debugtrack/strstr name=wtxml/strstr 
name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lstresult 
name=response numFound=1 start=0docstr 
name=thingURLhttp://www.xyz.com/strstr 
name=id9f4748c0-fe16-4632-b74e-4fee6b80cbf5/strlong 
name=_version_1483135330558148608/long/doc/resultlst 
name=debug//response

 

On Mon, Oct 27, 2014 at 12:19 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

On Mon, Oct 27, 2014 at 9:40 PM, S.L simpleliving...@gmail.com wrote:

 One is not smaller than the other, because the numDocs is same for both
 replicas and essentially they seem to be disjoint sets.


That is strange. Can we see your clusterstate.json? With that, please also
specify the two replicas which are out of sync.


 Also manually purging the replicas is not option , because this is
 frequently indexed index and we need everything to be automated.

 What other options do I have now.

 1. Turn of the replication completely in SolrCloud
 2. Use traditional Master Slave replication model.
 3. Introduce a replica aware field in the index , to figure out which
 replica the request should go to from the client.
 4. Try a distribution like Helios to see if it has any different behavior.

 Just think out loud here ..

 On Mon, Oct 27, 2014 at 11:56 AM, Markus Jelsma 
 markus.jel...@openindex.io
 wrote:

  Hi - if there is a very large discrepancy, you could consider to purge
 the
  smallest replica, it will then resync from the leader.
 
 
  -Original message-
   From:S.L simpleliving...@gmail.com
   Sent: Monday 27th October 2014 16:41
   To: solr-user@lucene.apache.org
   Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1
 replicas
  out of synch.
  
   Markus,
  
   I would like to ignore it too, but whats happening is that the there
 is a
   lot of discrepancy between the replicas , queries like
   q=*:*fq=(id:220a8dce-3b31-4d46-8386-da8405595c47) fail depending on
  which
   replica the request goes to, because of huge amount of discrepancy
  between
   the replicas.
  
   Thank you for confirming that it is a know issue , I was thinking I was
  the
   only one facing this due to my set up.
  
   On Mon, Oct 27, 2014 at 11:31 AM, Markus Jelsma 
  markus.jel...@openindex.io
   wrote:
  
It is an ancient issue. One of the major contributors to the issue
 was
resolved some versions ago but we are still seeing it sometimes too,
  there
is nothing to see in the logs. We ignore it and just reindex.
   
-Original message-
 From:S.L simpleliving...@gmail.com
 Sent: Monday 27th October 2014 16:25
 To: solr-user@lucene.apache.org
 Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1
  replicas
out of synch.

 Thank Otis,

 I have checked the logs , in my case the default catalina.out and I
  dont
 see any OOMs or , any other exceptions.

 What others metrics do you suggest ?

 On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic 
 otis.gospodne...@gmail.com wrote:

  Hi,
 
  You may simply be overwhelming your cluster-nodes. Have you
 checked
  various metrics to see if that is the case?
 
  Otis
  --
  Monitoring * Alerting * Anomaly Detection

Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread S.L
Good point about ZK logs , I do see the following exceptions intermittently
in the ZK log.

2014-10-27 06:54:14,621 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for
client /xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029
2014-10-27 07:00:06,697 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection
from /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,725 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish
new session at /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,746 [myid:1] - INFO
[CommitProcessor:1:ZooKeeperServer@617] - Established session
0x14949db9da40037 with negotiated timeout 1 for client
/xxx.xxx.xxx.xxx:37336
2014-10-27 07:01:06,520 [myid:1] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x14949db9da40037, likely client has closed socket
at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:744)

For queuing theory , I dont know of any way to see how fasts the requests
are being served by SolrCloud , and if a queue is being maintained if the
service rate is slower than the rate of requests from the incoming multiple
threads.

On Mon, Oct 27, 2014 at 7:09 PM, Will Martin wmartin...@gmail.com wrote:

 2 naïve comments, of course.



 -  Queuing theory

 -  Zookeeper logs.



 From: S.L [mailto:simpleliving...@gmail.com]
 Sent: Monday, October 27, 2014 1:42 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas
 out of synch.



 Please find the clusterstate.json attached.

 Also in this case atleast the Shard1 replicas are out of sync , as can be
 seen below.

 Shard 1 replica 1 *does not* return a result with distrib=false.

 Query :http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:* 
 http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebug=trackshards.info=true
 fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebug=track
 shards.info=true



 Result :

 responselst name=responseHeaderint name=status0/intint
 name=QTime1/intlst name=paramsstr name=q*:*/strstr name=
 shards.infotrue/strstr name=distribfalse/strstr
 name=debugtrack/strstr name=wtxml/strstr
 name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lstresult
 name=response numFound=0 start=0/lst name=debug//response



 Shard1 replica 2 *does* return the result with distrib=false.

 Query: http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:* 
 http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebug=trackshards.info=true
 fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebug=track
 shards.info=true

 Result:

 responselst name=responseHeaderint name=status0/intint
 name=QTime1/intlst name=paramsstr name=q*:*/strstr name=
 shards.infotrue/strstr name=distribfalse/strstr
 name=debugtrack/strstr name=wtxml/strstr
 name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lstresult
 name=response numFound=1 start=0docstr name=thingURL
 http://www.xyz.com/strstr
 name=id9f4748c0-fe16-4632-b74e-4fee6b80cbf5/strlong
 name=_version_1483135330558148608/long/doc/resultlst
 name=debug//response



 On Mon, Oct 27, 2014 at 12:19 PM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

 On Mon, Oct 27, 2014 at 9:40 PM, S.L simpleliving...@gmail.com wrote:

  One is not smaller than the other, because the numDocs is same for both
  replicas and essentially they seem to be disjoint sets.
 

 That is strange. Can we see your clusterstate.json? With that, please also
 specify the two replicas which are out of sync.

 
  Also manually purging the replicas is not option , because this is
  frequently indexed index and we need everything to be automated.
 
  What other options do I have now.
 
  1. Turn of the replication completely in SolrCloud
  2. Use traditional Master Slave replication model.
  3. Introduce a replica aware field in the index , to figure out which
  replica the request should go to from the client.
  4. Try a distribution like Helios to see if it has any different
 behavior.
 
  Just think out loud here ..
 
  On Mon, Oct 27, 2014 at 11:56 AM, Markus Jelsma 
  markus.jel...@openindex.io
  wrote:
 
   Hi - if there is a very large discrepancy, you could consider to purge
  the
   smallest replica, it will then resync from the leader.
  
  
   -Original message-
From:S.L simpleliving...@gmail.com
Sent: Monday 27th October 2014 16:41

RE: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread Will Martin
Erick Erickson has a comment on a thread out there that says there's a lot of 
pinging between SolrCloud and ZK. AND if a timeout occurs (which could be 
fallback behavior on that exception) ZK will mark the node down AND SolrCloud 
won't use it until ZK gets back inline/online.
Fwiw.


-Original Message-
From: S.L [mailto:simpleliving...@gmail.com] 
Sent: Monday, October 27, 2014 9:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of 
synch.

Good point about ZK logs , I do see the following exceptions intermittently in 
the ZK log.

2014-10-27 06:54:14,621 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client 
/xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029
2014-10-27 07:00:06,697 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection 
from /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,725 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish new 
session at /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,746 [myid:1] - INFO
[CommitProcessor:1:ZooKeeperServer@617] - Established session
0x14949db9da40037 with negotiated timeout 1 for client
/xxx.xxx.xxx.xxx:37336
2014-10-27 07:01:06,520 [myid:1] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 
0x14949db9da40037, likely client has closed socket
at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:744)

For queuing theory , I dont know of any way to see how fasts the requests are 
being served by SolrCloud , and if a queue is being maintained if the service 
rate is slower than the rate of requests from the incoming multiple threads.

On Mon, Oct 27, 2014 at 7:09 PM, Will Martin wmartin...@gmail.com wrote:

 2 naïve comments, of course.



 -  Queuing theory

 -  Zookeeper logs.



 From: S.L [mailto:simpleliving...@gmail.com]
 Sent: Monday, October 27, 2014 1:42 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 
 replicas out of synch.



 Please find the clusterstate.json attached.

 Also in this case atleast the Shard1 replicas are out of sync , as can 
 be seen below.

 Shard 1 replica 1 *does not* return a result with distrib=false.

 Query 
 :http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*  
 http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%
 28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebu
 g=trackshards.info=true 
 fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=false
 debug=track
 shards.info=true



 Result :

 responselst name=responseHeaderint name=status0/intint 
 name=QTime1/intlst name=paramsstr name=q*:*/strstr name=
 shards.infotrue/strstr name=distribfalse/strstr 
 name=debugtrack/strstr name=wtxml/strstr 
 name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lst
 result name=response numFound=0 start=0/lst 
 name=debug//response



 Shard1 replica 2 *does* return the result with distrib=false.

 Query: 
 http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:*  
 http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:*fq=%
 28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=falsedebu
 g=trackshards.info=true 
 fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29wt=xmldistrib=false
 debug=track
 shards.info=true

 Result:

 responselst name=responseHeaderint name=status0/intint 
 name=QTime1/intlst name=paramsstr name=q*:*/strstr name=
 shards.infotrue/strstr name=distribfalse/strstr 
 name=debugtrack/strstr name=wtxml/strstr 
 name=fq(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)/str/lst/lst
 result name=response numFound=1 start=0docstr 
 name=thingURL http://www.xyz.com/strstr 
 name=id9f4748c0-fe16-4632-b74e-4fee6b80cbf5/strlong
 name=_version_1483135330558148608/long/doc/resultlst
 name=debug//response



 On Mon, Oct 27, 2014 at 12:19 PM, Shalin Shekhar Mangar  
 shalinman...@gmail.com wrote:

 On Mon, Oct 27, 2014 at 9:40 PM, S.L simpleliving...@gmail.com wrote:

  One is not smaller than the other, because the numDocs is same for 
  both replicas and essentially they seem to be disjoint sets.
 

 That is strange. Can we see your clusterstate.json? With that, please 
 also specify the two replicas which are out of sync.

 
  Also manually purging the replicas is not option , because this is 
  frequently indexed index and we need everything to be automated.
 
  What other options do I have now.
 
  1. Turn of the replication completely in SolrCloud 2. Use 
  traditional Master Slave replication model.
  3. Introduce