RE: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

Will Martin Mon, 27 Oct 2014 23:08:10 -0700

The easiest, and coarsest measure of response time [not service time in a 
distributed system] can be picked up in your localhost_access.log file.
You're using tomcat write?  Lookup AccessLogValve in the docs and server.xml. 
You can add configuration to report the payload and time to service the request 
without touching any code.


Queueing theory is what Otis was talking about when he said you've saturated 
your environment. In AWS people just auto-scale up and don't worry about where 
the load comes from; its dumb if it happens more than 2 times. Capacity 
planning is tough, let's hope it doesn't disappear altogether.

G'luck


-----Original Message-----
From: S.L [mailto:simpleliving...@gmail.com] 
Sent: Monday, October 27, 2014 9:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of 
synch.

Good point about ZK logs , I do see the following exceptions intermittently in 
the ZK log.

2014-10-27 06:54:14,621 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client 
/xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029
2014-10-27 07:00:06,697 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection 
from /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,725 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish new 
session at /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,746 [myid:1] - INFO
[CommitProcessor:1:ZooKeeperServer@617] - Established session
0x14949db9da40037 with negotiated timeout 10000 for client
/xxx.xxx.xxx.xxx:37336
2014-10-27 07:01:06,520 [myid:1] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 
0x14949db9da40037, likely client has closed socket
        at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
        at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
        at java.lang.Thread.run(Thread.java:744)

For queuing theory , I dont know of any way to see how fasts the requests are 
being served by SolrCloud , and if a queue is being maintained if the service 
rate is slower than the rate of requests from the incoming multiple threads.

On Mon, Oct 27, 2014 at 7:09 PM, Will Martin <wmartin...@gmail.com> wrote:

> 2 naïve comments, of course.
>
>
>
> -          Queuing theory
>
> -          Zookeeper logs.
>
>
>
> From: S.L [mailto:simpleliving...@gmail.com]
> Sent: Monday, October 27, 2014 1:42 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 
> replicas out of synch.
>
>
>
> Please find the clusterstate.json attached.
>
> Also in this case atleast the Shard1 replicas are out of sync , as can 
> be seen below.
>
> Shard 1 replica 1 *does not* return a result with distrib=false.
>
> Query 
> :http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:* < 
> http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*&fq=%
> 28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29&wt=xml&distrib=false&debu
> g=track&shards.info=true> 
> &fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29&wt=xml&distrib=false
> &debug=track&
> shards.info=true
>
>
>
> Result :
>
> <response><lst name="responseHeader"><int name="status">0</int><int 
> name="QTime">1</int><lst name="params"><str name="q">*:*</str><str name="
> shards.info">true</str><str name="distrib">false</str><str 
> name="debug">track</str><str name="wt">xml</str><str 
> name="fq">(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)</str></lst></lst><
> result name="response" numFound="0" start="0"/><lst 
> name="debug"/></response>
>
>
>
> Shard1 replica 2 *does* return the result with distrib=false.
>
> Query: 
> http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:* < 
> http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:*&fq=%
> 28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29&wt=xml&distrib=false&debu
> g=track&shards.info=true> 
> &fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29&wt=xml&distrib=false
> &debug=track&
> shards.info=true
>
> Result:
>
> <response><lst name="responseHeader"><int name="status">0</int><int 
> name="QTime">1</int><lst name="params"><str name="q">*:*</str><str name="
> shards.info">true</str><str name="distrib">false</str><str 
> name="debug">track</str><str name="wt">xml</str><str 
> name="fq">(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)</str></lst></lst><
> result name="response" numFound="1" start="0"><doc><str 
> name="thingURL"> http://www.xyz.com</str><str 
> name="id">9f4748c0-fe16-4632-b74e-4fee6b80cbf5</str><long
> name="_version_">1483135330558148608</long></doc></result><lst
> name="debug"/></response>
>
>
>
> On Mon, Oct 27, 2014 at 12:19 PM, Shalin Shekhar Mangar < 
> shalinman...@gmail.com> wrote:
>
> On Mon, Oct 27, 2014 at 9:40 PM, S.L <simpleliving...@gmail.com> wrote:
>
> > One is not smaller than the other, because the numDocs is same for 
> > both "replicas" and essentially they seem to be disjoint sets.
> >
>
> That is strange. Can we see your clusterstate.json? With that, please 
> also specify the two replicas which are out of sync.
>
> >
> > Also manually purging the replicas is not option , because this is 
> > "frequently" indexed index and we need everything to be automated.
> >
> > What other options do I have now.
> >
> > 1. Turn of the replication completely in SolrCloud 2. Use 
> > traditional Master Slave replication model.
> > 3. Introduce a "replica" aware field in the index , to figure out 
> > which "replica" the request should go to from the client.
> > 4. Try a distribution like Helios to see if it has any different
> behavior.
> >
> > Just think out loud here ......
> >
> > On Mon, Oct 27, 2014 at 11:56 AM, Markus Jelsma < 
> > markus.jel...@openindex.io>
> > wrote:
> >
> > > Hi - if there is a very large discrepancy, you could consider to 
> > > purge
> > the
> > > smallest replica, it will then resync from the leader.
> > >
> > >
> > > -----Original message-----
> > > > From:S.L <simpleliving...@gmail.com>
> > > > Sent: Monday 27th October 2014 16:41
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1
> > replicas
> > > out of synch.
> > > >
> > > > Markus,
> > > >
> > > > I would like to ignore it too, but whats happening is that the 
> > > > there
> > is a
> > > > lot of discrepancy between the replicas , queries like
> > > > q=*:*&fq=(id:220a8dce-3b31-4d46-8386-da8405595c47) fail 
> > > > depending on
> > > which
> > > > replica the request goes to, because of huge amount of 
> > > > discrepancy
> > > between
> > > > the replicas.
> > > >
> > > > Thank you for confirming that it is a know issue , I was 
> > > > thinking I
> was
> > > the
> > > > only one facing this due to my set up.
> > > >
> > > > On Mon, Oct 27, 2014 at 11:31 AM, Markus Jelsma <
> > > markus.jel...@openindex.io>
> > > > wrote:
> > > >
> > > > > It is an ancient issue. One of the major contributors to the 
> > > > > issue
> > was
> > > > > resolved some versions ago but we are still seeing it 
> > > > > sometimes
> too,
> > > there
> > > > > is nothing to see in the logs. We ignore it and just reindex.
> > > > >
> > > > > -----Original message-----
> > > > > > From:S.L <simpleliving...@gmail.com>
> > > > > > Sent: Monday 27th October 2014 16:25
> > > > > > To: solr-user@lucene.apache.org
> > > > > > Subject: Re: Heavy Multi-threaded indexing and SolrCloud 
> > > > > > 4.10.1
> > > replicas
> > > > > out of synch.
> > > > > >
> > > > > > Thank Otis,
> > > > > >
> > > > > > I have checked the logs , in my case the default 
> > > > > > catalina.out
> and I
> > > dont
> > > > > > see any OOMs or , any other exceptions.
> > > > > >
> > > > > > What others metrics do you suggest ?
> > > > > >
> > > > > > On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic < 
> > > > > > otis.gospodne...@gmail.com> wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > You may simply be overwhelming your cluster-nodes. Have 
> > > > > > > you
> > checked
> > > > > > > various metrics to see if that is the case?
> > > > > > >
> > > > > > > Otis
> > > > > > > --
> > > > > > > Monitoring * Alerting * Anomaly Detection * Centralized 
> > > > > > > Log
> > > Management
> > > > > > > Solr & Elasticsearch Support * http://sematext.com/
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > On Oct 26, 2014, at 9:59 PM, S.L 
> > > > > > > > <simpleliving...@gmail.com>
> > > wrote:
> > > > > > > >
> > > > > > > > Folks,
> > > > > > > >
> > > > > > > > I have posted previously about this , I am using 
> > > > > > > > SolrCloud
> > > 4.10.1 and
> > > > > > > have
> > > > > > > > a sharded collection with  6 nodes , 3 shards and a
> replication
> > > > > factor
> > > > > > > of 2.
> > > > > > > >
> > > > > > > > I am indexing Solr using a Hadoop job , I have 15 Map 
> > > > > > > > fetch
> > > tasks ,
> > > > > that
> > > > > > > > can each have upto 5 threds each , so the load on the
> indexing
> > > side
> > > > > can
> > > > > > > get
> > > > > > > > to as high as 75 concurrent threads.
> > > > > > > >
> > > > > > > > I am facing an issue where the replicas of a particular
> > shard(s)
> > > are
> > > > > > > > consistently getting out of synch , initially I thought 
> > > > > > > > this
> > was
> > > > > > > beccause I
> > > > > > > > was using a custom component , but I did a fresh install 
> > > > > > > > and
> > > removed
> > > > > the
> > > > > > > > custom component and reindexed using the Hadoop job , I 
> > > > > > > > still
> > > see the
> > > > > > > same
> > > > > > > > behavior.
> > > > > > > >
> > > > > > > > I do not see any exceptions in my catalina.out , like 
> > > > > > > > OOM ,
> or
> > > any
> > > > > other
> > > > > > > > excepitions, I suspecting thi scould be because of the
> > > multi-threaded
> > > > > > > > indexing nature of the Hadoop job . I use 
> > > > > > > > CloudSolrServer
> from
> > my
> > > > > java
> > > > > > > code
> > > > > > > > to index and initialize the CloudSolrServer using a 3 
> > > > > > > > node ZK
> > > > > ensemble.
> > > > > > > >
> > > > > > > > Does any one know of any known issues with a highly
> > > multi-threaded
> > > > > > > indexing
> > > > > > > > and SolrCloud ?
> > > > > > > >
> > > > > > > > Can someone help ? This issue has been slowing things 
> > > > > > > > down on
> > my
> > > end
> > > > > for
> > > > > > > a
> > > > > > > > while now.
> > > > > > > >
> > > > > > > > Thanks and much appreciated!
> > > > > > >
> > > > > >
> > > > >
> > >
> >
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>
>
>
>

RE: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

Reply via email to