Re: Slow forwarding requests to collection leader

2014-10-30 Thread Matt Hilt
Thanks for the info Daniel. I will go forth and make a better client.


On Oct 29, 2014, at 2:28 AM, Daniel Collins danwcoll...@gmail.com wrote:

 I kind of think this might be working as designed, but I'll be happy to
 be corrected by others :)
 
 We had a similar issue which we discovered by accident, we had 2 or 3
 collections spread across some machines, and we accidentally tried to send
 an indexing request to a node in teh cloud that didn't have a replica of
 collection1 (but it had other collections). We saw an instant jump in
 indexing latency to 5s, which given the previous latencies had been ~20ms
 was rather obvious!
 
 Querying seems to be fine with this kind of forwarding approach, but
 indexing would logically require ZK information (to find the right shard
 for the destination collection and the leader of that shard), so I'm
 wondering if a node in the cloud that has a replica of collection1 has that
 information cached, whereas a node in the (same) cloud that only has a
 collection2 replica only has collection2 information cached, and has to go
 to ZK for every forwarding request.
 
 I haven't checked the code recently, but that seems plausible to me. Would
 you really want all your collection2 nodes to be running ZK watches for all
 collection1 updates as well as their own collection2 watches, that would
 clog them up processing updates that in all honestly, they shouldn't have
 to deal with. Every node in the cloud would have to have a watch on
 everything else which if you have a lot of independent collections would be
 an unnecessary burden on each of them.
 
 If you use SolrJ as a client, that would route to a correct node in the
 cloud (which is what we ended up using through JNI which was
 interesting), but if you are using HTTP to index, that's something your
 application has to take care of.
 
 On 28 October 2014 19:29, Matt Hilt matt.h...@numerica.us wrote:
 
 I have three equal machines each running solr cloud (4.8). I have multiple
 collections that are replicated but not sharded. I also have document
 generation processes running on these nodes which involves querying the
 collection ~5 times per document generated.
 
 Node 1 has a replica of collection A and is running document generation
 code that pushes to the HTTP /update/json hander.
 Node 2 is the leader of collection A.
 Node 3 does not have a replica of node A, but is running document
 generation code for collection A.
 
 The issue I see is that node 1 can push documents into Solr 3-5 times
 faster than node 3 when they both talk to the solr instance on their
 localhost. If either of them talk directly to the solr instance on node 2,
 the performance is excellent (on par with node 1). To me it seems that the
 only difference in these cases is the query/put request forwarding. Does
 this involve some slow zookeeper communication that should be avoided? Any
 other insights?
 
 Thanks



smime.p7s
Description: S/MIME cryptographic signature


Re: Slow forwarding requests to collection leader

2014-10-30 Thread Erick Erickson
Matt:

You might want to look at SolrJ, in particular with the use of CloudSolrServer.
The big benefit here is that it'll route the docs to the correct leader for each
shard rather than relying on the nodes to communicate with each other.

Here's a SolrJ example. NOTE: it used ConcurrentUpdateSolrServer which
you should replace with CloudSolrServer. Other than making the c'tor work, that
should be the only change you need as far as instantiating the right Solr
Server.

This one connects with a DB and also parses Tika files, but you should be able
to remove all that without too much problem.

https://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Thu, Oct 30, 2014 at 10:08 AM, Matt Hilt matt.h...@numerica.us wrote:
 Thanks for the info Daniel. I will go forth and make a better client.


 On Oct 29, 2014, at 2:28 AM, Daniel Collins danwcoll...@gmail.com wrote:

 I kind of think this might be working as designed, but I'll be happy to
 be corrected by others :)

 We had a similar issue which we discovered by accident, we had 2 or 3
 collections spread across some machines, and we accidentally tried to send
 an indexing request to a node in teh cloud that didn't have a replica of
 collection1 (but it had other collections). We saw an instant jump in
 indexing latency to 5s, which given the previous latencies had been ~20ms
 was rather obvious!

 Querying seems to be fine with this kind of forwarding approach, but
 indexing would logically require ZK information (to find the right shard
 for the destination collection and the leader of that shard), so I'm
 wondering if a node in the cloud that has a replica of collection1 has that
 information cached, whereas a node in the (same) cloud that only has a
 collection2 replica only has collection2 information cached, and has to go
 to ZK for every forwarding request.

 I haven't checked the code recently, but that seems plausible to me. Would
 you really want all your collection2 nodes to be running ZK watches for all
 collection1 updates as well as their own collection2 watches, that would
 clog them up processing updates that in all honestly, they shouldn't have
 to deal with. Every node in the cloud would have to have a watch on
 everything else which if you have a lot of independent collections would be
 an unnecessary burden on each of them.

 If you use SolrJ as a client, that would route to a correct node in the
 cloud (which is what we ended up using through JNI which was
 interesting), but if you are using HTTP to index, that's something your
 application has to take care of.

 On 28 October 2014 19:29, Matt Hilt matt.h...@numerica.us wrote:

 I have three equal machines each running solr cloud (4.8). I have multiple
 collections that are replicated but not sharded. I also have document
 generation processes running on these nodes which involves querying the
 collection ~5 times per document generated.

 Node 1 has a replica of collection A and is running document generation
 code that pushes to the HTTP /update/json hander.
 Node 2 is the leader of collection A.
 Node 3 does not have a replica of node A, but is running document
 generation code for collection A.

 The issue I see is that node 1 can push documents into Solr 3-5 times
 faster than node 3 when they both talk to the solr instance on their
 localhost. If either of them talk directly to the solr instance on node 2,
 the performance is excellent (on par with node 1). To me it seems that the
 only difference in these cases is the query/put request forwarding. Does
 this involve some slow zookeeper communication that should be avoided? Any
 other insights?

 Thanks



Re: Slow forwarding requests to collection leader

2014-10-30 Thread CP Mishra
+1 for CloudSolrServer

CloudSolrServer also has built in fault tolerance (i.e. if the master shard
is not reachable then it adds to the replica) and much better error
reporting than ConcurrentUpdateSolrServer.  The only downside is lack of
batching. As long as you are adding documents in decent size batches (can
also use multiple threads to add), you will get good indexing performance.

CP

On Thu, Oct 30, 2014 at 6:53 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Matt:

 You might want to look at SolrJ, in particular with the use of
 CloudSolrServer.
 The big benefit here is that it'll route the docs to the correct leader
 for each
 shard rather than relying on the nodes to communicate with each other.

 Here's a SolrJ example. NOTE: it used ConcurrentUpdateSolrServer which
 you should replace with CloudSolrServer. Other than making the c'tor work,
 that
 should be the only change you need as far as instantiating the right Solr
 Server.

 This one connects with a DB and also parses Tika files, but you should be
 able
 to remove all that without too much problem.

 https://lucidworks.com/blog/indexing-with-solrj/

 Best,
 Erick

 On Thu, Oct 30, 2014 at 10:08 AM, Matt Hilt matt.h...@numerica.us wrote:
  Thanks for the info Daniel. I will go forth and make a better client.
 
 
  On Oct 29, 2014, at 2:28 AM, Daniel Collins danwcoll...@gmail.com
 wrote:
 
  I kind of think this might be working as designed, but I'll be happy
 to
  be corrected by others :)
 
  We had a similar issue which we discovered by accident, we had 2 or 3
  collections spread across some machines, and we accidentally tried to
 send
  an indexing request to a node in teh cloud that didn't have a replica of
  collection1 (but it had other collections). We saw an instant jump in
  indexing latency to 5s, which given the previous latencies had been
 ~20ms
  was rather obvious!
 
  Querying seems to be fine with this kind of forwarding approach, but
  indexing would logically require ZK information (to find the right shard
  for the destination collection and the leader of that shard), so I'm
  wondering if a node in the cloud that has a replica of collection1 has
 that
  information cached, whereas a node in the (same) cloud that only has a
  collection2 replica only has collection2 information cached, and has to
 go
  to ZK for every forwarding request.
 
  I haven't checked the code recently, but that seems plausible to me.
 Would
  you really want all your collection2 nodes to be running ZK watches for
 all
  collection1 updates as well as their own collection2 watches, that would
  clog them up processing updates that in all honestly, they shouldn't
 have
  to deal with. Every node in the cloud would have to have a watch on
  everything else which if you have a lot of independent collections
 would be
  an unnecessary burden on each of them.
 
  If you use SolrJ as a client, that would route to a correct node in the
  cloud (which is what we ended up using through JNI which was
  interesting), but if you are using HTTP to index, that's something
 your
  application has to take care of.
 
  On 28 October 2014 19:29, Matt Hilt matt.h...@numerica.us wrote:
 
  I have three equal machines each running solr cloud (4.8). I have
 multiple
  collections that are replicated but not sharded. I also have document
  generation processes running on these nodes which involves querying the
  collection ~5 times per document generated.
 
  Node 1 has a replica of collection A and is running document generation
  code that pushes to the HTTP /update/json hander.
  Node 2 is the leader of collection A.
  Node 3 does not have a replica of node A, but is running document
  generation code for collection A.
 
  The issue I see is that node 1 can push documents into Solr 3-5 times
  faster than node 3 when they both talk to the solr instance on their
  localhost. If either of them talk directly to the solr instance on
 node 2,
  the performance is excellent (on par with node 1). To me it seems that
 the
  only difference in these cases is the query/put request forwarding.
 Does
  this involve some slow zookeeper communication that should be avoided?
 Any
  other insights?
 
  Thanks
 



Re: Slow forwarding requests to collection leader

2014-10-29 Thread Daniel Collins
I kind of think this might be working as designed, but I'll be happy to
be corrected by others :)

We had a similar issue which we discovered by accident, we had 2 or 3
collections spread across some machines, and we accidentally tried to send
an indexing request to a node in teh cloud that didn't have a replica of
collection1 (but it had other collections). We saw an instant jump in
indexing latency to 5s, which given the previous latencies had been ~20ms
was rather obvious!

Querying seems to be fine with this kind of forwarding approach, but
indexing would logically require ZK information (to find the right shard
for the destination collection and the leader of that shard), so I'm
wondering if a node in the cloud that has a replica of collection1 has that
information cached, whereas a node in the (same) cloud that only has a
collection2 replica only has collection2 information cached, and has to go
to ZK for every forwarding request.

I haven't checked the code recently, but that seems plausible to me. Would
you really want all your collection2 nodes to be running ZK watches for all
collection1 updates as well as their own collection2 watches, that would
clog them up processing updates that in all honestly, they shouldn't have
to deal with. Every node in the cloud would have to have a watch on
everything else which if you have a lot of independent collections would be
an unnecessary burden on each of them.

If you use SolrJ as a client, that would route to a correct node in the
cloud (which is what we ended up using through JNI which was
interesting), but if you are using HTTP to index, that's something your
application has to take care of.

On 28 October 2014 19:29, Matt Hilt matt.h...@numerica.us wrote:

 I have three equal machines each running solr cloud (4.8). I have multiple
 collections that are replicated but not sharded. I also have document
 generation processes running on these nodes which involves querying the
 collection ~5 times per document generated.

 Node 1 has a replica of collection A and is running document generation
 code that pushes to the HTTP /update/json hander.
 Node 2 is the leader of collection A.
 Node 3 does not have a replica of node A, but is running document
 generation code for collection A.

 The issue I see is that node 1 can push documents into Solr 3-5 times
 faster than node 3 when they both talk to the solr instance on their
 localhost. If either of them talk directly to the solr instance on node 2,
 the performance is excellent (on par with node 1). To me it seems that the
 only difference in these cases is the query/put request forwarding. Does
 this involve some slow zookeeper communication that should be avoided? Any
 other insights?

 Thanks


Slow forwarding requests to collection leader

2014-10-28 Thread Matt Hilt
I have three equal machines each running solr cloud (4.8). I have multiple 
collections that are replicated but not sharded. I also have document 
generation processes running on these nodes which involves querying the 
collection ~5 times per document generated.

Node 1 has a replica of collection A and is running document generation code 
that pushes to the HTTP /update/json hander.
Node 2 is the leader of collection A.
Node 3 does not have a replica of node A, but is running document generation 
code for collection A.

The issue I see is that node 1 can push documents into Solr 3-5 times faster 
than node 3 when they both talk to the solr instance on their localhost. If 
either of them talk directly to the solr instance on node 2, the performance is 
excellent (on par with node 1). To me it seems that the only difference in 
these cases is the query/put request forwarding. Does this involve some slow 
zookeeper communication that should be avoided? Any other insights?

Thanks

smime.p7s
Description: S/MIME cryptographic signature