Re: Sharding configuration
On 30 Oct 2014 14:49, "Shawn Heisey" wrote: > In order to see a gain in performance from multiple shards per server, > the server must have a lot of CPUs and the query rate must be fairly > low. If the query rate is high, then all the CPUs will be busy just > handling simultaneous queries, so putting multiple shards per server > will probably slow things down. When query rate is low, multiple CPUs > can handle each shard query simultaneously, speeding up the overall query. Except that your query latency isn't always CPU bound, there's a significant IO bound portion as well. I wouldn't go so far as to say that will large query volumes you shouldn't use multiple shards -- finally comes down to how many shards a machine can handle under peak load, it could depend on CPU/IO/GC pressure.. We have multiple shards on a machine under heavy query load for example. The only real way is to benchmark this and see.. > Thanks, > Shawn >
Re: Sharding configuration
On 30 Oct 2014 23:46, "Erick Erickson" wrote: > > This configuration deals with all > the replication, NRT processing, self-repair when nodes go up and > down and all that, but since there's no second trip to get the docs > from shards your query performance won't be affected. More or less.. Vaguely recall that you still would need to add a shortCircuit parameter to the url in such a case to avoid a second trip. I might be wrong here but I do recall wondering why that wasn't the default.. > > And using SolrCloud with a single shard will essentially scale linearly > as you add nodes for queries. > > Best, > Erick > > > On Thu, Oct 30, 2014 at 8:29 AM, Anca Kopetz wrote: > > Hi, > > > > You are right, it is a mistake in my phrase, for the tests with 4 > > shards/ 4 instances, the latency was worse (therefore *bigger*) than > > for the tests with one shard. > > > > In our case, the query rate is high. > > > > Thanks, > > Anca > > > > > > On 10/30/2014 03:48 PM, Shawn Heisey wrote: > >> > >> On 10/30/2014 4:32 AM, Anca Kopetz wrote: > >>> > >>> We did some tests with 4 shards / 4 different tomcat instances on the > >>> same server and the average latency was smaller than the one when having > >>> only one shard. > >>> We tested also é shards on different servers and the performance results > >>> were also worse. > >>> > >>> It seems that the sharding does not make any difference for our index in > >>> terms of latency gains. > >> > >> That statement is confusing, because if latency goes down, that's good, > >> not worse. > >> > >> If you're going to put multiple shards on one server, it should be done > >> with one solr/tomcat instance, not multiple. One instance is perfectly > >> capable of dealing with many shards, and has a lot less overhead. The > >> SolrCloud collection create command would need the maxShardsPerNode > >> parameter. > >> > >> In order to see a gain in performance from multiple shards per server, > >> the server must have a lot of CPUs and the query rate must be fairly > >> low. If the query rate is high, then all the CPUs will be busy just > >> handling simultaneous queries, so putting multiple shards per server > >> will probably slow things down. When query rate is low, multiple CPUs > >> can handle each shard query simultaneously, speeding up the overall query. > >> > >> Thanks, > >> Shawn > >> > > > > Kelkoo SAS > > Société par Actions Simplifiée > > Au capital de € 4.168.964,30 > > Siège social : 8, rue du Sentier 75002 Paris > > 425 093 069 RCS Paris > > > > Ce message et les pièces jointes sont confidentiels et établis à l'attention > > exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce > > message, merci de le détruire et d'en avertir l'expéditeur.
Re: Sharding configuration
This is not too surprising. There are additional hops necessary for a cloud setup. This is the sequence, let's say there are 4 shards and the rows parameter on the query is 10 and you're sorting by score node1 receives request. node1 sends the request out to each shard node1 receives the top 10 doc Ids back with the score (note, not the _contents_). node1 sorts the 4 lists of 10 docs into the final top 10. node1 then requests the actual docs from the nodes that they reside on node1 then gets the results back and assembles them into a final list node1 then returns the list to the client. Contrast this with a single shard node1 receives the request node1 finds the top 10 docs locally node1 return the docs to the client You should only resort to sharding when you have too many docs to fit in a single shard (and give you acceptable search times). If all your docs fit comfortably on a single machine, you can _still_ use SolrCloud, just with a single shard. This configuration deals with all the replication, NRT processing, self-repair when nodes go up and down and all that, but since there's no second trip to get the docs from shards your query performance won't be affected. And using SolrCloud with a single shard will essentially scale linearly as you add nodes for queries. Best, Erick On Thu, Oct 30, 2014 at 8:29 AM, Anca Kopetz wrote: > Hi, > > You are right, it is a mistake in my phrase, for the tests with 4 > shards/ 4 instances, the latency was worse (therefore *bigger*) than > for the tests with one shard. > > In our case, the query rate is high. > > Thanks, > Anca > > > On 10/30/2014 03:48 PM, Shawn Heisey wrote: >> >> On 10/30/2014 4:32 AM, Anca Kopetz wrote: >>> >>> We did some tests with 4 shards / 4 different tomcat instances on the >>> same server and the average latency was smaller than the one when having >>> only one shard. >>> We tested also é shards on different servers and the performance results >>> were also worse. >>> >>> It seems that the sharding does not make any difference for our index in >>> terms of latency gains. >> >> That statement is confusing, because if latency goes down, that's good, >> not worse. >> >> If you're going to put multiple shards on one server, it should be done >> with one solr/tomcat instance, not multiple. One instance is perfectly >> capable of dealing with many shards, and has a lot less overhead. The >> SolrCloud collection create command would need the maxShardsPerNode >> parameter. >> >> In order to see a gain in performance from multiple shards per server, >> the server must have a lot of CPUs and the query rate must be fairly >> low. If the query rate is high, then all the CPUs will be busy just >> handling simultaneous queries, so putting multiple shards per server >> will probably slow things down. When query rate is low, multiple CPUs >> can handle each shard query simultaneously, speeding up the overall query. >> >> Thanks, >> Shawn >> > > Kelkoo SAS > Société par Actions Simplifiée > Au capital de € 4.168.964,30 > Siège social : 8, rue du Sentier 75002 Paris > 425 093 069 RCS Paris > > Ce message et les pièces jointes sont confidentiels et établis à l'attention > exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce > message, merci de le détruire et d'en avertir l'expéditeur.
Re: Sharding configuration
Hi, You are right, it is a mistake in my phrase, for the tests with 4 shards/ 4 instances, the latency was worse (therefore *bigger*) than for the tests with one shard. In our case, the query rate is high. Thanks, Anca On 10/30/2014 03:48 PM, Shawn Heisey wrote: On 10/30/2014 4:32 AM, Anca Kopetz wrote: We did some tests with 4 shards / 4 different tomcat instances on the same server and the average latency was smaller than the one when having only one shard. We tested also é shards on different servers and the performance results were also worse. It seems that the sharding does not make any difference for our index in terms of latency gains. That statement is confusing, because if latency goes down, that's good, not worse. If you're going to put multiple shards on one server, it should be done with one solr/tomcat instance, not multiple. One instance is perfectly capable of dealing with many shards, and has a lot less overhead. The SolrCloud collection create command would need the maxShardsPerNode parameter. In order to see a gain in performance from multiple shards per server, the server must have a lot of CPUs and the query rate must be fairly low. If the query rate is high, then all the CPUs will be busy just handling simultaneous queries, so putting multiple shards per server will probably slow things down. When query rate is low, multiple CPUs can handle each shard query simultaneously, speeding up the overall query. Thanks, Shawn Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Re: Sharding configuration
On 10/30/2014 4:32 AM, Anca Kopetz wrote: > We did some tests with 4 shards / 4 different tomcat instances on the > same server and the average latency was smaller than the one when having > only one shard. > We tested also é shards on different servers and the performance results > were also worse. > > It seems that the sharding does not make any difference for our index in > terms of latency gains. That statement is confusing, because if latency goes down, that's good, not worse. If you're going to put multiple shards on one server, it should be done with one solr/tomcat instance, not multiple. One instance is perfectly capable of dealing with many shards, and has a lot less overhead. The SolrCloud collection create command would need the maxShardsPerNode parameter. In order to see a gain in performance from multiple shards per server, the server must have a lot of CPUs and the query rate must be fairly low. If the query rate is high, then all the CPUs will be busy just handling simultaneous queries, so putting multiple shards per server will probably slow things down. When query rate is low, multiple CPUs can handle each shard query simultaneously, speeding up the overall query. Thanks, Shawn
Re: Sharding configuration
Hi, We did some tests with 4 shards / 4 different tomcat instances on the same server and the average latency was smaller than the one when having only one shard. We tested also é shards on different servers and the performance results were also worse. It seems that the sharding does not make any difference for our index in terms of latency gains. Thanks for your response, Anca On 10/28/2014 08:44 PM, Ramkumar R. Aiyengar wrote: As far as the second option goes, unless you are using a large amount of memory and you reach a point where a JVM can't sensibly deal with a GC load, having multiple JVMs wouldn't buy you much. With a 26GB index, you probably haven't reached that point. There are also other shared resources at an instance level like connection pools and ZK connections, but those are tunable and you probably aren't pushing them as well (I would imagine you are just trying to have only a handful of shards given that you aren't sharded at all currently). That leaves single vs multiple machines. Assuming the network isn't a bottleneck, and given the same amount of resources overall (number of cores, amount of memory, IO bandwidth times number of machines), it shouldn't matter between the two. If you are procuring new hardware, I would say buy more, smaller machines, but if you already have the hardware, you could serve as much as possible off a machine before moving to a second. There's nothing which limits the number of shards as long as the underlying machine has the sufficient amount of parallelism. Again, this advice is for a small number of shards, if you had a lot more (hundreds) of shards and significant volume of requests, things start to become a bit more fuzzy with other limits kicking in. On 28 Oct 2014 09:26, "Anca Kopetz" wrote: Hi, We have a SolrCloud configuration of 10 servers, no sharding, 20 millions of documents, the index has 26 GB. As the number of documents has increased recently, the performance of the cluster decreased. We thought of sharding the index, in order to measure the latency. What is the best approach ? - to use shard splitting and have several sub-shards on the same server and in the same tomcat instance - having several shards on the same server but on different tomcat instances - having one shard on each server (for example 2 shards / 5 replicas on 10 servers) What's the impact of these 3 configuration on performance ? Thanks, Anca -- Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur. Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
RE: Sharding configuration
Informational only. FYI Machine parallelism has been empirically proven to be application dependent. See DaCapo benchmarks (lucene indexing and lucene searching) use in http://dx.doi.org/10.1145/2479871.2479901 " Parallelism profiling and wall-time prediction for multi-threaded applications" 2013. FYI: -Original Message- From: Ramkumar R. Aiyengar [mailto:andyetitmo...@gmail.com] Sent: Tuesday, October 28, 2014 3:44 PM To: solr-user@lucene.apache.org Subject: Re: Sharding configuration As far as the second option goes, unless you are using a large amount of memory and you reach a point where a JVM can't sensibly deal with a GC load, having multiple JVMs wouldn't buy you much. With a 26GB index, you probably haven't reached that point. There are also other shared resources at an instance level like connection pools and ZK connections, but those are tunable and you probably aren't pushing them as well (I would imagine you are just trying to have only a handful of shards given that you aren't sharded at all currently). That leaves single vs multiple machines. Assuming the network isn't a bottleneck, and given the same amount of resources overall (number of cores, amount of memory, IO bandwidth times number of machines), it shouldn't matter between the two. If you are procuring new hardware, I would say buy more, smaller machines, but if you already have the hardware, you could serve as much as possible off a machine before moving to a second. There's nothing which limits the number of shards as long as the underlying machine has the sufficient amount of parallelism. Again, this advice is for a small number of shards, if you had a lot more (hundreds) of shards and significant volume of requests, things start to become a bit more fuzzy with other limits kicking in. On 28 Oct 2014 09:26, "Anca Kopetz" < <mailto:anca.kop...@kelkoo.com> anca.kop...@kelkoo.com> wrote: > Hi, > > We have a SolrCloud configuration of 10 servers, no sharding, 20 > millions of documents, the index has 26 GB. > As the number of documents has increased recently, the performance of > the cluster decreased. > > We thought of sharding the index, in order to measure the latency. > What is the best approach ? > - to use shard splitting and have several sub-shards on the same > server and in the same tomcat instance > - having several shards on the same server but on different tomcat > instances > - having one shard on each server (for example 2 shards / 5 replicas > on > 10 servers) > > What's the impact of these 3 configuration on performance ? > > Thanks, > Anca > > -- > > Kelkoo SAS > Société par Actions Simplifiée > Au capital de € 4.168.964,30 > Siège social : 8, rue du Sentier 75002 Paris > 425 093 069 RCS Paris > > Ce message et les pièces jointes sont confidentiels et établis à > l'attention exclusive de leurs destinataires. Si vous n'êtes pas le > destinataire de ce message, merci de le détruire et d'en avertir > l'expéditeur. >
Re: Sharding configuration
As far as the second option goes, unless you are using a large amount of memory and you reach a point where a JVM can't sensibly deal with a GC load, having multiple JVMs wouldn't buy you much. With a 26GB index, you probably haven't reached that point. There are also other shared resources at an instance level like connection pools and ZK connections, but those are tunable and you probably aren't pushing them as well (I would imagine you are just trying to have only a handful of shards given that you aren't sharded at all currently). That leaves single vs multiple machines. Assuming the network isn't a bottleneck, and given the same amount of resources overall (number of cores, amount of memory, IO bandwidth times number of machines), it shouldn't matter between the two. If you are procuring new hardware, I would say buy more, smaller machines, but if you already have the hardware, you could serve as much as possible off a machine before moving to a second. There's nothing which limits the number of shards as long as the underlying machine has the sufficient amount of parallelism. Again, this advice is for a small number of shards, if you had a lot more (hundreds) of shards and significant volume of requests, things start to become a bit more fuzzy with other limits kicking in. On 28 Oct 2014 09:26, "Anca Kopetz" wrote: > Hi, > > We have a SolrCloud configuration of 10 servers, no sharding, 20 > millions of documents, the index has 26 GB. > As the number of documents has increased recently, the performance of > the cluster decreased. > > We thought of sharding the index, in order to measure the latency. What > is the best approach ? > - to use shard splitting and have several sub-shards on the same server > and in the same tomcat instance > - having several shards on the same server but on different tomcat > instances > - having one shard on each server (for example 2 shards / 5 replicas on > 10 servers) > > What's the impact of these 3 configuration on performance ? > > Thanks, > Anca > > -- > > Kelkoo SAS > Société par Actions Simplifiée > Au capital de € 4.168.964,30 > Siège social : 8, rue du Sentier 75002 Paris > 425 093 069 RCS Paris > > Ce message et les pièces jointes sont confidentiels et établis à > l'attention exclusive de leurs destinataires. Si vous n'êtes pas le > destinataire de ce message, merci de le détruire et d'en avertir > l'expéditeur. >