Re: Sharding configuration

2014-11-01 Thread Ramkumar R. Aiyengar
On 30 Oct 2014 23:46, Erick Erickson erickerick...@gmail.com wrote:

 This configuration deals with all
 the replication, NRT processing, self-repair when nodes go up and
 down and all that, but since there's no second trip to get the docs
 from shards your query performance won't be affected.

More or less.. Vaguely recall that you still would need to add a
shortCircuit parameter to the url in such a case to avoid a second trip. I
might be wrong here but I do recall wondering why that wasn't the default..


 And using SolrCloud with a single shard will essentially scale linearly
 as you add nodes for queries.

 Best,
 Erick


 On Thu, Oct 30, 2014 at 8:29 AM, Anca Kopetz anca.kop...@kelkoo.com
wrote:
  Hi,
 
  You are right, it is a mistake in my phrase, for the tests with 4
  shards/ 4 instances,  the latency was worse (therefore *bigger*) than
  for the tests with one shard.
 
  In our case, the query rate is high.
 
  Thanks,
  Anca
 
 
  On 10/30/2014 03:48 PM, Shawn Heisey wrote:
 
  On 10/30/2014 4:32 AM, Anca Kopetz wrote:
 
  We did some tests with 4 shards / 4 different tomcat instances on the
  same server and the average latency was smaller than the one when
having
  only one shard.
  We tested also é shards on different servers and the performance
results
  were also worse.
 
  It seems that the sharding does not make any difference for our index
in
  terms of latency gains.
 
  That statement is confusing, because if latency goes down, that's good,
  not worse.
 
  If you're going to put multiple shards on one server, it should be done
  with one solr/tomcat instance, not multiple.  One instance is perfectly
  capable of dealing with many shards, and has a lot less overhead.  The
  SolrCloud collection create command would need the maxShardsPerNode
  parameter.
 
  In order to see a gain in performance from multiple shards per server,
  the server must have a lot of CPUs and the query rate must be fairly
  low.  If the query rate is high, then all the CPUs will be busy just
  handling simultaneous queries, so putting multiple shards per server
  will probably slow things down.  When query rate is low, multiple CPUs
  can handle each shard query simultaneously, speeding up the overall
query.
 
  Thanks,
  Shawn
 
 
  Kelkoo SAS
  Société par Actions Simplifiée
  Au capital de € 4.168.964,30
  Siège social : 8, rue du Sentier 75002 Paris
  425 093 069 RCS Paris
 
  Ce message et les pièces jointes sont confidentiels et établis à
l'attention
  exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de
ce
  message, merci de le détruire et d'en avertir l'expéditeur.


Re: Sharding configuration

2014-11-01 Thread Ramkumar R. Aiyengar
On 30 Oct 2014 14:49, Shawn Heisey apa...@elyograg.org wrote:

 In order to see a gain in performance from multiple shards per server,
 the server must have a lot of CPUs and the query rate must be fairly
 low.  If the query rate is high, then all the CPUs will be busy just
 handling simultaneous queries, so putting multiple shards per server
 will probably slow things down.  When query rate is low, multiple CPUs
 can handle each shard query simultaneously, speeding up the overall query.

Except that your query latency isn't always CPU bound, there's a
significant IO bound portion as well. I wouldn't go so far as to say that
will large query volumes you shouldn't use multiple shards -- finally comes
down to how many shards a machine can handle under peak load, it could
depend on CPU/IO/GC pressure.. We have multiple shards on a machine under
heavy query load for example. The only real way is to benchmark this and
see..

 Thanks,
 Shawn



Re: Sharding configuration

2014-10-30 Thread Anca Kopetz

Hi,

We did some tests with 4 shards / 4 different tomcat instances on the
same server and the average latency was smaller than the one when having
only one shard.
We tested also é shards on different servers and the performance results
were also worse.

It seems that the sharding does not make any difference for our index in
terms of latency gains.

Thanks for your response,
Anca

On 10/28/2014 08:44 PM, Ramkumar R. Aiyengar wrote:

As far as the second option goes, unless you are using a large amount of
memory and you reach a point where a JVM can't sensibly deal with a GC
load, having multiple JVMs wouldn't buy you much. With a 26GB index, you
probably haven't reached that point. There are also other shared resources
at an instance level like connection pools and ZK connections, but those
are tunable and you probably aren't pushing them as well (I would imagine
you are just trying to have only a handful of shards given that you aren't
sharded at all currently).

That leaves single vs multiple machines. Assuming the network isn't a
bottleneck, and given the same amount of resources overall (number of
cores, amount of memory, IO bandwidth times number of machines), it
shouldn't matter between the two. If you are procuring new hardware, I
would say buy more, smaller machines, but if you already have the hardware,
you could serve as much as possible off a machine before moving to a
second. There's nothing which limits the number of shards as long as the
underlying machine has the sufficient amount of parallelism.

Again, this advice is for a small number of shards, if you had a lot more
(hundreds) of shards and significant volume of requests, things start to
become a bit more fuzzy with other limits kicking in.
On 28 Oct 2014 09:26, Anca Kopetzanca.kop...@kelkoo.com  wrote:


Hi,

We have a SolrCloud configuration of 10 servers, no sharding, 20
millions of documents, the index has 26 GB.
As the number of documents has increased recently, the performance of
the cluster decreased.

We thought of sharding the index, in order to measure the latency. What
is the best approach ?
- to use shard splitting and have several sub-shards on the same server
and in the same tomcat instance
- having several shards on the same server but on different tomcat
instances
- having one shard on each server (for example 2 shards / 5 replicas on
10 servers)

What's the impact of these 3 configuration on performance ?

Thanks,
Anca

--

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à
l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
destinataire de ce message, merci de le détruire et d'en avertir
l'expéditeur.



Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: Sharding configuration

2014-10-30 Thread Shawn Heisey
On 10/30/2014 4:32 AM, Anca Kopetz wrote:
 We did some tests with 4 shards / 4 different tomcat instances on the
 same server and the average latency was smaller than the one when having
 only one shard.
 We tested also é shards on different servers and the performance results
 were also worse.

 It seems that the sharding does not make any difference for our index in
 terms of latency gains.

That statement is confusing, because if latency goes down, that's good,
not worse.

If you're going to put multiple shards on one server, it should be done
with one solr/tomcat instance, not multiple.  One instance is perfectly
capable of dealing with many shards, and has a lot less overhead.  The
SolrCloud collection create command would need the maxShardsPerNode
parameter.

In order to see a gain in performance from multiple shards per server,
the server must have a lot of CPUs and the query rate must be fairly
low.  If the query rate is high, then all the CPUs will be busy just
handling simultaneous queries, so putting multiple shards per server
will probably slow things down.  When query rate is low, multiple CPUs
can handle each shard query simultaneously, speeding up the overall query.

Thanks,
Shawn



Re: Sharding configuration

2014-10-30 Thread Anca Kopetz

Hi,

You are right, it is a mistake in my phrase, for the tests with 4
shards/ 4 instances,  the latency was worse (therefore *bigger*) than
for the tests with one shard.

In our case, the query rate is high.

Thanks,
Anca

On 10/30/2014 03:48 PM, Shawn Heisey wrote:

On 10/30/2014 4:32 AM, Anca Kopetz wrote:

We did some tests with 4 shards / 4 different tomcat instances on the
same server and the average latency was smaller than the one when having
only one shard.
We tested also é shards on different servers and the performance results
were also worse.

It seems that the sharding does not make any difference for our index in
terms of latency gains.

That statement is confusing, because if latency goes down, that's good,
not worse.

If you're going to put multiple shards on one server, it should be done
with one solr/tomcat instance, not multiple.  One instance is perfectly
capable of dealing with many shards, and has a lot less overhead.  The
SolrCloud collection create command would need the maxShardsPerNode
parameter.

In order to see a gain in performance from multiple shards per server,
the server must have a lot of CPUs and the query rate must be fairly
low.  If the query rate is high, then all the CPUs will be busy just
handling simultaneous queries, so putting multiple shards per server
will probably slow things down.  When query rate is low, multiple CPUs
can handle each shard query simultaneously, speeding up the overall query.

Thanks,
Shawn



Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: Sharding configuration

2014-10-30 Thread Erick Erickson
This is not too surprising. There are additional hops necessary for a
cloud setup. This is the sequence, let's say there are 4 shards and the
rows parameter on the query is 10 and you're sorting by score

node1 receives request.
node1 sends the request out to each shard
node1 receives the top 10 doc Ids back with the  score (note, not the
_contents_).
node1 sorts the 4 lists of 10 docs into the final top 10.
node1 then requests the actual docs from the nodes that they reside on
node1 then gets the results back and assembles them into a final list
node1 then returns the list to the client.

Contrast this with a single shard
node1 receives the request
node1 finds the top 10 docs locally
node1 return the docs to the client

You should only resort to sharding when you have too many docs
to fit in a single shard (and give you acceptable search times). If
all your docs fit comfortably on a single machine, you can _still_ use
SolrCloud, just with a single shard. This configuration deals with all
the replication, NRT processing, self-repair when nodes go up and
down and all that, but since there's no second trip to get the docs
from shards your query performance won't be affected.

And using SolrCloud with a single shard will essentially scale linearly
as you add nodes for queries.

Best,
Erick


On Thu, Oct 30, 2014 at 8:29 AM, Anca Kopetz anca.kop...@kelkoo.com wrote:
 Hi,

 You are right, it is a mistake in my phrase, for the tests with 4
 shards/ 4 instances,  the latency was worse (therefore *bigger*) than
 for the tests with one shard.

 In our case, the query rate is high.

 Thanks,
 Anca


 On 10/30/2014 03:48 PM, Shawn Heisey wrote:

 On 10/30/2014 4:32 AM, Anca Kopetz wrote:

 We did some tests with 4 shards / 4 different tomcat instances on the
 same server and the average latency was smaller than the one when having
 only one shard.
 We tested also é shards on different servers and the performance results
 were also worse.

 It seems that the sharding does not make any difference for our index in
 terms of latency gains.

 That statement is confusing, because if latency goes down, that's good,
 not worse.

 If you're going to put multiple shards on one server, it should be done
 with one solr/tomcat instance, not multiple.  One instance is perfectly
 capable of dealing with many shards, and has a lot less overhead.  The
 SolrCloud collection create command would need the maxShardsPerNode
 parameter.

 In order to see a gain in performance from multiple shards per server,
 the server must have a lot of CPUs and the query rate must be fairly
 low.  If the query rate is high, then all the CPUs will be busy just
 handling simultaneous queries, so putting multiple shards per server
 will probably slow things down.  When query rate is low, multiple CPUs
 can handle each shard query simultaneously, speeding up the overall query.

 Thanks,
 Shawn


 Kelkoo SAS
 Société par Actions Simplifiée
 Au capital de € 4.168.964,30
 Siège social : 8, rue du Sentier 75002 Paris
 425 093 069 RCS Paris

 Ce message et les pièces jointes sont confidentiels et établis à l'attention
 exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce
 message, merci de le détruire et d'en avertir l'expéditeur.


Sharding configuration

2014-10-28 Thread Anca Kopetz

Hi,

We have a SolrCloud configuration of 10 servers, no sharding, 20
millions of documents, the index has 26 GB.
As the number of documents has increased recently, the performance of
the cluster decreased.

We thought of sharding the index, in order to measure the latency. What
is the best approach ?
- to use shard splitting and have several sub-shards on the same server
and in the same tomcat instance
- having several shards on the same server but on different tomcat instances
- having one shard on each server (for example 2 shards / 5 replicas on
10 servers)

What's the impact of these 3 configuration on performance ?

Thanks,
Anca

--

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: Sharding configuration

2014-10-28 Thread Ramkumar R. Aiyengar
As far as the second option goes, unless you are using a large amount of
memory and you reach a point where a JVM can't sensibly deal with a GC
load, having multiple JVMs wouldn't buy you much. With a 26GB index, you
probably haven't reached that point. There are also other shared resources
at an instance level like connection pools and ZK connections, but those
are tunable and you probably aren't pushing them as well (I would imagine
you are just trying to have only a handful of shards given that you aren't
sharded at all currently).

That leaves single vs multiple machines. Assuming the network isn't a
bottleneck, and given the same amount of resources overall (number of
cores, amount of memory, IO bandwidth times number of machines), it
shouldn't matter between the two. If you are procuring new hardware, I
would say buy more, smaller machines, but if you already have the hardware,
you could serve as much as possible off a machine before moving to a
second. There's nothing which limits the number of shards as long as the
underlying machine has the sufficient amount of parallelism.

Again, this advice is for a small number of shards, if you had a lot more
(hundreds) of shards and significant volume of requests, things start to
become a bit more fuzzy with other limits kicking in.
On 28 Oct 2014 09:26, Anca Kopetz anca.kop...@kelkoo.com wrote:

 Hi,

 We have a SolrCloud configuration of 10 servers, no sharding, 20
 millions of documents, the index has 26 GB.
 As the number of documents has increased recently, the performance of
 the cluster decreased.

 We thought of sharding the index, in order to measure the latency. What
 is the best approach ?
 - to use shard splitting and have several sub-shards on the same server
 and in the same tomcat instance
 - having several shards on the same server but on different tomcat
 instances
 - having one shard on each server (for example 2 shards / 5 replicas on
 10 servers)

 What's the impact of these 3 configuration on performance ?

 Thanks,
 Anca

 --

 Kelkoo SAS
 Société par Actions Simplifiée
 Au capital de € 4.168.964,30
 Siège social : 8, rue du Sentier 75002 Paris
 425 093 069 RCS Paris

 Ce message et les pièces jointes sont confidentiels et établis à
 l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
 destinataire de ce message, merci de le détruire et d'en avertir
 l'expéditeur.



RE: Sharding configuration

2014-10-28 Thread Will Martin
Informational only. FYI

 

Machine parallelism has been empirically proven to be application dependent.

See DaCapo benchmarks (lucene indexing and lucene searching) use in  
http://dx.doi.org/10.1145/2479871.2479901

 

 Parallelism profiling and wall-time prediction for multi-threaded 
applications 2013.

 

FYI:

 



 

 



 

-Original Message-
From: Ramkumar R. Aiyengar [mailto:andyetitmo...@gmail.com] 
Sent: Tuesday, October 28, 2014 3:44 PM
To: solr-user@lucene.apache.org
Subject: Re: Sharding configuration

 

As far as the second option goes, unless you are using a large amount of memory 
and you reach a point where a JVM can't sensibly deal with a GC load, having 
multiple JVMs wouldn't buy you much. With a 26GB index, you probably haven't 
reached that point. There are also other shared resources at an instance level 
like connection pools and ZK connections, but those are tunable and you 
probably aren't pushing them as well (I would imagine you are just trying to 
have only a handful of shards given that you aren't sharded at all currently).

 

That leaves single vs multiple machines. Assuming the network isn't a 
bottleneck, and given the same amount of resources overall (number of cores, 
amount of memory, IO bandwidth times number of machines), it shouldn't matter 
between the two. If you are procuring new hardware, I would say buy more, 
smaller machines, but if you already have the hardware, you could serve as much 
as possible off a machine before moving to a second. There's nothing which 
limits the number of shards as long as the underlying machine has the 
sufficient amount of parallelism.

 

Again, this advice is for a small number of shards, if you had a lot more

(hundreds) of shards and significant volume of requests, things start to become 
a bit more fuzzy with other limits kicking in.

On 28 Oct 2014 09:26, Anca Kopetz  mailto:anca.kop...@kelkoo.com 
anca.kop...@kelkoo.com wrote:

 

 Hi,

 

 We have a SolrCloud configuration of 10 servers, no sharding, 20 

 millions of documents, the index has 26 GB.

 As the number of documents has increased recently, the performance of 

 the cluster decreased.

 

 We thought of sharding the index, in order to measure the latency. 

 What is the best approach ?

 - to use shard splitting and have several sub-shards on the same 

 server and in the same tomcat instance

 - having several shards on the same server but on different tomcat 

 instances

 - having one shard on each server (for example 2 shards / 5 replicas 

 on

 10 servers)

 

 What's the impact of these 3 configuration on performance ?

 

 Thanks,

 Anca

 

 --

 

 Kelkoo SAS

 Société par Actions Simplifiée

 Au capital de € 4.168.964,30

 Siège social : 8, rue du Sentier 75002 Paris

 425 093 069 RCS Paris

 

 Ce message et les pièces jointes sont confidentiels et établis à 

 l'attention exclusive de leurs destinataires. Si vous n'êtes pas le 

 destinataire de ce message, merci de le détruire et d'en avertir 

 l'expéditeur.