RE: huge shards (300GB each) and load balancing

2011-06-15 Thread Burton-West, Tom
Hi Dimitry,

The parameters you have menioned -- termInfosIndexDivisor and
termIndexInterval -- are not found in the solr 1.4.1 config|schema. Are you 
using SOLR 3.1?

I'm pretty sure that the termIndexInterval (ratio of tii file to tis file) is 
in the 1.4.1 example solrconfig.xml file, although I don't have a copy to check 
at the moment.  We are using a 3.1 dev version.  As far as the 
termInfosIndexDivisor, I I'm also pretty sure it works with 1.4.1, but you 
might have to ask the list to be sure.  As you can see from the blog posts 
those settings really reduced our memory requirements.We haven't been doing 
faceting so we expect memory use to go up again once we add faceting, but at 
least we are starting at a 4GB baseline instead of a 20-32GB baseline.

Did you you do logical sharding or document hash based?

On the indexing side we just assign documents to a particular shard on a round 
robin basis and use a database to keep track of which document is in which 
shard so if we need to update it we update the right shard (See the Forty 
days article on the blog for a more detailed description and some diagrams) .  
We hope that this distributes the documents evenly enough to avoid problems 
with Solr's lack of global idf.

Do you have load balancer between the front SOLR (or front entity) and shards,

As far as load balancing which shard is the head shard/front shard, again, our 
app layer just randomly picks one of the shards to be the head shard.  We 
originally were going to do tests to determine if it was better to have one 
dedicated machine configured to be the head shard, but never got around to 
that.  We have a very low query request rate, so haven't had to seriously look 
at load balancing

do you do merging? 

I'm not sure what you mean by do you do merging .  We are just using the 
default Solr distributed search.  In theory our documents should be randomly 
distributed among the shards so the lack of global idf should not hurt the 
merging process.  Andrzej Bialecki gave a recent presentation on Solr 
distributed search that talks about less than optimal results merging and some 
ideas for dealing with it:
http://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/AndrzejBialecki-Buzzwords-2011_0.pdf

Each shard currently is allocated max 12GB memory. 
I'm curious about how much memory you leave to the OS for disk caching.  Can 
you give any details about the number of shards per machine and the total 
memory on the machine.


Tom Burton-West
 http://www.hathitrust.org/blogs/large-scale-search




From: Dmitry Kan [dmitry@gmail.com]
Sent: Tuesday, June 14, 2011 2:15 PM
To: solr-user@lucene.apache.org
Subject: Re: huge shards (300GB each) and load balancing

Hi Tom,

Thanks a lot for sharing this. We have about half a terabyte total index
size, and we have split our index over 10 shards (horizontal scaling, no
replication). Each shard currently is allocated max 12GB memory. We use
facet search a lot and non-facet search with parameter values generated by
facet search (hence more focused search that hits small portion of solr
documents).

The parameters you have menioned -- termInfosIndexDivisor and
termIndexInterval -- are not found in the solr 1.4.1 config|schema. Are you
using SOLR 3.1? Did you you do logical sharding or document hash based? Do
you have load balancer between the front SOLR (or front entity) and shards,
do you do merging?





Re: huge shards (300GB each) and load balancing

2011-06-14 Thread Dmitry Kan
Hi Tom,

Thanks a lot for sharing this. We have about half a terabyte total index
size, and we have split our index over 10 shards (horizontal scaling, no
replication). Each shard currently is allocated max 12GB memory. We use
facet search a lot and non-facet search with parameter values generated by
facet search (hence more focused search that hits small portion of solr
documents).

The parameters you have menioned -- termInfosIndexDivisor and
termIndexInterval -- are not found in the solr 1.4.1 config|schema. Are you
using SOLR 3.1? Did you you do logical sharding or document hash based? Do
you have load balancer between the front SOLR (or front entity) and shards,
do you do merging?

On Wed, Jun 8, 2011 at 10:23 PM, Burton-West, Tom tburt...@umich.eduwrote:

 Hi Dmitry,

 I am assuming you are splitting one very large index over multiple shards
 rather than replicating and index multiple times.

 Just for a point of comparison, I thought I would describe our experience
 with large shards. At HathiTrust, we run a 6 terabyte index over 12 shards.
  This is split over 4 machines with 3 shards per machine and our shards are
 about 400-500GB.  We get average response times of around 200 ms with the
 99th percentile queries up around 1-2 seconds. We have a very low qps rate,
 i.e. less than 1 qps.  We also index offline on a separate machine and
 update the indexes nightly.

 Some of the issues we have found with very large shards are:
 1) Becaue of the very large shard size, I/O tends to be the bottleneck,
 with phrase queries containing common words being the slowest.
 2) Because of the I/O issues running cache-warming queries to get postings
 into the OS disk cache is important as is leaving significant free memory
 for the OS to use for disk caching
 3) Because of the I/O issues using stop words or CommonGrams produces a
 significant performance increase.
 2) We have a huge number of unique terms in our indexes.  In order to
 reduce the amount of memory needed by the in-memory terms index we set the
 termInfosIndexDivisor to 8, which causes Solr to only load every 8th term
 from the tii file into memory. This reduced memory use from over 18GB to
 below 3G and got rid of 30 second stop the world java Garbage Collections.
 (See
 http://www.hathitrust.org/blogs/large-scale-search/too-many-words-againfor 
 details)  We later ran into memory problems when indexing so instead
 changed the index time parameter termIndexInterval from 128 to 1024.

 (More details here: http://www.hathitrust.org/blogs/large-scale-search)

 Tom Burton-West




-- 
Regards,

Dmitry Kan


Re: huge shards (300GB each) and load balancing

2011-06-08 Thread Dmitry Kan
Hi Upayavira,

Thanks for sharing insights and experience on this.

As we have 6 shards at the moment, it is pretty hard (=almost impossible) to
keep them on a single box, so that's why we decided to shard. On the other
hand, we have never tried multicore architecture, so that's a good point,
thanks.

On the indexing side, we do it rather straightforward, that is, by updating
the online shards. This should hopefully be improved with [offline update /
http swap] system, as already now, updating online 200GB shards at times
produces OOM, freezing and other issues.



Does someone have other experience / pointers to load balancer software that
was tried with SOLR?

Dmitry

On Wed, Jun 8, 2011 at 12:32 PM, Upayavira u...@odoko.co.uk wrote:



 On Wed, 08 Jun 2011 10:42 +0300, Dmitry Kan dmitry@gmail.com
 wrote:
  Hello list,
 
  Thanks for attending to my previous questions so far, have learnt a lot.
  Here is another one, I hope it will be interesting to answer.
 
 
 
  We run our SOLR shards and front end SOLR on the Amazon high-end
  machines.
  Currently we have 6 shards with around 200GB in each. Currently we have
  only
  one front end SOLR which, given a client query, redirects it to all the
  shards. Our shards are constantly growing, data is at times reindexed (in
  batches, which is done by removing a decent chunk before replacing it
  with
  updated data), constant stream of new data is coming every hour (usually
  hits the latest shard in time, but can also hit other shards, which have
  older data). Since the front end SOLR has started to be a SPOF, we are
  thinking about setting up some sort of load balancer.
 
  1) do you think ELB from Amazon is a good solution for starters? We don't
  need to maintain sessions between SOLR and client.
  2) What other load balancers have been used specifically with SOLR?
 
 
  Overall: does SOLR scale to such size (200GB in an index) and what can be
  recommended as next step -- resharding (cutting existing shards to
  smaller
  chunks), replication?

 Really, it is going to be up to you to work out what works in your
 situation. You may be reaching the limit of what a Lucene index can
 handle, don't know. If your query traffic is low, you might find that
 two 100Gb cores in a single instance performs better. But then, maybe
 not! Or two 100Gb shards on smaller Amazon hosts. But then, maybe not!
 :-)

 The principal issue with Amazon's load balancers (at least when I was
 using them last year) is that the ports that they balance need to be
 public. You can't use an Amazon load balancer as an internal service
 within a security group. For a service such as Solr, that can be a bit
 of a killer.

 If they've fixed that issue, then they'd work fine (I used them quite
 happily in another scenario).

 When looking at resolving single points of failure, handling search is
 pretty easy (as you say, stateless load balancer). You will need to give
 more attention though to how you handle it regarding indexing.

 Hope that helps a bit!

 Upayavira





 ---
 Enterprise Search Consultant at Sourcesense UK,
 Making Sense of Open Source




Re: huge shards (300GB each) and load balancing

2011-06-08 Thread Bill Bell
Re Amazon elb.

This is not exactly true. The ELB does load balancer internal IPs. But the ELB  
 IP address must be external. Still a major issue unless you use 
authentication. Nginx and others can also do load balancing.

Bill Bell
Sent from mobile


On Jun 8, 2011, at 3:32 AM, Upayavira u...@odoko.co.uk wrote:

 
 
 On Wed, 08 Jun 2011 10:42 +0300, Dmitry Kan dmitry@gmail.com
 wrote:
 Hello list,
 
 Thanks for attending to my previous questions so far, have learnt a lot.
 Here is another one, I hope it will be interesting to answer.
 
 
 
 We run our SOLR shards and front end SOLR on the Amazon high-end
 machines.
 Currently we have 6 shards with around 200GB in each. Currently we have
 only
 one front end SOLR which, given a client query, redirects it to all the
 shards. Our shards are constantly growing, data is at times reindexed (in
 batches, which is done by removing a decent chunk before replacing it
 with
 updated data), constant stream of new data is coming every hour (usually
 hits the latest shard in time, but can also hit other shards, which have
 older data). Since the front end SOLR has started to be a SPOF, we are
 thinking about setting up some sort of load balancer.
 
 1) do you think ELB from Amazon is a good solution for starters? We don't
 need to maintain sessions between SOLR and client.
 2) What other load balancers have been used specifically with SOLR?
 
 
 Overall: does SOLR scale to such size (200GB in an index) and what can be
 recommended as next step -- resharding (cutting existing shards to
 smaller
 chunks), replication?
 
 Really, it is going to be up to you to work out what works in your
 situation. You may be reaching the limit of what a Lucene index can
 handle, don't know. If your query traffic is low, you might find that
 two 100Gb cores in a single instance performs better. But then, maybe
 not! Or two 100Gb shards on smaller Amazon hosts. But then, maybe not!
 :-)
 
 The principal issue with Amazon's load balancers (at least when I was
 using them last year) is that the ports that they balance need to be
 public. You can't use an Amazon load balancer as an internal service
 within a security group. For a service such as Solr, that can be a bit
 of a killer.
 
 If they've fixed that issue, then they'd work fine (I used them quite
 happily in another scenario).
 
 When looking at resolving single points of failure, handling search is
 pretty easy (as you say, stateless load balancer). You will need to give
 more attention though to how you handle it regarding indexing.
 
 Hope that helps a bit!
 
 Upayavira
 
 
 
 
 
 --- 
 Enterprise Search Consultant at Sourcesense UK, 
 Making Sense of Open Source
 


Re: huge shards (300GB each) and load balancing

2011-06-08 Thread Dmitry Kan
Hi, Bill. Thanks, always nice to have options!

Dmitry

On Wed, Jun 8, 2011 at 4:47 PM, Bill Bell billnb...@gmail.com wrote:

 Re Amazon elb.

 This is not exactly true. The ELB does load balancer internal IPs. But the
 ELB   IP address must be external. Still a major issue unless you use
 authentication. Nginx and others can also do load balancing.

 Bill Bell
 Sent from mobile


 On Jun 8, 2011, at 3:32 AM, Upayavira u...@odoko.co.uk wrote:

 
 
  On Wed, 08 Jun 2011 10:42 +0300, Dmitry Kan dmitry@gmail.com
  wrote:
  Hello list,
 
  Thanks for attending to my previous questions so far, have learnt a lot.
  Here is another one, I hope it will be interesting to answer.
 
 
 
  We run our SOLR shards and front end SOLR on the Amazon high-end
  machines.
  Currently we have 6 shards with around 200GB in each. Currently we have
  only
  one front end SOLR which, given a client query, redirects it to all the
  shards. Our shards are constantly growing, data is at times reindexed
 (in
  batches, which is done by removing a decent chunk before replacing it
  with
  updated data), constant stream of new data is coming every hour (usually
  hits the latest shard in time, but can also hit other shards, which have
  older data). Since the front end SOLR has started to be a SPOF, we are
  thinking about setting up some sort of load balancer.
 
  1) do you think ELB from Amazon is a good solution for starters? We
 don't
  need to maintain sessions between SOLR and client.
  2) What other load balancers have been used specifically with SOLR?
 
 
  Overall: does SOLR scale to such size (200GB in an index) and what can
 be
  recommended as next step -- resharding (cutting existing shards to
  smaller
  chunks), replication?
 
  Really, it is going to be up to you to work out what works in your
  situation. You may be reaching the limit of what a Lucene index can
  handle, don't know. If your query traffic is low, you might find that
  two 100Gb cores in a single instance performs better. But then, maybe
  not! Or two 100Gb shards on smaller Amazon hosts. But then, maybe not!
  :-)
 
  The principal issue with Amazon's load balancers (at least when I was
  using them last year) is that the ports that they balance need to be
  public. You can't use an Amazon load balancer as an internal service
  within a security group. For a service such as Solr, that can be a bit
  of a killer.
 
  If they've fixed that issue, then they'd work fine (I used them quite
  happily in another scenario).
 
  When looking at resolving single points of failure, handling search is
  pretty easy (as you say, stateless load balancer). You will need to give
  more attention though to how you handle it regarding indexing.
 
  Hope that helps a bit!
 
  Upayavira
 
 
 
 
 
  ---
  Enterprise Search Consultant at Sourcesense UK,
  Making Sense of Open Source
 




-- 
Regards,

Dmitry Kan


RE: huge shards (300GB each) and load balancing

2011-06-08 Thread Burton-West, Tom
Hi Dmitry,

I am assuming you are splitting one very large index over multiple shards 
rather than replicating and index multiple times.

Just for a point of comparison, I thought I would describe our experience with 
large shards. At HathiTrust, we run a 6 terabyte index over 12 shards.  This is 
split over 4 machines with 3 shards per machine and our shards are about 
400-500GB.  We get average response times of around 200 ms with the 99th 
percentile queries up around 1-2 seconds. We have a very low qps rate, i.e. 
less than 1 qps.  We also index offline on a separate machine and update the 
indexes nightly.

Some of the issues we have found with very large shards are:
1) Becaue of the very large shard size, I/O tends to be the bottleneck, with 
phrase queries containing common words being the slowest.
2) Because of the I/O issues running cache-warming queries to get postings into 
the OS disk cache is important as is leaving significant free memory for the OS 
to use for disk caching
3) Because of the I/O issues using stop words or CommonGrams produces a 
significant performance increase.
2) We have a huge number of unique terms in our indexes.  In order to reduce 
the amount of memory needed by the in-memory terms index we set the 
termInfosIndexDivisor to 8, which causes Solr to only load every 8th term from 
the tii file into memory. This reduced memory use from over 18GB to below 3G 
and got rid of 30 second stop the world java Garbage Collections. (See 
http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again for 
details)  We later ran into memory problems when indexing so instead changed 
the index time parameter termIndexInterval from 128 to 1024.

(More details here: http://www.hathitrust.org/blogs/large-scale-search)

Tom Burton-West