On 6/4/2014 12:45 AM, Vineet Mishra wrote:
> Thanks all for your response.
> I presume this conversation concludes that indexing around 1Billion
> documents per shard won't be a problem, as I have 10 Billion docs to index,
> so approx 10 shards with 1 Billion each should be fine with it and how
> about Memory, what size of RAM should be fine for this amount of data?

Figure out the heap requirements of the operating system and every
program on the machine (Solr especially).  Then you would add that
number to the total size of the index data on the machine.  That is the
ideal minimum RAM.

http://wiki.apache.org/solr/SolrPerformanceProblems

Unfortunately, if you are dealing with a huge index with billions of
documents, it is likely to be prohibitively expensive to buy that much
RAM.  If you are running Solr on Amazon's cloud, the cost for that much
RAM would be astronomical.

Exactly how much RAM would actually be required is very difficult to
predict.  If you had only 25% of the ideal, your index might have
perfectly acceptable performance, or it might not.  It might do fine
under a light query load, but if you increase to 50 queries per second,
performance may drop significantly ... or it might be good.  It's
generally not possible to know how your hardware will perform until you
actually build and use your index.

http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

A general rule of thumb for RAM that I have found to be useful is that
if you've got less than half of the ideal memory size, you might have
performance problems.

> Moreover what should be the indexing technique for this huge data set, as
> currently I am indexing with EmbeddedSolrServer but its going pathetically
> slow after some 20Gb of indexing. Comparatively SolrHttpPost was slow due
> to network delays and response but after this long running the indexing
> with EmbeddedSolrServer I am getting a different notion.
> Any good indexing technique for this huge dataset would be highly
> appreciated.

EmbeddedSolrServer is not recommended.  Run Solr in the traditional way
with HTTP connectivity.  HTTP overhead on a LAN is usually quite small.
 Solr is fully thread-safe, so you can have several indexing threads all
going at the same time.

Indexes at this scale should normally be built with SolrCloud, with
enough servers so that each machine is only handling one shard replica.
 The ideal indexing program would be written in Java, using CloudSolrServer.

Thanks,
Shawn

Reply via email to