On 10/2/2018 9:33 AM, Rekha wrote:
Dear Solr Team, I need following clarification from you, please check and give
suggestion to me, 1. I want to store and search 200 Billions of documents(Each
document contains 16 fields). For my case can I able to achieve by using Solr
cloud? 2. For my case how many shard and nodes will be needed? 3. In future can
I able to increase the nodes and shards? Thanks, Rekha Karthick
In a nutshell: It's not possible to give generic advice. The contents
of the fields will affect exactly what you need. The nature of the
queries that you send will affect exactly what you need. The query rate
will affect exactly what you need. The overall size of the index (disk
space, as well as document count) will affect what you need.
In the "not very helpful" department, but I promise this is absolute
truth, there's this blog post:
https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
To handle 200 billion documents *in a single collection*, you're
probably going to want at least 200 shards, and there are good reasons
to go with even more shards than that. But you need to be warned that
there can be serious scalability problems when SolrCloud must keep track
of that many different indexes. Here's an issue I filed for scalability
problems with thousands of collections ... there can be similar problems
with lots of shards as well. This issue says it is fixed, but no code
changes that I am aware of were ever made related to the issue, and as
far as I can tell, it's still a problem even in the latest version:
https://issues.apache.org/jira/browse/SOLR-7191
That many shard/replicas on one collection is likely to need zookeeper's
maximum znode size (jute.maxbuffer) boosted, because it will probably
require more than one megabyte to hold the JSON structure describing the
collection.
As for how many machines you'll need ... absolutely no idea. If query
rate will be insanely high, you'll want a dedicated machine for each
shard replica, and you may need many replicas, which is going to mean
hundreds, possibly thousands, of servers. If the query rate is really
low and/or each document is very small, you might be able to house more
than one shard per server. But you should know that handling 200
billion documents is going to require a lot of hardware even if it turns
out that you're not going to be handling tons of data (per document) or
queries.
Thanks,
Shawn