John:
The MapReduceIndexerTool (in contrib) is intended for bulk indexing in
a Hadoop ecosystem. This doesn't preclude home-grown setups of course,
but it's available OOB. The only tricky bit is at the end. Either you
have your Solr indexes on HDFS in which case MRIT can merge them into
a live
Regarding a 12TB index:
Yago Riveiro wrote:
> Our cluster is small for the data we hold (12 machines with SSD and 32G of
> RAM), but we don't need sub-second queries, we need facet with high
> cardinality (in worst case scenarios we aggregate 5M unique string values)
>
John Bickerstaff wrote:
> As an aside - I just spoke with somone the other day who is using Hadoop
> for re-index in order to save a lot of time.
If you control which documents goes into which shards, then that is certainly a
possibility. We have a collection with long
As an aside - I just spoke with somone the other day who is using Hadoop
for re-index in order to save a lot of time. I don't know the details, but
I assume they're using Hadoop to call Lucene code and index documents using
the map-reduce approach...
This was made in their own shop - I don't
"LucidWorks achieved 150k docs/second"
This is only valid is you don't have replication, I don't know your use case,
but a realistic use case normally use some type of redundancy to not lost data
in a hardware failure, at least 2 replicas, more implicates a reduction of
throughput. Also
Hey Yago,
12 T is very impressive.
Can you also share some numbers about the shards, replicas, machine
count/specs and docs/second for your case?
I think you would not be having a single index of 12 TB too. So some
details on that would be really helpful too.
In my company we have a SolrCloud cluster with 12T.
My advices:
Be nice with CPU you will needed in some point (very important if you have not
control over the kind of queries to the cluster, clients are greedy, the want
all results at the same time)
SSD and memory (as many as you can afford
Solr is RAM hungry. Make sure that you have enough RAM to have most if the
index of a core in the RAM itself.
You should also consider using really good SSDs.
That would be a good start. Like others said, test and verify your setup.
--Pushkar Raste
On Sep 23, 2016 4:58 PM, "Jeffery Yuan"
Thanks so much for your prompt reply.
We are definitely going to use SolrCloud.
I am just wondering whether SolrCloud can scale even at TB data level and
what kind of hardware configuration it should be.
Thanks.
--
View this message in context: