Re: Best practice to index a large crawl through Solr?

Alejandro Caceres Mon, 22 Oct 2012 12:12:19 -0700

It sort of depends on your purpose and the amount of data. I currently
have a single Solr instance (~1GB of memory, 2 processors on the
server) serving almost ~3,700,000 records from Nutch and it's still
working great for me. If you have around that I'd say a single Solr
instance is OK, depending on if you are planning on making your data
publicly available or not.

If you're creating something larger of some sort, Solr 4.0, which
supports sharding natively would be a great option (I think it's still
in Beta, but if you're feeling brave...). This is especially true if
you are creating a search engine of some sort, or would like easily
searchable data.

I would imagine doing this directly from HBase would not be a great
option, as Nutch is storing the data in the format that is convenient
for Nutch itself to use, and not so much in a format that it is
friendly for you to reuse for your own purposes.

IMO your best bet is going to try out Solr 4.0.

Alex

On Mon, Oct 22, 2012 at 3:03 PM, Thilina Gunarathne <[email protected]> wrote:
> Dear All,
> What would be the best practice to index a large crawl using Solr? The
> crawl is performed on a multi node Hadoop cluster using HBase as the back
> end.. Would Solr become a bottleneck if we use just a single Solr instance?
>  Is it possible to store the indexed data on HBase and to serve them from
> the HBase it self?
>
> thanks a lot,
> Thilina
>
> --
> https://www.cs.indiana.edu/~tgunarat/
> http://www.linkedin.com/in/thilina
> http://thilina.gunarathne.org

-- 
___

Alejandro Caceres
Hyperion Gray, LLC
Owner/CTO

Re: Best practice to index a large crawl through Solr?

Reply via email to