It sort of depends on your purpose and the amount of data. I currently have a single Solr instance (~1GB of memory, 2 processors on the server) serving almost ~3,700,000 records from Nutch and it's still working great for me. If you have around that I'd say a single Solr instance is OK, depending on if you are planning on making your data publicly available or not.
If you're creating something larger of some sort, Solr 4.0, which supports sharding natively would be a great option (I think it's still in Beta, but if you're feeling brave...). This is especially true if you are creating a search engine of some sort, or would like easily searchable data. I would imagine doing this directly from HBase would not be a great option, as Nutch is storing the data in the format that is convenient for Nutch itself to use, and not so much in a format that it is friendly for you to reuse for your own purposes. IMO your best bet is going to try out Solr 4.0. Alex On Mon, Oct 22, 2012 at 3:03 PM, Thilina Gunarathne <[email protected]> wrote: > Dear All, > What would be the best practice to index a large crawl using Solr? The > crawl is performed on a multi node Hadoop cluster using HBase as the back > end.. Would Solr become a bottleneck if we use just a single Solr instance? > Is it possible to store the indexed data on HBase and to serve them from > the HBase it self? > > thanks a lot, > Thilina > > -- > https://www.cs.indiana.edu/~tgunarat/ > http://www.linkedin.com/in/thilina > http://thilina.gunarathne.org -- ___ Alejandro Caceres Hyperion Gray, LLC Owner/CTO

