Hi Alex, Thanks again for the information. My current requirement is to implement a simple searching application for a publication. Our current data sizes probably would not exceed the amount of records you mentioned and for now, we should be fine with a single Solr instance. I'm going to check out the SolrCloud for our future needs.
>Hm, so you are thinking Nutch -> HBase -> Solr -> HBase, that does >sound pretty crazy. I agree :).. Unfortunately (or may be luckily) I do not have much time to invest on this and I'll probably have to rely on the existing tools, rather than trying to reinvent the wheels :).. thanks, Thilina On Mon, Oct 22, 2012 at 4:00 PM, Alejandro Caceres < [email protected]> wrote: > No problem. Wrt to your first question, Solr would actually be storing > this data locally. Solr sharding actually uses its own mechanism > called SolrCloud. I'd recommend checking it out here: > http://wiki.apache.org/solr/SolrCloud, it seems cool though I have not > used it myself. > > Hm, so you are thinking Nutch -> HBase -> Solr -> HBase, that does > sound pretty crazy. You can most definitely find a more efficient way > to do this, either by going to HBase directly from the start (I > wouldn't do so personally) or just using Solr. It might be good to > know what kind of application you are looking to build and asking more > specifically. > > Alex > > On Mon, Oct 22, 2012 at 3:48 PM, Thilina Gunarathne <[email protected]> > wrote: > > Hi Alex, > > Thanks for the very fast response :).. > > > > It sort of depends on your purpose and the amount of data. I currently > >> have a single Solr instance (~1GB of memory, 2 processors on the > >> server) serving almost ~3,700,000 records from Nutch and it's still > >> working great for me. If you have around that I'd say a single Solr > >> instance is OK, depending on if you are planning on making your data > >> publicly available or not. > >> > > This is very useful information. In this case, would the Solr instance be > > retrieving and storing all the data locally or is it still using the > Nutch > > data store to retrieve the actual content while serving the queries? > > > > > >> If you're creating something larger of some sort, Solr 4.0, which > >> supports sharding natively would be a great option (I think it's still > >> in Beta, but if you're feeling brave...). This is especially true if > >> you are creating a search engine of some sort, or would like easily > >> searchable data. > >> > > That's interesting. I'll check that out. By any chance, do you know > whether > > the Solr sharding is using the HDFS to store the data or is it using it's > > own infrastructure? > > > > > >> I would imagine doing this directly from HBase would not be a great > >> option, as Nutch is storing the data in the format that is convenient > >> for Nutch itself to use, and not so much in a format that it is > >> friendly for you to reuse for your own purposes. > >> > > I was actually thinking of a scenario where we would use Solr to index > the > > data and storing the resultant index in HBase. Then using the HBase > > directly to perform simple index lookups.. Please pardon my lack of > > knowledge on Nutch and Solr, if the above sounds ludicrous :).. > > > > thanks, > > Thilina > > > > > >> IMO your best bet is going to try out Solr 4.0. > >> > >> Alex > >> > >> On Mon, Oct 22, 2012 at 3:03 PM, Thilina Gunarathne <[email protected]> > >> wrote: > >> > Dear All, > >> > What would be the best practice to index a large crawl using Solr? The > >> > crawl is performed on a multi node Hadoop cluster using HBase as the > back > >> > end.. Would Solr become a bottleneck if we use just a single Solr > >> instance? > >> > Is it possible to store the indexed data on HBase and to serve them > from > >> > the HBase it self? > >> > > >> > thanks a lot, > >> > Thilina > >> > > >> > -- > >> > https://www.cs.indiana.edu/~tgunarat/ > >> > http://www.linkedin.com/in/thilina > >> > http://thilina.gunarathne.org > >> > >> > >> > >> -- > >> ___ > >> > >> Alejandro Caceres > >> Hyperion Gray, LLC > >> Owner/CTO > >> > > > > > > > > -- > > https://www.cs.indiana.edu/~tgunarat/ > > http://www.linkedin.com/in/thilina > > http://thilina.gunarathne.org > > > > -- > ___ > > Alejandro Caceres > Hyperion Gray, LLC > Owner/CTO > -- https://www.cs.indiana.edu/~tgunarat/ http://www.linkedin.com/in/thilina http://thilina.gunarathne.org

