Re: Best practice to index a large crawl through Solr?

Thilina Gunarathne Mon, 22 Oct 2012 12:48:57 -0700

Hi Alex,
Thanks for the very fast response :)..

It sort of depends on your purpose and the amount of data. I currently
> have a single Solr instance (~1GB of memory, 2 processors on the
> server) serving almost ~3,700,000 records from Nutch and it's still
> working great for me. If you have around that I'd say a single Solr
> instance is OK, depending on if you are planning on making your data
> publicly available or not.
>
This is very useful information. In this case, would the Solr instance be
retrieving and storing all the data locally or is it still using the Nutch
data store to retrieve the actual content while serving the queries?



> If you're creating something larger of some sort, Solr 4.0, which
> supports sharding natively would be a great option (I think it's still
> in Beta, but if you're feeling brave...). This is especially true if
> you are creating a search engine of some sort, or would like easily
> searchable data.
>
That's interesting. I'll check that out. By any chance, do you know whether
the Solr sharding is using the HDFS to store the data or is it using it's
own infrastructure?


> I would imagine doing this directly from HBase would not be a great
> option, as Nutch is storing the data in the format that is convenient
> for Nutch itself to use, and not so much in a format that it is
> friendly for you to reuse for your own purposes.
>
I was actually thinking  of a scenario where we would use Solr to index the
data and storing the resultant index in HBase.  Then using the HBase
directly to perform simple index lookups..  Please pardon my lack of
knowledge on Nutch and Solr, if the above sounds ludicrous :)..

thanks,
Thilina


> IMO your best bet is going to try out Solr 4.0.
>
> Alex
>
> On Mon, Oct 22, 2012 at 3:03 PM, Thilina Gunarathne <[email protected]>
> wrote:
> > Dear All,
> > What would be the best practice to index a large crawl using Solr? The
> > crawl is performed on a multi node Hadoop cluster using HBase as the back
> > end.. Would Solr become a bottleneck if we use just a single Solr
> instance?
> >  Is it possible to store the indexed data on HBase and to serve them from
> > the HBase it self?
> >
> > thanks a lot,
> > Thilina
> >
> > --
> > https://www.cs.indiana.edu/~tgunarat/
> > http://www.linkedin.com/in/thilina
> > http://thilina.gunarathne.org
>
>
>
> --
> ___
>
> Alejandro Caceres
> Hyperion Gray, LLC
> Owner/CTO
>



-- 
https://www.cs.indiana.edu/~tgunarat/
http://www.linkedin.com/in/thilina
http://thilina.gunarathne.org

Re: Best practice to index a large crawl through Solr?

Reply via email to