Re: Best practice to index a large crawl through Solr?

Thilina Gunarathne Mon, 22 Oct 2012 13:31:18 -0700

Hi Alex,
Thanks again for the information.

My current requirement is to implement a  simple searching application for
a publication. Our current data sizes probably would not exceed the amount
of records you mentioned and for now, we should be fine with a single Solr
instance. I'm going to check out the SolrCloud for our future needs.


>Hm, so you are thinking Nutch -> HBase -> Solr -> HBase, that does
>sound pretty crazy.
I agree :).. Unfortunately (or may be luckily) I do not have much time to
invest on this and I'll probably have to rely on the existing tools, rather
than trying to reinvent the wheels :)..

thanks,
Thilina


On Mon, Oct 22, 2012 at 4:00 PM, Alejandro Caceres <
[email protected]> wrote:

> No problem. Wrt to your first question, Solr would actually be storing
> this data locally. Solr sharding actually uses its own mechanism
> called SolrCloud. I'd recommend checking it out here:
> http://wiki.apache.org/solr/SolrCloud, it seems cool though I have not
> used it myself.
>
> Hm, so you are thinking Nutch -> HBase -> Solr -> HBase, that does
> sound pretty crazy. You can most definitely find a more efficient way
> to do this, either by going to HBase directly from the start (I
> wouldn't do so personally) or just using Solr. It might be good to
> know what kind of application you are looking to build and asking more
> specifically.
>
> Alex
>
> On Mon, Oct 22, 2012 at 3:48 PM, Thilina Gunarathne <[email protected]>
> wrote:
> > Hi Alex,
> > Thanks for the very fast response :)..
> >
> > It sort of depends on your purpose and the amount of data. I currently
> >> have a single Solr instance (~1GB of memory, 2 processors on the
> >> server) serving almost ~3,700,000 records from Nutch and it's still
> >> working great for me. If you have around that I'd say a single Solr
> >> instance is OK, depending on if you are planning on making your data
> >> publicly available or not.
> >>
> > This is very useful information. In this case, would the Solr instance be
> > retrieving and storing all the data locally or is it still using the
> Nutch
> > data store to retrieve the actual content while serving the queries?
> >
> >
> >> If you're creating something larger of some sort, Solr 4.0, which
> >> supports sharding natively would be a great option (I think it's still
> >> in Beta, but if you're feeling brave...). This is especially true if
> >> you are creating a search engine of some sort, or would like easily
> >> searchable data.
> >>
> > That's interesting. I'll check that out. By any chance, do you know
> whether
> > the Solr sharding is using the HDFS to store the data or is it using it's
> > own infrastructure?
> >
> >
> >> I would imagine doing this directly from HBase would not be a great
> >> option, as Nutch is storing the data in the format that is convenient
> >> for Nutch itself to use, and not so much in a format that it is
> >> friendly for you to reuse for your own purposes.
> >>
> > I was actually thinking  of a scenario where we would use Solr to index
> the
> > data and storing the resultant index in HBase.  Then using the HBase
> > directly to perform simple index lookups..  Please pardon my lack of
> > knowledge on Nutch and Solr, if the above sounds ludicrous :)..
> >
> > thanks,
> > Thilina
> >
> >
> >> IMO your best bet is going to try out Solr 4.0.
> >>
> >> Alex
> >>
> >> On Mon, Oct 22, 2012 at 3:03 PM, Thilina Gunarathne <[email protected]>
> >> wrote:
> >> > Dear All,
> >> > What would be the best practice to index a large crawl using Solr? The
> >> > crawl is performed on a multi node Hadoop cluster using HBase as the
> back
> >> > end.. Would Solr become a bottleneck if we use just a single Solr
> >> instance?
> >> >  Is it possible to store the indexed data on HBase and to serve them
> from
> >> > the HBase it self?
> >> >
> >> > thanks a lot,
> >> > Thilina
> >> >
> >> > --
> >> > https://www.cs.indiana.edu/~tgunarat/
> >> > http://www.linkedin.com/in/thilina
> >> > http://thilina.gunarathne.org
> >>
> >>
> >>
> >> --
> >> ___
> >>
> >> Alejandro Caceres
> >> Hyperion Gray, LLC
> >> Owner/CTO
> >>
> >
> >
> >
> > --
> > https://www.cs.indiana.edu/~tgunarat/
> > http://www.linkedin.com/in/thilina
> > http://thilina.gunarathne.org
>
>
>
> --
> ___
>
> Alejandro Caceres
> Hyperion Gray, LLC
> Owner/CTO
>



-- 
https://www.cs.indiana.edu/~tgunarat/
http://www.linkedin.com/in/thilina
http://thilina.gunarathne.org

Re: Best practice to index a large crawl through Solr?

Reply via email to