Re: Best practice to index a large crawl through Solr?

Alejandro Caceres Mon, 22 Oct 2012 13:01:16 -0700

No problem. Wrt to your first question, Solr would actually be storing
this data locally. Solr sharding actually uses its own mechanism
called SolrCloud. I'd recommend checking it out here:
http://wiki.apache.org/solr/SolrCloud, it seems cool though I have not
used it myself.


Hm, so you are thinking Nutch -> HBase -> Solr -> HBase, that does
sound pretty crazy. You can most definitely find a more efficient way
to do this, either by going to HBase directly from the start (I
wouldn't do so personally) or just using Solr. It might be good to
know what kind of application you are looking to build and asking more
specifically.

Alex

On Mon, Oct 22, 2012 at 3:48 PM, Thilina Gunarathne <[email protected]> wrote:
> Hi Alex,
> Thanks for the very fast response :)..
>
> It sort of depends on your purpose and the amount of data. I currently
>> have a single Solr instance (~1GB of memory, 2 processors on the
>> server) serving almost ~3,700,000 records from Nutch and it's still
>> working great for me. If you have around that I'd say a single Solr
>> instance is OK, depending on if you are planning on making your data
>> publicly available or not.
>>
> This is very useful information. In this case, would the Solr instance be
> retrieving and storing all the data locally or is it still using the Nutch
> data store to retrieve the actual content while serving the queries?
>
>
>> If you're creating something larger of some sort, Solr 4.0, which
>> supports sharding natively would be a great option (I think it's still
>> in Beta, but if you're feeling brave...). This is especially true if
>> you are creating a search engine of some sort, or would like easily
>> searchable data.
>>
> That's interesting. I'll check that out. By any chance, do you know whether
> the Solr sharding is using the HDFS to store the data or is it using it's
> own infrastructure?
>
>
>> I would imagine doing this directly from HBase would not be a great
>> option, as Nutch is storing the data in the format that is convenient
>> for Nutch itself to use, and not so much in a format that it is
>> friendly for you to reuse for your own purposes.
>>
> I was actually thinking  of a scenario where we would use Solr to index the
> data and storing the resultant index in HBase.  Then using the HBase
> directly to perform simple index lookups..  Please pardon my lack of
> knowledge on Nutch and Solr, if the above sounds ludicrous :)..
>
> thanks,
> Thilina
>
>
>> IMO your best bet is going to try out Solr 4.0.
>>
>> Alex
>>
>> On Mon, Oct 22, 2012 at 3:03 PM, Thilina Gunarathne <[email protected]>
>> wrote:
>> > Dear All,
>> > What would be the best practice to index a large crawl using Solr? The
>> > crawl is performed on a multi node Hadoop cluster using HBase as the back
>> > end.. Would Solr become a bottleneck if we use just a single Solr
>> instance?
>> >  Is it possible to store the indexed data on HBase and to serve them from
>> > the HBase it self?
>> >
>> > thanks a lot,
>> > Thilina
>> >
>> > --
>> > https://www.cs.indiana.edu/~tgunarat/
>> > http://www.linkedin.com/in/thilina
>> > http://thilina.gunarathne.org
>>
>>
>>
>> --
>> ___
>>
>> Alejandro Caceres
>> Hyperion Gray, LLC
>> Owner/CTO
>>
>
>
>
> --
> https://www.cs.indiana.edu/~tgunarat/
> http://www.linkedin.com/in/thilina
> http://thilina.gunarathne.org



-- 
___

Alejandro Caceres
Hyperion Gray, LLC
Owner/CTO

Re: Best practice to index a large crawl through Solr?

Reply via email to