Re: Distributed Indexing with nutch

Marco Didonna Tue, 08 Feb 2011 03:46:14 -0800

On 8 February 2011 11:23, Julien Nioche <[email protected]> wrote:
> Hi Marco
>
> Nutch now delegates the indexing and searching to SOLR, all the steps you
> described (tokenization, lowercasing, etc...) are implemented there and
> Nutch does not do anything special about it. From a Nutch point of view, the
> indexing consists in gathering the data from various sources (crawldb,
> segments, linkdb), apply some simple transformations (indexingfilters) then
> send to SOLR.
> You can of course write some custom map reduce function with SOLR embedded
> but that's not what we do in Nutch. Have a look at the SOLR mailing lists,
> you'll probably find more info there
>
> HTH
>
> Julien
>
> PS: (shameful self promotion for one of my pet projects) Behemoth (
> https://github.com/jnioche/behemoth) is about doing large scale text
> processing on Hadoop. there is a component which delegates the indexing of
> documents to SOLR but it could be modified to do what you described and have
> SOLR instances within the map/reduce functions
>


I did download the behemoth source and I was looking for some
documentation in order to understand what could I re-use...I found
none and no script to generate it. Could you point me to the right
direction? I would appreciate it

MD

Re: Distributed Indexing with nutch

Reply via email to