On 8 February 2011 11:23, Julien Nioche <[email protected]> wrote: > Hi Marco > > Nutch now delegates the indexing and searching to SOLR, all the steps you > described (tokenization, lowercasing, etc...) are implemented there and > Nutch does not do anything special about it. From a Nutch point of view, the > indexing consists in gathering the data from various sources (crawldb, > segments, linkdb), apply some simple transformations (indexingfilters) then > send to SOLR. > You can of course write some custom map reduce function with SOLR embedded > but that's not what we do in Nutch. Have a look at the SOLR mailing lists, > you'll probably find more info there > > HTH > > Julien > > PS: (shameful self promotion for one of my pet projects) Behemoth ( > https://github.com/jnioche/behemoth) is about doing large scale text > processing on Hadoop. there is a component which delegates the indexing of > documents to SOLR but it could be modified to do what you described and have > SOLR instances within the map/reduce functions >
I did download the behemoth source and I was looking for some documentation in order to understand what could I re-use...I found none and no script to generate it. Could you point me to the right direction? I would appreciate it MD

