On 8 February 2011 11:46, Claudio Martella <[email protected]> wrote: > Hi Marco, > > as Julien suggests, nutch is probably not the right place to look at. > My personal advice is to have a look at Jimmy Lin's Cloud9, and > specifically: > > http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/exercises/indexing.html > >
Actually my implementation is based on jimmy lin's examples because I have used his book as introduction to mapreduce algorithm design :) However I also take into account term positions, and section (title, body, abstract) in which a term occurs. That said, I think my approach is a little dirty as I use lucene's class to get the positional term vector: there are too many objects created and there's a waste of memory, IMHO. So I was looking for something a little more elegant and "hadoopish" Thanks for you answer. MD

