On 8 February 2011 11:46, Claudio Martella <[email protected]> wrote:
> Hi Marco,
>
> as Julien suggests, nutch is probably not the right place to look at.
> My personal advice is to have a look at Jimmy Lin's Cloud9, and
> specifically:
>
> http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/exercises/indexing.html
>
>

Actually my implementation is based on jimmy lin's examples because I
have used his book as introduction to mapreduce algorithm design :)
However I also take into account term positions, and section (title,
body, abstract) in which a term occurs. That said, I think my approach
is a little dirty as I use lucene's class to get the positional term
vector: there are too many objects created and there's a waste of
memory, IMHO. So I was looking for something a little more elegant and
"hadoopish"

Thanks for you answer.

MD

Reply via email to