Re: Problem in search

lewis john mcgibbney Sat, 25 Jun 2011 19:55:32 -0700

I see within you're nutch-site file that you have set an http.content.limit
value of 340,671. Is there any reason for this value? I'm assuming you are
not indexing this page so you can merely search for the term phenomena, and
that there is other textual content within the page that you are interested
in...would this assumption be right?

As Markus explained the page has a http content length of some 600,000, and
from looking at where the first occourance of the term phenomena is, it is
located roughly half way through the page.

When crawling large sites such as wikipedia (which we all know contains
large http content within its webpages), I have found that a safe guard
measure to ensure we get all page content is to set the http.content.limit
to a negative value e.g. -1. This way we are guaranteed that we get all page
content. Another useful tool which is widely used is LUKE [1], this will
enable you to search you Lucene index and confirm whether or not Nutch has
fetched and sent the content you wish to be stored within your index.

[1] http://code.google.com/p/luke/

On Sat, Jun 25, 2011 at 7:42 AM, Jefferson <[email protected]> wrote:

> The problem is that he returns the beginning of the text section of the
> website. The correct he is returning the passage in which the word
> <phenomena> is found.
> Sorry my english...
>
>
> Jefferson
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Problem-in-search-tp3104461p3107810.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
*Lewis*

Re: Problem in search

Reply via email to