I see within you're nutch-site file that you have set an http.content.limit value of 340,671. Is there any reason for this value? I'm assuming you are not indexing this page so you can merely search for the term phenomena, and that there is other textual content within the page that you are interested in...would this assumption be right?
As Markus explained the page has a http content length of some 600,000, and from looking at where the first occourance of the term phenomena is, it is located roughly half way through the page. When crawling large sites such as wikipedia (which we all know contains large http content within its webpages), I have found that a safe guard measure to ensure we get all page content is to set the http.content.limit to a negative value e.g. -1. This way we are guaranteed that we get all page content. Another useful tool which is widely used is LUKE [1], this will enable you to search you Lucene index and confirm whether or not Nutch has fetched and sent the content you wish to be stored within your index. [1] http://code.google.com/p/luke/ On Sat, Jun 25, 2011 at 7:42 AM, Jefferson <[email protected]> wrote: > The problem is that he returns the beginning of the text section of the > website. The correct he is returning the passage in which the word > <phenomena> is found. > Sorry my english... > > > Jefferson > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Problem-in-search-tp3104461p3107810.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- *Lewis*

