I have found the solution for my problem, I'm posting it, in case others are also stuck in this problem. :)
Nutch can store the whole text content of the html pages. for nutch 1.3 Step 1:In nutch/runtime/local/conf/nutch-site.xml add <property> <name>http.content.limit</name> <value>-1</value> </property> Step 2:In Solr /example/solr/conf/schema.xml Set <field name="content" type="text" stored="true" indexed="true"/> From: Kelvin <[email protected]> To: "[email protected]" <[email protected]> Sent: Wednesday, 20 July 2011 11:41 AM Subject: How to get the original html file that is crawled by Nutch? Dear all, I have used both nutch 1.2 and 1.3. Both work fine for the crawling, indexing. When I want to search using some keywords, it return the results, showing snippets of the htmls that contain the keywords. Is there a way to retrieve or access the full original html pages that contain the keywords? Thank you for your help.

