Re: How to get the original html file that is crawled by Nutch?

Kelvin Wed, 20 Jul 2011 08:24:01 -0700

I have found the solution for my problem, I'm posting it, in case others are 
also stuck in this problem. :)


Nutch can store the whole text content of the html pages. for nutch 1.3


Step 1:In nutch/runtime/local/conf/nutch-site.xml 
            add

<property>
 <name>http.content.limit</name>
 <value>-1</value>
</property>

Step 2:In Solr /example/solr/conf/schema.xml

Set <field name="content" type="text" stored="true" indexed="true"/>


From: Kelvin <[email protected]>
To: "[email protected]" <[email protected]>
Sent: Wednesday, 20 July 2011 11:41 AM
Subject: How to get the original html file that is crawled by Nutch?

Dear all,

I have used both nutch 1.2 and 1.3. Both work fine for the crawling, indexing. 
When I want to search using some keywords, it return the results, showing 
snippets of the htmls that contain the keywords. Is there a way to retrieve or 
access the full original html pages that contain the keywords?

Thank you for your help.

Re: How to get the original html file that is crawled by Nutch?

Reply via email to