Re: How to get the original html file that is crawled by Nutch?

Chris Alexander Wed, 20 Jul 2011 03:48:49 -0700

One way I have seen this working is to edit the schema.xml file
{SOLR_HOME}/conf/schema.xml. Modify the field with name "content" to have
its "stored" parameter set to "true". Something like this:

<field name="content" type="text" *stored="true"* .....

You will need to re-index pages (either by emptying solr and deleting the
crawl directory for nutch, or re-crawling the page when it has timed out)
for this to take effect; new pages will have their content stored
automatically.

Hope this helps

Chris

On 20 July 2011 04:41, Kelvin <[email protected]> wrote:

> Dear all,
>
> I have used both nutch 1.2 and 1.3. Both work fine for the crawling,
> indexing. When I want to search using some keywords, it return the results,
> showing snippets of the htmls that contain the keywords. Is there a way to
> retrieve or access the full original html pages that contain the keywords?
>
> Thank you for your help.
>

Re: How to get the original html file that is crawled by Nutch?

Reply via email to