Re: How to get the original html file that is crawled by Nutch?

Julien Nioche Wed, 20 Jul 2011 04:32:07 -0700

The original content (e.g. HTML) is not sent for indexing and is not the
content (extracted text). What you are describing would store the  text and
should be sufficient for generating snippets in SOLR.


On 20 July 2011 11:47, Chris Alexander <[email protected]> wrote:

> One way I have seen this working is to edit the schema.xml file
> {SOLR_HOME}/conf/schema.xml. Modify the field with name "content" to have
> its "stored" parameter set to "true". Something like this:
>
> <field name="content" type="text" *stored="true"* .....
>
> You will need to re-index pages (either by emptying solr and deleting the
> crawl directory for nutch, or re-crawling the page when it has timed out)
> for this to take effect; new pages will have their content stored
> automatically.
>
> Hope this helps
>
> Chris
>
> On 20 July 2011 04:41, Kelvin <[email protected]> wrote:
>
> > Dear all,
> >
> > I have used both nutch 1.2 and 1.3. Both work fine for the crawling,
> > indexing. When I want to search using some keywords, it return the
> results,
> > showing snippets of the htmls that contain the keywords. Is there a way
> to
> > retrieve or access the full original html pages that contain the
> keywords?
> >
> > Thank you for your help.
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: How to get the original html file that is crawled by Nutch?

Reply via email to