Re: Crawl Page, Store full HTML content

Markus Jelsma Wed, 10 Aug 2011 05:20:26 -0700

I'm not sure how to do this but i think creating an parse and indexing filter 
will do the trick. First you make the parse filter that reads the byte[] 
content from the Content object that is available in the parse filter. You 
then add the raw data in that parse filter to the parse data.


In your indexing filter you simply read that field and add it to the document. 
See writing plugin example on the wiki for basic introduction to writing 
plugins.

On Wednesday 10 August 2011 14:12:13 Christopher Gross wrote:
> I have Nutch 1.3 running, and have it connected to a Solr 3.3
> instance.  Right now the data comes over from Nutch to Solr just fine,
> but I'd like it to send the "content" field to Solr as the raw HTML,
> so that I can have all the original markup to work with later.
> 
> I've tried digging around on Google and I can't seem to find anything.
>  Can someone please push me in the right direction?
> 
> Thanks!
> 
> -- Christopher Gross

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Crawl Page, Store full HTML content

Reply via email to