Re: Crawl Page, Store full HTML content

Way Cool Thu, 11 Aug 2011 00:04:48 -0700

You can try using the string type as below:
<field name="content" type="string" stored="true" indexed="true"/>



On Wed, Aug 10, 2011 at 6:20 AM, Markus Jelsma
<[email protected]>wrote:

> I'm not sure how to do this but i think creating an parse and indexing
> filter
> will do the trick. First you make the parse filter that reads the byte[]
> content from the Content object that is available in the parse filter. You
> then add the raw data in that parse filter to the parse data.
>
> In your indexing filter you simply read that field and add it to the
> document.
> See writing plugin example on the wiki for basic introduction to writing
> plugins.
>
> On Wednesday 10 August 2011 14:12:13 Christopher Gross wrote:
> > I have Nutch 1.3 running, and have it connected to a Solr 3.3
> > instance.  Right now the data comes over from Nutch to Solr just fine,
> > but I'd like it to send the "content" field to Solr as the raw HTML,
> > so that I can have all the original markup to work with later.
> >
> > I've tried digging around on Google and I can't seem to find anything.
> >  Can someone please push me in the right direction?
> >
> > Thanks!
> >
> > -- Christopher Gross
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Re: Crawl Page, Store full HTML content

Reply via email to