Re: Crawl Page, Store full HTML content

Markus Jelsma Thu, 11 Aug 2011 02:16:14 -0700

Nutch doesn't put raw HTML in NutchDocument objects.


> You can try using the string type as below:
> <field name="content" type="string" stored="true" indexed="true"/>
> 
> 
> On Wed, Aug 10, 2011 at 6:20 AM, Markus Jelsma
> 
> <[email protected]>wrote:
> > I'm not sure how to do this but i think creating an parse and indexing
> > filter
> > will do the trick. First you make the parse filter that reads the byte[]
> > content from the Content object that is available in the parse filter.
> > You then add the raw data in that parse filter to the parse data.
> > 
> > In your indexing filter you simply read that field and add it to the
> > document.
> > See writing plugin example on the wiki for basic introduction to writing
> > plugins.
> > 
> > On Wednesday 10 August 2011 14:12:13 Christopher Gross wrote:
> > > I have Nutch 1.3 running, and have it connected to a Solr 3.3
> > > instance.  Right now the data comes over from Nutch to Solr just fine,
> > > but I'd like it to send the "content" field to Solr as the raw HTML,
> > > so that I can have all the original markup to work with later.
> > > 
> > > I've tried digging around on Google and I can't seem to find anything.
> > > 
> > >  Can someone please push me in the right direction?
> > > 
> > > Thanks!
> > > 
> > > -- Christopher Gross
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

Re: Crawl Page, Store full HTML content

Reply via email to