Nutch doesn't put raw HTML in NutchDocument objects.
> You can try using the string type as below: > <field name="content" type="string" stored="true" indexed="true"/> > > > On Wed, Aug 10, 2011 at 6:20 AM, Markus Jelsma > > <[email protected]>wrote: > > I'm not sure how to do this but i think creating an parse and indexing > > filter > > will do the trick. First you make the parse filter that reads the byte[] > > content from the Content object that is available in the parse filter. > > You then add the raw data in that parse filter to the parse data. > > > > In your indexing filter you simply read that field and add it to the > > document. > > See writing plugin example on the wiki for basic introduction to writing > > plugins. > > > > On Wednesday 10 August 2011 14:12:13 Christopher Gross wrote: > > > I have Nutch 1.3 running, and have it connected to a Solr 3.3 > > > instance. Right now the data comes over from Nutch to Solr just fine, > > > but I'd like it to send the "content" field to Solr as the raw HTML, > > > so that I can have all the original markup to work with later. > > > > > > I've tried digging around on Google and I can't seem to find anything. > > > > > > Can someone please push me in the right direction? > > > > > > Thanks! > > > > > > -- Christopher Gross > > > > -- > > Markus Jelsma - CTO - Openindex > > http://www.linkedin.com/in/markus17 > > 050-8536620 / 06-50258350

