Thank you, Markus. I knew about this option, but wanted to check with more experienced Nutch developers if I was over-complicating things. SegmentReader it is then. My idea was that maybe it was possible to make some changes to the Nutch and Solr config files for it to work. If not, I'll have to use SegmentReader to extract the HTML content and put into Solr after each crawl.
Thank you for the help! -----Original Message----- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: den 15 augusti 2012 17:38 To: user@nutch.apache.org; Max Dzyuba Subject: RE: how to add raw HTML field to Solr The easiest non-java approach would be using Nutch' SegmentReader tool to extract the HTML from your segments and store them somewhere you can access them easiliy. -----Original message----- > From:Max Dzyuba <max.dzy...@comintelli.com> > Sent: Wed 15-Aug-2012 17:00 > To: user@nutch.apache.org > Subject: how to add raw HTML field to Solr > > Hello everyone, > > > > I have Nutch installed and running just fine. Nutch submits the crawl > results to Solr for indexing. I need to have a separate field in Solr > document that would hold raw HTML. At the moment, the "content" field > holds the parsed text from the page only. > > > > From what I read, it's impossible to do what I need without writing > your own plugin. I don't know Java that well. What would be the > easiest way to approach this task? > > > > > > Thank you in advance, > > Max > >