RE: how to add raw HTML field to Solr

Max Dzyuba Thu, 16 Aug 2012 00:19:14 -0700

Thank you, Markus.

I knew about this option, but wanted to check with more experienced Nutch 
developers if I was over-complicating things. SegmentReader it is then. My idea 
was that maybe it was possible to make some changes to the Nutch and Solr 
config files for it to work. If not, I'll have to use SegmentReader to extract 
the HTML content and put into Solr after each crawl.


Thank you for the help!


-----Original Message-----
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: den 15 augusti 2012 17:38
To: user@nutch.apache.org; Max Dzyuba
Subject: RE: how to add raw HTML field to Solr

The easiest non-java approach would be using Nutch' SegmentReader tool to 
extract the HTML from your segments and store them somewhere you can access 
them easiliy.

 
 
-----Original message-----
> From:Max Dzyuba <max.dzy...@comintelli.com>
> Sent: Wed 15-Aug-2012 17:00
> To: user@nutch.apache.org
> Subject: how to add raw HTML field to Solr
> 
> Hello everyone,
> 
>  
> 
> I have Nutch installed and running just fine. Nutch submits the crawl 
> results to Solr for indexing. I need to have a separate field in Solr 
> document that would hold raw HTML. At the moment, the "content" field 
> holds the parsed text from the page only.
> 
>  
> 
> From what I read, it's impossible to do what I need without writing 
> your own plugin. I don't know Java that well. What would be the 
> easiest way to approach this task?
> 
>  
> 
>  
> 
> Thank you in advance,
> 
> Max
> 
>

RE: how to add raw HTML field to Solr

Reply via email to