Why would you want to index raw HTML documents? That's not really efficient at 
all.

What do you mean by `is is cleaner` ?

On Thursday 17 November 2011 02:18:32 codegigabyte wrote:
> Hey guys.
> 
> Over the past few weeks I have learn a lot on nutch with solr and alot
> more to learn.
> 
> I am thinking of using nutch as a pure web crawler to extract the pure
> html (maybe including headers) and url solely to pass it to solr.
> 
> I know I can modify the index-basic filter of nutch. But I am wondering
> if there is any easier and cleaner way to do, maybe via the modifcation
> of schema etc without modify any source code of nutch?
> 
> The reason I want to do it this way is because it is cleaner, so i just
> need to focus on solr plugin customization rather than trying to modify
> nutch and solr at the same time. Indexing will be done at solr level.
> Anyone, any ideas?
> 
> Thanks in advance. =)

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to