Why would you want to index raw HTML documents? That's not really efficient at all.
What do you mean by `is is cleaner` ? On Thursday 17 November 2011 02:18:32 codegigabyte wrote: > Hey guys. > > Over the past few weeks I have learn a lot on nutch with solr and alot > more to learn. > > I am thinking of using nutch as a pure web crawler to extract the pure > html (maybe including headers) and url solely to pass it to solr. > > I know I can modify the index-basic filter of nutch. But I am wondering > if there is any easier and cleaner way to do, maybe via the modifcation > of schema etc without modify any source code of nutch? > > The reason I want to do it this way is because it is cleaner, so i just > need to focus on solr plugin customization rather than trying to modify > nutch and solr at the same time. Indexing will be done at solr level. > Anyone, any ideas? > > Thanks in advance. =) -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

