Hey guys.

Over the past few weeks I have learn a lot on nutch with solr and alot more to learn.

I am thinking of using nutch as a pure web crawler to extract the pure html (maybe including headers) and url solely to pass it to solr.

I know I can modify the index-basic filter of nutch. But I am wondering if there is any easier and cleaner way to do, maybe via the modifcation of schema etc without modify any source code of nutch?

The reason I want to do it this way is because it is cleaner, so i just need to focus on solr plugin customization rather than trying to modify nutch and solr at the same time. Indexing will be done at solr level. Anyone, any ideas?

Thanks in advance. =)

Reply via email to