Hi, I want to index content from (selected) web sites to solr. I therefore want to extract the data from the document's DOM and put these information to corresponding fields of the index. In other words: I want to use nutch as a crawler and content extractor only.
I read that I would have to write a custom HtmlParseFilter. (http://wiki.apache.org/nutch/WritingPluginExample-0.9) This would add the extracted information to the parse object which can be accessed later when indexing. So far, so good. But how to post my data to solr? The article mentioned above suggests to write a custom "Indexer Extension". This extension would index new custom information additionally to the standard values. But note that I _don't_ want to index the standard outcome of nutch. (host, content, ...) The SolrIndexer that comes along with nutch therefore seems to be impractical for my needs. I can't see how to convince it to not send the standard field values to the solr server. Beside that I actually would not need to use information from linkdb. But the need to provide a linkdb to SolrIndexer is hard coded. Do I have to write a new Indexer from scratch? Is a custom HtmlParseFilter the right choice for my needs? Anything else I am not aware of? Any hints how to get ahead are appreciated. Thanks Guido By the way: Where comes nutch/conf/schema.xml into play? I assume that it is just a template to replace solr/conf/schema.xml.

