Hi,

I want to index content from (selected) web sites to solr. I therefore
want to extract the data from the document's DOM and put these
information to corresponding fields of the index. In other words: I want
to use nutch as a crawler and content extractor only.

I read that I would have to write a custom HtmlParseFilter.
(http://wiki.apache.org/nutch/WritingPluginExample-0.9)
This would add the extracted information to the parse object which can
be accessed later when indexing.

So far, so good. But how to post my data to solr? 

The article mentioned above suggests to write a custom "Indexer
Extension". This extension would index new custom information
additionally to the standard values. But note that I _don't_ want to
index the standard outcome of nutch. (host, content, ...)

The SolrIndexer that comes along with nutch therefore seems to be
impractical for my needs. I can't see how to convince it to not send the
standard field values to the solr server. Beside that I actually would
not need to use information from linkdb. But the need to provide a
linkdb to SolrIndexer is hard coded.

Do I have to write a new Indexer from scratch?
Is a custom HtmlParseFilter the right choice for my needs?
Anything else I am not aware of?

Any hints how to get ahead are appreciated. 


Thanks

Guido

By the way:
Where comes nutch/conf/schema.xml into play? I assume that it is just a
template to replace solr/conf/schema.xml. 

Reply via email to