Using solrmapping.xml you can map fields you don't want to an `ignored` field in Solr or map Solr's fields to a `ignored` fieldType.
On Thursday 18 November 2010 22:08:15 Guido wrote: > Hi, > > I want to index content from (selected) web sites to solr. I therefore > want to extract the data from the document's DOM and put these > information to corresponding fields of the index. In other words: I want > to use nutch as a crawler and content extractor only. > > I read that I would have to write a custom HtmlParseFilter. > (http://wiki.apache.org/nutch/WritingPluginExample-0.9) > This would add the extracted information to the parse object which can > be accessed later when indexing. > > So far, so good. But how to post my data to solr? > > The article mentioned above suggests to write a custom "Indexer > Extension". This extension would index new custom information > additionally to the standard values. But note that I _don't_ want to > index the standard outcome of nutch. (host, content, ...) > > The SolrIndexer that comes along with nutch therefore seems to be > impractical for my needs. I can't see how to convince it to not send the > standard field values to the solr server. Beside that I actually would > not need to use information from linkdb. But the need to provide a > linkdb to SolrIndexer is hard coded. > > Do I have to write a new Indexer from scratch? > Is a custom HtmlParseFilter the right choice for my needs? > Anything else I am not aware of? > > Any hints how to get ahead are appreciated. > > > Thanks > > Guido > > By the way: > Where comes nutch/conf/schema.xml into play? I assume that it is just a > template to replace solr/conf/schema.xml. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

