Sebastian, I have been calling Solrindex with -normalize (and have modified the crawl script to use -normalize at the index stage also). I have also been deleting the documents from Solr before reindexing.
The problem is occurring with all documents to be normalized -- although there are only four so far in my testing. I am using Nutch 1.6 on Ubuntu and Solr 1.4.1 I am using the url as the id field in solr if that makes a difference. The following is a slightly longer log extract 2013-09-06 21:29:58,927 DEBUG regex.RegexURLNormalizer - resource for scope 'indexer': regex-normalize-indexer.xml 2013-09-06 21:30:01,263 DEBUG util.ObjectCache - No object cache found for conf=Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0001.xml, instantiating a new object cache 2013-09-06 21:30:01,286 INFO indexer.IndexingFilters - Adding com.atlantbh.nutch.filter.index.omit.OmitIndexingFilter 2013-09-06 21:30:01,337 INFO indexer.IndexingFilters - Adding com.atlantbh.nutch.filter.xpath.XPathIndexingFilter 2013-09-06 21:30:01,338 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2013-09-06 21:30:01,394 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter 2013-09-06 21:30:01,394 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.staticfield.StaticFieldIndexer 2013-09-06 21:30:01,394 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2013-09-06 21:30:01,394 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2013-09-06 21:30:01,804 INFO solr.SolrMappingReader - source: subcollection dest: company 2013-09-06 21:30:01,804 INFO solr.SolrMappingReader - source: name dest: name 2013-09-06 21:30:01,804 INFO solr.SolrMappingReader - source: address dest: address 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: email dest: email 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: education dest: education 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: practice dest: practice 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: vcflink dest: vcflink 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: imageurl dest: imageurl 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: role dest: role 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: profile dest: profile 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: title dest: title 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: host dest: host 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: segment dest: segment 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: boost dest: boost 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: digest dest: digest 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: tstamp dest: tstamp 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: url dest: id 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: url dest: url 2013-09-06 21:30:01,845 INFO collection.CollectionManager - Instantiating CollectionManager 2013-09-06 21:30:01,845 INFO collection.CollectionManager - initializing CollectionManager 2013-09-06 21:30:01,884 INFO collection.CollectionManager - file has 16 elements 2013-09-06 21:30:02,065 INFO solr.SolrWriter - Indexing 4 documents 2013-09-06 21:30:22,411 INFO solr.SolrIndexer - SolrIndexer: finished at 2013-09-06 21:30:22, elapsed: 00:01:09 -----Original Message----- From: Sebastian Nagel [mailto:[email protected]] Sent: Friday, September 06, 2013 3:03 PM To: [email protected] Subject: Re: Changing Host Name for Solr Index Hi Iain, I assume that indexer is called with -normalize % bin/nutch solrindex ... -normalize and that Solr index is emptied before because adding or changing normalization rules will not cause that "old" documents are updated/deleted because their id field is still filled with unnormalized URLs. Does this happen for some or for all documents with URLs to be normalized? Can you provide more information (Nutch version, more logs)? Thanks, Sebastian On 09/06/2013 03:34 AM, Iain Lopata wrote: > I am attempting to crawl the mobile subdomain of a site: m.example.com. > > Instead of indexing these pages in Solr as m.example.com/page1.html, I want > to add them as www.example.com/page1.html > > I have used the regex-urlnormalizer at the indexing phase and specified: > > <regex> > <pattern>m\.example\.com</pattern> > <substitution>www\.example\.com</substitution> > </regex> > > I can see in the hadoop log that the configuration file for the > indexer scope is being correctly read: > > 2013-09-05 18:24:06,685 DEBUG regex.RegexURLNormalizer - > resource for scope 'indexer': regex-normalize-indexer.xml > > I have also confirmed that this the substitution is working with > URLNormalizerChecker. > > However, when the indexing to Solr completes, both the url and host > fields contain the m.example.com host name. > > Any ideas on how I can correct this? > > Thanks >

