Hi Iain, I assume that indexer is called with -normalize
% bin/nutch solrindex ... -normalize and that Solr index is emptied before because adding or changing normalization rules will not cause that "old" documents are updated/deleted because their id field is still filled with unnormalized URLs. Does this happen for some or for all documents with URLs to be normalized? Can you provide more information (Nutch version, more logs)? Thanks, Sebastian On 09/06/2013 03:34 AM, Iain Lopata wrote: > I am attempting to crawl the mobile subdomain of a site: m.example.com. > > Instead of indexing these pages in Solr as m.example.com/page1.html, I want > to add them as www.example.com/page1.html > > I have used the regex-urlnormalizer at the indexing phase and specified: > > <regex> > <pattern>m\.example\.com</pattern> > <substitution>www\.example\.com</substitution> > </regex> > > I can see in the hadoop log that the configuration file for the indexer > scope is being correctly read: > > 2013-09-05 18:24:06,685 DEBUG regex.RegexURLNormalizer - resource for > scope 'indexer': regex-normalize-indexer.xml > > I have also confirmed that this the substitution is working with > URLNormalizerChecker. > > However, when the indexing to Solr completes, both the url and host fields > contain the m.example.com host name. > > Any ideas on how I can correct this? > > Thanks >

