I am attempting to crawl the mobile subdomain of a site: m.example.com. 

Instead of indexing these pages in Solr as m.example.com/page1.html,  I want
to add them as www.example.com/page1.html   

I have used the regex-urlnormalizer at the indexing phase and specified:

<regex>
  <pattern>m\.example\.com</pattern>
  <substitution>www\.example\.com</substitution>
</regex>

I can see in the hadoop log that the configuration file for the indexer
scope is being correctly read:

       2013-09-05 18:24:06,685 DEBUG regex.RegexURLNormalizer - resource for
scope 'indexer': regex-normalize-indexer.xml
       
I have also confirmed that this the substitution is working with
URLNormalizerChecker.

However, when the indexing to Solr completes, both the url and host fields
contain the m.example.com host name.

Any ideas on how I can correct this?

Thanks

Reply via email to