Re: Changing Host Name for Solr Index

Sebastian Nagel Fri, 06 Sep 2013 13:30:46 -0700

Hi Iain,

I assume that indexer is called with -normalize


% bin/nutch solrindex ... -normalize

and that Solr index is emptied before because
adding or changing normalization rules will not
cause that "old" documents are updated/deleted
because their id field is still filled with
unnormalized URLs.

Does this happen for some or for all documents with URLs to
be normalized?

Can you provide more information (Nutch version,
more logs)?

Thanks,
Sebastian

On 09/06/2013 03:34 AM, Iain Lopata wrote:
> I am attempting to crawl the mobile subdomain of a site: m.example.com. 
> 
> Instead of indexing these pages in Solr as m.example.com/page1.html,  I want
> to add them as www.example.com/page1.html   
> 
> I have used the regex-urlnormalizer at the indexing phase and specified:
> 
> <regex>
>   <pattern>m\.example\.com</pattern>
>   <substitution>www\.example\.com</substitution>
> </regex>
> 
> I can see in the hadoop log that the configuration file for the indexer
> scope is being correctly read:
> 
>        2013-09-05 18:24:06,685 DEBUG regex.RegexURLNormalizer - resource for
> scope 'indexer': regex-normalize-indexer.xml
>        
> I have also confirmed that this the substitution is working with
> URLNormalizerChecker.
> 
> However, when the indexing to Solr completes, both the url and host fields
> contain the m.example.com host name.
> 
> Any ideas on how I can correct this?
> 
> Thanks
>

Re: Changing Host Name for Solr Index

Reply via email to