RE: Changing Host Name for Solr Index

Iain Lopata Fri, 06 Sep 2013 19:43:35 -0700

Sebastian,

I have been calling Solrindex with -normalize (and have modified the crawl 
script to use -normalize at the index stage also).  I have also been deleting 
the documents from Solr before reindexing.


The problem is occurring with all documents to be normalized -- although there 
are only four so far in my testing.

I am using Nutch 1.6 on Ubuntu and Solr 1.4.1

I am using the url as the id field in solr if that makes a difference.

The following is a slightly longer log extract

2013-09-06 21:29:58,927 DEBUG regex.RegexURLNormalizer - resource for scope 
'indexer': regex-normalize-indexer.xml
2013-09-06 21:30:01,263 DEBUG util.ObjectCache - No object cache found for 
conf=Configuration: core-default.xml, core-site.xml, mapred-default.xml, 
mapred-site.xml, 
file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0001.xml, 
instantiating a new object cache
2013-09-06 21:30:01,286 INFO  indexer.IndexingFilters - Adding 
com.atlantbh.nutch.filter.index.omit.OmitIndexingFilter
2013-09-06 21:30:01,337 INFO  indexer.IndexingFilters - Adding 
com.atlantbh.nutch.filter.xpath.XPathIndexingFilter
2013-09-06 21:30:01,338 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.basic.BasicIndexingFilter
2013-09-06 21:30:01,394 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
2013-09-06 21:30:01,394 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.staticfield.StaticFieldIndexer
2013-09-06 21:30:01,394 INFO  anchor.AnchorIndexingFilter - Anchor 
deduplication is: off
2013-09-06 21:30:01,394 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2013-09-06 21:30:01,804 INFO  solr.SolrMappingReader - source: subcollection 
dest: company
2013-09-06 21:30:01,804 INFO  solr.SolrMappingReader - source: name dest: name
2013-09-06 21:30:01,804 INFO  solr.SolrMappingReader - source: address dest: 
address
2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: email dest: email
2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: education dest: 
education
2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: practice dest: 
practice
2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: vcflink dest: 
vcflink
2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: imageurl dest: 
imageurl
2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: role dest: role
2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: profile dest: 
profile
2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: title dest: title
2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: host dest: host
2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: segment dest: 
segment
2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: boost dest: boost
2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: digest dest: 
digest
2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: tstamp dest: 
tstamp
2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: url dest: id
2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: url dest: url
2013-09-06 21:30:01,845 INFO  collection.CollectionManager - Instantiating 
CollectionManager
2013-09-06 21:30:01,845 INFO  collection.CollectionManager - initializing 
CollectionManager
2013-09-06 21:30:01,884 INFO  collection.CollectionManager - file has 16 
elements
2013-09-06 21:30:02,065 INFO  solr.SolrWriter - Indexing 4 documents
2013-09-06 21:30:22,411 INFO  solr.SolrIndexer - SolrIndexer: finished at 
2013-09-06 21:30:22, elapsed: 00:01:09

-----Original Message-----
From: Sebastian Nagel [mailto:[email protected]] 
Sent: Friday, September 06, 2013 3:03 PM
To: [email protected]
Subject: Re: Changing Host Name for Solr Index

Hi Iain,

I assume that indexer is called with -normalize

% bin/nutch solrindex ... -normalize

and that Solr index is emptied before because adding or changing normalization 
rules will not cause that "old" documents are updated/deleted because their id 
field is still filled with unnormalized URLs.

Does this happen for some or for all documents with URLs to be normalized?

Can you provide more information (Nutch version, more logs)?

Thanks,
Sebastian

On 09/06/2013 03:34 AM, Iain Lopata wrote:
> I am attempting to crawl the mobile subdomain of a site: m.example.com. 
> 
> Instead of indexing these pages in Solr as m.example.com/page1.html,  I want
> to add them as www.example.com/page1.html   
> 
> I have used the regex-urlnormalizer at the indexing phase and specified:
> 
> <regex>
>   <pattern>m\.example\.com</pattern>
>   <substitution>www\.example\.com</substitution>
> </regex>
> 
> I can see in the hadoop log that the configuration file for the 
> indexer scope is being correctly read:
> 
>        2013-09-05 18:24:06,685 DEBUG regex.RegexURLNormalizer - 
> resource for scope 'indexer': regex-normalize-indexer.xml
>        
> I have also confirmed that this the substitution is working with 
> URLNormalizerChecker.
> 
> However, when the indexing to Solr completes, both the url and host 
> fields contain the m.example.com host name.
> 
> Any ideas on how I can correct this?
> 
> Thanks
>

RE: Changing Host Name for Solr Index

Reply via email to