Sebastian, That was exactly the problem -- and the fix you have posted fixed it. Thank you so much!
FYI for others -- The patch file doesn’t apply to 1.6, but it is very easy to see where the changes need to be made. Iain -----Original Message----- From: Sebastian Nagel [mailto:[email protected]] Sent: Saturday, September 07, 2013 5:08 PM To: [email protected] Subject: Re: Changing Host Name for Solr Index Hi Iain, I set up a similar project to check normalizing in indexer. In general, it works (for current trunk, I didn't check 1.6) but there is a problem if documents are target of a redirect (there has been only one such document). In this case, there is a second "representation" URL (eg. if the source of a temporary redirect is simpler than the target it's taken as more representative URL). NUTCH-1636 is opened to address this bug. Are all 4 URLs you tried target of redirects resp. is the repr URL set in CrawlDb? (you can use "bin/nutch readdb" and look for _repr_) Sebastian On 09/07/2013 04:42 AM, Iain Lopata wrote: > Sebastian, > > I have been calling Solrindex with -normalize (and have modified the crawl > script to use -normalize at the index stage also). I have also been deleting > the documents from Solr before reindexing. > > The problem is occurring with all documents to be normalized -- although > there are only four so far in my testing. > > I am using Nutch 1.6 on Ubuntu and Solr 1.4.1 > > I am using the url as the id field in solr if that makes a difference. > > The following is a slightly longer log extract > > 2013-09-06 21:29:58,927 DEBUG regex.RegexURLNormalizer - resource for > scope 'indexer': regex-normalize-indexer.xml > 2013-09-06 21:30:01,263 DEBUG util.ObjectCache - No object cache found > for conf=Configuration: core-default.xml, core-site.xml, > mapred-default.xml, mapred-site.xml, > file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0001.xml, > instantiating a new object cache > 2013-09-06 21:30:01,286 INFO indexer.IndexingFilters - Adding > com.atlantbh.nutch.filter.index.omit.OmitIndexingFilter > 2013-09-06 21:30:01,337 INFO indexer.IndexingFilters - Adding > com.atlantbh.nutch.filter.xpath.XPathIndexingFilter > 2013-09-06 21:30:01,338 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.basic.BasicIndexingFilter > 2013-09-06 21:30:01,394 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter > 2013-09-06 21:30:01,394 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.staticfield.StaticFieldIndexer > 2013-09-06 21:30:01,394 INFO anchor.AnchorIndexingFilter - Anchor > deduplication is: off > 2013-09-06 21:30:01,394 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.anchor.AnchorIndexingFilter > 2013-09-06 21:30:01,804 INFO solr.SolrMappingReader - source: > subcollection dest: company > 2013-09-06 21:30:01,804 INFO solr.SolrMappingReader - source: name > dest: name > 2013-09-06 21:30:01,804 INFO solr.SolrMappingReader - source: address > dest: address > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: email > dest: email > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: > education dest: education > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: > practice dest: practice > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: vcflink > dest: vcflink > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: > imageurl dest: imageurl > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: role > dest: role > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: profile > dest: profile > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: title > dest: title > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: host > dest: host > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: segment > dest: segment > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: boost > dest: boost > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: digest > dest: digest > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: tstamp > dest: tstamp > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: url > dest: id > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: url > dest: url > 2013-09-06 21:30:01,845 INFO collection.CollectionManager - > Instantiating CollectionManager > 2013-09-06 21:30:01,845 INFO collection.CollectionManager - > initializing CollectionManager > 2013-09-06 21:30:01,884 INFO collection.CollectionManager - file has > 16 elements > 2013-09-06 21:30:02,065 INFO solr.SolrWriter - Indexing 4 documents > 2013-09-06 21:30:22,411 INFO solr.SolrIndexer - SolrIndexer: finished > at 2013-09-06 21:30:22, elapsed: 00:01:09 > > -----Original Message----- > From: Sebastian Nagel [mailto:[email protected]] > Sent: Friday, September 06, 2013 3:03 PM > To: [email protected] > Subject: Re: Changing Host Name for Solr Index > > Hi Iain, > > I assume that indexer is called with -normalize > > % bin/nutch solrindex ... -normalize > > and that Solr index is emptied before because adding or changing > normalization rules will not cause that "old" documents are updated/deleted > because their id field is still filled with unnormalized URLs. > > Does this happen for some or for all documents with URLs to be normalized? > > Can you provide more information (Nutch version, more logs)? > > Thanks, > Sebastian > > On 09/06/2013 03:34 AM, Iain Lopata wrote: >> I am attempting to crawl the mobile subdomain of a site: m.example.com. >> >> Instead of indexing these pages in Solr as m.example.com/page1.html, I want >> to add them as www.example.com/page1.html >> >> I have used the regex-urlnormalizer at the indexing phase and specified: >> >> <regex> >> <pattern>m\.example\.com</pattern> >> <substitution>www\.example\.com</substitution> >> </regex> >> >> I can see in the hadoop log that the configuration file for the >> indexer scope is being correctly read: >> >> 2013-09-05 18:24:06,685 DEBUG regex.RegexURLNormalizer - >> resource for scope 'indexer': regex-normalize-indexer.xml >> >> I have also confirmed that this the substitution is working with >> URLNormalizerChecker. >> >> However, when the indexing to Solr completes, both the url and host >> fields contain the m.example.com host name. >> >> Any ideas on how I can correct this? >> >> Thanks >> > >

