Hi Iain, I set up a similar project to check normalizing in indexer. In general, it works (for current trunk, I didn't check 1.6) but there is a problem if documents are target of a redirect (there has been only one such document). In this case, there is a second "representation" URL (eg. if the source of a temporary redirect is simpler than the target it's taken as more representative URL). NUTCH-1636 is opened to address this bug.
Are all 4 URLs you tried target of redirects resp. is the repr URL set in CrawlDb? (you can use "bin/nutch readdb" and look for _repr_) Sebastian On 09/07/2013 04:42 AM, Iain Lopata wrote: > Sebastian, > > I have been calling Solrindex with -normalize (and have modified the crawl > script to use -normalize at the index stage also). I have also been deleting > the documents from Solr before reindexing. > > The problem is occurring with all documents to be normalized -- although > there are only four so far in my testing. > > I am using Nutch 1.6 on Ubuntu and Solr 1.4.1 > > I am using the url as the id field in solr if that makes a difference. > > The following is a slightly longer log extract > > 2013-09-06 21:29:58,927 DEBUG regex.RegexURLNormalizer - resource for scope > 'indexer': regex-normalize-indexer.xml > 2013-09-06 21:30:01,263 DEBUG util.ObjectCache - No object cache found for > conf=Configuration: core-default.xml, core-site.xml, mapred-default.xml, > mapred-site.xml, > file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0001.xml, > instantiating a new object cache > 2013-09-06 21:30:01,286 INFO indexer.IndexingFilters - Adding > com.atlantbh.nutch.filter.index.omit.OmitIndexingFilter > 2013-09-06 21:30:01,337 INFO indexer.IndexingFilters - Adding > com.atlantbh.nutch.filter.xpath.XPathIndexingFilter > 2013-09-06 21:30:01,338 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.basic.BasicIndexingFilter > 2013-09-06 21:30:01,394 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter > 2013-09-06 21:30:01,394 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.staticfield.StaticFieldIndexer > 2013-09-06 21:30:01,394 INFO anchor.AnchorIndexingFilter - Anchor > deduplication is: off > 2013-09-06 21:30:01,394 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.anchor.AnchorIndexingFilter > 2013-09-06 21:30:01,804 INFO solr.SolrMappingReader - source: subcollection > dest: company > 2013-09-06 21:30:01,804 INFO solr.SolrMappingReader - source: name dest: name > 2013-09-06 21:30:01,804 INFO solr.SolrMappingReader - source: address dest: > address > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: email dest: > email > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: education > dest: education > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: practice dest: > practice > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: vcflink dest: > vcflink > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: imageurl dest: > imageurl > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: role dest: role > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: profile dest: > profile > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: title dest: > title > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: host dest: host > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: segment dest: > segment > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: boost dest: > boost > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: digest dest: > digest > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: tstamp dest: > tstamp > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: url dest: id > 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: url dest: url > 2013-09-06 21:30:01,845 INFO collection.CollectionManager - Instantiating > CollectionManager > 2013-09-06 21:30:01,845 INFO collection.CollectionManager - initializing > CollectionManager > 2013-09-06 21:30:01,884 INFO collection.CollectionManager - file has 16 > elements > 2013-09-06 21:30:02,065 INFO solr.SolrWriter - Indexing 4 documents > 2013-09-06 21:30:22,411 INFO solr.SolrIndexer - SolrIndexer: finished at > 2013-09-06 21:30:22, elapsed: 00:01:09 > > -----Original Message----- > From: Sebastian Nagel [mailto:[email protected]] > Sent: Friday, September 06, 2013 3:03 PM > To: [email protected] > Subject: Re: Changing Host Name for Solr Index > > Hi Iain, > > I assume that indexer is called with -normalize > > % bin/nutch solrindex ... -normalize > > and that Solr index is emptied before because adding or changing > normalization rules will not cause that "old" documents are updated/deleted > because their id field is still filled with unnormalized URLs. > > Does this happen for some or for all documents with URLs to be normalized? > > Can you provide more information (Nutch version, more logs)? > > Thanks, > Sebastian > > On 09/06/2013 03:34 AM, Iain Lopata wrote: >> I am attempting to crawl the mobile subdomain of a site: m.example.com. >> >> Instead of indexing these pages in Solr as m.example.com/page1.html, I want >> to add them as www.example.com/page1.html >> >> I have used the regex-urlnormalizer at the indexing phase and specified: >> >> <regex> >> <pattern>m\.example\.com</pattern> >> <substitution>www\.example\.com</substitution> >> </regex> >> >> I can see in the hadoop log that the configuration file for the >> indexer scope is being correctly read: >> >> 2013-09-05 18:24:06,685 DEBUG regex.RegexURLNormalizer - >> resource for scope 'indexer': regex-normalize-indexer.xml >> >> I have also confirmed that this the substitution is working with >> URLNormalizerChecker. >> >> However, when the indexing to Solr completes, both the url and host >> fields contain the m.example.com host name. >> >> Any ideas on how I can correct this? >> >> Thanks >> > >

