Hi Iain, that's good news. Thank you for reporting!
Sebastian On 09/08/2013 02:10 AM, Iain Lopata wrote: > Sebastian, > > That was exactly the problem -- and the fix you have posted fixed it. Thank > you so much! > > FYI for others -- The patch file doesn’t apply to 1.6, but it is very easy to > see where the changes need to be made. > > Iain > > -----Original Message----- > From: Sebastian Nagel [mailto:[email protected]] > Sent: Saturday, September 07, 2013 5:08 PM > To: [email protected] > Subject: Re: Changing Host Name for Solr Index > > Hi Iain, > > I set up a similar project to check normalizing in indexer. > In general, it works (for current trunk, I didn't check 1.6) but there is a > problem if documents are target of a redirect (there has been only one such > document). > In this case, there is a second "representation" URL (eg. > if the source of a temporary redirect is simpler than the target it's taken > as more representative URL). > NUTCH-1636 is opened to address this bug. > > Are all 4 URLs you tried target of redirects resp. is the repr URL set in > CrawlDb? > (you can use "bin/nutch readdb" and look for _repr_) > > Sebastian > > On 09/07/2013 04:42 AM, Iain Lopata wrote: >> Sebastian, >> >> I have been calling Solrindex with -normalize (and have modified the crawl >> script to use -normalize at the index stage also). I have also been >> deleting the documents from Solr before reindexing. >> >> The problem is occurring with all documents to be normalized -- although >> there are only four so far in my testing. >> >> I am using Nutch 1.6 on Ubuntu and Solr 1.4.1 >> >> I am using the url as the id field in solr if that makes a difference. >> >> The following is a slightly longer log extract >> >> 2013-09-06 21:29:58,927 DEBUG regex.RegexURLNormalizer - resource for >> scope 'indexer': regex-normalize-indexer.xml >> 2013-09-06 21:30:01,263 DEBUG util.ObjectCache - No object cache found >> for conf=Configuration: core-default.xml, core-site.xml, >> mapred-default.xml, mapred-site.xml, >> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0001.xml, >> instantiating a new object cache >> 2013-09-06 21:30:01,286 INFO indexer.IndexingFilters - Adding >> com.atlantbh.nutch.filter.index.omit.OmitIndexingFilter >> 2013-09-06 21:30:01,337 INFO indexer.IndexingFilters - Adding >> com.atlantbh.nutch.filter.xpath.XPathIndexingFilter >> 2013-09-06 21:30:01,338 INFO indexer.IndexingFilters - Adding >> org.apache.nutch.indexer.basic.BasicIndexingFilter >> 2013-09-06 21:30:01,394 INFO indexer.IndexingFilters - Adding >> org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter >> 2013-09-06 21:30:01,394 INFO indexer.IndexingFilters - Adding >> org.apache.nutch.indexer.staticfield.StaticFieldIndexer >> 2013-09-06 21:30:01,394 INFO anchor.AnchorIndexingFilter - Anchor >> deduplication is: off >> 2013-09-06 21:30:01,394 INFO indexer.IndexingFilters - Adding >> org.apache.nutch.indexer.anchor.AnchorIndexingFilter >> 2013-09-06 21:30:01,804 INFO solr.SolrMappingReader - source: >> subcollection dest: company >> 2013-09-06 21:30:01,804 INFO solr.SolrMappingReader - source: name >> dest: name >> 2013-09-06 21:30:01,804 INFO solr.SolrMappingReader - source: address >> dest: address >> 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: email >> dest: email >> 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: >> education dest: education >> 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: >> practice dest: practice >> 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: vcflink >> dest: vcflink >> 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: >> imageurl dest: imageurl >> 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: role >> dest: role >> 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: profile >> dest: profile >> 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: title >> dest: title >> 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: host >> dest: host >> 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: segment >> dest: segment >> 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: boost >> dest: boost >> 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: digest >> dest: digest >> 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: tstamp >> dest: tstamp >> 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: url >> dest: id >> 2013-09-06 21:30:01,805 INFO solr.SolrMappingReader - source: url >> dest: url >> 2013-09-06 21:30:01,845 INFO collection.CollectionManager - >> Instantiating CollectionManager >> 2013-09-06 21:30:01,845 INFO collection.CollectionManager - >> initializing CollectionManager >> 2013-09-06 21:30:01,884 INFO collection.CollectionManager - file has >> 16 elements >> 2013-09-06 21:30:02,065 INFO solr.SolrWriter - Indexing 4 documents >> 2013-09-06 21:30:22,411 INFO solr.SolrIndexer - SolrIndexer: finished >> at 2013-09-06 21:30:22, elapsed: 00:01:09 >> >> -----Original Message----- >> From: Sebastian Nagel [mailto:[email protected]] >> Sent: Friday, September 06, 2013 3:03 PM >> To: [email protected] >> Subject: Re: Changing Host Name for Solr Index >> >> Hi Iain, >> >> I assume that indexer is called with -normalize >> >> % bin/nutch solrindex ... -normalize >> >> and that Solr index is emptied before because adding or changing >> normalization rules will not cause that "old" documents are updated/deleted >> because their id field is still filled with unnormalized URLs. >> >> Does this happen for some or for all documents with URLs to be normalized? >> >> Can you provide more information (Nutch version, more logs)? >> >> Thanks, >> Sebastian >> >> On 09/06/2013 03:34 AM, Iain Lopata wrote: >>> I am attempting to crawl the mobile subdomain of a site: m.example.com. >>> >>> Instead of indexing these pages in Solr as m.example.com/page1.html, I want >>> to add them as www.example.com/page1.html >>> >>> I have used the regex-urlnormalizer at the indexing phase and specified: >>> >>> <regex> >>> <pattern>m\.example\.com</pattern> >>> <substitution>www\.example\.com</substitution> >>> </regex> >>> >>> I can see in the hadoop log that the configuration file for the >>> indexer scope is being correctly read: >>> >>> 2013-09-05 18:24:06,685 DEBUG regex.RegexURLNormalizer - >>> resource for scope 'indexer': regex-normalize-indexer.xml >>> >>> I have also confirmed that this the substitution is working with >>> URLNormalizerChecker. >>> >>> However, when the indexing to Solr completes, both the url and host >>> fields contain the m.example.com host name. >>> >>> Any ideas on how I can correct this? >>> >>> Thanks >>> >> >> > >

