Sebastian,

That was exactly the problem -- and the fix you have posted fixed it.  Thank 
you so much!

FYI for others -- The patch file doesn’t apply to 1.6, but it is very easy to 
see where the changes need to be made.

Iain

-----Original Message-----
From: Sebastian Nagel [mailto:[email protected]] 
Sent: Saturday, September 07, 2013 5:08 PM
To: [email protected]
Subject: Re: Changing Host Name for Solr Index

Hi Iain,

I set up a similar project to check normalizing in indexer.
In general, it works (for current trunk, I didn't check 1.6) but there is a 
problem if documents are target of a redirect (there has been only one such 
document).
In this case, there is a second "representation" URL (eg.
if the source of a temporary redirect is simpler than the target it's taken as 
more representative URL).
NUTCH-1636 is opened to address this bug.

Are all 4 URLs you tried target of redirects resp. is the repr URL set in 
CrawlDb?
(you can use "bin/nutch readdb" and look for _repr_)

Sebastian

On 09/07/2013 04:42 AM, Iain Lopata wrote:
> Sebastian,
> 
> I have been calling Solrindex with -normalize (and have modified the crawl 
> script to use -normalize at the index stage also).  I have also been deleting 
> the documents from Solr before reindexing.
> 
> The problem is occurring with all documents to be normalized -- although 
> there are only four so far in my testing.
> 
> I am using Nutch 1.6 on Ubuntu and Solr 1.4.1
> 
> I am using the url as the id field in solr if that makes a difference.
> 
> The following is a slightly longer log extract
> 
> 2013-09-06 21:29:58,927 DEBUG regex.RegexURLNormalizer - resource for 
> scope 'indexer': regex-normalize-indexer.xml
> 2013-09-06 21:30:01,263 DEBUG util.ObjectCache - No object cache found 
> for conf=Configuration: core-default.xml, core-site.xml, 
> mapred-default.xml, mapred-site.xml, 
> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0001.xml, 
> instantiating a new object cache
> 2013-09-06 21:30:01,286 INFO  indexer.IndexingFilters - Adding 
> com.atlantbh.nutch.filter.index.omit.OmitIndexingFilter
> 2013-09-06 21:30:01,337 INFO  indexer.IndexingFilters - Adding 
> com.atlantbh.nutch.filter.xpath.XPathIndexingFilter
> 2013-09-06 21:30:01,338 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2013-09-06 21:30:01,394 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
> 2013-09-06 21:30:01,394 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.staticfield.StaticFieldIndexer
> 2013-09-06 21:30:01,394 INFO  anchor.AnchorIndexingFilter - Anchor 
> deduplication is: off
> 2013-09-06 21:30:01,394 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2013-09-06 21:30:01,804 INFO  solr.SolrMappingReader - source: 
> subcollection dest: company
> 2013-09-06 21:30:01,804 INFO  solr.SolrMappingReader - source: name 
> dest: name
> 2013-09-06 21:30:01,804 INFO  solr.SolrMappingReader - source: address 
> dest: address
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: email 
> dest: email
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: 
> education dest: education
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: 
> practice dest: practice
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: vcflink 
> dest: vcflink
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: 
> imageurl dest: imageurl
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: role 
> dest: role
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: profile 
> dest: profile
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: title 
> dest: title
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: host 
> dest: host
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: segment 
> dest: segment
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: boost 
> dest: boost
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: digest 
> dest: digest
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: tstamp 
> dest: tstamp
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: url 
> dest: id
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: url 
> dest: url
> 2013-09-06 21:30:01,845 INFO  collection.CollectionManager - 
> Instantiating CollectionManager
> 2013-09-06 21:30:01,845 INFO  collection.CollectionManager - 
> initializing CollectionManager
> 2013-09-06 21:30:01,884 INFO  collection.CollectionManager - file has 
> 16 elements
> 2013-09-06 21:30:02,065 INFO  solr.SolrWriter - Indexing 4 documents
> 2013-09-06 21:30:22,411 INFO  solr.SolrIndexer - SolrIndexer: finished 
> at 2013-09-06 21:30:22, elapsed: 00:01:09
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:[email protected]]
> Sent: Friday, September 06, 2013 3:03 PM
> To: [email protected]
> Subject: Re: Changing Host Name for Solr Index
> 
> Hi Iain,
> 
> I assume that indexer is called with -normalize
> 
> % bin/nutch solrindex ... -normalize
> 
> and that Solr index is emptied before because adding or changing 
> normalization rules will not cause that "old" documents are updated/deleted 
> because their id field is still filled with unnormalized URLs.
> 
> Does this happen for some or for all documents with URLs to be normalized?
> 
> Can you provide more information (Nutch version, more logs)?
> 
> Thanks,
> Sebastian
> 
> On 09/06/2013 03:34 AM, Iain Lopata wrote:
>> I am attempting to crawl the mobile subdomain of a site: m.example.com. 
>>
>> Instead of indexing these pages in Solr as m.example.com/page1.html,  I want
>> to add them as www.example.com/page1.html   
>>
>> I have used the regex-urlnormalizer at the indexing phase and specified:
>>
>> <regex>
>>   <pattern>m\.example\.com</pattern>
>>   <substitution>www\.example\.com</substitution>
>> </regex>
>>
>> I can see in the hadoop log that the configuration file for the 
>> indexer scope is being correctly read:
>>
>>        2013-09-05 18:24:06,685 DEBUG regex.RegexURLNormalizer - 
>> resource for scope 'indexer': regex-normalize-indexer.xml
>>        
>> I have also confirmed that this the substitution is working with 
>> URLNormalizerChecker.
>>
>> However, when the indexing to Solr completes, both the url and host 
>> fields contain the m.example.com host name.
>>
>> Any ideas on how I can correct this?
>>
>> Thanks
>>
> 
> 


Reply via email to